Multilayer language resource set for semantic analysis and synthesis of text in Latvian (Q3056314)

Teksta viedā automātiskā rezumēšana ir aktuāla tēma dabiskās valodas sapratnē (NLU) un tekstradē (NLG). Pretstatā virspusējai rezumēšanai, kurā no teksta tiek atlasīti daži informatīvi teikumi, viedajai rezumēšanai nepieciešama pilna gramatiskā un semantiskā analīze, nozīmīgākās informācijas identificēšana un saistītu parafrāžu sintēze (ģenerēšana). Projekta industriālajam partnerim – Nacionālajai informācijas aģentūrai LETA – vieda automātiskā rezumēšana nepieciešama mediju monitorēšanā, savukārt zinātniskajam partnerim – LU MII – ir ievērojama pieredze progresīvā semantiskajā analīzē un anotētu valodas resursu izveidē. Projektā tiks izveidoti daudzslāņu semantiski anotēti latviešu valodas resursi, izmantojot un attīstot pasaulē atzītas multilingvālas reprezentācijas (AMR, PropBank, FrameNet, Universal Dependencies, Grammatical Framework, BabelNet, DBpedia). Iegūto resursu izmantošanas iespējas tiks demonstrētas, izstrādājot konceptuālu teksta viedās rezumēšanas tehnoloģijas prototipu, kura potenciāls tiks novērtēts gan mediju monitorēšanas kontekstā, gan izmantojot ROUGE un citas metrikas. Projekts būtiski sekmēs pētījumus un inovācijas latviešu valodas automātiskā sapratnē un tekstradē.Atslēgvārdi: datorlingvistika, valodas tehnoloģijas, valodas resursi, leksiskā semantika, gramatiskā analīze, semantiskā analīze, teksta rezumēšana.Šī starpdisciplinārā projekta vispārīgais mērķis ir attīstīt pētniecību un inovācijas valodas automātiskā sapratnē un tekstradē, nostiprinot latviešu valodas tehnoloģisko atbalstu Eiropas daudzvalodu digitālajā vienotajā tirgū. Lai nodrošinātu tam pamatu, projekta specifiskais mērķis ir izveidot jaunu, fundamentālu daudzslāņu latviešu valodas resursu kopu un nodemonstrēt šo resursu izmantošanas potenciālu jaunu, inovatīvu tehnoloģiju izstrādē valodas sapratnes un tekstrades lietojumiem.Projektā tiks veikti rūpnieciskie pētījumi atbilstoši “Computer and information sciences” (FORD 1.2) un “Languages and literature” (FORD 6.2) zinātņu nozarēm. Projekts nav saistīts ar saimniecisko darbību.Projektā ir plānotas piecas galvenās darbības. Pirmās trīs darbības ir saistītas ar mašīnlasāmu, sintaktiski un semantiski anotētu tekstu korpusu izveidi un novērtēšanu valodas sapratnes lietojumiem. Ceturtā darbība ir apjomīgas skaidrojošās un sinonīmu vārdnīcas formalizēšana un integrēšana semantiskā tīmekļa saistīto atvērto datu mākonī. No formalizētās vārdnīcas tiks atvasināti multilingvāli skaitļojamie leksikoni, kas nepieciešami tekstradē. Šie rezultāti tiks izmantoti piektajā darbībā, izstrādājot teksta automātiskās rezumēšanas tehnoloģijas laboratorisku prototipu (TRL 4).Projektu īstenos LU MII Mākslīgā intelekta laboratorijas zinātniskie darbinieki – datorzinātnieki un valodnieki, t.sk. jaunie zinātnieki un doktoranti, – sadarbībā ar SIA LETA Pētniecības laboratorijas darbiniekiem. Ņemot vērā SIA LETA kompetenci mediju monitoringā un pieredzi pētniecības projektos, kas saistīti ar valodas automātisku semantisko analīzi, sadarbība paredzēta darbībās, kas attiecas uz semantiski anotētu datu sagatavošanu, mašīnmācīšanās metožu izstrādi un rezultātu novērtēšanu. SIA LETA kompetence ir ļoti būtiska arī konceptuālā prototipa specificēšanā, izstrādē un validēšanā.Projekta īstenošana būtiski stiprinās LU MII kompetenci un kapacitāti valodas resursu un tehnoloģiju pētniecībā un inovācijās un pavērs jaunas iespējas sadarbībai ar komersantiem un ārvalstu zinātniskajām institūcijām.Par pētījuma problemātiku un rezultātiem tiks publicēti 10 oriģināli zinātniskie raksti. Viens no galvenajiem projekta praktiskajiem rezultātiem būs brīvi pieejamās datu kopas ar augstu pievienoto vērtību: 5 anotēti tekstu korpusi un 5 skaitļojamie leksikoni. Otrs galvenais praktiskais rezultāts – jaunas valodas tehnoloģijas prototips.Projekta kopējās izmaksas ir 649417.19 EUR, no kurām 550871.56 EUR ir ERAF atbalsts.Projektu plānots īstenot no 01.12.2016. līdz 31.11.2019. (Latvian)

0 references

Smart automatic summary of the text is a topical theme in understanding natural language (NLU) and texttrade (NLG). In contrast to superficial summation, where some informative sentences are selected from the text, smart summary requires full grammatical and semantic analysis, identification of the most important information and synthesis of related paraphrases (generation). The industrial partner of the project – the National Information Agency LETA – smart automatic summary is necessary for media monitoring, while the scientific partner – UL MII – has considerable experience in advanced semantic analysis and development of annotated language resources. The project will create multi-layered semantic annotated Latvian language resources, using and developing multilingual representations (AMR, PropBank, FrameNet, Universal Dependencies, Grammatical Framework, BabelNet, DBpedia). The use of the resulting resources will be demonstrated by developing a conceptual prototype of smart text summarising technology, the potential of which will be assessed both in the context of media monitoring and using Rouge and other metrics. The project will significantly promote research and innovation in automatic understanding and texttrade of the Latvian language. Computer-linguistics, language technologies, language resources, lexical semantics, grammatical analysis, semantic analysis, text summarising.The general objective of this interdisciplinary project is to develop research and innovation in automatic understanding and translation of languages, strengthening the technological support of the Latvian language in the European multilingual digital single market. In order to provide the basis for this, the project’s specific aim is to create a new, fundamental multilayered Latvian language resource set and to demonstrate the potential of the use of these resources in the development of new innovative technologies for language understanding and textrade applications.In the project industrial research will be carried out in accordance with the fields of “Computer and information sciences” (FORD 1.2) and “Languages and literature” (FORD 6.2). The project is not related to economic activity.The project is planned for five main activities. The first three activities are related to the creation and evaluation of machine-readable, syntactically and semanticly annotated text housings. The fourth is the formalisation and integration of a large interpretative and synonym dictionary into a semantic web-linked open data cloud. From the formalised dictionary will be derived multilingual computational lexicons required for texttrade. These results will be used in the fifth action, developing the laboratory prototype of automatic abstraction technology (TRL 4).The project will be implemented by the scientists of the MII UL Artificial Intelligence Laboratory – computer scientists and linguists, including young scientists and doctoral students – in cooperation with the research laboratory employees of SIA LETA. Taking into account SIA LETA’s competence in media monitoring and experience in research projects related to automatic semantic analysis of language, cooperation is envisaged in activities related to the preparation of semantic annotated data, development of machine learning methods and evaluation of results. LETA’s competence is also very important in defining, developing and validating the conceptual prototype.Implementation of the project will significantly strengthen the competence and capacity of UL’s MII in research and innovation of language resources and technologies, and will open new opportunities for cooperation with business and foreign scientific institutions. One of the main practical results of the project will be open datasets with high added value: 5 annotated text boxes and 5 computational lexicons. The second main practical result – a prototype of new language technology.The total cost of the project is EUR 649417.19, of which EUR 550871.56 is ERDF support.The project is planned to be implemented from 01.12.2016 to 31.11.2019. (English)

Multilayer language resource set for semantic analysis and synthesis of text in Latvian (Q3056314)

Statements

Identifiers

Navigation menu

Search