NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian...
Transcript of NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian...
![Page 1: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every](https://reader033.fdocuments.us/reader033/viewer/2022060716/607c2eac794eca6c076fd242/html5/thumbnails/1.jpg)
NEW ESTONIAN WORDS AND
SENSES: DETECTION AND
DESCRIPTIONMargit Langemets, Jelena Kallas, Kaisa Norak, Indrek HeinInstitute of the Estonian Language
Globalex Workshop on Lexicography and Neologism 8 May 2019DSNA 22 – Indiana University, Bloomington, IN
![Page 2: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every](https://reader033.fdocuments.us/reader033/viewer/2022060716/607c2eac794eca6c076fd242/html5/thumbnails/2.jpg)
New words in the dictionaries
• Grenzstein (1884)• 1,600 words
• Aavik (1919, 2nd ed. 1921)• 4,000 words
• Erelt, Kull, Meriste 1985• 150 words (stems)
separate dics new words included into Ekilex (2019)(a) separate general dics (unified single resource)(b) database of new words User Interface
Sõnaveeb (Wordweb)
8 May 2019 Globalex Workshop on Neologism 2
![Page 3: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every](https://reader033.fdocuments.us/reader033/viewer/2022060716/607c2eac794eca6c076fd242/html5/thumbnails/3.jpg)
Sõnaveeb (Wordweb): computer and mobile view(Ver. 1.4.0 released in February 2019)
8 May 2019 Globalex Workshop on Neologism 3
![Page 4: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every](https://reader033.fdocuments.us/reader033/viewer/2022060716/607c2eac794eca6c076fd242/html5/thumbnails/4.jpg)
Sõnaveeb (Wordweb): computer and mobile view(Ver. 1.4.0 released in February 2019)
8 May 2019 Globalex Workshop on Neologism 4
![Page 5: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every](https://reader033.fdocuments.us/reader033/viewer/2022060716/607c2eac794eca6c076fd242/html5/thumbnails/5.jpg)
Sõnaveeb (Wordweb): computer and mobile view(Ver. 1.4.0 released in February 2019)
8 May 2019 Globalex Workshop on Neologism 5
![Page 6: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every](https://reader033.fdocuments.us/reader033/viewer/2022060716/607c2eac794eca6c076fd242/html5/thumbnails/6.jpg)
Unified single resource Ekilex
• enables constant updating of different data subsets
• NEW WORD in the database• provided (ideally) with
• long definition (< general explanatory dic, large bilingual dic) – Detailed view
• short/simpler definition (< learners' dic, bilingual dic) – Simple view
• gloss/signpost (< orthological dic, bilingual dic) – Detailed/Simple view
• terminological definition (< termbase)
• prescriptive advice
• morphological information
• etymological information
• usage examples (for L1, L2, prescriptive advice)
• translation equivalents (different languages)
• synonyms
• etc.
8 May 2019 Globalex Workshop on Neologism 6
![Page 7: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every](https://reader033.fdocuments.us/reader033/viewer/2022060716/607c2eac794eca6c076fd242/html5/thumbnails/7.jpg)
Methods used so far
• the workflow not yet automated
• Estonian National Corpus (NC)
• started in the 1990s
• monitoring corpus (since 2017 every two years)
• Estonian NC 2017 – 1.1 billion tokens
• Estonian NC 2019 (October)
• Sketch Engine
• Wordlist function
• ELEXIS Survey for Lexicographers (2019): 54,8% (of those using 22 CQSs) are using SkE
• there are many neologisms that will be missed (Kilgarriff et al. 2015)
8 May 2019 Globalex Workshop on Neologism 7
![Page 8: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every](https://reader033.fdocuments.us/reader033/viewer/2022060716/607c2eac794eca6c076fd242/html5/thumbnails/8.jpg)
An experimental study: detecting new words
• Exclusion Dictionary Architecture (Cartier, 2017)
• extraction of novel forms from monitor corpora
• using lexicographic resources as a reference exclusion dictionary to induce unknown words
• filters to eliminate spelling errors and proper nouns
• no tracking of new meanings (semantic neologisms)
5 stages
Kaisa Norak, Indrek Hein, lexicographers (February–April 2018)
8 May 2019 Globalex Workshop on Neologism 8
![Page 9: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every](https://reader033.fdocuments.us/reader033/viewer/2022060716/607c2eac794eca6c076fd242/html5/thumbnails/9.jpg)
Stages 1-2: extraction of novel word forms, filtering
• extraction of novel word forms from the Institute’s text collection (collected from 2016 to 2018)• single new words (not MWEs)
• online news, TV subtitles, transcribed books (from heliraamat.eki.ee: text>audio)
• 712,197 word forms that had failed in the automatic morphological analysis
• filtering (first round)• Python 3 language and its library
• EstNLTK 1.4.1 (for lemmatization and morphological tagging)
• R and its library Tidyverse (for filtering and sorting)
• Excel (sorting)
• Lemmatization
• data selection and (multiple) cleaning of selected lemmas
5,290 lemmas
8 May 2019 Globalex Workshop on Neologism 9
![Page 10: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every](https://reader033.fdocuments.us/reader033/viewer/2022060716/607c2eac794eca6c076fd242/html5/thumbnails/10.jpg)
Stage 3: reference exclusion word list
• lexicographic resources
• Explanatory Dictionary of the Estonian Language (EKSS 2009)
• Dictionary of Estonian (DicEst 2019)
• Dictionary of Foreign Words (VL 2015)
• Dictionary of Standard Estonian (ÕS 2013)
• in-house database of new words
3,722 lemmas
• English-Estonian Machine Translation Dictionary
• incl. 233 unadapted English loanwords
• weekend, lite, backup, wallet
8 May 2019 Globalex Workshop on Neologism 10
![Page 11: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every](https://reader033.fdocuments.us/reader033/viewer/2022060716/607c2eac794eca6c076fd242/html5/thumbnails/11.jpg)
Stage 4-5: compilation, lexicographic evaluation
• more tokenizing errors• ganisatsioon ‘ganization’
• common spelling mistakes• aitähh ‘thank you’
• lemmatization errors, e.g. nouns in genitive and partitive
• direct loans from other languages• fer-de-lance, fouetté, bordereau, soentjie, societa, bueno, laissez-faire
• Estonian dialect words• tüdrik ‘girl’, mõlemi ‘both’
• words derived from proper nouns (ca 180)• lutsiferianism ‘Luciferianism’ and tarsanlik ‘Tarzan-like’
ca 200 new words• süler ‘laptop’, akrojooga ‘acrobatic yoga’
8 May 2019 Globalex Workshop on Neologism 11
![Page 12: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every](https://reader033.fdocuments.us/reader033/viewer/2022060716/607c2eac794eca6c076fd242/html5/thumbnails/12.jpg)
Registering and presenting new words
• in Ekilex + Sõnaveeb (Wordweb)
• diakooniline ‘diaconic’
• in Ekilex for further examination
• baklavaa ‘baklava’, blog ‘blog’, veelkord ‘once more’ – vs. the
standardized lemma forms baklava, blogi, veel kord
• 5,000 words on the waiting list (since 2005)
• 1,500 registered annually (manually), incl. MWEs
• a lot of derivatives and semantically transparent
compound words
• digiteerimine ‘digitalizing’
8 May 2019 Globalex Workshop on Neologism 12
![Page 13: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every](https://reader033.fdocuments.us/reader033/viewer/2022060716/607c2eac794eca6c076fd242/html5/thumbnails/13.jpg)
Descriptive vs. prescriptive data: all in one?
• 100 years long line of spelling or ortographic dics (ÕS 2018, 2013 ... 1918)
• government regulations for literary norm (since 2006): printed (!) ÕS
• prescriptive data
• orthography and pronunciation, marking the degree of quantity, stress and palatalization
• inflection
• specifying what belongs to standard Estonian and what does not
• prescriptively pointing out good and bad style in language
• ? meanings
• ? usage examples, ? collocations
• descriptive data
• ? orthography and pronunciation (variation)
• meanings
• usage examples, collocations
• etymological information
8 May 2019 Globalex Workshop on Neologism 13
![Page 14: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every](https://reader033.fdocuments.us/reader033/viewer/2022060716/607c2eac794eca6c076fd242/html5/thumbnails/14.jpg)
Plans for the future
• more advanced tools for neologism detecting
• detection of multi-word expressions and new meanings
• ? database of common spelling mistakes
• ? ELEXIS tools
• joining or implementing Néoveille, a web platform for neologism tracking
(Cartier 2017)
• visualizing usage and frequency information on the basis of time-stamped
corpora
• presenting both descriptive and prescriptive data
8 May 2019 Globalex Workshop on Neologism 14
![Page 15: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every](https://reader033.fdocuments.us/reader033/viewer/2022060716/607c2eac794eca6c076fd242/html5/thumbnails/15.jpg)
Thank you
Margit Langemets [email protected]
Jelena Kallas [email protected]
Kaisa Norak [email protected]
Indrek Hein [email protected]
8 May 2019 Globalex Workshop on Neologism 15