IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf ·...

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

IndiLem@FIRE-MET-2014 : An UnsupervisedLemmatizer for Indian Languages

Abhisek Chakrabarty

Indian Statistical Institute, Kolkata

December 6, 2014

AbhisekChakrabarty

. . . . . .

Contents

Task of a lemmatizer and its need.

The proposed lemmatization Approach.

Results and error analysis.

AbhisekChakrabarty

. . . . . .

Contents

AbhisekChakrabarty

. . . . . .

Contents

AbhisekChakrabarty

. . . . . .

What is lemmatization and why is it needed?

Lemmatization is a process that returns the base or dictionaryform of a word in context, which is known as the lemma ofthat word in that context.

A lemma of a word in a context is required to retrieve themeaning of that word in that context.

For example, ‘I retrieved that document.’

Here the lemma of retrieved is retrieve.

If we cannot map retrieved to retrieve, the meaning ofretrieved is not accessible.

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

Need of lemmatization for Indian languages

Major Indian languages (Hindi, Bengali, Gujrati etc.) aremorphologically very rich and suffixing in nature.

Knowledge resources (dictionary, WordNet) for those languagesusually store root words with their morphological and semanticdescriptions.

Very often we face several inflected word forms in raw texts likestories, newspapers, poems etc.

To obtain the meaning and morphological properties of them,we have to determine the appropriate root words which is thetask of a lemmatizer.

So lemmatization is necessary for building up many NLP tools( WSD systems, translation systems etc.) for Indian languages.

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

Different from stemmer

A stemmer operates on a single word without knowledge of thecontext.

Usually a stemmer returns the common portion of the variantword forms and the stem may be an invalid word.

But on varying contexts, the lemma of a particular word may bedifferent and the lemma must be a valid word of the language.

For all of the words retrieved, retrieval, retrieving, a stemmermay return retriev as the stem as it is the common portion ofall the inflected forms

But a lemmatizer should return retieve.

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

Proposed Lemmatization Algorithm

Our lemmatization algorithm requires a dictionary or WordNetfor collecting the root words of a language.

Atfirst, the root words are stored in a trie structure.

Each node in the trie corresponds to an unicode character ofthe language.

The nodes that end with the final character of a root word aremarked as ”final” nodes.

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

Examples

‘অংশ’/‘angsh’ = ‘অ’/‘a’+ ‘◌ং’/‘ng’ + ‘শ’/‘sh’

‘অংশ‌ু’/‘angshu’ =‘অংশ’/‘angsh’ + ‘◌ু’/‘u’‘অংশ‌ুক’/‘angshuk’ =‘অংশ‌ু’/‘angshu’ + ‘ক’/‘k’‘অংশ‌ুধর’/‘angshudhar’ =‘অংশ‌ু’/‘angshu’ +‘ধ’/‘dha’ + ‘র’/‘r’‘অংশগত’/‘angshgata’ =‘অংশ’/‘angsh’ + ‘গ’/‘ga’+ ‘ত/‘ta’

AbhisekChakrabarty

. . . . . .

Examples

‘অংশ’/‘angsh’ = ‘অ’/‘a’+ ‘◌ং’/‘ng’ + ‘শ’/‘sh’‘অংশ‌ু’/‘angshu’ =‘অংশ’/‘angsh’ + ‘◌ু’/‘u’

‘অংশ‌ুক’/‘angshuk’ =‘অংশ‌ু’/‘angshu’ + ‘ক’/‘k’‘অংশ‌ুধর’/‘angshudhar’ =‘অংশ‌ু’/‘angshu’ +‘ধ’/‘dha’ + ‘র’/‘r’‘অংশগত’/‘angshgata’ =‘অংশ’/‘angsh’ + ‘গ’/‘ga’+ ‘ত/‘ta’

AbhisekChakrabarty

. . . . . .

Examples

‘অংশ’/‘angsh’ = ‘অ’/‘a’+ ‘◌ং’/‘ng’ + ‘শ’/‘sh’‘অংশ‌ু’/‘angshu’ =‘অংশ’/‘angsh’ + ‘◌ু’/‘u’‘অংশ‌ুক’/‘angshuk’ =‘অংশ‌ু’/‘angshu’ + ‘ক’/‘k’

‘অংশ‌ুধর’/‘angshudhar’ =‘অংশ‌ু’/‘angshu’ +‘ধ’/‘dha’ + ‘র’/‘r’‘অংশগত’/‘angshgata’ =‘অংশ’/‘angsh’ + ‘গ’/‘ga’+ ‘ত/‘ta’

AbhisekChakrabarty

. . . . . .

Examples

‘অংশ’/‘angsh’ = ‘অ’/‘a’+ ‘◌ং’/‘ng’ + ‘শ’/‘sh’‘অংশ‌ু’/‘angshu’ =‘অংশ’/‘angsh’ + ‘◌ু’/‘u’‘অংশ‌ুক’/‘angshuk’ =‘অংশ‌ু’/‘angshu’ + ‘ক’/‘k’‘অংশ‌ুধর’/‘angshudhar’ =‘অংশ‌ু’/‘angshu’ +‘ধ’/‘dha’ + ‘র’/‘r’

‘অংশগত’/‘angshgata’ =‘অংশ’/‘angsh’ + ‘গ’/‘ga’+ ‘ত/‘ta’

AbhisekChakrabarty

. . . . . .

Examples

‘অংশ’/‘angsh’ = ‘অ’/‘a’+ ‘◌ং’/‘ng’ + ‘শ’/‘sh’‘অংশ‌ু’/‘angshu’ =‘অংশ’/‘angsh’ + ‘◌ু’/‘u’‘অংশ‌ুক’/‘angshuk’ =‘অংশ‌ু’/‘angshu’ + ‘ক’/‘k’‘অংশ‌ুধর’/‘angshudhar’ =‘অংশ‌ু’/‘angshu’ +‘ধ’/‘dha’ + ‘র’/‘r’‘অংশগত’/‘angshgata’ =‘অংশ’/‘angsh’ + ‘গ’/‘ga’+ ‘ত/‘ta’

AbhisekChakrabarty

. . . . . .

Searching strategy in the trie

To find the lemma of a surface word, the trie is navigatedstarting from the initial node.

Navigation ends when either the word is completely found inthe trie or after some portion of the word there is no pathpresent in the trie to navigate.

While navigating, some situations may occur, depending onwhich we are taking decision to determine the lemma.

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

Searching strategy in the trie (continued)

If the surface word isitself a root word, thenwe will reach to a finalnode.

If the surface word is nota root word, then the trieis navigated upto thatnode where the surfaceword completely ends orthere is no path tonavigate.

We call this node as theend node.

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

Now two different cases may occur here.

1. In the path from initial node to the end node, if one or morethan one final nodes are found, then pick that final nodewhich is closest to the end node.

The word represented by the path from initial node to thepicked final node is considered as the lemma.

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

Examples

Consider two inflectedwords ‘অংেশর’/‘angsher’and‘অংশীদােরর’/‘angshidaarer’.

‘অংেশর’/‘angsher’ comesfrom ‘অংশ’/‘angsh’.

‘অংশীদােরর’/‘angshidaarer’comes from‘অংশীদার’/‘angshidaar’.

AbhisekChakrabarty

. . . . . .

Examples

AbhisekChakrabarty

. . . . . .

Examples

AbhisekChakrabarty

. . . . . .

Examples

AbhisekChakrabarty

. . . . . .

Examples

AbhisekChakrabarty

. . . . . .

2. If no root word is found in the path from the intial node tothe end node, then find the final node in the trie which isclosest to the end node.

If more than one final nodes are found at the closest distancethen pick all of them.

Now, generate the root word(s) which is/are represented by thepath from initial node to those picked final node(s).

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

Finally among the generated root word(s), pick the rootword(s) which has/have maximum overlapping prefix lengthwith the surface word.

By the phrase ‘overlapping prefix length’ between two words,we mean the length of the longest common prefix betweenthem.

Even at this stage if more than one roots are selected, thenselect any one of them arbitrarily as the lemma.

As it is very rare to have more than one root words in this stageand if more than one root exist, then all are viable candidates.

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

Examples

consider the dictionaryroot words‘শ‌ুনা’/‘shuna’,‘শ‌ুনািন’/‘shunani’ and‘শ‌ুনােনা’/‘shunano’.

Now took an inflectedword ‘শ‌ুেন’/‘shune’which actually comesfrom ‘শ‌ুনা’/‘shuna’.

AbhisekChakrabarty

. . . . . .

Examples

AbhisekChakrabarty

. . . . . .

Examples

AbhisekChakrabarty

. . . . . .

Examples

AbhisekChakrabarty

. . . . . .

Results

Based on the evaluation of the Morpheme Extraction Task -FIRE 2014, the results obtained on Bengali data using ourlemmatization system are given in the following Table.

TOTAL. Precision TOTAL. Recall TOTAL. F1-measure:56.19% 65.08% 60.31%

AbhisekChakrabarty

. . . . . .

Results

AbhisekChakrabarty

. . . . . .

Results

AbhisekChakrabarty

. . . . . .

Source of Lemmatization Errors

Compound words and out-of-vocabulary words are notconsidered in our algorithm.

Root words are taken from dictionary but if the coverageof the dictionary used is not good, then that will causeerrors.

However, as there is no such good language independentlemmatizer for Indian languages, we hope our effort is apositive contribution.

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

AbhisekChakrabarty

. . . . . .

Questions ??

AbhisekChakrabarty

. . . . . .

Questions ??

AbhisekChakrabarty

. . . . . .

References

Pushpak Bhattacharyya, Ankit Bahuguna, Lavita Talukdar and Bornali Phukan. FacilitatingMulti-Lingual Sense Annotation: Human Mediated Lemmatizer. Global WordNet Conference. 2014.

Ljiljana Dolamic and Jacques Savoy. Comparative Study of Indexing and Search Strategies for theHindi, Marathi and Bengali Languages. ACM Transactions on Asian Language InformationProcessing (TALIP), 9.3:11.

Debasis Ganguly, Johannes Leveling and Gareth J. F. Jones. DCU@ FIRE 2012:Rule-BasedStemmers for Bengali and Hindi. Working Notes for the FIRE 2012 Workshop.

Aki Loponen, Jiaul H. Paik and Kalervo Järvelin.UTA Stemming and Lemmatization Experiments inthe FIRE Bengali Ad Hoc Task. Multilingual Information Access in South Asian Languages. SpringerBerlin Heidelberg, 258-268.

Sandipan Sarkar and Sivaji Bandyopadhyay. Morpheme Extraction Task Using Mulaadhaar – ARule-Based Stemmer for Bengali.JU@FIRE MET 2012. Working Notes for the FIRE 2012 Workshop.

AbhisekChakrabarty

. . . . . .

Thank You

IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf ·...

Documents

Transcript of IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf ·...

A Unified Framework for phrase- based, Hierarchical and Syntax … · 2016. 5. 23. · Hierarchical phrase extraction Chart decoder. Syntactic Decoding Pipeline Preprocessing-tokenizer-tagging-lemmatization

BioLemmatizer: a lemmatization tool for morphological processing of biomedical text

Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format'

Facilitating Multi-Lingual Sense Annotation: Human ...pb/papers/gwc14-multilingual-stemmer.pdf · Facilitating Multi-Lingual Sense Annotation: Human Mediated Lemmatizer Pushpak Bhattacharyya

TWITTER SENTIMENT ANALYSIS USING DEMPSTER SHAFER …€¦ · issues in the classification of the sentiment [13]. To address this problem, ... lemmatization, spelling correction, stop

ENROLMENT STUDY NO PROGRAMME CENTRE NAME FATHER … · 165982469 bcom 2814 abhisek chatterjee niloy chatterjee bshf101 eco1 eco2 feg1 fbg1 eec11 168389331 bsc 2810 abhisek das anup

Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools.

51 DAFTAR PUSTAKA 1. Manuel ST, Abhisek P, Kundala M ...

Natural Language Processing · 2020. 8. 10. · Text Normalization!14 • Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization) techniques in

Simultaneous Morphological Analysis and Lemmatization of Arabic

The Durm German Lemmatizer Documentation

Abhisek chawbe 2115405

Cep construction of railway overbridge by abhisek panda

THE LEMMATIZATION OF OLD ENGLISH VERBS FROM … · THE LEMMATIZATION OF OLD ENGLISH VERBS FROM ... the present subjunctive singular, the first and third person of preterite ... strong

Collatinus: Lemmatizer and morphological analyzer for Latin texts

Deori-Tiwa-Assamese: A Morpho-Phonemic Studyshodhganga.inflibnet.ac.in/bitstream/10603/31796/10/10_chapter 3.p… · Lemmatization is a difficult process in lexical semantics, since

Changing trends in Indian Tele-Communication Industry-Abhisek Paul

ERP In Relational Networking-A Walkthrough-Abhisek Paul

LEMMATIZATION OF ENGLISH VERBS IN COMPOUND TENSES · PDF fileLEMMATIZATION OF ENGLISH VERBS IN COMPOUND TENSES Maurice Gross To cite this version: Maurice Gross. LEMMATIZATION OF ENGLISH