IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf ·...

70
IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results . . . . . . IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Indian Statistical Institute, Kolkata December 6, 2014

Transcript of IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf ·...

Page 1: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

IndiLem@FIRE-MET-2014 : An UnsupervisedLemmatizer for Indian Languages

Abhisek Chakrabarty

Indian Statistical Institute, Kolkata

December 6, 2014

Page 2: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Contents

Task of a lemmatizer and its need.

The proposed lemmatization Approach.

Results and error analysis.

Page 3: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Contents

Task of a lemmatizer and its need.

The proposed lemmatization Approach.

Results and error analysis.

Page 4: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Contents

Task of a lemmatizer and its need.

The proposed lemmatization Approach.

Results and error analysis.

Page 5: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

What is lemmatization and why is it needed?

Lemmatization is a process that returns the base or dictionaryform of a word in context, which is known as the lemma ofthat word in that context.

A lemma of a word in a context is required to retrieve themeaning of that word in that context.

For example, ‘I retrieved that document.’

Here the lemma of retrieved is retrieve.

If we cannot map retrieved to retrieve, the meaning ofretrieved is not accessible.

Page 6: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

What is lemmatization and why is it needed?

Lemmatization is a process that returns the base or dictionaryform of a word in context, which is known as the lemma ofthat word in that context.

A lemma of a word in a context is required to retrieve themeaning of that word in that context.

For example, ‘I retrieved that document.’

Here the lemma of retrieved is retrieve.

If we cannot map retrieved to retrieve, the meaning ofretrieved is not accessible.

Page 7: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

What is lemmatization and why is it needed?

Lemmatization is a process that returns the base or dictionaryform of a word in context, which is known as the lemma ofthat word in that context.

A lemma of a word in a context is required to retrieve themeaning of that word in that context.

For example, ‘I retrieved that document.’

Here the lemma of retrieved is retrieve.

If we cannot map retrieved to retrieve, the meaning ofretrieved is not accessible.

Page 8: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

What is lemmatization and why is it needed?

Lemmatization is a process that returns the base or dictionaryform of a word in context, which is known as the lemma ofthat word in that context.

A lemma of a word in a context is required to retrieve themeaning of that word in that context.

For example, ‘I retrieved that document.’

Here the lemma of retrieved is retrieve.

If we cannot map retrieved to retrieve, the meaning ofretrieved is not accessible.

Page 9: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

What is lemmatization and why is it needed?

Lemmatization is a process that returns the base or dictionaryform of a word in context, which is known as the lemma ofthat word in that context.

A lemma of a word in a context is required to retrieve themeaning of that word in that context.

For example, ‘I retrieved that document.’

Here the lemma of retrieved is retrieve.

If we cannot map retrieved to retrieve, the meaning ofretrieved is not accessible.

Page 10: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Need of lemmatization for Indian languages

Major Indian languages (Hindi, Bengali, Gujrati etc.) aremorphologically very rich and suffixing in nature.

Knowledge resources (dictionary, WordNet) for those languagesusually store root words with their morphological and semanticdescriptions.

Very often we face several inflected word forms in raw texts likestories, newspapers, poems etc.

To obtain the meaning and morphological properties of them,we have to determine the appropriate root words which is thetask of a lemmatizer.

So lemmatization is necessary for building up many NLP tools( WSD systems, translation systems etc.) for Indian languages.

Page 11: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Need of lemmatization for Indian languages

Major Indian languages (Hindi, Bengali, Gujrati etc.) aremorphologically very rich and suffixing in nature.

Knowledge resources (dictionary, WordNet) for those languagesusually store root words with their morphological and semanticdescriptions.

Very often we face several inflected word forms in raw texts likestories, newspapers, poems etc.

To obtain the meaning and morphological properties of them,we have to determine the appropriate root words which is thetask of a lemmatizer.

So lemmatization is necessary for building up many NLP tools( WSD systems, translation systems etc.) for Indian languages.

Page 12: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Need of lemmatization for Indian languages

Major Indian languages (Hindi, Bengali, Gujrati etc.) aremorphologically very rich and suffixing in nature.

Knowledge resources (dictionary, WordNet) for those languagesusually store root words with their morphological and semanticdescriptions.

Very often we face several inflected word forms in raw texts likestories, newspapers, poems etc.

To obtain the meaning and morphological properties of them,we have to determine the appropriate root words which is thetask of a lemmatizer.

So lemmatization is necessary for building up many NLP tools( WSD systems, translation systems etc.) for Indian languages.

Page 13: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Need of lemmatization for Indian languages

Major Indian languages (Hindi, Bengali, Gujrati etc.) aremorphologically very rich and suffixing in nature.

Knowledge resources (dictionary, WordNet) for those languagesusually store root words with their morphological and semanticdescriptions.

Very often we face several inflected word forms in raw texts likestories, newspapers, poems etc.

To obtain the meaning and morphological properties of them,we have to determine the appropriate root words which is thetask of a lemmatizer.

So lemmatization is necessary for building up many NLP tools( WSD systems, translation systems etc.) for Indian languages.

Page 14: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Need of lemmatization for Indian languages

Major Indian languages (Hindi, Bengali, Gujrati etc.) aremorphologically very rich and suffixing in nature.

Knowledge resources (dictionary, WordNet) for those languagesusually store root words with their morphological and semanticdescriptions.

Very often we face several inflected word forms in raw texts likestories, newspapers, poems etc.

To obtain the meaning and morphological properties of them,we have to determine the appropriate root words which is thetask of a lemmatizer.

So lemmatization is necessary for building up many NLP tools( WSD systems, translation systems etc.) for Indian languages.

Page 15: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Different from stemmer

A stemmer operates on a single word without knowledge of thecontext.

Usually a stemmer returns the common portion of the variantword forms and the stem may be an invalid word.

But on varying contexts, the lemma of a particular word may bedifferent and the lemma must be a valid word of the language.

For all of the words retrieved, retrieval, retrieving, a stemmermay return retriev as the stem as it is the common portion ofall the inflected forms

But a lemmatizer should return retieve.

Page 16: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Different from stemmer

A stemmer operates on a single word without knowledge of thecontext.

Usually a stemmer returns the common portion of the variantword forms and the stem may be an invalid word.

But on varying contexts, the lemma of a particular word may bedifferent and the lemma must be a valid word of the language.

For all of the words retrieved, retrieval, retrieving, a stemmermay return retriev as the stem as it is the common portion ofall the inflected forms

But a lemmatizer should return retieve.

Page 17: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Different from stemmer

A stemmer operates on a single word without knowledge of thecontext.

Usually a stemmer returns the common portion of the variantword forms and the stem may be an invalid word.

But on varying contexts, the lemma of a particular word may bedifferent and the lemma must be a valid word of the language.

For all of the words retrieved, retrieval, retrieving, a stemmermay return retriev as the stem as it is the common portion ofall the inflected forms

But a lemmatizer should return retieve.

Page 18: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Different from stemmer

A stemmer operates on a single word without knowledge of thecontext.

Usually a stemmer returns the common portion of the variantword forms and the stem may be an invalid word.

But on varying contexts, the lemma of a particular word may bedifferent and the lemma must be a valid word of the language.

For all of the words retrieved, retrieval, retrieving, a stemmermay return retriev as the stem as it is the common portion ofall the inflected forms

But a lemmatizer should return retieve.

Page 19: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Proposed Lemmatization Algorithm

Our lemmatization algorithm requires a dictionary or WordNetfor collecting the root words of a language.

Atfirst, the root words are stored in a trie structure.

Each node in the trie corresponds to an unicode character ofthe language.

The nodes that end with the final character of a root word aremarked as ”final” nodes.

Page 20: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Proposed Lemmatization Algorithm

Our lemmatization algorithm requires a dictionary or WordNetfor collecting the root words of a language.

Atfirst, the root words are stored in a trie structure.

Each node in the trie corresponds to an unicode character ofthe language.

The nodes that end with the final character of a root word aremarked as ”final” nodes.

Page 21: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Proposed Lemmatization Algorithm

Our lemmatization algorithm requires a dictionary or WordNetfor collecting the root words of a language.

Atfirst, the root words are stored in a trie structure.

Each node in the trie corresponds to an unicode character ofthe language.

The nodes that end with the final character of a root word aremarked as ”final” nodes.

Page 22: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Proposed Lemmatization Algorithm

Our lemmatization algorithm requires a dictionary or WordNetfor collecting the root words of a language.

Atfirst, the root words are stored in a trie structure.

Each node in the trie corresponds to an unicode character ofthe language.

The nodes that end with the final character of a root word aremarked as ”final” nodes.

Page 23: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Proposed Lemmatization Algorithm

Our lemmatization algorithm requires a dictionary or WordNetfor collecting the root words of a language.

Atfirst, the root words are stored in a trie structure.

Each node in the trie corresponds to an unicode character ofthe language.

The nodes that end with the final character of a root word aremarked as ”final” nodes.

Page 24: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Proposed Lemmatization Algorithm

Our lemmatization algorithm requires a dictionary or WordNetfor collecting the root words of a language.

Atfirst, the root words are stored in a trie structure.

Each node in the trie corresponds to an unicode character ofthe language.

The nodes that end with the final character of a root word aremarked as ”final” nodes.

Page 25: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Examples

‘অংশ’/‘angsh’ = ‘অ’/‘a’+ ‘◌ং’/‘ng’ + ‘শ’/‘sh’

‘অংশ‌ু’/‘angshu’ =‘অংশ’/‘angsh’ + ‘◌ু’/‘u’‘অংশ‌ুক’/‘angshuk’ =‘অংশ‌ু’/‘angshu’ + ‘ক’/‘k’‘অংশ‌ুধর’/‘angshudhar’ =‘অংশ‌ু’/‘angshu’ +‘ধ’/‘dha’ + ‘র’/‘r’‘অংশগত’/‘angshgata’ =‘অংশ’/‘angsh’ + ‘গ’/‘ga’+ ‘ত/‘ta’

Page 26: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Examples

‘অংশ’/‘angsh’ = ‘অ’/‘a’+ ‘◌ং’/‘ng’ + ‘শ’/‘sh’‘অংশ‌ু’/‘angshu’ =‘অংশ’/‘angsh’ + ‘◌ু’/‘u’

‘অংশ‌ুক’/‘angshuk’ =‘অংশ‌ু’/‘angshu’ + ‘ক’/‘k’‘অংশ‌ুধর’/‘angshudhar’ =‘অংশ‌ু’/‘angshu’ +‘ধ’/‘dha’ + ‘র’/‘r’‘অংশগত’/‘angshgata’ =‘অংশ’/‘angsh’ + ‘গ’/‘ga’+ ‘ত/‘ta’

Page 27: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Examples

‘অংশ’/‘angsh’ = ‘অ’/‘a’+ ‘◌ং’/‘ng’ + ‘শ’/‘sh’‘অংশ‌ু’/‘angshu’ =‘অংশ’/‘angsh’ + ‘◌ু’/‘u’‘অংশ‌ুক’/‘angshuk’ =‘অংশ‌ু’/‘angshu’ + ‘ক’/‘k’

‘অংশ‌ুধর’/‘angshudhar’ =‘অংশ‌ু’/‘angshu’ +‘ধ’/‘dha’ + ‘র’/‘r’‘অংশগত’/‘angshgata’ =‘অংশ’/‘angsh’ + ‘গ’/‘ga’+ ‘ত/‘ta’

Page 28: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Examples

‘অংশ’/‘angsh’ = ‘অ’/‘a’+ ‘◌ং’/‘ng’ + ‘শ’/‘sh’‘অংশ‌ু’/‘angshu’ =‘অংশ’/‘angsh’ + ‘◌ু’/‘u’‘অংশ‌ুক’/‘angshuk’ =‘অংশ‌ু’/‘angshu’ + ‘ক’/‘k’‘অংশ‌ুধর’/‘angshudhar’ =‘অংশ‌ু’/‘angshu’ +‘ধ’/‘dha’ + ‘র’/‘r’

‘অংশগত’/‘angshgata’ =‘অংশ’/‘angsh’ + ‘গ’/‘ga’+ ‘ত/‘ta’

Page 29: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Examples

‘অংশ’/‘angsh’ = ‘অ’/‘a’+ ‘◌ং’/‘ng’ + ‘শ’/‘sh’‘অংশ‌ু’/‘angshu’ =‘অংশ’/‘angsh’ + ‘◌ু’/‘u’‘অংশ‌ুক’/‘angshuk’ =‘অংশ‌ু’/‘angshu’ + ‘ক’/‘k’‘অংশ‌ুধর’/‘angshudhar’ =‘অংশ‌ু’/‘angshu’ +‘ধ’/‘dha’ + ‘র’/‘r’‘অংশগত’/‘angshgata’ =‘অংশ’/‘angsh’ + ‘গ’/‘ga’+ ‘ত/‘ta’

Page 30: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie

To find the lemma of a surface word, the trie is navigatedstarting from the initial node.

Navigation ends when either the word is completely found inthe trie or after some portion of the word there is no pathpresent in the trie to navigate.

While navigating, some situations may occur, depending onwhich we are taking decision to determine the lemma.

Page 31: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie

To find the lemma of a surface word, the trie is navigatedstarting from the initial node.

Navigation ends when either the word is completely found inthe trie or after some portion of the word there is no pathpresent in the trie to navigate.

While navigating, some situations may occur, depending onwhich we are taking decision to determine the lemma.

Page 32: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie

To find the lemma of a surface word, the trie is navigatedstarting from the initial node.

Navigation ends when either the word is completely found inthe trie or after some portion of the word there is no pathpresent in the trie to navigate.

While navigating, some situations may occur, depending onwhich we are taking decision to determine the lemma.

Page 33: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie

To find the lemma of a surface word, the trie is navigatedstarting from the initial node.

Navigation ends when either the word is completely found inthe trie or after some portion of the word there is no pathpresent in the trie to navigate.

While navigating, some situations may occur, depending onwhich we are taking decision to determine the lemma.

Page 34: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie (continued)

If the surface word isitself a root word, thenwe will reach to a finalnode.

If the surface word is nota root word, then the trieis navigated upto thatnode where the surfaceword completely ends orthere is no path tonavigate.

We call this node as theend node.

Page 35: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie (continued)

If the surface word isitself a root word, thenwe will reach to a finalnode.

If the surface word is nota root word, then the trieis navigated upto thatnode where the surfaceword completely ends orthere is no path tonavigate.

We call this node as theend node.

Page 36: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie (continued)

If the surface word isitself a root word, thenwe will reach to a finalnode.

If the surface word is nota root word, then the trieis navigated upto thatnode where the surfaceword completely ends orthere is no path tonavigate.

We call this node as theend node.

Page 37: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie (continued)

If the surface word isitself a root word, thenwe will reach to a finalnode.

If the surface word is nota root word, then the trieis navigated upto thatnode where the surfaceword completely ends orthere is no path tonavigate.

We call this node as theend node.

Page 38: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie (continued)

Now two different cases may occur here.

1. In the path from initial node to the end node, if one or morethan one final nodes are found, then pick that final nodewhich is closest to the end node.

The word represented by the path from initial node to thepicked final node is considered as the lemma.

Page 39: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie (continued)

Now two different cases may occur here.

1. In the path from initial node to the end node, if one or morethan one final nodes are found, then pick that final nodewhich is closest to the end node.

The word represented by the path from initial node to thepicked final node is considered as the lemma.

Page 40: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie (continued)

Now two different cases may occur here.

1. In the path from initial node to the end node, if one or morethan one final nodes are found, then pick that final nodewhich is closest to the end node.

The word represented by the path from initial node to thepicked final node is considered as the lemma.

Page 41: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie (continued)

Now two different cases may occur here.

1. In the path from initial node to the end node, if one or morethan one final nodes are found, then pick that final nodewhich is closest to the end node.

The word represented by the path from initial node to thepicked final node is considered as the lemma.

Page 42: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Examples

Consider two inflectedwords ‘অংেশর’/‘angsher’and‘অংশীদােরর’/‘angshidaarer’.

‘অংেশর’/‘angsher’ comesfrom ‘অংশ’/‘angsh’.

‘অংশীদােরর’/‘angshidaarer’comes from‘অংশীদার’/‘angshidaar’.

Page 43: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Examples

Consider two inflectedwords ‘অংেশর’/‘angsher’and‘অংশীদােরর’/‘angshidaarer’.

‘অংেশর’/‘angsher’ comesfrom ‘অংশ’/‘angsh’.

‘অংশীদােরর’/‘angshidaarer’comes from‘অংশীদার’/‘angshidaar’.

Page 44: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Examples

Consider two inflectedwords ‘অংেশর’/‘angsher’and‘অংশীদােরর’/‘angshidaarer’.

‘অংেশর’/‘angsher’ comesfrom ‘অংশ’/‘angsh’.

‘অংশীদােরর’/‘angshidaarer’comes from‘অংশীদার’/‘angshidaar’.

Page 45: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Examples

Consider two inflectedwords ‘অংেশর’/‘angsher’and‘অংশীদােরর’/‘angshidaarer’.

‘অংেশর’/‘angsher’ comesfrom ‘অংশ’/‘angsh’.

‘অংশীদােরর’/‘angshidaarer’comes from‘অংশীদার’/‘angshidaar’.

Page 46: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Examples

Consider two inflectedwords ‘অংেশর’/‘angsher’and‘অংশীদােরর’/‘angshidaarer’.

‘অংেশর’/‘angsher’ comesfrom ‘অংশ’/‘angsh’.

‘অংশীদােরর’/‘angshidaarer’comes from‘অংশীদার’/‘angshidaar’.

Page 47: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie (continued)

2. If no root word is found in the path from the intial node tothe end node, then find the final node in the trie which isclosest to the end node.

The word represented by the path from initial node to thepicked final node is considered as the lemma.

If more than one final nodes are found at the closest distancethen pick all of them.

Now, generate the root word(s) which is/are represented by thepath from initial node to those picked final node(s).

Page 48: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie (continued)

2. If no root word is found in the path from the intial node tothe end node, then find the final node in the trie which isclosest to the end node.

The word represented by the path from initial node to thepicked final node is considered as the lemma.

If more than one final nodes are found at the closest distancethen pick all of them.

Now, generate the root word(s) which is/are represented by thepath from initial node to those picked final node(s).

Page 49: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie (continued)

2. If no root word is found in the path from the intial node tothe end node, then find the final node in the trie which isclosest to the end node.

The word represented by the path from initial node to thepicked final node is considered as the lemma.

If more than one final nodes are found at the closest distancethen pick all of them.

Now, generate the root word(s) which is/are represented by thepath from initial node to those picked final node(s).

Page 50: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie (continued)

2. If no root word is found in the path from the intial node tothe end node, then find the final node in the trie which isclosest to the end node.

The word represented by the path from initial node to thepicked final node is considered as the lemma.

If more than one final nodes are found at the closest distancethen pick all of them.

Now, generate the root word(s) which is/are represented by thepath from initial node to those picked final node(s).

Page 51: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie (continued)

2. If no root word is found in the path from the intial node tothe end node, then find the final node in the trie which isclosest to the end node.

The word represented by the path from initial node to thepicked final node is considered as the lemma.

If more than one final nodes are found at the closest distancethen pick all of them.

Now, generate the root word(s) which is/are represented by thepath from initial node to those picked final node(s).

Page 52: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie (continued)

Finally among the generated root word(s), pick the rootword(s) which has/have maximum overlapping prefix lengthwith the surface word.

By the phrase ‘overlapping prefix length’ between two words,we mean the length of the longest common prefix betweenthem.

Even at this stage if more than one roots are selected, thenselect any one of them arbitrarily as the lemma.

As it is very rare to have more than one root words in this stageand if more than one root exist, then all are viable candidates.

Page 53: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie (continued)

Finally among the generated root word(s), pick the rootword(s) which has/have maximum overlapping prefix lengthwith the surface word.

By the phrase ‘overlapping prefix length’ between two words,we mean the length of the longest common prefix betweenthem.

Even at this stage if more than one roots are selected, thenselect any one of them arbitrarily as the lemma.

As it is very rare to have more than one root words in this stageand if more than one root exist, then all are viable candidates.

Page 54: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie (continued)

Finally among the generated root word(s), pick the rootword(s) which has/have maximum overlapping prefix lengthwith the surface word.

By the phrase ‘overlapping prefix length’ between two words,we mean the length of the longest common prefix betweenthem.

Even at this stage if more than one roots are selected, thenselect any one of them arbitrarily as the lemma.

As it is very rare to have more than one root words in this stageand if more than one root exist, then all are viable candidates.

Page 55: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie (continued)

Finally among the generated root word(s), pick the rootword(s) which has/have maximum overlapping prefix lengthwith the surface word.

By the phrase ‘overlapping prefix length’ between two words,we mean the length of the longest common prefix betweenthem.

Even at this stage if more than one roots are selected, thenselect any one of them arbitrarily as the lemma.

As it is very rare to have more than one root words in this stageand if more than one root exist, then all are viable candidates.

Page 56: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Searching strategy in the trie (continued)

Finally among the generated root word(s), pick the rootword(s) which has/have maximum overlapping prefix lengthwith the surface word.

By the phrase ‘overlapping prefix length’ between two words,we mean the length of the longest common prefix betweenthem.

Even at this stage if more than one roots are selected, thenselect any one of them arbitrarily as the lemma.

As it is very rare to have more than one root words in this stageand if more than one root exist, then all are viable candidates.

Page 57: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Examples

consider the dictionaryroot words‘শ‌ুনা’/‘shuna’,‘শ‌ুনািন’/‘shunani’ and‘শ‌ুনােনা’/‘shunano’.

Now took an inflectedword ‘শ‌ুেন’/‘shune’which actually comesfrom ‘শ‌ুনা’/‘shuna’.

Page 58: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Examples

consider the dictionaryroot words‘শ‌ুনা’/‘shuna’,‘শ‌ুনািন’/‘shunani’ and‘শ‌ুনােনা’/‘shunano’.

Now took an inflectedword ‘শ‌ুেন’/‘shune’which actually comesfrom ‘শ‌ুনা’/‘shuna’.

Page 59: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Examples

consider the dictionaryroot words‘শ‌ুনা’/‘shuna’,‘শ‌ুনািন’/‘shunani’ and‘শ‌ুনােনা’/‘shunano’.

Now took an inflectedword ‘শ‌ুেন’/‘shune’which actually comesfrom ‘শ‌ুনা’/‘shuna’.

Page 60: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Examples

consider the dictionaryroot words‘শ‌ুনা’/‘shuna’,‘শ‌ুনািন’/‘shunani’ and‘শ‌ুনােনা’/‘shunano’.

Now took an inflectedword ‘শ‌ুেন’/‘shune’which actually comesfrom ‘শ‌ুনা’/‘shuna’.

Page 61: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Results

Based on the evaluation of the Morpheme Extraction Task -FIRE 2014, the results obtained on Bengali data using ourlemmatization system are given in the following Table.

TOTAL. Precision TOTAL. Recall TOTAL. F1-measure:56.19% 65.08% 60.31%

Page 62: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Results

Based on the evaluation of the Morpheme Extraction Task -FIRE 2014, the results obtained on Bengali data using ourlemmatization system are given in the following Table.

TOTAL. Precision TOTAL. Recall TOTAL. F1-measure:56.19% 65.08% 60.31%

Page 63: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Results

Based on the evaluation of the Morpheme Extraction Task -FIRE 2014, the results obtained on Bengali data using ourlemmatization system are given in the following Table.

TOTAL. Precision TOTAL. Recall TOTAL. F1-measure:56.19% 65.08% 60.31%

Page 64: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Source of Lemmatization Errors

Compound words and out-of-vocabulary words are notconsidered in our algorithm.

Root words are taken from dictionary but if the coverageof the dictionary used is not good, then that will causeerrors.

However, as there is no such good language independentlemmatizer for Indian languages, we hope our effort is apositive contribution.

Page 65: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Source of Lemmatization Errors

Compound words and out-of-vocabulary words are notconsidered in our algorithm.

Root words are taken from dictionary but if the coverageof the dictionary used is not good, then that will causeerrors.

However, as there is no such good language independentlemmatizer for Indian languages, we hope our effort is apositive contribution.

Page 66: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Source of Lemmatization Errors

Compound words and out-of-vocabulary words are notconsidered in our algorithm.

Root words are taken from dictionary but if the coverageof the dictionary used is not good, then that will causeerrors.

However, as there is no such good language independentlemmatizer for Indian languages, we hope our effort is apositive contribution.

Page 67: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Questions ??

Page 68: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Questions ??

Page 69: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

References

Pushpak Bhattacharyya, Ankit Bahuguna, Lavita Talukdar and Bornali Phukan. FacilitatingMulti-Lingual Sense Annotation: Human Mediated Lemmatizer. Global WordNet Conference. 2014.

Ljiljana Dolamic and Jacques Savoy. Comparative Study of Indexing and Search Strategies for theHindi, Marathi and Bengali Languages. ACM Transactions on Asian Language InformationProcessing (TALIP), 9.3:11.

Debasis Ganguly, Johannes Leveling and Gareth J. F. Jones. DCU@ FIRE 2012:Rule-BasedStemmers for Bengali and Hindi. Working Notes for the FIRE 2012 Workshop.

Aki Loponen, Jiaul H. Paik and Kalervo Järvelin.UTA Stemming and Lemmatization Experiments inthe FIRE Bengali Ad Hoc Task. Multilingual Information Access in South Asian Languages. SpringerBerlin Heidelberg, 258-268.

Sandipan Sarkar and Sivaji Bandyopadhyay. Morpheme Extraction Task Using Mulaadhaar – ARule-Based Stemmer for Bengali.JU@FIRE MET 2012. Working Notes for the FIRE 2012 Workshop.

Page 70: IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for ...clia/slides/Abhisek_MET_fire14.pdf · Abhisek Chakrabarty Basics of Lemmatiza-tion Lemmatization Algorithm Experimental Setup

IndiLem@FIRE-MET-2014 :

AnUnsupervisedLemmatizerfor IndianLanguages

AbhisekChakrabarty

Basics ofLemmatiza-tion

LemmatizationAlgorithm

ExperimentalSetup andResults

. . . . . .

Thank You