Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have...

28
Marathi POS Marathi POS Tagger Tagger Prof. Pushpak Bhattacharyya Veena Dixit Sachin Burange Sushant Devlekar IIT Bombay

Transcript of Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have...

Page 1: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

Marathi POS Marathi POS TaggerTagger

Prof. Pushpak BhattacharyyaVeena Dixit

Sachin BurangeSushant Devlekar

IIT Bombay

Page 2: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

About Marathi Language

IIT Bombay

•Marathi is the state language of Maharashtra, a province in the western part of India.

•Marathi is spoken by about 16 million people.

•It is 15th in the world in the population wise ranking list.

•It belongs to the Indo Aryan language family with many influences from Dravidian languages.

fahd

Page 3: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

IIT Bombay

Part of Speech Tagger

fahd

•Basic aim of Part of Speech Tagging is to identify correct word cartogry(such as Noun,Verb etc) from a given sentence.

•Part of Speech (POS) tagging is a crucial step in any language processing system.

• Parsing, Machine Translation, Information Extraction- all these tasks have to employ POS tagging in the initial stages.

• POS tagging has its own challenges, some of which are POS ambiguity, unknown words and Proper nouns.

Page 4: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

IIT Bombay

Part of Speech Tagger

POS Tagging techniques overview :

In reality this picture is more complicated

Page 5: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

IIT Bombay

Modules completed in Marathi POS Tagger

•Verb Computation. (example बसतो, खातात,खाशील)

•Conjunct Computation. (example आ ण,पण,परंतु )

•Interjection Computation.(अरे वा, बाप रे,etc )

•Noun Computation (Currently doing) (Example राहलू ,बेडकाल,बाईला,खाऊ )

Page 6: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

IIT Bombay

Marathi POS Tagger

Modules.∗ Tokenization. (Common Module)∗ Stemmer ∗ Morphological anyalser.∗ Tag Generator (Common Module)

Page 7: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

IIT Bombay

Marathi POS Tagger

Verb Basic• Prathama taakhyaata ूथम ता यात

• Dwitiya taakhyaata तीय ता यात• Laakhyaata ला यात• Vaakhyaata वा यात• ‘I:’ aakhyaata ई -आ यात• ‘U:’ aakhyaata ऊ-आ यात• ‘I:laa’khyaata ईला यात• Chaakhyaata/Aayachaakhyaata चा यात

Page 8: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

POS Tagger : Indian Languages

As indian languages has morphologocally rich so importance of linguistic in Indian language is increased.

For example रामाला, here in this example राम+आ+ला this breaking never be ignored as each cluster giving important information

Page 9: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

Way towards Marathi POS

We started with verb as, it is most important category of the words.

Implementation of Aakhyatya(आ यात) Theory is been implemented.

Aakhyatya(आ यात [7]) Theory refers to the group of suffixes which gives information like (G_N_P_T_A_M)

Page 10: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

Pratham Takhyata

पु षGen

एकवचन अनेकवचनPlural

पु लगंMasc

ीिलगंFem

नपुंसक िलगंNeut

ूथमFirst

त , तो य, त, ते (त) त , तो

तीयSec

तोस येस, तेस, तीस (तस) तां, ता

ततृीयThird

तो ये, ते, ती त, ते तं(तS) तात

Page 11: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

Dvitiya takhyata

पु ष पु लंग ीिलंग नपुंसक िलंगए. अ. ए. अ. ए. अ.

ूथम त , तो त , तो य, त, ते त , तो (त) (त )

तीय तास तां, तेत,ता तात

तीस तां,याता,तात

(तस) (तां तींत)

ततृीय ता ते ती या त, त(ंतS) तीं, ती

Page 12: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

Lakhyataपु ष पु लगं ीिलगं नपुंसक िलगं

ए. अ. ए. अ. ए. अ.ूथम ल , लो ल , लो ये, ल

ले, लील ,लो

(ल) (ल )

तीय लास लां, लांत,लेत लात

लीस लां,लांत,या,त,लात

(लस) (लां, लांत)

ततृीय ला ले ली या ल, लेलं(लS)

लीं, ली

Page 13: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

Vakhyata/Avakhyata

पु ष पु लंग ीिलंग नपुंसक िलंगए. अ. ए. अ. ए. अ.

ूथम आवा आवे आवी आ या (आव) (आवीं)तीय आवास आवे,

आवेत,आवं(आवS)

आवीस आ याआ यात

(आवस) (आवींआवींत)

ततृीय आवा आवेआवेत

आवी आ याआ यात

आव आवं(आवS)

आवीं,आवींतआवीआवीत

Page 14: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

I-akhyata

पु ष एकवचन अनेकवचन

ूथम ई ए ऊं,ओ,ं ऊ,ओ

तीय स आ,ंआ

ततृीय ई ए त

Page 15: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

Ilakhyata

पु ष एकवचन अनेकवचन

ूथम ईन एन ऊं, ऊ

तीय शील आल

ततृीय ईल एल तील

Page 16: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

* Computation *

Tokenization is a process of separating the different Tokens. for example “बसतो” , ”;” , ”( ” Tokenization can be done in various ways

* StringTokenizer Class.* Regex Expression* java.text.* package

Page 17: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

* Computation *

Stemming is important in the system and we show the process with an example.

Suppose the input word is “बसतो (to sit)". In “बसतो ", two suffixes matched is of the category verb, “-to" The stems formed after removal of this suffixs is “बस (sit)”Searching this stems in the lexicon shows that “बस" is present in the lexicon Applications : Multilingual search engines, POS Taggeretc.

Page 18: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

* Computation *

Verb Module

VP ST

DR

RF2

Engine

Engine

P1 P2 P3 AT

IP TO

RF1 RF3

Page 19: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

* Computation *

IP :Marathi text in UNICODEST :It identifies the longest suffix and the stem for the input word.VP : It modifies the suffix wherever it is necessary.DR : It is a dictionary of all types of root words (It counts more than

2000 verbs)RF1: It consists of rules relating irregular stems to their root forms.RF2: It consists of rules relating suffixes and the corresponding

features.(तो)RF3: It consists of rules regarding the most frequent and the most

deviated verb forms. (होणे )AT : It generates tags based on the results returned by the engine.TO : The tagged verb form is returned as output.

VP

ST

DR

RF2

Engine

Engine

P1

P2 P3

AT

IP TO

RF1

RF3

Page 20: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

Rule Files

RF1: Rule File 1:

It consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding root forms.

1. <w>कर_kar<s> के_karNe_to do<r>

Page 21: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

Rule Files continued …

RF1: Rule File 2:

It consists of rules relating suffixes and the corresponding features.We have implemented over 1700 rules. The syntax of the rules is asfollows.

<r>णे_Ne<c>तो_to<s>present<t>habitual<a>indicative<m>M<g>S<n>1<p>.

The rule states that if suffix तो_to is separated, then add changingpart णे_Ne to the regular stem and search the root in the dictionaryDR. If the root is found then respective TAM GNP information will beextracted.

Page 22: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

Rule Files continued ..

RF1: Rule File 3:

It consists of rules regarding the most frequent and the mostdeviated verb forms. The root verbs,होणे_hoNe_to become, नहोणे_na hone_not.The syntax of the rules is the same asthe rules in RF1, which relates the deviated form of the stems andthe root. (The format of RF3 is same as RF2 and RF1).

Page 23: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

Other Modules

Engine: It processes the suffixes according to their categories. We have implemented processes P1, P2 and P3 for the corresponding categories. Suffix without space

Regular verb (P1) (example तो,तात etc)Irregular verb (P2) (example केले etc.)Suffix with space (P3) ( लो_आहे etc.)

Assign Tags (AT): This module generates the tags based on the results returned by the engine. We have defined the number of tags as listed in the table 6 in Appendix A. Tags are attached to respective word.

Tagged Output (TO): The verb forms detected from the text is displayed along with the respective tag.

Page 24: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

Evaluation

Total number of tagged verb forms 2176Total number of correctly detected tagged verb forms 2166Undetected verb forms: 97Total number of verb forms present in the corpus : 2263

Following precision and recall values are with ambiguity.Precision 0.9995Recall 0.97

Page 25: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

Conjunct Computation

* Modules.* Tokenization. (Common Module)* Sorting the Conjunct List.* Searching Word using Binary Search.* Tag Generator (Common Module)

(Tag is Conj )

Page 26: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

Interject Computation

* Modules.* Tokenization. (Common Module)* Sorting the interject List.* Searching Word using Binary Search.* Tag Generator (Common Module)

(Tag is Intej )

Page 27: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

Noun Commutation

CM PP Case

0 0 Direct (राम,रान)

0 1 Direct (रामला)

1 0 *Direct/Oblique

1 1 Oblique ( रामाला )

Page 28: Marathi POS TaggerIt consists of rules relating irregular stems to their root forms. We have analyzed around 25 irregular verbs. In total 65 rules, relate irregular stems to the corresponding

10 Case

घोडा ( N_M_S_D ) Ex : घोडा पळाला.घोडे ( N_M_P_D )Ex : घोडे पळाले.घो या ( N_M_S_Voc)Ex : घो या इकडे ये.