Part-of-Speech Annotation Challenges in MarathiTitle: Part-of-Speech Annotation Challenges in...

6
1 P a r t o f S p e e c h A n n ot a t i on C h al l e n ge s i n M a r at h i G aj a n an R an e N i l e s h Jo s h i G e e t an j a l i R an e H an u m a n t R e d k ar M al h ar K u l k ar n i an d P u s h p ak B h at t a c h ar y ya C e n t e r F or I n di a n L a n g ua g e T e c h n ol og y I n d i a n I n s t i t ut e o f T e c hn ol o g y B om b a y M um b a i I ndi a { g kr a n e j o s hi n i l e s h g e e t a n j a l e e r a ne ha n u m a nt r e d k a r m a l ha r ku a n d pus h p a kbh } @ gm a i l c o m A b s t r a c t P a r t o f S p e e c h P O S a nn o t a t i on i s a s i g ni f i c a nt c h a l l e n ge i n na t ur a l l a n gu a ge p r o c e s s i ng T he p a p e r d i s c u s s e s i s s ue s a n d c ha l l e n g e s f a c e d i n t he p r oc e s s of P O S a n no t a t i o n of t h e M a r a t hi da t a f r o m f o u r d om a i ns v i z t our i s m he a l t h e n t e r t a i n m e n t a nd a g r i c ul t u r e D ur i ng P O S a n n ot a t i on a l ot o f i s s u e s w e r e e n c ou nt e r e d S o m e of t h e m a j or on e s a r e d i s c us s e d i n de t a i l i n t hi s pa pe r A l s ot h e t w o a p p r oa c he s v i z t h e l e x i c a l L a pp r oa c h a nd t he f u nc t i o na l F a pp r o a c h of P O S t a g gi n g h a ve b e e n di s c us s e d a n d p r e s e n t e dw i t h e xa m p l e s F ur t he r s om e a m b i g u o us c a s e s i nP O S a nn o t a t i on a r e p r e s e n t e di nt h e p a p e r K e yw or d s M a r a t h i P O S A n n o t a t i on P O S T a gg i ngL e x i c a l F un c t i ona l M a r a t hi P O S T a gs e t I L C I I n t r od u c t i o n I n a n y na t ur a l l a n g ua g e P a r t o f S p e e c h P O S s uc h a s noun pr onoun a dj e c t i v e v e r b a dv e r b de m ons t r a t i v e e t c f o r m s a n i n t e g r a l bui l di n g bl o c k o f a s e nt e nc e s t r uc t ur e P O S t a gg i ng i s one o f t he m a j o r a c t i vi t i e s i n N a t ur a l L a ngu a ge P r o c e s s i n g N L P I n c or p u s l i n g ui s t i c s P O S t a gg i n g i s t he p r o c e s s o f m a r ki n g a n no t a t i ng a w o r d i n a t e xt c or p u s w h i c h c o r r e s p o n ds t o a pa r t i c ul a r P O S T he a n n o t a t i o n i s done ba s e d o n i t s d e f i n i t i o n a nd i t s c o n t e x t i e i t s r e l a t i on s h i p w i t h a d j a c e n t a nd r e l a t e d w o r ds i n a p hr a s e s e nt e nc e o r p a r a g r a p h A s i m pl i f i e d f o r m o f t hi s i s c om m onl yt a ught t o s c h oo l a g e c hi l d r e n i n t he i de nt i f i c a t i o n o f w or d s a s n o u ns v e r bs a dj e c t i v e s a d v e r b s e t c T h e t e r m P a r t o f S pe e c h T a g g i ngi s a l s o k n o w n a s P O S t a gg i n g P O S T P O S a nn ot a t i on g r a m m a t i c a l t a ggi n g o r w o r d c a t e gor y d i s a m bi g ua t i on I n t hi s p a pe r t he c ha l l e nge s a n d i s s u e s i n P O S t a g gi ng w i t h s pe c i a l r e f e r e nc e t o M a r a t h i ha v e b e e n pr e s e nt e d T h e M a r a t h i l a n g ua g e i s o ne o f t he m a j o r l a ngu a ge s o f I n d i a I t b e l o n g s t o t he I nd o A r y a n L a n g u a g e f a m i l y w i t h a bou t u s e r s I t i s p r e d o m i n a nt l ys poke n i n t he s t a t e o f M a h a r a s h t r a i n W e s t e r n I nd i a C ha udha r i e t a l I n r e c e nt y e a r s m a n y r e s e a r c h i ns t i t u t i o n s a n d o r g a ni z a t i on s a r e i nv ol ve d i n de ve l opi n g t h e l e xi c a l r e s o ur c e s f o r M a r a t hi f or N L P a c t i v i t i e s M a r a t hi W or d n e t i s on e s u c h l e x i c a l r e s our c e de v e l o pe d a t I I T B om ba y P o pa l e a ndB ha t t a c h a r yy a T h e pa p e r i s o r ga ni z e d a s f ol l o w s S e c t i o n i n t r oduc e s P O S a nnot a t i o n s e c t i on p r o vi d e s i nf o r m a t i o n on M a r a t hi a n not a t e d c o r por a s e c t i on de s c r i b e s M a r a t h i t a g s e t s e c t i o n e xpl a i n s t a g gi ng a ppr o a c h e s s e c t i on p r e s e nt s a m bi guous b e ha v i o r s o f t he M a r a t hi w o r ds s e c t i o n p r e s e n t s a d i s c u s s i on o n s pe c i a l c a s e s a n d s e c t i o n c on c l ud e s t h e pa pe r w i t hf u t u r e w o r k ht t p l a ng u a g e w o r l dof c o m pu t i n g ne t p o s t a gg i ng p a r t s o f s p e e c h t a gg i n ght m l ht t p w w w i ndi a n m i r r or c om l a n gu a g e s m a r a t hil a n gu a g e ht m l ht t p w w w c e ns u s i n di a go v i n P ar t s O f S p e e c h A n n ot at i on I n N L P pi pe l i n e P O S t a g gi ng i s a n i m p o r t a nt a c t i v i t y w h i c h f o r m s t he ba s e o f va r i o u s l a ngua g e pr o c e s s i n g a p pl i c a t i o ns A n not a t i n g a t e xt w i t h P O S t a g s i s a s t a nda r d l o w l e v e l t e x t pr e pr oc e s s i n g s t e p be f o r e m o vi ng t o h i g he r l e v e l s i n t he pi pe l i ne l i k e c h u nk i n g d e pe nd e n c y p a r s i n g e t c B h a t t a c ha r y y a I de n t i f i c a t i o n o f t he p a r t s of s pe e c h s u c h a s n o u n s v e r bs a d j e c t i ve s a d v e r bs f o r e a c h w o r d t o k e n o f t he s e n t e nc e h e l ps i n a na l y z i ng t he r ol e o f e a c h w o r d i n a s e nt e nc e J u r a f s ky D e t a l I t r e pr e s e nt s a t oke n l e v e l a nn o t a t i on w h e r e i n i t a s s i gns a t o k e n w i t h P O S c a t e gor y M ar at h i A n n ot at e d C or p or a A i m o f P O S t a gg i ng i s t o c r e a t e a l a r ge a nnot a t e d c or p or a f o r na t u r a l l a n g ua g e pr oc e s s i n g s pe e c h r e c o gni t i on a n d ot h e r r e l a t e d a p pl i c a t i o ns A nn o t a t e d c or por a s e r v e a s a n i m p o r t a nt r e s ou r c e i n N L P a c t i v i t i e s I t p r ove s t o be a ba s i c bu i l di ng b l oc k f o r c ons t r uc t i n g s t a t i s t i c a l m od e l s f o r t he a u t o m a t i c pr o c e s s i n g o f na t ur a l l a n g u a g e s T he s i gni f i c a nc e o f l a r g e a nn o t a t e d c o r po r a i s w i d e l y a p pr e c i a t e d b y r e s e a r c he r s a n d a p pl i c a t i on de v e l op e r s V a r i ous r e s e a r c h i ns t i t u t e s i n I n d i a vi z I I T B om ba y I I I T H y d e r a ba d J N U N e w D e l hi a nd o t he r i n s t i t ut e s h a v e de v e l op e d a l a r g e c o r pu s of P O S t a gg e d da t a I n M a r a t hi t h e r e i s a r o u n d k a n n o t a t e d da t a de v e l o pe d a s a pa r t o f I n di a n L a n g u a g e s C o r p o r a I ni t i a t i v e I L C I p r o j e c t f u nde d b y M e i t Y N e w D e l hi T h i s I L C I c o r p u s c o ns i s t s o f f o ur d o m a i ns vi z T o ur i s m H e a l t h A g r i c u l t u r e a n d E nt e r t a i n m e n t T h i s t a g g e d da t a T ou r i s m K H e a l t h K A g r i c u l t ur e K E n t e r t a i n m e nt K G e n e r a l K i s us e df o r v a r i ou s a p p l i c a t i on s l i k e c hu n k i ng de p e n d e nc y t r e e ba n k i n g w o r d s e ns e d i s a m b i g ua t i on e t c T hi s I L C I ht t p w w w c f i l t i i t ba c i n ht t p s l t r c i i i t a c i n ht t p s w w w j nu a c i n h t t p s a n s k r i t j n u a c i n i l c i i n d e x j s p h t t p s m e i t y g ov i n

Transcript of Part-of-Speech Annotation Challenges in MarathiTitle: Part-of-Speech Annotation Challenges in...

  • Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation, pages 1–6Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020

    c© European Language Resources Association (ELRA), licensed under CC-BY-NC

    1

    Part-of-Speech Annotation Challenges in Marathi

    Gajanan Rane, Nilesh Joshi, Geetanjali Rane, Hanumant Redkar, Malhar Kulkarni and Pushpak Bhattacharyya

    Center For Indian Language Technology Indian Institute of Technology Bombay, Mumbai, India

    {gkrane45, joshinilesh60, geetanjaleerane, hanumantredkar, malharku and pushpakbh}@gmail.com

    Abstract Part of Speech (POS) annotation is a significant challenge in natural language processing. The paper discusses issues and challenges faced in the process of POS annotation of the Marathi data from four domains viz., tourism, health, entertainment and agriculture. During POS annotation, a lot of issues were encountered. Some of the major ones are discussed in detail in this paper. Also, the two approaches viz., the lexical (L approach) and the functional (F approach) of POS tagging have been discussed and presented with examples. Further, some ambiguous cases in POS annotation are presented in the paper.

    Keywords: Marathi, POS Annotation, POS Tagging, Lexical, Functional, Marathi POS Tagset, ILCI

    1 Introduction In any natural language, Part of Speech (POS) such as noun, pronoun, adjective, verb, adverb, demonstrative, etc., forms an integral building block of a sentence struc-ture. POS tagging1 is one of the major activities in Natural Language Processing (NLP). In corpus linguistics, POS tagging is the process of marking/annotating a word in a text/corpus which corresponds to a particular POS. The annotation is done based on its definition and its context i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identifi-cation of words as nouns, verbs, adjectives, adverbs, etc. The term ‘Part-of-Speech Tagging’ is also known as POS tagging, POST, POS annotation, grammatical tagging or word-category disambiguation.

    In this paper, the challenges and issues in POS tagging with special reference to Marathi2 have been presented. The Marathi language is one of the major languages of India. It belongs to the Indo-Aryan Language family with about 71,936,894 users3. It is predominantly spoken in the state of Maharashtra in Western India (Chaudhari et. al., 2017). In recent years many research institutions and or-ganizations are involved in developing the lexical re-sources for Marathi for NLP activities. Marathi Wordnet is one such lexical resource developed at IIT Bombay (Popale and Bhattacharyya, 2017).

    The paper is organized as follows: Section 2 introduces POS annotation; section 3 provides information on Mara-thi annotated corpora; section 4 describes Marathi tag set; section 5 explains tagging approaches, section 6 presents ambiguous behaviors of the Marathi words, section 7 presents a discussion on special cases, and section 8 con-cludes the paper with future work.

    1http://language.worldofcomputing.net/pos-tagging/parts-of-speech-tagging.html 2http://www.indianmirror.com/languages/marathi-language.html 3http://www.censusindia.gov.in/

    2 Parts-Of-Speech Annotation In NLP pipeline POS tagging is an important activity which forms the base of various language processing ap-plications. Annotating a text with POS tags is a standard low-level text pre-processing step before moving to higher levels in the pipeline like chunking, dependency parsing, etc. (Bhattacharyya, 2015). Identification of the parts of speech such as nouns, verbs, adjectives, adverbs for each word (token) of the sentence helps in analyzing the role of each word in a sentence (Jurafsky D. et. al., 2016). It represents a token level annotation wherein it assigns a token with POS category.

    3 Marathi Annotated Corpora Aim of POS tagging is to create a large annotated corpora for natural language processing, speech recognition and other related applications. Annotated corpora serve as an important resource in NLP activities. It proves to be a ba-sic building block for constructing statistical models for the automatic processing of natural languages. The signi-ficance of large annotated corpora is widely appreciated by researchers and application developers. Various re-search institutes in India viz., IIT Bombay4, IIIT Hydera-bad5, JNU New Delhi6, and other institutes have devel-oped a large corpus of POS tagged data. In Marathi, there is around 100k annotated data developed as a part of In-dian Languages Corpora Initiative (ILCI)7 project funded by MeitY8, New Delhi. This ILCI corpus consists of four domains viz., Tourism, Health, Agriculture, and Enter-tainment. This tagged data (Tourism - 25K, Health - 25K, Agriculture - 10K, Entertainment - 10K, General – 30K) is used for various applications like chunking, dependency tree banking, word sense disambiguation, etc. This ILCI

    4http://www.cfilt.iitb.ac.in/ 5https://ltrc.iiit.ac.in/ 6https://www.jnu.ac.in/ 7http://sanskrit.jnu.ac.in/ilci/index.jsp 8https://meity.gov.in/

  • 2

    annotated data forms a baseline for Marathi POS tagging and is available for download at TDIL portal9.

    4 The Marathi POS Tag-Set The Bureau of Indian Standards (BIS)10 has come up with a standard set of tags for annotating data for Indian lan-guages. This tag-set is prepared for Hindi under the guid-ance of BIS. The BIS tag-set aims to ensure standardiza-tion in the POS tagging across the Indian languages. The tag sets of all Indian languages have been drafted by Dept. of Information Technology, MeitY and presented as Uni-fied POS standard in Indian languages11. Marathi POS tag-set has been prepared at IIT Bombay referring to the standard BIS POS Tag-set, IIIT Hyderabad guideline doc-ument (Bharati et al, 2006) and Konkani Tag-set (Vaz et. al., 2012). This Marathi POS Tag-set can be seen in Ap-pendix A.

    5 Lexical and Functional POS Tagging: Challenges and Discussions

    Lexical POS tagging (Lexical or L approach) deals with tagging of a word at a token level. Functional POS tag-ging (Functional or F approach) deals with tagging of a word as a syntactic function of a word in a sentence. In other words, a word can have two roles viz., grammatical role (lexical POS w.r.t. a dictionary entry) and functional role (contextual POS)12. For example, in the phrase ‘golf stick’, the POS tag of the word ‘golf’ could be determined as follows:

    Lexically it is a noun as per lexicon. Functionally it is an adjective as it is a modifier of

    succeeding noun. In the initial stage of ILCI data annotation, POS tagging was conducted using the lexical approach. However, over a while, POS tagging was done using the functional ap-proach only. The reason is that, by using the lexical ap-proach we do a general tagging, i.e., tagging at a surface level or token level and by using the functional approach we do a specific tagging, i.e., tagging at a semantic level. This eases the annotation process of chunking and parsing in the NLP pipeline.

    While performing POS annotation, many issues and chal-lenges were encountered, some of which are discussed be-low. Table 1 lists the occurrences of discussed words in the ILCI corpus.

    5.1 Subordinators which act as Adverbs There are three basic types of adverbs. They are time (N_NST), place (N_NST) and manner (RB). Traditional-ly, adverbs should be tagged as RB. Subordinators are conjunctions which are tagged as CCS.

    However, there are some subordinators in Marathi which act as adverbs. For example, ा माणे (jyApramANe, like- 9 https://www.tdil-dc.in/ 10http://www.bis.gov.in/ 11 http://tdil-dc.in/tdildcMain/articles/134692Draft POS Tag stan-dard.pdf 12https://www.cicling.org/2018/cicling18-hanoi-special-event-23mar18.pdf

    wise), ा माणे (tyApramANe, like that), ा माणे (hyA-pramANe, like this), जे ा (jevhA, when) and ते ा (tevhA, then). ा माणे (jyApramANe) and ा माण े (tyApramANe) are generated from pronominal stems viz., ा (jyA) and

    ा (hyA) hence they are lexically qualified as pronouns, however, they function as adverbs; hence to be functional-ly tagged as RB at the individual level. However, when these words appear as part of the clause then they should be functionally tagged as CCS.

    This distinction was also observed by noted Marathi grammarian, Damle13 (Damle, 1965) [p. 206-07].

    5.2 Words with Suffixes There are suffixes like मुळे (muLe, because of; due to), साठी (sAThI, for), बरोबर, (barobara, along with), etc. When these suffixes are attached to pronouns they func-tion as adverbs or conjunctions at a syntactic level. For example, words ामुळे (tyAmuLe, because of that), यामुळे (yAmuLe, because of this), यासाठी (yAsAThI, for this),

    ामुळे (hyAmuLe, because of this), ा ामुळे (hyAchyA-muLe, because of it/him), यां ामुळे (yAMchyAmuLe, be-cause of them), ाचबरोबर (=तसेच) (tyAchabarobara (=tasecha), further) are formed by attaching the above suffixes to pronouns. These words which are formed are lexically tagged as PRP. However, functionally these words act as conjunctions at the sentence level; therefore, they should be tagged as CCD. Also, consider the words

    ावेळी (tyAveLI, at that time), ावेळी (hyAveLI, at this time), ानंतर (hyAnaMtara, after this), ानंतर (tyAnaMta-ra, after that). Here, wherever the first string/morpheme appears as ा (tyA) and ा (hyA), the tag should be given as PRP, lexically. But functionally, all these words shall be tagged as N_NST (time adverb).

    5.3 Words which are Adjectives Adjectives are tagged as JJ. Consider the example below:

    ा ाम े ही कला परंपरागत चालत आली आहे (tyAchyAmadhye hI kalA paraMparAgata chAlataAlIAhe, this art has come to him by tradition). Lexically, the word परंपरागत (pa-raMparAgata, traditional) is an adjective, but, in the above sentence, it qualifies the verb चालत येणे (chAla-tayeNe, to be practiced). Hence functionally, the word परंपरागत (paraMparAgata) should be tagged as an RB. Similarly, a word वाईट (vAITa, bad) has a lexical POS as an adjective (Date-Karve, 1932). But in the sentence मला वाईट वाटते (malA vAITa vATate, I am feeling bad), it func-tions as an adverb, as it is qualifying the verb and not pre-ceding the pronoun मला (malA, I; me). Therefore, func-tionally word वाईट (vAITa) acts as adverb hence should be tagged as RB.

    5.4 Adnominal Suffixes Attached to Verbs The adnominal suffix जोगं (jogaM) and all its forms (जोगा, जोगी, जोगे, जो ा; jogA, jogI, joge, jogyA) are always at-tached to verbs. For example, word कर ाजो ा (karaNyA-jogyA, doable) is lexically tagged as a verb. However, word कर ा (karaNyA) is a Kridanta form of a verb करणे (karaNe, to do) and suffix जोगं (jogaM) is an adnominal suffix attached to Kridanta form; hence, a verb with all the 13http://www.cfilt.iitb.ac.in/damale/index.html

  • 3

    forms of जोगं (jogaM) should functionally be treated as ad-jectives. Therefore verbs with adnominal suffix should be tagged as JJ.

    5.5 Words जसे (jase) तसे (tase) As per Damle (1956), words जसे (jase, like this) and तसे (tase, like that) are tagged as adverbs. However, if they appear with nouns in a sentence, they are influenced by the inflection and gender property of that nominal stem. For example, words जसे (jase, like this) and तसे (tase, like that) have inflected forms like जसा (jasA, like him), जशी (jashI, like her), तसा (tasA, like this), तशी (tashI, like this), तसे (tase, like this), etc. All these words function as a rela-tive pronoun in a sentence. Hence, the words and their variations should be functionally tagged as PRL.

    5.6 Word तसेतर (tasetara) A word तसेतर (tasetara, as it is seen) is the same as तसे पािहले तर (tase pAhile tara, as it is seen). Lexically, it can be tagged as a particle (RPD) but since it has a function of conjunction; it should be tagged as CCD. For example, in a sentence तसेतर तणावामुळेही काळी वलय येतात (tasetara ta-NAvAmuLehI kALI valaya yetAta, as it is seen that black circles appear because of stress as well), word तसेतर (tase-tara) functions as conjunction and hence should be tagged as CCD instead of tagging it as RPD.

    5.7 Word अ था (anyathA) The standard dictionaries give POS of the word अ था (anyathA, otherwise; else; or) as an adverb/indeclinable. For example, consider a sentence अ था तो येणार नाही (any-athA to yeNAra nAhI, Otherwise, he will not come). Here, while annotating अ था (anyathA) there is a possibility that annotator can directly tag this word as an adverb at a lexical level. However, it behaves like conjunction at the sentence level and hence it should be tagged as CCD.

    5.8 Different Forms of कसा (kasA) As per BIS Tag-set, words कसा, कशी, कसे (kasA, kashI, kase; how) shall be tagged as PRQ. However, the PRQ tag is only for pronoun category and the word कसा (kasA) is not a pronoun; it can behave as an adverb or as a modifier. Consider the examples below:

    1. तो माणूस कसा आहे हे ा ाशी बोल ावरच कळेल (to mANUsa kasA Ahe he tyAchyAshI bolalyA-varacha kaLela, we will come to know about him only after talking to him) [adnominal]

    2. सरकारी ठरावाने काय ाचे कलम कसे र होणार (sarakArI TharAvAne kAyadyAche kalama kase radda hoNAra, How can this clause of law be prohibited by Government Resolution?) [adver-bial]

    In the 1st case, word कसा (kasA, how) functionally acts as a pronoun, hence to be tagged as PRQ. While, in the 2nd case, it acts as an adverb, hence to be functionally tagged as RB.

    5.9 Word मा (mAtra) A word मा (mAtra) is very ambiguous in its various usages; it is difficult to functionally identify the POS of this word at a sentence level. Various meanings of word मा (mAtra) are given in Data-Karve dictionary14. Some of the different senses of मा (mAtra) are discussed here:

    When the word मा (mAtra) conveys the mean-ing of ही, देखील, सु ा (hI, dekhIla, suddhA; also) then it should be tagged as RB functionally.

    When a word is related to the preceding word तेथे (tethe, there) and its function is an emphatic marker च (cha) then it should be tagged as RPD functionally.

    When word मा (mAtra) appears in the form of conjunction then it should be marked as CC functionally.

    If the word is modifying the succeeding noun, then it should be tagged as JJ functionally.

    If the word is modifying the preceding word, then the tag will be RPD as a particle functional-ly.

    Therefore, it is noticed that the word मा (mAtra) does not have one single POS tag functionally and it depends upon the appearance in a sentence. Hence, should be tagged as per the usage.

    Token Lexical Functional Occur-rences

    ा माणे Pronoun Adverb 94 ा माणे Pronoun Adverb 180 ा माणे Pronoun Adverb 8 जे ा Subordinator Time adverb 1496 ते ा Subordinator Time adverb 1577

    मा Adverb Post-position Conjunction

    Particle 426

    तसेतर Particle Conjunction 8 कसा Wh-word Adverb 269 ामुळे Pronoun Conjunction 530 ामुळे Pronoun Conjunction 424

    ा ामुळे Pronoun Conjunction 8 ावेळी Pronoun Time adverb 244 ावेळी Pronoun Time adverb 33 ानंतर Pronoun Time adverb 71 ानंतर Pronoun Time adverb 298

    परंपरागत Adjective Adverb 104 वाईट Adjective Adverb 246

    अ था Adverb Conjunction 24 जसे Relative pronoun Adverb 1007 तसे Relative pronoun Adverb 511

    कर ाजो ा Verb Adjective 97

    Table 1: Occurrences of discussed words and lexical v/s functional tags assigned to these words

    6 POS Ambiguity: Challenges and Discus-sions

    Ambiguity is a major open problem in NLP. Several POS level ambiguity issues were faced by annotators while an-notating the Marathi corpus. Following are some POS 14http://www.transliteral.org/dictionary/mr.kosh.maharasht

    ra/source

  • 4

    specific ambiguity problems encountered while annotat-ing.

    6.1 Ambiguous POS: Adjective or Noun? Examples: वय र (vayaskara, the aged)

    कुटंुबा ा वय र सद ांनी मतदान केले (kuTuM-bAchyA vayaskara sadasyAMnI matadAnakele, all the aged members of the family voted).

    सव वय रांनी मतदान केले (sarva vayaskarAMnI matadAna kele, all the aged people voted).

    In the above examples, the word वय र (vayaskarAMnI) lexically acts as an adjective as well as a noun. However, at the syntactic level, in the first example, it is functioning as adjective hence to be tagged as JJ, while in the second example it is functioning as a noun hence to be tagged as N_NN. This is one of the challenges while annotating ad-jectives appearing in nominal form. Annotators usually fail to disambiguate these types of words at the lexical level; therefore such words should be disambiguated at syntactic level. Hence, annotators need to take special care while annotating such cases.

    6.2 Ambiguous POS: Demonstrators While annotating demonstrators such as हा, ही, हे, तो, ती, ते ((hA, hI, he), this), (to, tI, te), that) annotators often get confused whether to tag them as DMD or DMR. Simple guideline can be followed is, if the demonstrator is direct-ly following noun, then tag it as DMD, otherwise tag it as DMR i.e., if the demonstrator is referring to previous noun/person.

    6.3 Ambiguous POS: Noun and Conjunction Example: word कारण (kAraNa, reason; because). At se-mantic level, the word कारण (kAraNa) has two meanings, one is ‘a reason’ which acts as a noun and another is ‘be-cause’ which acts as a conjunction. Annotators have to pay special attention while tagging such cases.

    6.4 Ambiguous Words: ते (te) and तेही (tehI) The word ते (te) has different grammatical categories like pronoun (they), demonstrator (that) and conjunction (to). Examples:

    ३० ते ४० (30 te 40, 30 to 40) The word ते (te) lexically and functionally acts as

    conjunction, hence to be tagged as CCD. ते णाले (te mhaNAle, they said) Here word ते (te) acts as personal pronoun, hence

    to be tagged as PR_PRP ते कुठे आहेत? (te kuThe Aheta?, where are they?) Here word ते (te) acts as relative demonstrator,

    hence to be tagged as DM_DMR राकेशने पोलीसांना फोन केला आिण ते दो ी चोर पकडले

    गेले (rAkeshane polIsAMnA phona kelA ANi te donhI chora pakaDale gele, Rakesh called police and those two thieves got caught).

    Here, word ते (te) is modifying its succeeding noun चोर (chora, thief) so it is Deictic demonstra-tor, hence to be tagged as DM_DMD.

    ांना हे कधीच पसंत न ते, ां ा मुलाने संगीत िशकावे आिण तेही नृ (tyAMnA he kadhIchapasaMtanav-hate, tyAMchyAmulAnesaMgItashikAveANite-hInRRitya, He never wanted his son to learn mu-sic and that too the dance form)

    Here, the word तेही (tehI) is an ambiguous word. It is mod-ifying succeeding noun or previous context. Here, ही (hI) is a bound morpheme and conveys the meaning ‘also’. Therefore word तेही (tehI) should be tagged as DM_DMR.

    6.5 Ambiguous word: उलटा (ulaTA) Examples:

    उलटे टांगून सुकवले जाते (ulaTe TAMgUna sukavale jAte). Here, उलटे (ulaTe, upside down is behav-ing as manner, not a noun, hence to be tagged as RB.

    उलटे भांडे सुलटे कर (ulaTe bhAMDe sulaTe kara). Here उलटे (ulaTe) it is modifying succeeding noun, hence it is an adjective, hence to be tagged as JJ.

    In the above examples, annotator should identify word behavior in the sentence and tag accordingly.

    6.6 Ambiguous words: िकतीही (kitIhI), ना का (nA kA) and असू दे ना का (asU de nA kA)

    Examples:

    संगणक हा िकतीही गत िकंवा चतुर असू दे ना का, तो केवळ तेच काम क शकतो ाची िवधी (प त) आप ाला त: मािहत आहे. (saMgaNaka hA kitIhI pragata kiMvA chatura asU de nA kA, to kevaLa techa kAma karU shakato jyAchI vidhI (paddha-ta) ApalyAlA svata: mAhita Ahe¸ The computer how much ever may be advanced and clever, it only does that work whose method we only know). Here, िकतीही (kitIhI, how much) is a quantifier, hence to be tagged as QTF.

    In the phrase असू दे ना का (asU de nA kA), the to-ken ना (nA) is a part of verb असू दे (asU de, let it be) and should be tagged as VM, hence the phrase should be tagged as VM, while the token का (kA) is acting as a particle in this phrase and not as a question marker, therefore का (kA) should be tagged as RPD.

    िकती माणसे जेवायला होती? (kitI mANase jevAyalA hotI, how many people were there for a meal?). Here, िकती (kitI, how many) is a question so it should be tagged as DMQ.

    6.7 Ambiguous word: तर (tara) Examples:

    Conjunction: जर मी वेळीच गेलो नसतो तर हा वाचला नसता (jara mI veLIcha gelo nasato tara hA vA-chalA nasatA, if I had not gone on time he would have not survived).

    Particle: 'हो! आता मी जातो तर!' = 'मी अिजबात जाणार नाही’('ho! AtA mI jAto tara!' = 'mI ajibAta jANA-

  • 5

    ra nAhI’, ‘yes! now I am leaving then’ = ‘I am not at all leaving’).

    In the above sentences, the word तर (tara) is used as a supplementary or stressable word so somewhat spe-cial as to give meaning in the sentence. (Date-Karve, 1932). Hence it should be treated as CCD.

    तु ी तर लाख पये मागतां व मी तर केवळ गरीब पडलो (tumhI tara lAkha rupaye mAgatAM va mI tara kevaLa garIba paDalo, you are asking for lakh rupees and I am a poor person). In this sentence, the word तर (tara) indicates opposition with re-spect to meaning between two connected sen-tences. (Date-Karve, 1932). Hence, it should be treated as a RPD.

    7 Discussions on Some Special Cases Words आ यु (Amlayukta), मलईरिहत (malaI-

    rahita), मेदरिहत (medarahita), दु ाळ (duSh-kALagrasta) are combinations of noun plus ad-jective suffix such as यु (yukta), (grasta) and रिहत (rahita). In such cases, even though noun is a head string and adjective part is a suf-fix, the whole word shall be tagged as JJ.

    Before tagging अभंग (abhaMga, verses), ओ ा (ovyA, stanzas), का (kAvya, poetry), etc., anno-tator shall first read between the lines; under-stand the meaning which it conveys and then de-cide upon the grammatical categories of each to-ken. For example, in sentence कळावे तयासी कळे अंतरीचे कारण ते साचे साच अंगी (kaLAve tayAsI kale aMtarIche kAraNa te sAche sAcha aMgI) the POS tagging should be done as साचे\V_VM साच\N_NN अंगी\N_NN, etc.

    Doubtful cases of word कोणता (koNatA)

    Examples:

    o कोणता मुलगा शार आहे (koNatA mulagA hu-shAra Ahe)?

    o वाहतुकी ा दर ान कोणतीही हानी झालेली नाही (vAhatukIchyA daramyAna koNatIhI hAnI jhAlelI nAhI).

    o ां ा बोल ाचा मा ावर कोणताही प रणाम झाला नाही (hyAMchyA bolaNyAchA mAjhyAvara koNatAhI pariNAma jhAlA nAhI).

    o शेतक यास कोण ाही वष पा ाची कमतरता भासणार नाही (shetakaryAsa koNatyAhI var-ShI pANyAchI kamataratA bhAsaNAra nA-hI).

    Here, in the 1st example, the word कोणता (koNa-tA, which one) undoubtedly is DMQ. In rest of the examples कोणतीही (koNatAhI, whichever, whomever), कोणताही (koNatAhI, whichever,

    whomever), कोण ाही (koNatyAhI, whichever, whomever) are DM adjective (DMD).

    8 Conclusion and Future Work Marathi POS tagging is an important activity for NLP tasks. While tagging, several challenges and issues were encoun-tered. In this paper, Marathi BIS tag-set has been discussed. Lexical and functional tagging approaches were discussed with examples. Further, various challenges, experiences, and special cases have been presented. The issues discussed here will be helpful for annotators, researchers, language learners, etc. of Marathi and other languages.

    In future, more issues such as tagging for words having multiple senses; words having multiple functional tags will be discussed. Also, tagset comparison of close languages will be done. Further, the evaluation of lexical and func-tional tagging using statistical analysis will be done.

    References Akshar Bharati, Dipti Misra Sharma, Lakshmi Bai, Rajeev

    Sangal. (2006). AnnCorra : Annotating Corpora Guide-lines For POS And Chunk Annotation For Indian Lan-guages. Language Technologies Research Centre, IIIT, Hyderabad.

    Chitra V. Chaudhari, Ashwini V. Khaire, Rashmi R. Mur-tadak, Komal S. Sirsulla. (2017). Sentiment Analysis in Marathi using Marathi WordNet. Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-4, 2017 ISSN: 2454-1362.

    Damle, Moro Keshav. (1965). Shastriya Marathi Vyakran. A scientific grammar of Marathi, 3rd edition. Pune, In-dia: RD Yande.

    Daniel Jurafsky & James H. Martin. (2016). Speech and Language Processing.

    Edna Vaz, Shantaram V. Walawalikar, Dr. Jyoti Pawar, Dr. Madhavi Sardesai. (2012). BIS Annotation Standards With Reference to Konkani Language. 24th Interna-tional Conference on Computational Linguistics (COL-ING 2012), Mumbai.

    Lata Popale and Pushpak Bhattacharyya. (2017). Creating Marathi WordNet. The WordNet in Indian Languages. Springer, Singapore, 2017. 147-166.

    Pushpak Bhattacharyya, (2015). Machine Translation, Book published by CRC Press, Taylor and Francis Group, USA.

    Yashwant Ramkrishna Date, Chintman Ganesh Karve, Aba Chandorkar, Chintaman Shankar Datar. (1932). Maharashtra Shabdakosh. Published by H. A. Bhave, Varada Books, Senapati Bapat Marg, Pune.

  • 6

    Appendix A

    Marathi Parts of Speech Tag-Set with Examples

    SI. No

    Category Label Annotation Convention

    ** Examples

    Top level & Subtype 1 Noun (नाम) N N

    1.1 Common (जातीवाचक नाम) NN N_NN गाय\N_NN गो ात\N_NN राहते.

    1.2 Proper ( ीवाचक नाम) NNP N_NNP रामाने\N_NNP रावणाला\N_NNP मारले.

    1.3 Nloc ( थल-काल) NST N_NST 1. तो येथे\N_NST काम करत होता. 2. ाने ही व ू खाली\N_NST ठेवली आहे.

    2 Pronoun (सवनाम) PR PR 2.1 Personal (पु ष वाचक) PRP PR_PRP मी\PR_PRP येतो. 2.2 Reflexive (आ वाचक) PRF PR_PRF मी तः\PR_PRF आलो.

    2.3 Relative (संबंधी) PRL PR_PRL ाने\PR_PRL हे सांिगतले ाने हे काम केले पािहजे.

    2.4 Reciprocal (पार रक) PRC PR_PRC पर र 2.5 Wh-word ( ाथक) PRQ PR_PRQ कोण\PR_PRQ येत आहे?

    2.6 Indefinite (अिनि त) PRI PR_PRI कोणी\PR_PRI कोणास\PR_PRI हासू नये. ा पेटीत काय\PR_PRI आहे ते सांगा.

    3 Demonstrative (दशक) DM DM हे पु क माझे आहे.

    3.1 Deictic DMD DM_DMD

    तो\DM_DMD मुलगा शार आहे. हा\DM_DMD मुलगा शार आहे. ही\DM_DMD मुलगी संुदरआहे. जेथे\DM_DMD राम होता तेथे\DM_DMD तो होता.

    3.2 Relative DMR DM_DMR हे\DM_DMR लाल रंगाचे असते. 3.3 Wh-word DMQ DM_DMQ कोणता\DM_DMQ मुलगा शार आहे? 4 Verb (ि यापद) V V

    4.1 Main (मु ि यापद) VM V_VM तो घरी गेला\V_VM.

    4.2 Auxiliary (सहा क ि यापद VAUX

    V_VAUX राम घरी जात आहे\V_VAUX.

    5 Adjective (िवशेषण) JJ

    संुदर\JJ मुलगी

    6 Adverb (ि यािवशेषण) RB हळूहळू\RB चाल.

    7 Conjunction (उभया यी अ य) CC CC 7.1 Coordinator CCD CC_CCD तो आिण\ CC_CCD मी.

    7.2 Subordinator CCS CC_CCS जर\CC_CCS ाने सांिगतले असते तर\CC_CCS हे काम मी केले असते.

    7.2.1 Quotative UT CC_CCS_U

    T असे\CC_CCS_UT णून\C_CCS_UT तो पुढे गेला.

    8 Particles RP RP 8.1 Default RPD RP_RPD मी तर\RP_RPD खूप दमले. 8.2 Interjection (उ ार वाचक) INJ RP_INJ अरेरे\RP_INJ ! सिचनची िवकेट ढापली.

    8.3 Intensifier (ती वाचक) INTF RP_INTF राम खूप\RP_INTF चांगला मुलगा आहे. 8.4 Negation (नकारा क) NEG RP_NEG नको, न 9 Quantifiers QT QT

    9.1 General QTF QT_QTF थोडी\QT_QTF साखर ा. 9.2 Cardinals QTC QT_QTC मला एक\QT_QTC गोळी दे. 9.3 Ordinals QTO QT_QTO माझा पिहला\QT_QTO मांक आला. 10 Residuals (उव रत) RD RD

    10.1 Foreign word RDF RD_RDF

    10.2 Symbol SYM RD_SYE $, &, *, (, ),

    10.3 Punctuation PUN

    C RD_PUNC

    . (period), ,(comma), ;(semi-colon), !(exclama-tion),? (question), : (colon), etc.

    10.4 Unknown UNK RD_UNK Not able to identify the Tag. 10.5 Echo-words ECH RD_ECH जेवण िबवण, डोके िबके