An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for...

27
An Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre for Applied Linguistics and Translation Studies, University of Hyderabad. University of Hyderabad. [email protected] March 22, 2010 CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 0

Transcript of An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for...

Page 1: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

An Improvized Morphological Analyzer cum Generator for Tamil :Generator for Tamil :

A case of implementing the open source platform APERTIUM

Parameswari K,Centre for Applied Linguistics and Translation Studies,

University of Hyderabad.University of [email protected]

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 0

Page 2: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

Overview

• Morphological Analyzer and Generator - twocrucialtoolsin MT involving NLP.crucialtoolsin MT involving NLP.

• Deals with the improvized database implementedon Apertium for morphological analysis andgeneration.

• Discusses the evaluation of tools with largecorpora to estimatethe efficacy , coverageandcorpora to estimatethe efficacy , coverageandspeed.

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 2

Page 3: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

• A Morphological analyzer is a computational tool toanalyze word forms into their roots, categories along withtheir constituent functional elements and the generator isan inverse of it.

•The attempt involves a practical adoption oflttoolbox forthe Modern Standard Written Tamil in order todevelop an improvized open source MorphologicalAnalyzer and generator.

• The tool uses the computational algorithmFinite StateTransducers for one-passanalysisand generation,andTransducers for one-passanalysisand generation,andthe database is developed in the morphological modelcalledWord and Paradigm .

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 3

Page 4: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

Need for improvized Morphological tools

� Open Source - Easily Accessible

� Handling Derivation Morphology

� Speed

� User friendly

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 4

Page 5: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

APERTIUM -‘lttoolbox’

• Developed by the Tran`sducens research group at theUniversitat d’Alacant in Spain.Universitat d’Alacant in Spain.

• One of the open source machine translation systemshas originated within the project“Open-SourceMachine Translation for the languages of Spain”.

• A componant called ‘lttoolbox’ for performing lexicalprocessing tasks of language like Morphologicalprocessing tasks of language like MorphologicalAnalyzer, Generator and POS Tagger.

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 5

Page 6: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

Downloading…

• Thecurrentversionsof theApertiumtoolboxas• Thecurrentversionsof theApertiumtoolboxaswell as language data are available from theSourceforge pagesf.net/projects/apertium.

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 6

Page 7: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

Data Organization In ApertiumAperium 'lttoolbox' uses,

� Word and Paradigm model (Hockett,1958) forlinguisticdatabaselinguisticdatabase

� Finite-State Transducers (Jurafsky, 2003 and MikelL. Forcada and et.al, 2008) as a computationalalgorithm for processing the data.

Lexical Resources required,

�Paradigms�Paradigms

�Root word Dictionary

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 7

Page 8: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

Data Organization in Apertium

� Primarily thedatais adoptedfrom CALTS� Primarily thedatais adoptedfrom CALTSMorph.

� The data is improved.

1. Paradigm Improvization

2. Dictionary Improvization

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 8

Page 9: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

Tamil Morphology• Tamil, a Dravidian Language is known for its agglutinativemorphology.

• Computing the analysis of Tamil Morphology demands acomprehensivebut exhaustive analysis of its inflectionalcomprehensivebut exhaustive analysis of its inflectionalcategories according to their functional properties.

• The present attempt classifies the morphological categoriesof Tamil based on their role in inflection. There are twoclasses,

Class A: The forms which anchor with suffixes ormorphosyntacticelements.morphosyntacticelements.

Class B: The forms which are incapable of receiving suchinflection.

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 9

Page 10: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 10

Page 11: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

Distinct Category� The present study considers pronouns as a distinct minor

class because of its characteristic formation of oblique andidiosyncratic plural forms.

ufkalYE- < nI (2p. pl)+ obl + Accusative>'us+Accusative'ufkalYE- < nI (2p. pl)+ obl + Accusative>'us+Accusative'

� The numerals have inflection with special particles such asquantitative particles (kAl, arE, mukkAl etc), attributiveparticles (per) and temporal particles like time (maNi) whichinvolve peculiar inflection when compared to nouns.

walAvAyiram <walA-attributive particle + Ayiram ' a thousandperhead‘>perhead‘>

� The locative nouns are the indicators of time and space that areascertained as a minor category because it exhibits an irregularinflection with morphosyntactic properties of noun. For instance,

arukil aruku + Locative 'near + Locative'March 22, 2010

CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 11

Page 12: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

HANDLING INFLECTION

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 12

Page 13: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

HANDLING DERIVATION

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 13

Page 14: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

Dictionary Entry

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 14

Page 15: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

DatabaseParadigmatic Database Dictionary Size (lemma)

Number of Words

Category Number of Inflectional

Number of Inflections per

Categorywise

TotalInflectional Classes

Inflections per class

wise

Noun 20 743 57,322

68,060

Verb 29 934 10,114

Adjective 2 372 209

Pronoun 11 654 18

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 15

Numeral 14 370 129

NST 7 67 62

Avy - - 206

Page 16: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

COMPILING AND PROCESSING

The data is compiled and processed by using theapplications used in the lexical processing modules and tools(lttoolbox).(lttoolbox).

The ‘lt-comp’ is the application responsible for compilingdictionaries used by Apertium into a compact and efficientrepresentation.

Synopsis : lt-comp [ lr | rl ] dictionary_file output_file

The dictionary which is compiled is processed by theapplication 'lt-proc' that is responsible for functioning the data.

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 16

application 'lt-proc' that is responsible for functioning the data.

Synopsis : lt-proc [-c] [-a|-g] fst_file [input_file [output_file]]

Page 17: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 17

Page 18: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 18

Page 19: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 19

Page 20: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 20

Page 21: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

Evaluation

• The Morphological analyzer tool was testedusing the corpus (CALTS corpus of 4.4million words and EMILLI CIIL corpus of 4.8million words and EMILLI CIIL corpus of 4.8million words) in order to find out its thecoverage of the corpus.

Corpus Total words Reg. words coverage Speed

CALTS Corpus 4,45,130 3,75,891 84.44% 0m0.289sCALTS Corpus 4,45,130 3,75,891 84.44% 0m0.289s

EMILLI CIIL Corpus 4,85,543 4,05,898 83.59% 0m0.297s

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 21

Page 22: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

SpeedThe morph analyzer is tested for its speed along with the

other available Tamil Morphological analyzers which aredeveloped in CALTS, University of Hyderabad and AU-KBCdeveloped in CALTS, University of Hyderabad and AU-KBCresearch Centre, Anna Unicersity. The speed of eachmodules for 1,00,000 words as follows.

Morph Processing Speed

CALTS- MORPH 0m14.194s

AU-KBC MORPH 0m14.703s

CALTS-APERTIUM 0m0.198s

The above speed shows Calts-Apertium consumes less speed toanalyze large number of data.

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 22

Page 23: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

Error AnalysisType Word From Frequency in the Copus

Orthographic Variation

koyil 'temple'kovil 'temple'

885 occurrences204 occurrences

Inflectional Variation

eVlYYuwwu-kkalY‘letters’eVlYYuwwu-kalY ‘letters’

57 occurrences171 occurrencesVariation eVlYYuwwu-kalY ‘letters’ 171 occurrences

Dialectal Variation

vanwAy‘You came’ (standard) vanweV ‘You came’ (dialect)

765occurrences6 occurrences

Naturalized polIS 'police' 20070 occurrencesNaturalized English words

polIS 'police' 20070 occurrences

Proper nouns kaNNanY'male name'wamilYYnAtu'Tamil Nadu'

211 occurances364 occurrences

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 23

Page 24: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

Salient features of Apertium

• Easily Accessible.

• Speed.

• Two in One tool (Both Generator and Analyzer ).

• GNU License – allow us to modify.

• Uses XML format which is easy to modify.

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 24

Page 25: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

Conclusion• The Apertium tool for Tamil is efficient in terms of

time for processing a large number of words.

• The combinationof Finite StateTransducers(letter• The combinationof Finite StateTransducers(lettertransducer) and the paradigm approach works moreefficient and speedy parsing.

• The other advantage of the Apertium is that the currentmorphological database can be used to create a parallelmorphological generator for Tamil.

• In future, furtherwork canbecarriedout in this area• In future, furtherwork canbecarriedout in this areato meet the other morphological lapses for attainingmaximum coverage and precision.

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 25

Page 26: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

ReferenceArden, A.H. 1976.A progressive Grammar of the Tamil language. Madras : The ChristianLiterature Society.

Fercada, Mikel et.al. 2008.Documentation of the Open-Source Shallow-Transfer MachineTranslationplatformApertium. Publishedin website: http://www.gnu.org/copyleft/fd1.html.TranslationplatformApertium. Publishedin website: http://www.gnu.org/copyleft/fd1.html.

Ramaswamy, Vaishnavi. 2003.A morphological Analyzer for Tamil.Unpublished Ph.D.Thesis. Hyderabad: University of Hyderabad.

Uma Maheshwar Rao, G. 1999.Morphological Analyzer for Telugu.(electronic form).Hyderabad: University of Hyderabad.

Uma Maheshwar Rao, G. 2002. A Computational Grammar of Telugu. (Momeo)Hyderabad: University of Hyderabad.

Vaidhya, Ashwini and Dipti Misra Sharma. 2009. Using Paradigms for CertainVaidhya, Ashwini and Dipti Misra Sharma. 2009. Using Paradigms for CertainMorphological phenomena in Marathi.Puplished in 7th International Conference onNLP-2009: India

Viswanathan. S et.al. 2003.A Tamil Morphological Analyserin Recent Advances in NLP.pp 31-39.

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 26

Page 27: An Improvized Morphological Analyzer cum … Improvized Morphological Analyzer cum Generator for Tamil : A case of implementing the open source platform APERTIUM Parameswari K, Centre

THANK YOU

March 22, 2010CALTS, University of Hyderabad Knowledge Sharing Event 1 - CIIL 27