[IEEE 2013 International Conference on Asian Language Processing (IALP) - Urumqi, China...

4
Transfer Grammar in Tamil-Hindi MT System Sobha Lalitha Devi, Sindhuja Gopalan and Vijay Sundar Ram AU-KBC Research Centre MIT Campus of Anna University Chennai, India [email protected] Abstract—In this paper, we present the work on transfer grammar, one of the most challenging issues in MT, in a bidirectional Tamil- Hindi translation system-Sampark. Transfer grammar between the above languages can be categorized into two levels (1) the structure transfer and (2) lexical level transfer. Tamil and Hindi differ extensively at the clausal construction level and at the verb formation level since Tamil is an agglutinative language and Hindi is not. Transfer grammar described here uses a hybrid approach using CRF a machine learning algorithm and linguistic rules for structure transfer, a rule based approach for word level transfer. We tested the approach in the Sampark system using web data and the results are encouraging. Keywords- transfer grammar, syntactic structure transfer, word level transfer, Tamil-Hindi MT I. INTRODUCTION Transfer grammar is the bridging component between the source language and the target language in an automatic machine translation system. The various divergences between the two languages are analysed and handled in this module by transferring various characteristics of source language to corresponding target language. The transfer grammar constitutes of lexical and structural level transfer. It plays a vital role in translating a sentence from source language to a natural sentence in target language. There are different approaches to handle structural transfer and divergence between source language and target language, such as Interlingua, transfer grammar and direct transfer [4]. Lavie has presented a Stat-XFER, a general search based and syntactic driven framework for developing MT systems [3]. Traditional SMT systems use aligned parallel corpus for learning the correct choice of lexical and structure transfer. In recent SMT approaches syntactic information is incorporated into the translation process to obtain better translations [1,2]. In the present work we have used a transfer grammar based approach, where we have used machine learning based approaches and rule based approach for various transfers. Sampark is a platform for Indian language to Indian Language translation (http://tdil- dc.in/index.php?option=com_vertical&parentid=74). The Sampark system is based on analyze- transfer-generate paradigm. First, analysis of the source language is done, then a transfer of vocabulary and structure to target language is carried out and finally the target language is generated. Each phase consists of multiple "modules" with 13 major ones such as Morphological Analyzer, POS Tagger, NP and VP chunker, NER, WSD, Transfer Grammar and Word Generator. An advantage of this approach is that a particular language analyzer can be developed once, independent of other languages and then paired with generators in other languages. The 13 major modules together form a hybrid system that combines rule- based approaches with statistical methods in which the software in essence discovers rules through "training" on text tagged by human language experts. In transfer Grammar module, the syntactic structure of the source language is mapped to the syntactic structure of target language. The languages we have considered here are Tamil (Ta), a Dravidian language and Hindi (Hi), an Indo-Aryan language. These two languages share certain similarities such as verb final language, free word order, morphologically rich inflections and due to influence of Sanskrit in both the languages they are similar at lexical level. But structurally they are very different. Tamil is a nominative-accusative language and Hindi is an ergative- accusative language. In Tamil, nouns are inflected with case markers by suffixation, but in Hindi the case markers are not attached to nouns and with pronouns it is suffixed. In Tamil copula drop is a common phenomenon. The above descriptions of the two languages show the vital role of transfer grammar in bridging both the languages and to come up with a translated sentence, which is natural. The rest of the paper is arranged as follows. In the next section we see the various transfers like Syntactic structural transfer, case transfer, handling of negative verbs, word level transfer, handling of other suffixes and postpositions and copula generation included in transfer grammar module. In section 3 results of the various transfers are discussed. Finally the paper ends with the conclusion. II. TRANSFER GRAMMAR We describe about the various type of transfers included in the transfer grammar module, which is a part of Tamil-Hindi automatic translation system. The following are the transfers that are included in the transfer grammar module, 1. Syntactic Structural Transfer, 2.Case Transfer, 3. Handling of negative verbs, 4. Handling of oblique case, 5. Handling of other suffixes and Postpositions, 6. Copula generation A. Syntactic Structural Transfer The goal of this syntactic structure transfer is to improve the translation grammatically and to give the naturalness to the target language structures We learn structures from parallel data which is clause identified, and the correct rule in target language can be selected for a given source language rule using semantic classification of PSP and the classification of case marker (CM) a noun takes. Transformational Based Learning (TBL) is used to 2013 International Conference on Asian Language Processing 978-0-7695-5063-3/13 $26.00 © 2013 IEEE DOI 10.1109/IALP.2013.24 79

Transcript of [IEEE 2013 International Conference on Asian Language Processing (IALP) - Urumqi, China...

Page 1: [IEEE 2013 International Conference on Asian Language Processing (IALP) - Urumqi, China (2013.08.17-2013.08.19)] 2013 International Conference on Asian Language Processing - Transfer

Transfer Grammar in Tamil-Hindi MT System

Sobha Lalitha Devi, Sindhuja Gopalan and Vijay Sundar Ram AU-KBC Research Centre

MIT Campus of Anna University Chennai, India

[email protected]

Abstract—In this paper, we present the work on transfer grammar, one of the most challenging issues in MT, in a bidirectional Tamil- Hindi translation system-Sampark. Transfer grammar between the above languages can be categorized into two levels (1) the structure transfer and (2) lexical level transfer. Tamil and Hindi differ extensively at the clausal construction level and at the verb formation level since Tamil is an agglutinative language and Hindi is not. Transfer grammar described here uses a hybrid approach using CRF a machine learning algorithm and linguistic rules for structure transfer, a rule based approach for word level transfer. We tested the approach in the Sampark system using web data and the results are encouraging.

Keywords- transfer grammar, syntactic structure transfer, word level transfer, Tamil-Hindi MT

I. INTRODUCTION Transfer grammar is the bridging component between

the source language and the target language in an automatic machine translation system. The various divergences between the two languages are analysed and handled in this module by transferring various characteristics of source language to corresponding target language. The transfer grammar constitutes of lexical and structural level transfer. It plays a vital role in translating a sentence from source language to a natural sentence in target language.

There are different approaches to handle structural transfer and divergence between source language and target language, such as Interlingua, transfer grammar and direct transfer [4]. Lavie has presented a Stat-XFER, a general search based and syntactic driven framework for developing MT systems [3]. Traditional SMT systems use aligned parallel corpus for learning the correct choice of lexical and structure transfer. In recent SMT approaches syntactic information is incorporated into the translation process to obtain better translations [1,2]. In the present work we have used a transfer grammar based approach, where we have used machine learning based approaches and rule based approach for various transfers.

Sampark is a platform for Indian language to Indian Language translation (http://tdil-dc.in/index.php?option=com_vertical&parentid=74). The Sampark system is based on analyze- transfer-generate paradigm. First, analysis of the source language is done, then a transfer of vocabulary and structure to target language is carried out and finally the target language is generated. Each phase consists of multiple "modules" with 13 major ones such as Morphological Analyzer, POS Tagger, NP and VP chunker, NER, WSD, Transfer Grammar and Word Generator. An advantage of this approach is that a particular language analyzer can be

developed once, independent of other languages and then paired with generators in other languages. The 13 major modules together form a hybrid system that combines rule-based approaches with statistical methods in which the software in essence discovers rules through "training" on text tagged by human language experts. In transfer Grammar module, the syntactic structure of the source language is mapped to the syntactic structure of target language.

The languages we have considered here are Tamil (Ta), a Dravidian language and Hindi (Hi), an Indo-Aryan language. These two languages share certain similarities such as verb final language, free word order, morphologically rich inflections and due to influence of Sanskrit in both the languages they are similar at lexical level. But structurally they are very different. Tamil is a nominative-accusative language and Hindi is an ergative-accusative language. In Tamil, nouns are inflected with case markers by suffixation, but in Hindi the case markers are not attached to nouns and with pronouns it is suffixed. In Tamil copula drop is a common phenomenon. The above descriptions of the two languages show the vital role of transfer grammar in bridging both the languages and to come up with a translated sentence, which is natural.

The rest of the paper is arranged as follows. In the next section we see the various transfers like Syntactic structural transfer, case transfer, handling of negative verbs, word level transfer, handling of other suffixes and postpositions and copula generation included in transfer grammar module. In section 3 results of the various transfers are discussed. Finally the paper ends with the conclusion.

II. TRANSFER GRAMMAR We describe about the various type of transfers

included in the transfer grammar module, which is a part of Tamil-Hindi automatic translation system. The following are the transfers that are included in the transfer grammar module,

1. Syntactic Structural Transfer, 2.Case Transfer, 3. Handling of negative verbs, 4. Handling of oblique case, 5. Handling of other suffixes and Postpositions, 6. Copula generation

A. Syntactic Structural Transfer The goal of this syntactic structure transfer is to

improve the translation grammatically and to give the naturalness to the target language structures We learn structures from parallel data which is clause identified, and the correct rule in target language can be selected for a given source language rule using semantic classification of PSP and the classification of case marker (CM) a noun takes. Transformational Based Learning (TBL) is used to

2013 International Conference on Asian Language Processing

978-0-7695-5063-3/13 $26.00 © 2013 IEEE

DOI 10.1109/IALP.2013.24

79

Page 2: [IEEE 2013 International Conference on Asian Language Processing (IALP) - Urumqi, China (2013.08.17-2013.08.19)] 2013 International Conference on Asian Language Processing - Transfer

learn the structures from the clause tagged Tamil and Hindi sentences [5]. In this paper, we have described more on the lexical level transfers.

B. Case Transfer Case-suffixes and postpositions are used to express

syntactic and semantic functions. Tamil Case system is analyzed in native and missionary grammars as consisting of a finite number of cases, to some of which postpositional suffixes may be added. In Hindi, case markers are written as separate words with nouns, but are suffixed with pronouns. In Tamil the case markers are agglutinated with nouns and pronouns. The Case Transfer from Tamil to Hindi is more complex in nature than the other nominal suffixes as there is no one to one correspondence. In table 1 we have presented the case mapping between Tamil and Hindi.

TABLE I. CASE MAPPING

Case Tamil Hindi Nominative/ Ergative NULL NULL,ne Accusative ai ko, NULL, se Dative ku ko, ke liye, NULL Instrumental aal, kontu se, NULL Locative il, itam meN/para, NULL Ablative iliruntu se Benefactive ukkaaka ke liye Sociative ootu, utan ke saatha Genitive utaiya, in, atu ka,ke,ki

In the following paragraph, we have described the steps

involved in transferring the case from Tamil to Hindi. Following are the steps involved in case transfer

1. Identify the subject argument in the sentence using selectional restriction rules. 2. Identify the type of the verb.

a) Is the finite verb a copula or existential. b) Is the finite verb a cognitive verb. c) Is the verb transitive or intransitive?

3. For each noun phrase in the sentence do the following. a) If the case name is nominative (null case) then do the following. 1. If the noun is the subject of the sentence then 1. Check whether the noun is in oblique form 2. If the verb is transitive then

a) The nominative case is transferred to ergative case ‘ne’. b) If the case name is accusative then do the following. 1. If the verb is finite then 1. The case marker ‘ai’ is transferred to ‘ko’. c) If the case name is dative then do the following. 1. If the verb is cognitive then 1. The dative case is transferred to ‘ko’. Else

2. The dative case is transferred to ‘ke_liye’. d) If the case name is benefactive then the case marker

‘ukkaka’ is transferred to ‘ke_liye’. e) If the case name is instrumental then the case marker

‘aal’ is transferred to ‘se’. f) If the case name is sociative then the case marker

‘utan|ootu’ is transferred to ‘ke_sath’. g) If the case name is locative then do the following. 1. If case marker is ‘itam’ then it is transferred to ‘se’. 2. If case marker is ‘il’ then it is transferred to ‘me’ 3. If case marker is ‘mel’ then it is transferred to ‘par’.

h) If the case name is ablative then the case marker ‘ilirunthu’ is transferred to ‘se’. i) If the case name is genitive then do the following.

1. Check the ‘number’ (singular, plural) and ‘gender’ of the following word. 2. If the current word has a genitive marker do the following.

a) If the following word is singular and masculine then genitive case is transferred to ‘ka’. b) If the following word is plural and masculine then genitive case is transferred to 'ke'. c) If the following word is singular or plural and feminine then genitive case is transferred to 'ki'.

C. Handling of negative verbs Negative verbs are verbs which indicate that an action

did not happen. There are various forms of negative verbs in Tamil. Some forms of negative verbs remain agglutinated with the infinite verb that precedes them. In Hindi, the negative verbs include ‘nahi’, ‘na’ and ‘math’. In certain cases when sentences from Tamil with negative verbs are translated to Hindi reordering of word occurs. This reordering occurs in verb chunks [7].

1. (Ta) en puththakam avanitam illai. PN N PN -soc neg (Hi) meri pustak uske sath nahi hE. PN N PN -soc neg V (My book is not with him.) In the above example, the Tamil sentence (1. (Ta)) has

the negative verb ‘illai’ which is translated to ‘nahi hE’ in Hindi.

The following paragraph describes how the negative verbs are handled in the transfer grammar module.

Following are the steps involved in handling of negative verbs. 1. Identify the negative verb in the sentence. 2. Identify the type of negative verb. a. If the negative verb is ‘illai’ then it is translated as ‘nahi hE’ b. If the negative verb is in agglutinated form with infinite verb then 1. Split the infinite verb and the finite negative verb. 2. Reorder the verbs 3. The negative verb ‘illai’ and ‘aathu’ is translated as ‘nahi’ and ‘math’ respectively. 4. If the negative verb is ‘kuutaathu’ then it is translated as ‘math’. c. If the negative verb is ‘mutiyathu’ then it is translated as ‘nahi sakte’. d. If the negative verb is ‘veentam’ then do the following:

1. If a noun precedes the negative verb 'veentam' then it is translated as ‘nahi cahiye’. 2. If an infinite verb precedes the negative verb 'veentam' then it is translated as 'nahi' followed by infinite verb and 'chaiye'

e. If the finite negative verb is ‘maataan| maataaL| maataRkaL’ and is preceded by infinite verb.

1. Reordering of verb occurs. 2. The negative verb is translated as ‘nahi’

80

Page 3: [IEEE 2013 International Conference on Asian Language Processing (IALP) - Urumqi, China (2013.08.17-2013.08.19)] 2013 International Conference on Asian Language Processing - Transfer

D. Handling of Oblique and Direct case The Number attribute is transferred based on the case

(direct and oblique) of the noun. The transfer is not a one to one mapping but a one to many mapping. Case suffixes added to the oblique forms of nouns agree in number and gender. 2. (Ta) kuzhanthaikaL viLaiyaatukiRaarkaL. baby(N)+pl play(V)+past+3pl (Hi) bache khelthe hEM. N+pl V copula (The children are playing.) 3. (Ta) naan kuzhanthaikaLaip paarththeen. I(PN) baby(N)+pl+acc see(V)+past+1s (Hi) mein bacchoN ko dekhaa. PN N+pl -acc V (I saw the children.)

In the above example the form of the plural noun in Hindi bache differ from the form bacchoN, when it takes an accusative case marker. But in Tamil both sentences have the same plural number ‘-kal’.

E. Handling of other suffixes and Postpositions In Tamil ‘um’ exist as coordination marker and

emphatic marker. 4. (Ta) raajuvum raamuvum kataikkuc

raju(N)+um ramu(N)+um shop(N)+dat cenRaarkaL. go(V)+past+3pl (Hi) raaju Ora raamu baazaar gaye. N -conj N N V (Raju and Ramu went to shop.) In the above example ‘um’ is a coordination marker.

When this sentence is translated to Hindi the conjunction ‘Ora’ is added. 5. (Ta) raaju puththakamum vaanginaan.

raju(N) book(N)+emp(um) buy(V)+past+3sm (Hi) raju pusthak bhii kharIdaa thaa. N N -emp V copula. (Raju also bought books.) In the above example ‘um’ exists as emphatic marker.

Hence it is translated to Hindi as ‘bhii’ Postpositions, combined with the genitive marker of the

previous word in Tamil are called as compound postpositions in Hindi. For example ‘ke alaavaa’, ‘ke anusaar’, ‘ke aage’ etc., are compound postposition in Hindi. 6. (Ta) avan viittin munnaal maram he(PN) house(N)+gen front(PSP) tree(N) irukkiRathu. is(V)+present+3sn (Hi) uske ghar ke saamne peed hE.

PN N cmpd PSP N copula (There is a tree in front of his house.) In the above example, the inflected noun ‘veetin’ in

Tamil sentence has the genitive marker ‘in’ and is followed by a postposition ‘munnaal’. When this sentence is translated to Hindi, the genitive marker ‘in’ combine with the postposition ‘munnaal’ and form a compound postposition ‘ke saamne’ in Hindi.

F. Copula Transfer A copula is often a verb or a verb-like word. Tamil

does not have a copula. The word is included in the

translations only to convey the meaning more easily. In Hindi copula verbs are expressed as possession, location and essential constructs such as hE, hEN, thaa, the, ho etc. It is necessary to add copula when sentences in Tamil with zero copula is translated to Hindi. Hence this is an important task in transfer grammar. 7. (Ta) avaL ennutaiya cakoothari.

she(PN) my(PN)+gen sister(N) (Hi) vah merii bahan hE.

PN PN-gen N copula (He is my sister.)

In the above example, the Tamil sentence has zero copulas. In Hindi sentence the copula is introduced. In machine translation it is important to identify the Tamil sentences with zero copulas and generate it in Hindi sentences.

III. RESULTS AND DISCUSSION

A. Nominal Transfer We evaluated Sampark system (Tamil- Hindi) with 250

sentences randomly picked from Tamil web sources. These sentences were run through the system and the results for nominal transfer are shown in table 2. The errors introduced by the preprocessing modules are taken care before giving it as an input.

TABLE II. PERFORMANCE OF CASE TRANSFER AND NUMBER

Type of Transfer

Total number of occurrences

Correctly transferred

Accuracy %

Case 1850 1440 77.8 Number 1883 1774 94.2

The results show that the system is able to transfer all

kinds of nominal such as nouns and pronouns. The case divergence is the major challenge in improving the system. For example, the accusative case marker ‘-ai’ in sentences such as Raju avanai adithaan 'Raju bet him' selects “se” case marker in Hindi but the current system is transferring this to ‘ko’. So the case divergence has to be analyzed separately. Regarding number gender agreement of Hindi nouns cannot always be determined on the basis of the phonetic value of the nouns. For example; many ‘-aa’ ending nouns belong to masculine gender, but there are words like mitrataa ‘friendship’, come under feminine class. So the nature of the ending vowel is not the only deciding parameter of the inflections selected by the nouns in Hindi.

B. Handling of Oblique Case, Copula and Postpositions The transfer of oblique, copula and postposition are

evaluated with 250 sentences taken from Tamil web source and the performance is presented in the table 3.

TABLE III. PERFORMANCE OF COPULA TRANSFER, POSTPOSITION AND OBLIQUE TRANSFER

Type of Transfer Total number of occurrences

Correctly transferred

Accuracy %

Oblique Transfer 52 52 100 Copula Transfer 148 141 95.3 Postposition 226 202 89.82

81

Page 4: [IEEE 2013 International Conference on Asian Language Processing (IALP) - Urumqi, China (2013.08.17-2013.08.19)] 2013 International Conference on Asian Language Processing - Transfer

The transfer of postposition becomes a little challenging task because of the genitive drop characteristics of Tamil. In Tamil, the genitive case marker can be dropped in certain sentence constructions. This is explained with the sentence below:

Tamil Sentence with Genitive marker 8. (Ta) mejiin medhu narkaliai vei. table(N)+gen above(ADV) chair(N) keep(V) (Hi) mej ke upar kursii ko rakho.

N cmpd PSP N -acc V (Keep the chair above the table.) Tamil Sentence with Genitive drop:

9. (Ta) meji medhu narkaliai vei. table(N) above(ADV) chair(N) keep(V) (Hi) mej me kursi ko rakho.

N -loc N -acc V (Keep the chair above the table.) This can be handled with set of linguistic rules after

identifying the sentence construction, where genitive drop occurs. These rules will depend a lot on the chunk information.

In oblique transfer, though the transfer happens properly, the error occurs in the clausal sentences. In relative clause sentence, in the structure transfer, the oblique plural noun phrase, which is shared between the sentences is moved to the subject position of the relative participle clause, here the noun has to be in direct case and not oblique. This shows the oblique transfer should be handled separately for clausal sentences. Here is an example 10.(Ta) chennaikku vandha chennai(N)+dat come(V)+past+rp viirarkaluku nalla varaverpu player(N) +pl+dat good (Adj) welcome(N) kidaithathu . get+past+3pn The correct translation is (Hi) jo khiladiyan chennai ko aaya unko RC N+pl N -dat V PN+dat accI svagath mila. Adj N V Considering as oblique case, the sentence is transferred as: 11. (Hi) jo khiladiyon Chennai ko aaya unko RC N+pl N -dat V RC+dat accI svagath mila. Adj N V

In the above example, the Tamil sentence has a word “viirarkaluku” “players” which is a noun in plural form with dative marker. After structure transfer since the noun becomes nominative the word “viirarkaluku” should get translated as “khiladiyan” as in sentence (10). But after oblique transfer it gets translated as “khiladiyon” as in sentence (11).

The copula transfer works with a high perfection. As it relies on the correctness of the preprocessing modules there is a small decrease in accuracy. Similarly in handling of negative verbs, co-ordinate suffix handling works with a high accuracy, when correct preprocessed input is provided.

Table IV shows the module wise and total performance of our Tamil-Hindi translation system. We have used measures like precision and recall to evaluate our system.

TABLE IV. PERFORMANCE OF TAMIL-HINDI MACHINE TRANSLATION SYSTEM

Precision (%) Recall (%)

Morph Analyser 95.51 95.06 Parts of Speech tagger 96.35 96.35

Chunker 88.81 90.01 Simple Parser 71.43 78.95 Transfer Grammar 93.13 93.71 Word Transfer 82.41 82.99 87.94 89.5

IV. CONCLUSION In this paper we have discussed about the transfer

grammar module which is used in Tamil-Hindi translation system. In transfer grammar module we have explained various transfers such as syntactic structure transfer, clause transfer, copula generation, handling of negative verbs and handling of oblique and direct forms. We have evaluated this with sentences taken from Tamil web data. On analyzing we found that the case divergence has to be handled with more rules. In syntactic structure transfer choosing of relative correlatives in Hindi using the semantic classification of postpositions found to be effective in obtaining translations which are natural. In future work we will try to incorporate various other lexical transfers and thus improving the measures of transfer grammar module and the total performance of the system.

REFERENCES [1] M. Collins, P. Koehn, and K. Ivona. “Clause restructuring for statistical machine translation”. In 43rd Annual Meeting of ACL, 2005. pp. 531–540. [2] P. Koehn, O.F Josef, and D. Marcu. “Statistical Phrase-Based Translation”. In HLT/NAACL’03. 2003. pp. 127-133. [3] A. Lavie. “Stat-XFER: A general search-based syntax-driven framework for machine translation”. In Computational Linguistics and Intelligent Text Pro-cessing, Lecture Notes in Computer Science vol.. 362375, Springer. 2008. [4] J. Slocum. “Machine Translation: its history, current status, and future prospects”. In 10th International Conference on Computational linguistics, Stanford, California. 1984. pp.546-561. [5] Sobha Lalitha Devi, R. Vijay Sundar Ram, Pravin Pralayankar and T. Bakiyavathi. “Syntactic Structure Transfer in a Tamil to Hindi MT System - A Hybrid Approach”. In A. Gelbukh (ed), Computational Linguistics and Intelligent Text Processing, Springer LNCS Vol. 6008. 2010a. pp 438 – 450. [6] Sobha, Lalitha Devi., V. Kavitha, Pravin Pralayankar, S. Menaka, T. Bakiyavathi and R. Vijay Sundar Ram. “Nominal Transfer from Tamil to Hindi”. In International Conference on Asian Language Processing (IALP), Harbin, China. 2010b. pp. 270 – 273. [7] Sobha Lalitha Devi, Pravin Pralayankar, S. Menaka, T. Bakiyavathi,R. Vijay Sundar Ram and V. Kavitha. “Verb Transfer in a Tamil to Hindi Machine Translation System”. In International Conference on Asian Language Processing (IALP), Harbin, China. 2010c. pp. 261 – 264.

82