Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

15
Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model Peter Nabende Alfa Informatica, Center for Language and Cognition Groningen, University of Groningen email: [email protected] The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 2009) 06/19/22 1

description

Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model. Peter Nabende Alfa Informatica, Center for Language and Cognition Groningen, University of Groningen email: [email protected]. Introduction. - PowerPoint PPT Presentation

Transcript of Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Page 1: Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Peter Nabende

Alfa Informatica,Center for Language and Cognition Groningen,

University of Groningenemail: [email protected]

The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 2009)04/24/23 1

Page 2: Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Introduction• Growing availability of comparable text on Wikipedia has led

to the development of techniques to automatically extract new bilingual terms including named entities from Wikipedia (Erdmann et al., 2008)

• Extraction of matching bilingual named entities is important for enriching bilingual dictionaries and providing lexicons for various NLP applications including MT, CLIR, and CLIE

• Apart from extracting named entity pairs from Wikipedia Link structure (Adafre and Rijke (2006); Bouma et al. (2006); Erdmann et al., 2008), unlinked article texts in Wikipedia are also used as a major source for extracting bilingual named entities

04/24/23 2The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)

Page 3: Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Introduction• A pair-HMM is used to simplify similarity estimation between

candidate named entities for languages using different alphabets, hence enabling simplified extraction of matching transliterations

• Added applications for this work include:– Identification of transliteration pairs for linking in Wikipedia. A third of

the pairs from article pairs on “Yoweri Museveni” were not linked– Identification of confusable drug names (Kondrak and Dorr, 2004)

04/24/23 3The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)

Page 4: Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Proposed Identification Approach

04/24/23 4

Source Language (SL) Wikipedia Article Corresponding Target Language (TL) ArticleInter language link

SL candidate named entities TL candidate named entities

Similarity Estimation

Bilingual named entities

Extract first n or n% highest ranking pairs

The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)

Fig. 1: Approach for identification of bilingual named entities from Wikipedia

Page 5: Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Similarity Estimation• Figure 2 shows the pair-HMM used for similarity estimation

04/24/23 5The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)

τyλy

Y

εy

MEND

X1-εx- τx –λx

1-δx- δx-τm

δx

δy

τm

τx

εx

1-εy- τY –λy

xi

xi

yj

yj

λx

Fig.2 pair-HMM for similarity estimation

Page 6: Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

04/24/23 6The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)

Fig.3:Illustration of alignment for an english name “jefferson” and its Russian transliteration following the pair-HMM concept

Similarity Estimation• Example illustrating an alignment from which similarity is

estimated using the pair-HMM between the English name “jefferson” and its Russian transliteration “джефферсона”

M Y

j

M M M M M M M

д ж е ф ф е р с о

M Y

н а

e f f e r s o n

Page 7: Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Similarity Estimation parameters• Two sets of parameters are estimated for the pair-HMM:

– Transition parameters– Emission parameters– Starting parameters are derived from the parameters associated with

moving from the match state

• The Baum-Welch algorithm is used for estimating the parameters of the pair-HMM

04/24/23 7The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)

Page 8: Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Experimental Setup• Training Data

– Language pair: English-Russian– Source of the data: English Wikipedia database dump download on 8th

Dec 2008– Training data was extracted using the Wikipedia inter-language links.

The search was restricted to only person names existing in the English Wikipedia infoboxes

– In total 6596 pairs were extracted person names with their corresponding Russian representations.

– Alphabets used by the pair-HMM were obtained from the extracted collection of person names for each language. For English, 95 characters were used and for Russian, 72 characters were used

04/24/23 8The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)

Page 9: Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Experimental Setup• Test Data

– Five online English Wikipedia articles with the titles “Abraham Lincoln”, “Dennis Bergkamp”, “Guus Hiddink”, “Marat Safin”, “Yoweri Museveni” and corresponding (Interlanguage) Russian Wikipedia articles were identified

– Transliteration candidates were then extracted from the article pairs and prepared for input to the pair-HMM. In total, 2101 English candidates and 1310 candidates were used

– Matching transliterations from each of the article pairs were hand-picked to form a gold standard set with 200 transliteration pairs, a third of these were variations

04/24/23 9The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)

Page 10: Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Experiment• Testing the pair-HMM

– Two methods were investigated for computing the similarity scores: the forward algorithm and forward log-odds algorithm

– The forward log-odds algorithm normalizes the estimation made by the forward algorithm through comparison to a random model

• Results– The measure used for evaluating the pair-HMM is Precision, p

04/24/23 10The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)

number of correct matching names returned for names in the test settotal number of names ( ) in the test set

pN

Page 11: Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Results

04/24/23 The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 2009) 11

• Table 1 illustrates sample results for precision at the 1st rank

English Russian matches Match at 1st rank

arkansas арканзас 1

jefferson davis джефферсона дэвиса 0(2)

william seaward уильям сьюард 1

lincoln memorial мемориал линкольна 0(>100)

maryland мэриленд 1

louisiana луизиана 1

Table 1: Sample results for precision at 1st rank

Page 12: Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Results

04/24/23 12The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)

Algorithm p (1st Rank) p (2nd Rank)

Forward 0.755 0.875

Forward log-odds 0.760 0.900

Table 2: Precision values after forward and forward log-odds algorithm for N = 200

• Forward log-odds algorithm performs better than the forward algorithm

• Forward log-odds did well even for differently ordered named entities at higher ranks

Page 13: Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Conclusion• Precision values show a promising application of the pair-

HMM in extracting bilingual named entities most importantly for languages that use different alphabets

• There is need to evaluate other structures of the pair HMM, for example by either increasing or reducing the number of nodes in the model to determine if there are improvements

• For named entities that are ordered differently between two languages, a more sophisticated model or added component for representing such a problem will be required

04/24/23 13The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)

Page 14: Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

THANKS !

Questions ?

04/24/23 14The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)

Page 15: Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

References• Maike Erdmann, Kotaro Nakayama, Takahiro Hara and Shojiro Nishio. 2008. An

Approach for Extracting Bilingual Terminology from Wikipedia. In J.R. Haritos, R Kotagiri, and V. Pudi (Eds.): DAFSAA 2008, LNCS 4947, pp. 380-392.

• S.F. Adafre and de M. Rijke. 2006. Finding Similar Sentences across Multiple Languages in Wikipedia. In Proceedings of the EACL Workshop on NEW TEXT Wikis and blogs and other dynamic text sources, pp. 62-69.

• G. Bouma, I. Fahmi, J.Mur, van G. Noord, van der L. Plas, and J. Tiedemann: The University of Groningen at QA@CLEF 2006 Using Syntactic Knowledge for QA, In Working Notes for the Cross Language Evaluation Forum Workshop.

• T. Declerck, A.G. Perez, O. Vela, Z. Ganter, and D. Manzano-Macho: Multilingual Lexical Semantic Resources for Ontology Translation, In Proceedings of the International Conference on Language Resources and Evaluation (LREC), pp. 1492-1495.

• Jorg Tiedemann. 1999. Automatic construction of Weighted String Similarity Measures. In EMNLP-VLC, pages 213-219.

• Grzegorz Kondrak and Bonnie J. Dorr. 2004. Identification of Confusable Drug Names: A new Approach and Evaluation Methodology. In Proceedings of COLING 2004, pages 952-958.

04/24/23 15The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)