Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Peter Nabende

Alfa Informatica,Center for Language and Cognition Groningen,

University of Groningenemail: [email protected]

The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 2009)04/24/23 1

Introduction• Growing availability of comparable text on Wikipedia has led

to the development of techniques to automatically extract new bilingual terms including named entities from Wikipedia (Erdmann et al., 2008)

• Extraction of matching bilingual named entities is important for enriching bilingual dictionaries and providing lexicons for various NLP applications including MT, CLIR, and CLIE

• Apart from extracting named entity pairs from Wikipedia Link structure (Adafre and Rijke (2006); Bouma et al. (2006); Erdmann et al., 2008), unlinked article texts in Wikipedia are also used as a major source for extracting bilingual named entities

04/24/23 2The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)

Introduction• A pair-HMM is used to simplify similarity estimation between

candidate named entities for languages using different alphabets, hence enabling simplified extraction of matching transliterations

• Added applications for this work include:– Identification of transliteration pairs for linking in Wikipedia. A third of

the pairs from article pairs on “Yoweri Museveni” were not linked– Identification of confusable drug names (Kondrak and Dorr, 2004)


Proposed Identification Approach

04/24/23 4

Source Language (SL) Wikipedia Article Corresponding Target Language (TL) ArticleInter language link

SL candidate named entities TL candidate named entities

Similarity Estimation

Bilingual named entities

Extract first n or n% highest ranking pairs

The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)

Fig. 1: Approach for identification of bilingual named entities from Wikipedia

Similarity Estimation• Figure 2 shows the pair-HMM used for similarity estimation


τyλy

Y

εy

MEND

X1-εx- τx –λx

1-δx- δx-τm

δx

δy

τm

τx

εx

1-εy- τY –λy

xi

xi

yj

yj

λx

Fig.2 pair-HMM for similarity estimation


Fig.3:Illustration of alignment for an english name “jefferson” and its Russian transliteration following the pair-HMM concept

Similarity Estimation• Example illustrating an alignment from which similarity is

estimated using the pair-HMM between the English name “jefferson” and its Russian transliteration “джефферсона”

M Y

j

M M M M M M M

д ж е ф ф е р с о

M Y

н а

e f f e r s o n

Similarity Estimation parameters• Two sets of parameters are estimated for the pair-HMM:

– Transition parameters– Emission parameters– Starting parameters are derived from the parameters associated with

moving from the match state

• The Baum-Welch algorithm is used for estimating the parameters of the pair-HMM


Experimental Setup• Training Data

– Language pair: English-Russian– Source of the data: English Wikipedia database dump download on 8th

Dec 2008– Training data was extracted using the Wikipedia inter-language links.

The search was restricted to only person names existing in the English Wikipedia infoboxes

– In total 6596 pairs were extracted person names with their corresponding Russian representations.

– Alphabets used by the pair-HMM were obtained from the extracted collection of person names for each language. For English, 95 characters were used and for Russian, 72 characters were used


Experimental Setup• Test Data

– Five online English Wikipedia articles with the titles “Abraham Lincoln”, “Dennis Bergkamp”, “Guus Hiddink”, “Marat Safin”, “Yoweri Museveni” and corresponding (Interlanguage) Russian Wikipedia articles were identified

– Transliteration candidates were then extracted from the article pairs and prepared for input to the pair-HMM. In total, 2101 English candidates and 1310 candidates were used

– Matching transliterations from each of the article pairs were hand-picked to form a gold standard set with 200 transliteration pairs, a third of these were variations


Experiment• Testing the pair-HMM

– Two methods were investigated for computing the similarity scores: the forward algorithm and forward log-odds algorithm

– The forward log-odds algorithm normalizes the estimation made by the forward algorithm through comparison to a random model

• Results– The measure used for evaluating the pair-HMM is Precision, p


number of correct matching names returned for names in the test settotal number of names ( ) in the test set

pN

Results

04/24/23 The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 2009) 11

• Table 1 illustrates sample results for precision at the 1st rank

English Russian matches Match at 1st rank

arkansas арканзас 1

jefferson davis джефферсона дэвиса 0(2)

william seaward уильям сьюард 1

lincoln memorial мемориал линкольна 0(>100)

maryland мэриленд 1

louisiana луизиана 1

Table 1: Sample results for precision at 1st rank

Results


Algorithm p (1st Rank) p (2nd Rank)

Forward 0.755 0.875

Forward log-odds 0.760 0.900

Table 2: Precision values after forward and forward log-odds algorithm for N = 200

• Forward log-odds algorithm performs better than the forward algorithm

• Forward log-odds did well even for differently ordered named entities at higher ranks

Conclusion• Precision values show a promising application of the pair-

HMM in extracting bilingual named entities most importantly for languages that use different alphabets

• There is need to evaluate other structures of the pair HMM, for example by either increasing or reducing the number of nodes in the model to determine if there are improvements

• For named entities that are ordered differently between two languages, a more sophisticated model or added component for representing such a problem will be required


THANKS !

Questions ?


References• Maike Erdmann, Kotaro Nakayama, Takahiro Hara and Shojiro Nishio. 2008. An

Approach for Extracting Bilingual Terminology from Wikipedia. In J.R. Haritos, R Kotagiri, and V. Pudi (Eds.): DAFSAA 2008, LNCS 4947, pp. 380-392.

• S.F. Adafre and de M. Rijke. 2006. Finding Similar Sentences across Multiple Languages in Wikipedia. In Proceedings of the EACL Workshop on NEW TEXT Wikis and blogs and other dynamic text sources, pp. 62-69.

• G. Bouma, I. Fahmi, J.Mur, van G. Noord, van der L. Plas, and J. Tiedemann: The University of Groningen at QA@CLEF 2006 Using Syntactic Knowledge for QA, In Working Notes for the Cross Language Evaluation Forum Workshop.

• T. Declerck, A.G. Perez, O. Vela, Z. Ganter, and D. Manzano-Macho: Multilingual Lexical Semantic Resources for Ontology Translation, In Proceedings of the International Conference on Language Resources and Evaluation (LREC), pp. 1492-1495.

• Jorg Tiedemann. 1999. Automatic construction of Weighted String Similarity Measures. In EMNLP-VLC, pages 213-219.

• Grzegorz Kondrak and Bonnie J. Dorr. 2004. Identification of Confusable Drug Names: A new Approach and Evaluation Methodology. In Proceedings of COLING 2004, pages 952-958.


Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Documents

Transcript of Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model