Eurovoc does not yet exist for your language? The Hungarian experience.
-
Upload
gavin-hale -
Category
Documents
-
view
22 -
download
0
description
Transcript of Eurovoc does not yet exist for your language? The Hungarian experience.
Eurovoc does not yet exist for your language? The Hungarian experience.
Tamás Váradi
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Overview of the project
• Objectives
• Partners
• Resources
• Methods
• Results
• Conclusions
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Project objectives
• Hungarian EUROVOC version
– only a draft version planned at first
– an authorative full-scale system
• Automatic indexing of documents
– using the technology developed at JRC
– prototype system for one domain
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Partners
• Project consortium:
– HAS RIL (coordinator)
– MorphoLogic Kft. (partner)
• Collaborators:
– JRC, Ispra
– Hungarian Parliament
– Ministry of Justice
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Resources
• NLP toolset (RIL)
• Digital dictionaries, software technology (MorphoLogic)
• Indexing technology (JRC Ispra)
• Terminology database, translation, supervision expertise (Justice Ministry)
• Coordination funding of Hungarian EUROVOC (Hungarian Parliament)
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
EUROVOC translation
• Done by the Translation Coordination Unit of the Ministry of Justice
• Team coordinating the massive effort of preparing the Hungarian translation of Acquis Communitaire
• Maintaining an online Terminological Database
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Terminological Database
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Translation process
• English, French, German & Spanish EUROVOC versions in xml files
• Automatic lookup of Terminological Database (cc. 20% coverage)
• Notepad2 xml-aware editor used
• micro-thesauri translated first, corresponding descriptors second
• pool of experts consulted when needed
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Indexing strategies
• Corpus: Hungarian translation of Acquis Communitaire
• Two approaches
1. To translate English associate terms (possible short-cut?)
2. To reconstruct the generation of associate terms by running the JRC technology on the Hungarian data
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Translation of associate terms
• Hypothesis:
– relation between English associate term and
EUROVOC descriptor is language independent
– hence Hungarian equivalent of English term
will also serve as appropriate associate term in
Hungarian texts
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Online dictionary lookup
• MorphoLogic Online English-Hungarian dictionaries applied
• 24.7 % direct match
<LIBELLE_EN>suspension of payments</LIBELLE_EN><LIBELLE_DE>Zahlungseinstellung</LIBELLE_DE><LIBELLE_FR>cessation de paiement</LIBELLE_FR><LIBELLE_ES>suspensión de pagos</LIBELLE_ES><LIBELLE_HU>kifizetések felfüggesztése</LIBELLE_HU>
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Manual check of automatic assignments
• Equivalence cannot be judged on its own merits: the Hungarian equivalent must be the one occuring in the texts
the Hungarian terms must be looked up in the translation corpus as well
parallel corpus aligned at least on the document level must be compiled
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Manual check
<LIBELLE_EN>sales promotion</LIBELLE_EN><LIBELLE_DE>Absatzförderung</LIBELLE_DE><LIBELLE_FR>promotion commerciale</LIBELLE_FR><LIBELLE_ES>promoción comercial</LIBELLE_ES><LIBELLE_HU>eladásösztönzés</LIBELLE_HU>
• Even frequency lists are useful:
Reklám 149Promóció 60Eladásösztönzés 1
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Manual check
<LIBELLE_EN>toxic substance</LIBELLE_EN><LIBELLE_DE>Giftstoff</LIBELLE_DE><LIBELLE_FR>substance toxique</LIBELLE_FR><LIBELLE_ES>sustancia tóxica</LIBELLE_ES><LIBELLE_HU>toxikus anyagok</LIBELLE_HU><LIBELLE_HU>mérgező anyagok</LIBELLE_HU>
• Even frequency lists are useful:
Equallyfrequent
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Generation of Hungarian associate-lists
• Tasks
1. Compile corpus of Hungarian translation of
Acquis Communitaire
2. Tag and lemmatize words
3. Compile list of stop words
4. Run automatic indexing tools (JRC)
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Hungarian Acquis Communautaire corpus
• 8308 files
<!ELEMENT document (title+,text,lemmatised, descriptors,description) >
HUN tokens 21,899,924
EN tokens 20,394,088
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
English stop-word list
• English stop word list: 1720 items
– function words
– "EUspeak"• objective, arrangements, committee
– Some strange multiword stringsnecessary_to_comply_with_this_directiveforward_this_resolution_to_the_commission
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Hungarian stop-word list
1. translated English items
2. checked their occurrence in HU CELEX
3. generated unigram,bigram and trigram frequency lists from HU CELEX corpus
4. checked first 3000 items on each list and added to the stwd list if needed
5. double checked infrequent items on English translation list and replaced translation with synonyms
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Hungarian stop-word list
single word entries 1265
multi-word entries 752
Total 2017
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Automatic indexing run 1
7971 texts divided into 3 sets:(total length of 65702474 chars)
1. 202 optimisation (evaluation set)
2. 179 final evaluation (test set)
3. 7590 the training set
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Precision/recall in terms of number of Eurovoc descriptors
Rank Precision Recall Prec RT Rec RT F1-measure
1 80.000 16.286 82.857 17.238 27.0627090127329
2 67.143 25.143 77.143 28.571 36.5857540472011
3 63.810 32.857 75.714 39.238 43.3778884210744
4 59.048 38.286 70.476 43.714 46.4526625434072
5 57.762 44.095 70.048 50.190 50.0115925267777
6 55.571 47.524 68.333 53.143 51.23344883845
7 52.170 48.476 65.408 54.095 50.255209745047
8 49.976 49.905 62.857 55.524 49.9404747649703
9 48.587 51.810 62.143 57.905 50.1467667360579
10 46.619 52.286 60.143 58.381 49.2901477983924
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Evaluation in terms of rank
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Precision/Recall graph
:
JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop
Conclusions
• First run already yields results comparable to other languages
• scope for fine-tunig/filtering process
• interesting to compare results gained from the two approaches