Bilingual terminology mining

Bilingual Terminology Mining

Estelle Delpech30th November, 2010

4th intensive summer school on Natural Language Processing

About me

● Estelle Delpech

● Research engineer at Lingua et Machina, France

● CAT tools provider● ed(at)lingua-et-machina(dot)com● www.lingua-et-machina.com

● Ph. Candidate at LINA, France● taln team : specialises in NLP● estelle.delpech(at)univ-nantes(dot)fr

Presentation outline

● About terms, terminology, terminology mining

● Term Extraction● Term Alignment

What is a term ?

● Classical definition : ● “unequivocal expression of a concept

within a technical domain“● Traces back to 1930 Eugene Wüster

« General Theory of Terminology »● Specialized language is / should be

unambiguous

concept

term referent

Ogden semiotic triangle

What is a term ?

“Classical terminology challenged in the 1990's by :

● sociolinguistics● corpus-based linguistics● computational terminology

● Observe terms in texts : ● there is variation, polysemy ● concepts evolve overtime● no clear-cut border between

specialized and general languages

What is a term ?

● Definition of « term » depends on the application / audience of the terminology

● Domain expert :● Unit of knowledge

● Information retrieval : ● Descriptors for indexation

● Translation ● word or phrase that :

● is not part of general language ● Translates differently in a particular

domain● can be :

● Noun, adjective, verb● Noun phrase, verb phrase, etc.

What is a terminology ?

● Set of terms + terminological records● Terminological record :

● Part-speech ● Frequency● Variants● contexts

● Relations between terms / concepts● Hypernoymy : cat is a sort of animal● Meronymy : head is part of body

● Bilingual terminology :● Translation relations

http://www.termiumplus.gc.ca/

Were do you find terms ?

● In specialized texts :● Research papers on breast cancer● Planes crashes reports

● Corpora building : ● important to gather texts following

a well-defined domain / thematic

term extractiondata mining

Bilingual terminology mining (1)

bilingualterminology

Specialized texts

termsterms

term alignment

terminologymanagement

software

synchronizedterm extractionand alignment

Bilingual terminology mining (2)

bilingualterminology

Specialized texts

terms terms

terminologymanagement

software

Term extraction : semi-supervised process

● The notion of term is « slippery »● The same lexical unit may or may not be

considered as a term depending on : ● Audience● Domain● Application

● Term extractors extract candidate terms● Frequent in texts of a given domain

● HER2 gene● Look like terms : well-formed phrase

● human cell lines● Group of words that frequently occur

together ● to compile a programL'Homme, 2004

Term extraction : semi-supervised, lexico-semantic process

specialized texts

term extractor

manual selection

candidate terms candidate terms

automaticindexing

terminology

concepts

Termhood clues (1) : Frequency

● Term occurs frequently in specialized texts● the higher, the better ?

● Comparison with general language :● Does the term occur more frequently

than expected in general language ?

● Compute significance tests : ● ex : ² chi-square

L'Homme, 2004

Termhood clues (2) : form

● A term is a well-formed phrase● ...HER2/neu oncogenes are members of...

● Match morpho-syntactic patterns ● Ex: NOUN + NOUN

● Many : ● NOUN PREP DET NOUN● alternation of the gene

● NOUN PREP NOUN COORD ADJ NOUN● susceptibility to breast and ovarian cancer

● NOUN NOUN NOUN NOUN NOUN● human breast cancer cell lines

Termhood clues (2) : form

● Preprocessing : ● Tokenization ● Lemmatisation● POS Tagging

… HER-2/neu oncogenes are members of ....

HER-2/neu oncogenes are members of

NOUN NOUN VERB NOUN PREP

HER-2/neu oncogene be member of

Identification of Syntactic Patterns

● Patterns expressed as regular expression / Finite state automata

NOUN NOUN

NOUN (PREP? NOUN) ?

● NOUN : gene● NOUN NOUN : HER2 gene ● NOUN PREP NOUN : member of family

Term hood clue (3) : words association

● Significant coocurrences are good clues for term hood :

● … breast cancer … ● ...breast remains...● .. alternative cancer...

● Must take into account :● number of times the two word cooccur● number of times word A occurs● number of times word B occurs

Measure for cooccurrence significance

● Mutual InformationMI a ,b= log2

P a ,bP a⋅P b

P a , b=nbocc a ,b /NP a=nbocc a/N

N=total nbof words in corpus

● remarkable attraction between invasive and carcinoma despite relatively low number of cooccurrencesChurch and Hanks, 1990

L'Homme, 2004

invasive carcinoma 20

invasive 30

carcinoma 20

MI 9,7

cancer means 50

cancer 800

means 800

MI 1,69

● in parallel corpora● in comparable corpora

Parallel and comparable corpora

● Parallel corpora● Source text and target texts are translations● Reduce search space little by little

● First sentences● Then terms

● Comparable corpora● Not translation but very similar in topic ● Good proportion of terms translations● Search space :

● All terms of target corpus

Sentence alignement (1)

● Gale and Church (1993) 's hypothesis : ● Translated sentences have roughly the

same length● Probability P(S,T) that sentence S

translates into T is based on the length difference

● Improvements : use seed-lexicon● Probability P(S,T) is based on the

number of words in common

Gale and Church, 1993

● Compute probabilites for all pairs of (S,T)● Build matrix where M(i,j) contains probability

that sentence i translates to sentence j

0 1 2 ... n

0 0,89 0,56 0,2 ... ...

1 0,45 0,9 0,1 ... ...

2 ... 0,23 0,9 0,3 ...

... ... ... 0,44 0,76 ...

m ... ... ... ... 0,88

● Use dynamic programming to find the best “path” i.e. the best alignments

0 1 2 ... n

0 0,89 0,56 0,2 ... ...

1 0,45 0,9 0,1 ... ...

2 ... 0,23 0,9 0,3 ...

... ... ... 0,44 0,76 ...

m ... ... ... ... 0,88

Sub sentence alignment : AnyMalign (Lardilleux, 2010)

Lardilleux et al., 2010

● AnyMalign is a sub-sentencial aligner● Aligns words, groups of words for MT

translation tables● Aligned group of words :

● more or less like statistical collocations● possible to find term patterns in these

groups of words

AnyMalign (Lardilleux, 2010)

a ↔ A is a perfect alignment

a d ↔ A D b ↔ Bb ↔ C

a e ↔ A DD

● Algorithm is based on « perfect alignments » :● words or groups of words that occur

exactly in the same aligned sentences

Sub corpora 1 : b ↔ BSub corpora 2 : a ↔ A

a e ↔ A DD

● How to get more « perfect alignments » ? ● with smaller corpora

● How to get smaller corpora ? ● randomly select sub corpora from your

corpora

Subcorpora 1

Subcorpora 2

Sub corpora 1 : b ↔ BSub corpora 2 : a ↔ A

a e ↔ A DD

● Complementaires of perfect alignments are likely to be good alignments too :

● Perfect alignment a ↔ A● Complementaries d ↔ De ↔ DD

● Process : Iteratively extract random samples of of random size from your corpora

● Extract « perfect alignements » and their complementary

● The same alignment can occur several times

● Count, for each alignement the number of times it occurs

● Output : ● alignments sorted by descending number of

occurrences● Alignement probability :

P S∣T =C S ,T C T

S = source group of wordsT = target group of wordsC (S,T) = number of times S was aligned with

TC (T) = number of times T appears in an

alignment

Advantages :● can perform alignment with more than 2

languages at the same time● 1 language → statistical collocations

● Extracts and aligns non contiguous sequences of words

to give something upto let someone down

● No a priori expectations on terms● Sometimes a term in source

language is not translated by a term● Terms = what you can align

● Words groups are not grammatical phrases :

that sample sentences and exchange format fitted for the

but not● Solutions :

● find term patterns● use heuristics

● trim stop words

sample sentences exchange format

● in parallel corpora● in comparable corpora

Advantages of comparable corpora

● More available● new languages● new language pairs● new topics / domains

● Less expensive to build● More natural

● data was produced spontaneously

● no influence from source text

Contextual approach

● Based on distributional linguistics (Z. Harris)

● Words with similar meaning appear in similar contexts

● If source and target words have similar contexts, they might be translations

● Compute contexts for each source and target word

● Compare contexts● Find the most similar contexts

Contextual approach

● Representation of the context of a given word with a vector :

● Head word + collocates

beer mou

drink ● ● ● ... ●

● Vector associates « head » word with most frequent collocates

● + some indication of the force of association between head-word and collocates

Building context vector for « drink »

● Collocates : word occuring at a distance of n words from head

is variety of reasons to drink plenty of water each day

simple as a glass of drinking water be the key to the

popular in Japan today to drink water from glass after waking

● (drink,water) = 3● (drink, glass) = 2● (drink, Japan) = 1● (drink, reason) = 1● (drink, plenty) = 1

Normalized cooccurrences frequency

● Ex : log likelihood ratio● 1000 cooc. in corpus● (drink,x) = 75 cooc.● (water,y) = 75 cooc.● (drink, water) = 25 cooc.

water ¬ waterdrink 50 25 75¬ drink 25 900 925

75 925 1000

● Normalization : use measure like IM, log likehood ratio to counteract the influence of high frequency words

Dunning, 1993

Log likelyhood ratio

● loglikelihoo ratio (drink,water) = 45,05

water ¬ waterdrink a b e¬ drink c d h

● Contingency table :

log likelihood ratio water , drink =log a b log bc log c d log d N log N

−e log e − f log f −g log g −h logh

Dunning, 1993

Context vector comparison

● Compute context vectors for words in source and target corpus

beer mou

drink ● ● ● ... ●

● How to compare words contexts in different languages ?

น�� เบ�ยร ป�

กแก�ว

ด ��ม ● ● ● ... ●

Rapp 1995 ; Fung 1997

● Use seed lexicon to map collocates

beer mou

drink ● ● ● ... ●

น�� เบ�ยร ป�

กแก�ว

ด ��ม ● ● ● ... ●

thaï-englishseed lexicon

Rapp 1995 ; Fung 1997

● Measuring context similarity of words a and b

● = measuring cosinus angle between vector of a and vector of b

cosinus anglea ,b=∑c∈a∪b

w c , a⋅w c ,b

∑c∈a w c , a2 ⋅∑

c∈bw c ,b

c∈x=collocate in vector of xw c , x =weight of association of collocate c withhead x

● Select the top 1, 10 or 20 most closest words as candidate translations

Rapp 1995 ; Fung 1997

Contextual approach : improvements

● Using syntactic collocates● Improving dictionary with cognates,

transliterations, other dictionaries ● Give more weight to « anchor words »

● cognates, transliterations● frequent, monosemous

● Filter with part-of-speech● Favor reciprocal translations

a'SOURCE TARGET

Chiao et Zweignebaum, 2002Sadat et al., 2003Gamallo and Campos, 2005Kohen and Knight, 2002Prochasson, 2010

Variant to direct translation of vector

● « Interlingual » translation● Translate the n-closest words instead of

context vector● Seed lexicon : some mappings between

source and target wordsSOURCE TARGET

seed lexicon

Déjean and Gaussier, 2002

● To translate term T : ● Find n-closest words● these closest words are in the lexicon

SOURCE TARGET

seed lexicon

● Find the target term which is the closest to the n closest words

SOURCE TARGET

seed lexicon

● « Interlingual » approach● Translate closest words instead of direct

context

SOURCE TARGET

Adaptation to multi-word terms

● Context vector :● Union of vector of each word of the terms

Morin et al., 2004Morin and Daille, 2009

er ...glass

energy ● ● ● ... ...

er mouth

energydrink

● ● ● ... ●

... beer mou

drink ... ● ● ... ●

Evaluation

Single word units

big, general language corpus

Multi-word units

small, specialized corpus

Multi-word terms

small, specialized corpus

● big = hundreds milliions of words● small = one million to 100 thousand

words vectorMorin and Daille, 2010

● Precison on TopN candidates● 50% on Top20● Correct translation is in the Top 20 best

candidates for 50% of source terms

Why is it so difficult ?

● translation might not be present● target term has not been extracted● polysemous words : undiscriminant,

fuzzy vector● low frequency words : unsignificant

vector● translation has different usage in target

language● big search space : all words of target

corpus→ can not be fully automatic→ semi supervised term alignment

Thank you

ed(a)lingua-et-machina.com

Franco-Thai Workshop 20104th intensive summer school on Natural Language Processing

Bilingual terminology mining

Technology

Transcript of Bilingual terminology mining

Automating Bilingual Terminology Creation. David Meikle (Lingo24).

Mining terminology from Eur-Lex corpora

Bilingual Evaluations: Writing the FIE report for Bilingual Students

Bilingual Terminology Extraction based on Translation Patterns

Data Mining - cling.csd.uwo.cacling.csd.uwo.ca/cs412/slides/Chapter2.pdf · Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 3 Terminology Components of the

BILINGUAL LANGUAGE LEARNINGenglish.neu.edu.tr/.../33/2016/01/Bilingual-ED-1.pdf · Being bilingual gives you advantages in many different variations. BENEFITS TO BEING BILINGUAL COGNITIVE

PEDAGOGICAL USE BILINGUAL TEXT-MINING...PEDAGOGICAL USE The use of bilingual text-mining system is introduced to EdUHK courses for 3 types of learning analytics supports. (1) The system

Topics in Machine Learning: I - Carleton Universitypeople.scs.carleton.ca/.../DecisionTreessUpload.pdf · 2015. 3. 18. · Machine Learning / Data Mining: Basic Terminology Decision

Bilingual learners and bilingual education

TEST FRAMEWORK FOR FIELD 193: BILINGUAL GENERALIST … Bilingual Generalist EC-6...The bilingual education teacher has knowledge of the foundations of bilingual education and ... Oral

Terminology Management - ZAACZerfass@zaac.de 18 Statistical extraction •Monolingual or bilingual •Suitable for every language / language combination (for example from a translation

BILINGUAL BANK TELLER - micasaresourcecenter.org six-week bilingual bank teller ... Professional vocabulary in English / Spanish ... Learning banking terminology and vocabulary

BITCOIN MINING · Bitcoin Mining Terminology To begin your journey into Bitcoin mining, you will need to run software with specialized hardware. We will go into further details about

ESL/ESOL Curriculum Grades: Kindergarten-12 Curriculum Grades: Kindergarten-12 ... ESL/Bilingual Terminology 39 ... Task Level Linguistic Complexity Vocabulary Usage Language Control

Industry: Healthcare Job Type: Customer Service, Clerical ex… · Desired Skills: Bilingual (English/Spanish) a plus. Aware of medical terminology and facts. Working toward a future

Today’s situation of Mongolian terminology · PDF fileGeology and mining terminological dictionary SCT’s document that approved Russian and Mongolian dictionary of Erdenet mining

Appendix DD Agency Coordination and Public InvolvementFact Sheet (bilingual) Contact Card (bilingual) Comment Sheet (bilingual) Welcome Sheet/Open House Road Map/Agenda (bilingual)

Glossary of Mining Terminology - First Nations€¦ · Glossary of Mining Terminology ... for a three-day workshop to develop the terms in this glossary. ... extent, quality or economic

UniVersum - uni-vechta.de · miere von “English for Social Work: A bilingual terminology hand-book.” Über 90 Gäste kamen zur Präsentation. Das Lehrwerk wurde als Praxisprojekt

Developing Bilingual B Dual Language Bilingual Program · 2015. 9. 17. · Developing Bilingual and Biliterate Student Dual Language Bilingual Program Department of Curriculum, Instruction,