SPED 2007, IaşiDan Tufis, Radu Ion1 Parallel Corpora, Alignment Technologies and Further Prospects...

SPED 2007, Iaşi Dan Tufis, Radu Ion 1

Parallel Corpora, Alignment Technologies and Further Prospects in Multilingual

Resources and Technology Infrastructure

Dan TUFIŞ, Radu ION

Research Institute for Artificial Intelligence, Romanian Academy


Parallel corpora• More and more data available (Hansard, EuroParl, JRC-

Acquis) in the range of tens of millions tokens per language• Contain a lot of implicit multilingual knowledge on

lexicons, word senses, grammars, collocations, idioms, phraseology, etc.

• This knowledge, once revealed, is fundamental in supporting cross-lingual and cross-cultural studies, communication and cooperation.

• Computer applications: Cross-lingual comprehension aids, machine (aided) translation,

evaluation of machine translation, language learning aids, cross-language information retrieval, cross-language question answering, etc…


Corpora alignment• Fully exploiting such linguistic information

sources requires parallel corpora alignment (frequently, high accuracy alignment requires basic language pre-processing: e.g. sentence splitting, tokenization, POS-tagging and lemmatization, chunking, dependency linking/parsing, and word sense disambiguation).– Sentence alignment– Phrase alignment – Word alignment

• Immediate outcomes:Translation lexicons, translation memories, translation models, annotation transfer facilities, cross-lingual induction facilities , support for evidence-based cross-linguistic studies. etc…


Reified Alignments (1) • A bitext alignment is a set of lexical token pairs

(links), each of them being characterized by a feature structure.

• Merging two or more comparable alignments of the same bitext, and using a trained link classifier, one can obtain a better alignment. COWAL a wrapper/merger of the alignments produces by YAWA and MEBA independent word aligners. The classifier’s decisions are entirely based on the links’ feature structures, and the improbable links (competing or not) are removed from the union of the initial alignments.


Reified Alignments (2) • Features characterizing a link <Token1 Token2>

– The feature values are real numbers in the [0,1] interval.– context independent features – CIF, they refer to the tokens of

the current link cognate, translation equivalents (TE), POS-affinity,

“obliqueness”, TE entropy– context dependent features – CDF, they refer to the properties

of the current link with respect to the rest of links in a bi-text. strong and/or weak locality, number of links crossed,

collocations– Based on the values of a link’s features we compute for each

possible link a global reliability score which is used to license or not a link in the final result.

n

iii ScoreFeatCoefFeatLinkScore

1

*


• Translation equivalents (TE)– YAWA uses an external bilingual lexicon (TREQ+RO&EN wordnets)– MEBA uses GIZA++ generated candidates filtered with a log-likelihood

threshold (11). – For a pair of languages translation equivalents are computed in both

directions. The value of the TE feature of a candidate link <TOKEN1 TOKEN2> is

1/2 (PTR(TOKEN1, TOKEN2) + PTR(TOKEN2, TOKEN1).

• Translation Entropy Score (ES)– The entropy of a word's translation equivalents distribution proved to be

an important hint on identifying highly reliable links (anchoring links) – Skewed distributions are favored against uniform ones

– For a link <A B>, the link feature value is 0.5(ES(A)+ES(B))

N

TRWpTRWpN

iii

WESlog

),(log*),(11)(


•Cognates (COGN)

TS=12 . . . k ; TT=12 . . . m

if i and are j the matching characters, & if (i) is the distance (in chars of TS) from the previous matching , & if ( i) is the distance (in chars of TT) from the previous matching then

otherwise

Threshold)T ,SYM(T if =)T ,(TCOGN

2q if

2q if mk

=)T ,SYM(T

TSTS

q

i iiTS

0

1

0

|)()(|12

1

•Part-of-speech affinity (PA)

The translated words tend to keep their part-of-speech and when they have different POSes,

this is not arbitrary. The information was computed based on a gold standard (GS2003), in

both directions (source-target and target-source). For a link <A,B> PA=0.5*(P(cat(A)|cat(B))+P(cat(B)|cat(A))


• Collocation

– Bi-gram lists (only content words) were built from each monolingual part of the training corpus, using the log-likelihood score (threshold of 10) and minimal occurrence frequency (3) for candidates filtering. Collocation probabilities are estimated for each surviving bi-gram.

– If neither token of a candidate link has a relevant collocation score with the tokens in its neighborhood, the link value of this feature is 0. Otherwise the value is 1. Competing links (starting or finishing in the same token) for YAWA are licensed only and only if at least one of them have a non-null collocation score.

• Obliqueness

– Each token in both sides of a bi-text is characterized by a position index, computed as the ratio between the relative position in the sentence and the length of the sentence. The absolute value of the difference between tokens’ position indexes subtracted from 1, gives the link’s “obliqueness” OBL(<SWi,TWj>).

)()(1),(

TSji Sentlength

j

Sentlength

iTWSWOBL


• Locality – When the dependency chunking module is available, and the chunks are aligned

via the linking of their constituents, the new candidate links starting in one chunk should finish in the aligned chunk (strong locality).

• Strong Locality:EM and CLAM Combined Linkers– We have modified the EM algorithm of IBM-1 to work on a ‘bitext’

that contains the source sentence and a replica of it as the target:» disregard the NULL alignment links » disregard words that are on the same position

– LAM introduced by Yuret, D. (1998). Discovery of linguistic relations using lexical attraction. PhD thesis, Dept of Computer Science and Electrical Engineering, MIT (subject to planarity restriction)

– Constrained LAM: a link is rejected if it does not pass any of the linking rules of a language: for instance the number agreement

– When the dependency chunking is not available, the locality is judged in a variable length window depending on the length of the current aligned sentences (weak locality)


•Weak Locality

When chunking/dependency links information is not available, the link localization is judged against a window containing m links. The value of m dependents on the aligned sentences length. The window is centered on thecandidate link.

s1

s2

.

.

.

.s

.

.

.sm

t1

t2

.

.

.t...tm

m

)|)||,max(|

|)||,min(|1

1

m

k kk

kk

ttss

ttss

mLOC

• Combining classifiers

If multiple classifiers are comparable, and if they do not make similar errors,combining their classificationsis always better than the Individual classifications.


COWAL

• An integrated platform that takes two parallel raw texts and produces their alignment– basic modules: collocations detector, tokenizers, lemmatizers,

POS-taggers, two or more comparable word-aligners (YAWA, MEBA), GIZA++ translation model builder, alignment combiner,

– optional modules : sentence aligner,, dependency “linkers, chunkers and bilingual dictionaries (Ro-En aligned wordnets)

– The platform also includes an XML generator (XCES schema compliant), an alignment viewer & editor, and a WSD based on WA and aligned wordnets.


Combining the Alignments• COWAL filters the reunion of the alignments. The filtering is achieved by a

SVM classifier (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) trained on our version of the GS2005 (for positive examples) and the differences among the basic alignments (YAWA, MEBA) and the GS2005 (for negative examples);

• The SVM classifier (LIBSVM (Fan et al., 2005) uses the default parameters: C-SVC classification (soft margin classifier) and RBF kernel (Radial Basis Function )

• Features used for the training (10-fold validation; about 7000 good examples and 7000 bad examples) :

TE(S,T), TE(T,S), OBL(S,T), LOC(S,T), PA(S,T), PA(T,S)The links labeled as incorrect links were removed from the merged alignments.

2

),(yx

eyxK

http://www.csie.ntu.edu.tw/~cjlin/libsvm/




• The words unaligned in the previous step may get links via their aligned dependents (HLP: Head Linking Projection heuristics): if b is aligned to c and b is linked to a, link a to c, unless there exist d in the same chunk with c, linked or not to it, and the POS category of d has a significant affinity with the category of a.

a c a c

b b d

– Alignment of sequences of words surrounded by the aligned chunks

– Filtering out improbable links (e.g.links that cross many other links)

Heuristics for improving the alignment (1)


Heuristics for improving the alignment (2)

• Unaligned chunks surrounded by aligned chunks get probable phrase alignment:

SL TL

Wsi ↔ Wtj

Wsk ↔ Wtm

Wsp Wsp+1…↔ Wtq Wtq+1…


Dependency chunks &Translation Model

• Regular expressions defined over the POS tags and dependency links

• Non-recursive chunks• Chunk alignment

based on their aligned constituents (one or more).


Exploiting the alignments (1)Applying the same methodology and the same assumption (two aligned words MUST have at least one cross-lingual equivalent meaning-ILI code):

• Aligned Wordnets Validation (“1984”)– Identifying the wrong ILI sense mappings– Identifying missing synsets from the commonly agreed set of

synsets (BCS1, BCS2, BCS3, …)• Extending the wordnets (“Ro-En-SemCor”)

– Identifying missing literals in the existing aligned synsets– Automatically adding new synsets (monosemous literals,

instances)• WSD (arbitrary Ro-En bitexts)


Exploiting the alignments (2)• Annotation transfer as a cross-lingual collaboration task

– “1984” parallel corpus; word aligned; the English part dependency parsed (Wolverhampton), validated and corrected (Univ. A.I. Cuza, Iasi); the Romanian part imported the parsing.

No Rel RO Lost EN Ac

1 qn 10 0 12 83.3%

2 neg 10 0 13 76.9%

3 oc 3 0 4 75.0%

4 dat 3 0 4 75.0%

5 cnt 8 0 11 72.7%

6 ad 25 0 35 71.4%

7 pcomp 218 9 316 71.0%

8 det 126 173 355 69.2%

9 comp 70 1 112 63.0%

No Rel RO Lost EN Ac

10 attr 151 4 245 62.7%

11 cc 94 2 155 59.4%

12 pm 44 1 75 58.5%

13 obj 79 2 137 58.5%

14 mod 114 1 74 55.4%

15 cla 8 0 15 53.3%

16 tmp 23 9 46 50.0%

17 man 16 0 32 50.0%

18 subj 121 72 319 48.9%


• Collocations analysis in a parallel corpus– Large parallel corpus (Acq-Com)– University Marc Bloch from Strasbourg, IMS Stuttgart

University and RACAI independently extracted the collocation in Fr, Ge, Ro and En (hub).

– We identified the equivalent collocations in the four languages. SURE-COLLOCX= COLLOCX TRX-COLLOCY (EQ1) member states, European Communities, international treaty, etc.

INT-COLLOCZ = COLLOCZ \ SURE-COLLOCZ (EQ2) • adversely affect <-> a aduce atingere[1]; • legal remedy <-> cale de atac[2], • to make good the damage <-> a compensa daunele[3] etc.

[1] A mot-a-mot translation would be to bring a touch[2] A mot-a-mot translation would be way to attack[3] A mot-a-mot translation would be to compensate the

damages

Exploiting the alignments (3)


Language Web Services

• This is just started; it was fostered by the need to closer cooperate with our partners at UAIC, University of Texas, University of Strasbourg, University of Stutgart in various projects (ROTEL, CLEF, LT4L, AUF, etc).

• Currently we added basic text processing for Romanian and English: tokenisation, tiered tagging, lemmatization (SOAP/WSDL/UDDI). Some others, for parallel corpora (sentence aligner, word aligner, dependency linker, RoWordNet, etc.) will be soon there.


Initiatives Towards Language Infrastructures

• Global Wordnet Association

• Language Grid

• CLARIN (including DAMLR)

Major goal: construction and operation of a shared distributed infrastructure that aims at making language resources and technology available to anybody. An infrastructure has to offer persistent services that allow to operate on language resources and technologies with a high availability and proper security for its users; The automatic processing of language material is of a complexity that cannot be tackled with the current fragmented approaches. What is needed, is primarily to turn existing, fragmented technology and resources into accessible

and stable services so that users can use them the way they want it.

SPED 2007, IaşiDan Tufis, Radu Ion1 Parallel Corpora, Alignment Technologies and Further Prospects...

Documents

Transcript of SPED 2007, IaşiDan Tufis, Radu Ion1 Parallel Corpora, Alignment Technologies and Further Prospects...