Automated Compounding as a means for Maximizing Lexical Coverage

Automated Compounding as a means for Maximizing Lexical

Coverage

Vincent VandeghinsteCentrum voor Computerlinguïstiek

K.U. Leuven

Maximizing Lexical Coverage• Target: Reduction of the number of OOV-words• Means:

– accurate content and organization of the recognizer lexicon– taking care of a number of productive word formation

processes• Evaluation:

– implementation of test tool– test results

• Conclusions

Lexicon: Content & Organization

• Starting point: CGN-lexicon (570.000 entries)• Reduction to one entry per wordform per POS

(300.000 entries)• Removal of compounds (160.000 entries)• Selection of most frequent entries (40.000) =>

Basic Word List (BWL)• Quasi-Word List (QWL): Compounding word

parts which don’t appear in BWL

Lexicon Accuracy

• Careful selection of the words in BWL:– no compounds– frequent words

• Organization of the lexicon:maximal applicability of compounding rules through

lexicon split into BWL and QWL

Word Formation Processes

• Input: number of word parts that can or cannot be compounded

• Hybrid approach: Rule-based + Statistical Filters• Output:

– compound + morfo-syntactic info + confidence measure

– no compounding possible with given word parts

Word Formation Processes: Input

• From BWL: full words, that can be part of a compound or can be words by themselves

• From QWL: ‘words’ that can only be part of a compound

• 2 up to 5 word parts

Word Formation Processes: Rules

• Making use of rules for word formation:e.g.: modifier (N) + head (N) => compound (N)

• Input from QWL: word part is N and can only be modifier

• Input from BWL: word is looked up in CGN: morfo-syntactic info is used in rules

• Rules use 2 word parts• When input > 2 word parts: recursivity in rules

Word Formation Processes: Statistics

• Relative Frequency Threshold Parameter• Confidence Measure of the Compound

Probability

Relative Frequency Threshold

• Makes use of relative frequency of POS for a word form

• Makes use of a threshold value (0.05%)• If RF > Threshold: POS is used for this wordform• If RF < Threshold: POS is rejected for this

wordform• Example: RF(bij(PREP)) = 0.999 > T, RF(bij(N)) =

0.0004<T, only bij(PREP) is used

Confidence Measure of Compounding Probability

• estimation of:P(comp(w1=mod, w2=head)) / P(comp(w1=*, w2=head))

where:– P(comp(w1=mod, w2=head)) is the probability that two

consecutive word parts form a compound rather than being 2 separate words

– P(comp(w1=*, w2=head) is the probability of w2 being a head, with any modifier

Confidence Measure of Compound Probability (2)

• If the compound is found in the frequency list, the ratio is estimated like this:[Fr(comp(w1=mod, w2=head))/Fr(comp(w1=*,w2=head))] x (1-Dhead)

where:– Fr(comp(w1=mod, w2=head)) is the frequency of the

compound that consists of w1 + w2

– Fr(comp(w1=*, w2=head)) is the frequency of the 2nd word part as a head, with any modifier

– Dhead is the discount parameter: amount of probability reserved for words not in frequency list

Confidence Measure of Compounding Probability (3)

• Discount parameter is estimated:Dhead= #diff(mod | head) / Fr(comp(w1=*, w2=head))

where:– #diff(mod | head) is the number of different modifiers

occuring with the given head– Fr(comp(w1=*, w2=head)) is the frequency of the 2nd word part

as a head, with any modifier

• (1-Dhead) is the amount of probability reserved for words that can be found in the frequency list

Confidence Measure of Compounding Probability (4)

• If the compound is not found in the frequency list, the ratio is estimated like this:Dhead x [Fr(comp(w1=mod, w2=*)) / Fr(*)]

where:– Fr(comp(w1=mod, w2=*)) is the frequency of the 1st word part

as a modifier of any head– Fr(*) is the total frequency of all words in the frequency

list (= 79.862.581)

Confidence Measures: Examples• binnen+kijken

– binnenkijken occurs in the frequency list– Fr(w1=binnen, w2=kijken) = 10– Fr(w1=*, w2=kijken) = 2188– #diff( mod | head=kijken) = 21– (10 / 2188) x (1 - 21/2188) = 0.0045

• frequentie + tabel– frequentietabel does not occur in frequency list– Fr(w1=*, w2=tabel) = 141– #diff( mod | head=tabel) = 17– Fr(w1=frequentie,w2=*) = 15– (17 / 141) x (15 / 79 862 581) = 2.26 e-8

Evaluation

• Test System• Test Results

The Test System• Takes a regular text as input• Converts punctuation marks into #• For the test system, a BWL of 35.000 entries was

used• Every word is checked in BWL:

– if word is not present in BWL: word gets split up in a modifier (QWL or BWL) and a head (BWL)

– no compounding rules are used for split up procedure– if no possible split up is found, split up in 3 parts is tried

• If a word can’t be found in BWL, and can’t be split up, it is classified as an OOV-word

The Test System (2)

• For every 2 consecutive word parts, it was tested whether they can be compounded or not

• Results are compared with original text• False compounding and false identification of

noncompounds can be counted this way• Same was done for every 3 consecutive word parts• A threshold was set on the Confidence Measure:

If Confidence Measure < Threshold, compound is rejected

Test Results

• 3 test texts were used:– Thuis (dialogue of soap series): 3415 words, 3.08%

OOV, 1.47 % compounds– Aspe (chapter of a novel): 4589 words, 3.77% OOV,

6.08 % compounds– Interview (transcript of spontaneous speech): 4645

words, 0.84% OOV, 2.95 % compounds• Most of the OOV’s are proper nouns or non-

standard Dutch

Test Results (2)

• Correct identification of noncompounds and compounds:– dependent on test text– dependent on parameter thresholds

• There is a nearly perfect negative correlation ( -0.98) between the optimal confidence threshold and the amounts of compounds in the test text

Test Results (3)

Text Rel.Freq.Threshold

ConfidenceThreshold

% CorrectIdentific.

Aspe 0.05 0.003 94.53%

Thuis 0.05 0.003 96.28%

Interview 0.05 0.003 98.47%

Conclusions

• Identifying compoundability can be done with an accuracy of 94.5 - 98.5 %

• Lexical coverage can be assured with OOV’s between 0.8 and 3.8 % and a lexicon with a total size of 36.000 entries (BWL+QWL)

Conclusions (2)

• Capturing already existing compounds by automated compounding proves to be successful

• Capturing new formed compounds proves to be a lot harder: the accuracy is a lot lower

• Automated compounding proves to be a useful means for maximizing lexical coverage

Automated Compounding as a means for Maximizing Lexical Coverage

Documents

Transcript of Automated Compounding as a means for Maximizing Lexical Coverage