Automated Compounding as a means for Maximizing Lexical Coverage
description
Transcript of Automated Compounding as a means for Maximizing Lexical Coverage
Automated Compounding as a means for Maximizing Lexical
Coverage
Vincent VandeghinsteCentrum voor Computerlinguïstiek
K.U. Leuven
Maximizing Lexical Coverage• Target: Reduction of the number of OOV-words• Means:
– accurate content and organization of the recognizer lexicon– taking care of a number of productive word formation
processes• Evaluation:
– implementation of test tool– test results
• Conclusions
Lexicon: Content & Organization
• Starting point: CGN-lexicon (570.000 entries)• Reduction to one entry per wordform per POS
(300.000 entries)• Removal of compounds (160.000 entries)• Selection of most frequent entries (40.000) =>
Basic Word List (BWL)• Quasi-Word List (QWL): Compounding word
parts which don’t appear in BWL
Lexicon Accuracy
• Careful selection of the words in BWL:– no compounds– frequent words
• Organization of the lexicon:maximal applicability of compounding rules through
lexicon split into BWL and QWL
Word Formation Processes
• Input: number of word parts that can or cannot be compounded
• Hybrid approach: Rule-based + Statistical Filters• Output:
– compound + morfo-syntactic info + confidence measure
– no compounding possible with given word parts
Word Formation Processes: Input
• From BWL: full words, that can be part of a compound or can be words by themselves
• From QWL: ‘words’ that can only be part of a compound
• 2 up to 5 word parts
Word Formation Processes: Rules
• Making use of rules for word formation:e.g.: modifier (N) + head (N) => compound (N)
• Input from QWL: word part is N and can only be modifier
• Input from BWL: word is looked up in CGN: morfo-syntactic info is used in rules
• Rules use 2 word parts• When input > 2 word parts: recursivity in rules
Word Formation Processes: Statistics
• Relative Frequency Threshold Parameter• Confidence Measure of the Compound
Probability
Relative Frequency Threshold
• Makes use of relative frequency of POS for a word form
• Makes use of a threshold value (0.05%)• If RF > Threshold: POS is used for this wordform• If RF < Threshold: POS is rejected for this
wordform• Example: RF(bij(PREP)) = 0.999 > T, RF(bij(N)) =
0.0004<T, only bij(PREP) is used
Confidence Measure of Compounding Probability
• estimation of:P(comp(w1=mod, w2=head)) / P(comp(w1=*, w2=head))
where:– P(comp(w1=mod, w2=head)) is the probability that two
consecutive word parts form a compound rather than being 2 separate words
– P(comp(w1=*, w2=head) is the probability of w2 being a head, with any modifier
Confidence Measure of Compound Probability (2)
• If the compound is found in the frequency list, the ratio is estimated like this:[Fr(comp(w1=mod, w2=head))/Fr(comp(w1=*,w2=head))] x (1-Dhead)
where:– Fr(comp(w1=mod, w2=head)) is the frequency of the
compound that consists of w1 + w2
– Fr(comp(w1=*, w2=head)) is the frequency of the 2nd word part as a head, with any modifier
– Dhead is the discount parameter: amount of probability reserved for words not in frequency list
Confidence Measure of Compounding Probability (3)
• Discount parameter is estimated:Dhead= #diff(mod | head) / Fr(comp(w1=*, w2=head))
where:– #diff(mod | head) is the number of different modifiers
occuring with the given head– Fr(comp(w1=*, w2=head)) is the frequency of the 2nd word part
as a head, with any modifier
• (1-Dhead) is the amount of probability reserved for words that can be found in the frequency list
Confidence Measure of Compounding Probability (4)
• If the compound is not found in the frequency list, the ratio is estimated like this:Dhead x [Fr(comp(w1=mod, w2=*)) / Fr(*)]
where:– Fr(comp(w1=mod, w2=*)) is the frequency of the 1st word part
as a modifier of any head– Fr(*) is the total frequency of all words in the frequency
list (= 79.862.581)
Confidence Measures: Examples• binnen+kijken
– binnenkijken occurs in the frequency list– Fr(w1=binnen, w2=kijken) = 10– Fr(w1=*, w2=kijken) = 2188– #diff( mod | head=kijken) = 21– (10 / 2188) x (1 - 21/2188) = 0.0045
• frequentie + tabel– frequentietabel does not occur in frequency list– Fr(w1=*, w2=tabel) = 141– #diff( mod | head=tabel) = 17– Fr(w1=frequentie,w2=*) = 15– (17 / 141) x (15 / 79 862 581) = 2.26 e-8
Evaluation
• Test System• Test Results
The Test System• Takes a regular text as input• Converts punctuation marks into #• For the test system, a BWL of 35.000 entries was
used• Every word is checked in BWL:
– if word is not present in BWL: word gets split up in a modifier (QWL or BWL) and a head (BWL)
– no compounding rules are used for split up procedure– if no possible split up is found, split up in 3 parts is tried
• If a word can’t be found in BWL, and can’t be split up, it is classified as an OOV-word
The Test System (2)
• For every 2 consecutive word parts, it was tested whether they can be compounded or not
• Results are compared with original text• False compounding and false identification of
noncompounds can be counted this way• Same was done for every 3 consecutive word parts• A threshold was set on the Confidence Measure:
If Confidence Measure < Threshold, compound is rejected
Test Results
• 3 test texts were used:– Thuis (dialogue of soap series): 3415 words, 3.08%
OOV, 1.47 % compounds– Aspe (chapter of a novel): 4589 words, 3.77% OOV,
6.08 % compounds– Interview (transcript of spontaneous speech): 4645
words, 0.84% OOV, 2.95 % compounds• Most of the OOV’s are proper nouns or non-
standard Dutch
Test Results (2)
• Correct identification of noncompounds and compounds:– dependent on test text– dependent on parameter thresholds
• There is a nearly perfect negative correlation ( -0.98) between the optimal confidence threshold and the amounts of compounds in the test text
Test Results (3)
Text Rel.Freq.Threshold
ConfidenceThreshold
% CorrectIdentific.
Aspe 0.05 0.003 94.53%
Thuis 0.05 0.003 96.28%
Interview 0.05 0.003 98.47%
Conclusions
• Identifying compoundability can be done with an accuracy of 94.5 - 98.5 %
• Lexical coverage can be assured with OOV’s between 0.8 and 3.8 % and a lexicon with a total size of 36.000 entries (BWL+QWL)
Conclusions (2)
• Capturing already existing compounds by automated compounding proves to be successful
• Capturing new formed compounds proves to be a lot harder: the accuracy is a lot lower
• Automated compounding proves to be a useful means for maximizing lexical coverage