automated compounding as a means for maximizing lexical coverage vincent vandeghinste centrum voor...

Automated Compounding as a means for Maximizing Lexical

Coverage

Vincent Vandeghinste

Centrum voor Computerlinguïstiek

K.U. Leuven

Maximizing Lexical Coverage• Target: Reduction of the number of OOV-words• Means:

– accurate content and organization of the recognizer lexicon

– taking care of a number of productive word formation processes

• Evaluation:– implementation of test tool

– test results

• Conclusions

Lexicon: Content & Organization

• Starting point: CGN-lexicon (570.000 entries)• Reduction to one entry per wordform per POS

(300.000 entries)

• Removal of compounds (160.000 entries)• Selection of most frequent entries (40.000) =>

Basic Word List (BWL)• Quasi-Word List (QWL): Compounding word

parts which don’t appear in BWL

Lexicon Accuracy

• Careful selection of the words in BWL:– no compounds

– frequent words

• Organization of the lexicon:maximal applicability of compounding rules through

lexicon split into BWL and QWL

Word Formation Processes

• Input: number of word parts that can or cannot be compounded

• Hybrid approach: Rule-based + Statistical Filters• Output:

– compound + morfo-syntactic info + confidence measure

– no compounding possible with given word parts

Word Formation Processes: Input

• From BWL: full words, that can be part of a compound or can be words by themselves

• From QWL: ‘words’ that can only be part of a compound

• 2 up to 5 word parts

Word Formation Processes: Rules

• Making use of rules for word formation:e.g.: modifier (N) + head (N) => compound (N)

• Input from QWL: word part is N and can only be modifier

• Input from BWL: word is looked up in CGN: morfo-syntactic info is used in rules

• Rules use 2 word parts• When input > 2 word parts: recursivity in rules

Word Formation Processes: Statistics

• Relative Frequency Threshold Parameter

• Confidence Measure of the Compound Probability

Relative Frequency Threshold

• Makes use of relative frequency of POS for a word form

• Makes use of a threshold value (0.05%)• If RF > Threshold: POS is used for this wordform• If RF < Threshold: POS is rejected for this

wordform

• Example: RF(bij(PREP)) = 0.999 > T, RF(bij(N)) = 0.0004<T, only bij(PREP) is used

Confidence Measure of Compounding Probability

• estimation of:P(comp(w1=mod, w2=head)) / P(comp(w1=*, w2=head))

where:

– P(comp(w1=mod, w2=head)) is the probability that two consecutive word parts form a compound rather than being 2 separate words

– P(comp(w1=*, w2=head) is the probability of w2 being a head, with any modifier

Confidence Measure of Compound Probability (2)

• If the compound is found in the frequency list, the ratio is estimated like this:[Fr(comp(w1=mod, w2=head))/Fr(comp(w1=*,w2=head))] x (1-Dhead)

where:

– Fr(comp(w1=mod, w2=head)) is the frequency of the compound that consists of w1 + w2

– Fr(comp(w1=*, w2=head)) is the frequency of the 2nd word part as a head, with any modifier

– Dhead is the discount parameter: amount of probability reserved for words not in frequency list

Confidence Measure of Compounding Probability (3)

• Discount parameter is estimated:

Dhead= #diff(mod | head) / Fr(comp(w1=*, w2=head))

where:– #diff(mod | head) is the number of different modifiers

occuring with the given head

– Fr(comp(w1=*, w2=head)) is the frequency of the 2nd word part as a head, with any modifier

• (1-Dhead) is the amount of probability reserved for words that can be found in the frequency list

Confidence Measure of Compounding Probability (4)

• If the compound is not found in the frequency list, the ratio is estimated like this:Dhead x [Fr(comp(w1=mod, w2=*)) / Fr(*)]

where:

– Fr(comp(w1=mod, w2=*)) is the frequency of the 1st word part as a modifier of any head

– Fr(*) is the total frequency of all words in the frequency list (= 79.862.581)

Confidence Measures: Examples• binnen+kijken

– binnenkijken occurs in the frequency list– Fr(w1=binnen, w2=kijken) = 10– Fr(w1=*, w2=kijken) = 2188– #diff( mod | head=kijken) = 21– (10 / 2188) x (1 - 21/2188) = 0.0045

• frequentie + tabel– frequentietabel does not occur in frequency list– Fr(w1=*, w2=tabel) = 141– #diff( mod | head=tabel) = 17– Fr(w1=frequentie,w2=*) = 15– (17 / 141) x (15 / 79 862 581) = 2.26 e-8

Evaluation

• Test System

• Test Results

The Test System• Takes a regular text as input• Converts punctuation marks into #• For the test system, a BWL of 35.000 entries was

used• Every word is checked in BWL:

– if word is not present in BWL: word gets split up in a modifier (QWL or BWL) and a head (BWL)

– no compounding rules are used for split up procedure– if no possible split up is found, split up in 3 parts is tried

• If a word can’t be found in BWL, and can’t be split up, it is classified as an OOV-word

The Test System (2)

• For every 2 consecutive word parts, it was tested whether they can be compounded or not

• Results are compared with original text• False compounding and false identification of

noncompounds can be counted this way• Same was done for every 3 consecutive word parts• A threshold was set on the Confidence Measure:

If Confidence Measure < Threshold, compound is rejected

Test Results

• 3 test texts were used:– Thuis (dialogue of soap series): 3415 words, 3.08%

OOV, 1.47 % compounds

– Aspe (chapter of a novel): 4589 words, 3.77% OOV, 6.08 % compounds

– Interview (transcript of spontaneous speech): 4645 words, 0.84% OOV, 2.95 % compounds

• Most of the OOV’s are proper nouns or non-standard Dutch

Test Results (2)

• Correct identification of noncompounds and compounds:– dependent on test text

– dependent on parameter thresholds

• There is a nearly perfect negative correlation ( -0.98) between the optimal confidence threshold and the amounts of compounds in the test text

Test Results (3)

Text Rel.Freq.Threshold

ConfidenceThreshold

% CorrectIdentific.

Aspe 0.05 0.003 94.53%

Thuis 0.05 0.003 96.28%

Interview 0.05 0.003 98.47%

Conclusions

• Identifying compoundability can be done with an accuracy of 94.5 - 98.5 %

• Lexical coverage can be assured with OOV’s between 0.8 and 3.8 % and a lexicon with a total size of 36.000 entries (BWL+QWL)

Conclusions (2)

• Capturing already existing compounds by automated compounding proves to be successful

• Capturing new formed compounds proves to be a lot harder: the accuracy is a lot lower

• Automated compounding proves to be a useful means for maximizing lexical coverage

automated compounding as a means for maximizing lexical coverage vincent vandeghinste centrum voor...

Documents

compounding word parts

number of word parts

word formation processes

given word parts slide

bwl slide

qwl slide

rules rules

compound n input