automated compounding as a means for maximizing lexical coverage vincent vandeghinste centrum voor...

22
Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Upload: kelly-shelton

Post on 26-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Automated Compounding as a means for Maximizing Lexical

Coverage

Vincent Vandeghinste

Centrum voor Computerlinguïstiek

K.U. Leuven

Page 2: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Maximizing Lexical Coverage• Target: Reduction of the number of OOV-words• Means:

– accurate content and organization of the recognizer lexicon

– taking care of a number of productive word formation processes

• Evaluation:– implementation of test tool

– test results

• Conclusions

Page 3: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Lexicon: Content & Organization

• Starting point: CGN-lexicon (570.000 entries)• Reduction to one entry per wordform per POS

(300.000 entries)

• Removal of compounds (160.000 entries)• Selection of most frequent entries (40.000) =>

Basic Word List (BWL)• Quasi-Word List (QWL): Compounding word

parts which don’t appear in BWL

Page 4: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Lexicon Accuracy

• Careful selection of the words in BWL:– no compounds

– frequent words

• Organization of the lexicon:maximal applicability of compounding rules through

lexicon split into BWL and QWL

Page 5: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Word Formation Processes

• Input: number of word parts that can or cannot be compounded

• Hybrid approach: Rule-based + Statistical Filters• Output:

– compound + morfo-syntactic info + confidence measure

– no compounding possible with given word parts

Page 6: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Word Formation Processes: Input

• From BWL: full words, that can be part of a compound or can be words by themselves

• From QWL: ‘words’ that can only be part of a compound

• 2 up to 5 word parts

Page 7: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Word Formation Processes: Rules

• Making use of rules for word formation:e.g.: modifier (N) + head (N) => compound (N)

• Input from QWL: word part is N and can only be modifier

• Input from BWL: word is looked up in CGN: morfo-syntactic info is used in rules

• Rules use 2 word parts• When input > 2 word parts: recursivity in rules

Page 8: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Word Formation Processes: Statistics

• Relative Frequency Threshold Parameter

• Confidence Measure of the Compound Probability

Page 9: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Relative Frequency Threshold

• Makes use of relative frequency of POS for a word form

• Makes use of a threshold value (0.05%)• If RF > Threshold: POS is used for this wordform• If RF < Threshold: POS is rejected for this

wordform

• Example: RF(bij(PREP)) = 0.999 > T, RF(bij(N)) = 0.0004<T, only bij(PREP) is used

Page 10: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Confidence Measure of Compounding Probability

• estimation of:P(comp(w1=mod, w2=head)) / P(comp(w1=*, w2=head))

where:

– P(comp(w1=mod, w2=head)) is the probability that two consecutive word parts form a compound rather than being 2 separate words

– P(comp(w1=*, w2=head) is the probability of w2 being a head, with any modifier

Page 11: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Confidence Measure of Compound Probability (2)

• If the compound is found in the frequency list, the ratio is estimated like this:[Fr(comp(w1=mod, w2=head))/Fr(comp(w1=*,w2=head))] x (1-Dhead)

where:

– Fr(comp(w1=mod, w2=head)) is the frequency of the compound that consists of w1 + w2

– Fr(comp(w1=*, w2=head)) is the frequency of the 2nd word part as a head, with any modifier

– Dhead is the discount parameter: amount of probability reserved for words not in frequency list

Page 12: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Confidence Measure of Compounding Probability (3)

• Discount parameter is estimated:

Dhead= #diff(mod | head) / Fr(comp(w1=*, w2=head))

where:– #diff(mod | head) is the number of different modifiers

occuring with the given head

– Fr(comp(w1=*, w2=head)) is the frequency of the 2nd word part as a head, with any modifier

• (1-Dhead) is the amount of probability reserved for words that can be found in the frequency list

Page 13: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Confidence Measure of Compounding Probability (4)

• If the compound is not found in the frequency list, the ratio is estimated like this:Dhead x [Fr(comp(w1=mod, w2=*)) / Fr(*)]

where:

– Fr(comp(w1=mod, w2=*)) is the frequency of the 1st word part as a modifier of any head

– Fr(*) is the total frequency of all words in the frequency list (= 79.862.581)

Page 14: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Confidence Measures: Examples• binnen+kijken

– binnenkijken occurs in the frequency list– Fr(w1=binnen, w2=kijken) = 10– Fr(w1=*, w2=kijken) = 2188– #diff( mod | head=kijken) = 21– (10 / 2188) x (1 - 21/2188) = 0.0045

• frequentie + tabel– frequentietabel does not occur in frequency list– Fr(w1=*, w2=tabel) = 141– #diff( mod | head=tabel) = 17– Fr(w1=frequentie,w2=*) = 15– (17 / 141) x (15 / 79 862 581) = 2.26 e-8

Page 15: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Evaluation

• Test System

• Test Results

Page 16: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

The Test System• Takes a regular text as input• Converts punctuation marks into #• For the test system, a BWL of 35.000 entries was

used• Every word is checked in BWL:

– if word is not present in BWL: word gets split up in a modifier (QWL or BWL) and a head (BWL)

– no compounding rules are used for split up procedure– if no possible split up is found, split up in 3 parts is tried

• If a word can’t be found in BWL, and can’t be split up, it is classified as an OOV-word

Page 17: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

The Test System (2)

• For every 2 consecutive word parts, it was tested whether they can be compounded or not

• Results are compared with original text• False compounding and false identification of

noncompounds can be counted this way• Same was done for every 3 consecutive word parts• A threshold was set on the Confidence Measure:

If Confidence Measure < Threshold, compound is rejected

Page 18: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Test Results

• 3 test texts were used:– Thuis (dialogue of soap series): 3415 words, 3.08%

OOV, 1.47 % compounds

– Aspe (chapter of a novel): 4589 words, 3.77% OOV, 6.08 % compounds

– Interview (transcript of spontaneous speech): 4645 words, 0.84% OOV, 2.95 % compounds

• Most of the OOV’s are proper nouns or non-standard Dutch

Page 19: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Test Results (2)

• Correct identification of noncompounds and compounds:– dependent on test text

– dependent on parameter thresholds

• There is a nearly perfect negative correlation ( -0.98) between the optimal confidence threshold and the amounts of compounds in the test text

Page 20: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Test Results (3)

Text Rel.Freq.Threshold

ConfidenceThreshold

% CorrectIdentific.

Aspe 0.05 0.003 94.53%

Thuis 0.05 0.003 96.28%

Interview 0.05 0.003 98.47%

Page 21: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Conclusions

• Identifying compoundability can be done with an accuracy of 94.5 - 98.5 %

• Lexical coverage can be assured with OOV’s between 0.8 and 3.8 % and a lexicon with a total size of 36.000 entries (BWL+QWL)

Page 22: Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Conclusions (2)

• Capturing already existing compounds by automated compounding proves to be successful

• Capturing new formed compounds proves to be a lot harder: the accuracy is a lot lower

• Automated compounding proves to be a useful means for maximizing lexical coverage