1 supervised classification of feature-based instances

1

Supervised Classification of Feature-based Instances

2

Simple Examples for Statistics-based Classification

• Based on class-feature counts

• Contingency table:

• We will see several examples of simple models based on these statistics

a b

c d

f

~f

C ~C

3

Prepositional-Phrase Attachment

• Simplified version of Hindle & Rooth (1993) [MS 8.3]

• Setting: V NP-chunk PP– Moscow sent soldiers into Afghanistan– ABC breached an agreement with XYZ

• Motivation for the classification task:– Attachment is often a problem for (full) parsers– Augment shallow/chunk parsers

4

Relevant Probabilities• P(prep|n) vs. P(prep|v)

– The probability of having the preposition prep attached to an occurrence of the noun n (the verb v).

– Notice: a single feature for each class

• Example: P(into|send) vs. P(into|soldier)• Decision measured by the likelihood ratio:

)|(

)|(log),,(

nprepP

vprepPpnv

• Positive/negative λ verb/noun attachment

5

Estimating Probabilities

• Based on attachment counts from a training corpus• Maximum likelihood estimates:

)(

),(_)|(

)(

),(_)|(

nfreq

nprepfreqattachnprepP

vfreq

vprepfreqattachvprepP

• How to count from an unlabeled ambiguous corpus? (Circularity problem)

• Some cases are unambiguous:– The road to London is long

– Moscow sent him to Afghanistan

6

Heuristic Bootstrapping and Ambiguous Counting

1. Produce initial estimates (model) by counting all unambiguous cases

2. Apply the initial model to all ambiguous cases; count each case under the resulting attachment if |λ| is greater than a threshold

• E.g. |λ|>2, meaning one attachment is at least 4 times more likely than the other

3. Consider each remaining ambiguous case as a 0.5 count for each attachment.

• Likely n-p and v-p pairs would “pop up” in the ambiguous counts, while incorrect attachments are likely to accumulate low counts

7

Example Decision

• Moscow sent soldiers into Afghanistan

70log0007.0

049.0loginto)soldier,send,(

0007.01478

1

)soldier(

)soldierinto,(_)soldier|into(

049.05.1742

86

)send(

)sendinto,(_)send|into(

22

freq

freqattachP

freq

freqattachP

• Verb attachment is 70 times more likely

8

Hindle & Rooth Evaluation

• H&R results for a somewhat richer model:– 80% correct if we always make a choice– 91.7% precision for 55.2% recall, when

requiring |λ|>3 for classification.

• Notice that the probability ratio doesn’t distinguish between decisions made based on high vs. low frequencies.

9

Possible Extensions

• Consider a-priori structural preference for “low” attachment (to noun)

• Consider lexical head of the PP:– I saw the bird with the telescope– I met the man with the telescope

• Such additional factors can be incorporated easily, assuming their independence

• Addressing more complex types of attachments, such as chains of several PP’s

• Similar attachment ambiguities within noun compounds: [N [N N]] vs. [[N N] N]

10

Classify by Best Single Feature: Decision List

• Training: for each feature, measure its “entailment score ” for each class, and register the class with the highest score– Sort all features by decreasing score

• Classification: for a given example, identify the highest entailment score among all “active” features, and select the appropriate class– Test all features for the class in decreasing score order, until first

success output the relevant class

– Default decision: the majority class

• For multiple classes per example: may apply a threshold on the feature-class entailment score

• Suitable when relatively few strong features indicate class (compare to manually written rules)

11

Example: Accent Restoration

• (David Yarowsky, 1994): for French and Spanish• Classes: alternative accent restorations for words in

text without accent marking • Example: côte (coast) vs. côté (side)• A variant of the general word sense disambiguation

problem - “one sense per collocation” motivates using decision lists

• Similar tasks: – Capitalization restoration in ALL-CAPS text

– Homograph disambiguation in speech synthesis (wind as noun and verb)

12

Accent Restoration - Features

• Word form coloocation features:– Single words in window: ±1, ±k (20-50)– Word pairs at <-1,+1>, <-2,-1>, <+1,+2> (complex

features)– Easy to implement

13

Accent Restoration - Features

• Local syntactic-based features (for Spanish)– Use a morphological analyzer– Lemmatized features - generalizing over inflections– POS of adjacent words as features– Some word classed (primarily time terms, to help with

tense ambiguity for unaccented words in Spanish)

14

Accent Restoration – Decision Score

• Probabilities estimated from training statistics, taken from a corpus with accents

• Smoothing - add small constant to all counts• Pruning:

– Remove redundancies for efficiency: remove specific features that score lower than their generalization (domingo - WEEKDAY, w1w2 – w1)

– Cross validation: remove features that causes more errors than correct classifications on held-out data

feature :

class :

)|(~

)|(log),(

f

c

fcP

fcPcfscore

15

“Add-1/Add-Constant” Smoothing

1usually :events language naturalIn

prior. uniform assuming Laplace, : 1

||)(

)(

:tionredistribu and gdiscountin - Smoothing

s)(sparsenes eventsy probabilit lowmany for 0)(

length) corpus (e.g. allfor count total the-

)occurrence word(e.g. event for count the- )(

)()(

XN

xcxp

xp

XxN

xxcNxc

xp

S

MLE

MLE

16

Accent Restoration – Results

• Agreement with accented test corpus for ambiguous words: 98% – Vs. 93% for baseline of most frequent form– Accented test corpus also includes errors

• Worked well for most of the highly ambiguous cases (see random sample in next slide)

• Results slightly better than Naive Bayes (weighing multiple features)– Consistent with related study on binary homograph

disambiguation, where combining multiple features almost always agrees with using a single best feature

– Incorporating many low-confidence features may introduce noise that would override the strong features

17

Accent Restoration – Tough Examples

18

Related Application: Anaphora Resolution

Weapon

Bombs

grenade

Actions

Cause_movement

throw drop

Traditional AI-style approach Manually encoded semantic preferences/constraints

<object – verb>

(Dagan, Justeson, Lappin, Lease, Ribak 1995)

The terrorist pulled the grenade from his pocket and

threw it at the policeman ?

19

Statistical Approach

Corpus(text collection)

<verb–object: throw-grenade> 20 times

<verb–object: throw-pocket> 1 time

“Semantic” Judgment

• Statistics can be acquired from unambiguous (non-anaphoric) occurrences in raw (English) corpus (cf. PP attachment)

• Semantic confidence combined with syntactic preferences it grenade

• “Language modeling” for disambiguation

20

Word Sense Disambiguationfor Machine Translation

I bought soap bars I bought window bars

sense1 sense2 sense1 sense2 (‘chafisa’) (‘sorag’) (‘chafisa’) (‘sorag’)

? ?

Corpus(text collection)

Sense1:<noun-noun: soap-bar> 20 times<noun-noun: chocolate-bar> 15 times

Sense2:<noun-noun: window-bar> 17 times<noun-noun: iron-bar> 22 times

• Features: co-occurrence within distinguished syntactic relations

• “Hidden” senses – manual labeling required(?)

21

Solution: Mapping to Target LanguageEnglish(-English)-Hebrew Dictionary:

bar1 ‘chafisa’ soap ‘sabon’ window ‘chalon’bar2 ‘sorag’

Map ambiguous “relations” to second language (all possibilities):

<noun-noun: soap-bar> 1 <noun-noun: ‘cahfisat-sabon’> 20 times 2 <noun-noun: ‘sorag-sabon’> 0 times

<noun-noun: window-bar> 1 <noun-noun: ‘cahfisat-chalon’> 0 times 2 <noun-noun: ‘sorag-chalon’> 15 times

Hebrew Corpus

• Exploiting ambiguities difference

• Principle – intersecting redundancies(Dagan and Itai 1994)

22

The Selection Model

• Constructed to choose (classify) the right translation for a complete relation rather than for each individual word at a time– since both words in a relation might be ambiguous, having

their translations dependent upon each other

• Assuming a multinomial model, under certain linguistic assumptions– The multinomial variable: a source relation

– Each alternative translation of the relation is a possible outcome of the variable

23

An Example Sentence

• A Hebrew sentence with 3 ambiguous words:

• The alternative translations to English:

24

Example - Relational Representation

25

Selection Model• We would like to use as a classification score the log

of the odds ratio between the most probable relation i and all other alternatives (in particular, the second most probable one j):

• Estimation is based on smoothed counts• A potential problem: the odds ratio for probabilities

doesn’t reflect the absolute counts from which the probabilities were estimated.– E.g., a count of 3 vs. (smoothed) 0

• Solution: using a one sided confidence interval (lower bound) for the odds ratio

j

i

p

pln

26

Confidence Interval (for a proportion)

• Given an estimate, what is the confidence that the estimate is “correct”, or at least close enough to the true value?

npp

pp

n

p

p

p)1(

)E(

size sample the:

variable)a as d(considere proportion sampled the:

n)(proportio valueparameter true the:

27

Confidence Interval (cont.)

• Approximating by normal distribution: the distribution of the sampled proportion (across samples) approaches a normal distribution for large n.

•

96.1 645.1 :luesPopular va

. is obtainingfor y probabilit

thesuch that deviations standard ofnumber the:

025.05.

zz

zpp

z

p

28

Confidence Interval (cont.)

npp

zpp

npp

zpp

p p

)1(

:bound)er (upper/low 1 confidence

with interval confidence sided-one of Estimation

)1(

:) estimatingfor (using 1 confidence

with interval confidence sided- twoof Estimation

2/

29

Selection Model (cont.)• The distribution of the log of the odds ratio (across

samples) converges to normal distribution• Selection “confidence” score for a single relation -

the lower bound for the odds-ratio:

• The most probable translation i for the relation is selected if Conf(i), the lower bound for the log odds ratio, exceeds θ.

• Notice roles of θ vs. α, and impact of n1,n2

)(

11lnln 1 iConf

nnZ

n

n

p

p

jij

i

j

i

30

Handling Multiple Relations in a Sentence: Constraint Propagation

1. Compute Conf(i) for each ambiguous source relation.

2. Pick the source relation with highest Conf(i). If Conf(i)< θ, or if no source relations left, then stop; Otherwise, select word translations according to target relation i and remove the source relation from the list.

3. Propagate the translation constraints: remove any target relation that contradicts the selections made; remove source relations that now become unambiguous.

4. Go to step 2.

• Notice similarity to the decision list algorithm

31

Selection Algorithm Example

32

Evaluation Results

• Results - HebrewEnglish translation:Coverage: ~70%Precision within coverage: ~90%– ~20% improvement over choosing most

frequent translation (95% statistical confidence for an improvement relative to this common baseline)

33

Analysis• Correct selections capture:

– Clear semantic preferences: sign/seal treaty– Lexical collocation usage: peace treaty/contract

• No selection: – Mostly: no statistics for any alternative (data

sparseness) • investigator/researcher of corruption

– Also: similar statistics for several alternatives– Solutions:

• Consult more features in remote (vs. syntactic) contextprime minister … take position/job

• Class/similarity-based generalizations (corruption-crime)

34

Analysis (cont.)• Confusing multiple sources (senses) for the same

target relation:– ‘sikkuy’ (chance/prospect) ‘kattan’ (small/young)

Valid (frequent) target relations:• small chance - correct• young prospect – incorrect, due to -

– “Young prospect” is the translation of another Hebrew expression – ‘tikva’ (hope) ‘zeira’ (young)

• The “soundness” assumption of the multinomial model is violated:– Assume counting the generated target relations corresponds

to sampling the source relation, hence assuming a known 1:n mapping (also completeness – another source of errors)

– Potential solutions: bilingual corpus, “reverse” translation

35

Sense Translation Model: Summary

• Classification instance: a relation with multiple words, rather than a single word at a time, to capture immediate (“circular”) dependencies.

• Make local decisions, based on a single feature• Taking into account statistical confidence of decisions• Constraint propagation for multiple dependent

classifications (remote dependencies)• Decision list style rational – classifying by a single high

confidence evidence is simpler, and may work better, than considering all weaker evidence simultaneously – Computing statistical confidence for a combination of multiple

events is difficult; easier to perform for each event at a time• Statistical classification scenario (model) constructed for

the linguistic setting – Important to identify explicitly the underlying model assumptions,

and to analyze the resulting errors

36

Word Sense Disambiguation

• Many words have multiple meanings– E.g, river bank, financial bank

• Problem: Assign proper sense to each ambiguous word in text

• Applications: – Machine translation– Information retrieval (mixed evidence)– Semantic interpretation of text

37

Compare to POS Tagging?

• Idea: Treat sense disambiguation like POS tagging, just with “semantic tags”

• The problems differ:– POS tags depend on specific structural cues -

mostly neighboring, and thus dependent, tags– Senses depend on semantic context – less

structured, longer distance dependency many relatively independent/unstructured features

38

Approaches

• Supervised learning:Learn from a pre-tagged corpus

• Dictionary-Based LearningLearn to distinguish senses from dictionary entries

• Unsupervised LearningAutomatically cluster word occurrences into different senses

39

Using an Aligned Bilingual Corpus

• Goal: get sense tagging cheaply• Use correlations between phrases in two languages to

disambiguateE.g, interest = ‘legal share’ (acquire an interest)

‘attention’ (show interest)In German Beteiligung erwerben

Interesse zeigen

• For each occurrence of an ambiguous word, determine which sense applies according to the aligned translation

• Limited to senses that are discriminated by the other language; suitable for disambiguation in translation

• Gale, Church and Yarowsky (1992)

40

Evaluation

• Train and test on pre-tagged (or bilingual) texts – Difficult to come by

• Artificial data – cheap to train and test: ‘merge’ two words to form an ‘ambiguous’ word with two ‘senses’– E.g, replace all occurrences of door and of window with

doorwindow and see if the system figures out which is which

– Useful to develop sense disambiguation methods

41

Performance Bounds

• How good is (say) 83.2%??• Evaluate performance relative to lower and

upper bounds:– Baseline performance: how well does the

simplest “reasonable” algorithm do? E.g., compare to selecting the most frequent sense

– Human performance: what percentage of the time do people agree on classification?

• Nature of the senses used impacts accuracy levels

1 supervised classification of feature-based instances

Documents