ch.8 lexical acquisition

Natural Language Processing

10-05-20091/40


In this chapter, we will look at the acquisition of more complex syntactic and semantic properties of words.

Main areas covered in this chapter are• Verb sub categorization• Attachment Ambiguity• Selectional Preference• Semantic categorization

3/40


The General Goal of Lexical Acquisition Develop algorithms and statistical

techniques for filling the holes in existing Machine-Readable Dictionary by looking at the occurrence patterns of words in large text corpora.

Many lexical acquisition problems besides collocations• Selectional preference• Subcategorization• Semantic categorization

4/40


Lexicon• That part of the grammar of a language

which includes the lexical entries for all the words and/or morphemes in the language and which may also include various other information, depending on the particular theory of grammar.

• (8.1)(a)The children ate the cake with their hands.

(b) The children ate the cake with blue icing.

5/40


Evaluation in IR makes frequent use of the notions of precision and recall, and their use has crossed over into work on evaluating Statistical NLP models

fp fntp

tn

selected target

Precision & Recall

• Precision = tp / (tp+fp)

• Recall = tp / (tp+fn)

Actual

System Target Not target

Selected tp fp

Not selected fn tn

6/40


RP

F1

)1(1

1

F measure

Combine precision and recall in a single measure of overall performance.

P : Precision

R : Recall

α : a factor which determines the weighting of P & R.

α= 0.5 is chosen often for equal weighting

7/40


Verb Subcategorization : Verbs subcategorize for different semantic categories.

Verb Subcategorization Frame : A particular set of syntactic category that a verb can appear with is called a subcategorization frame.

9/40


Each category has several subcategories that express their semantic arguments using different syntactic means.

The class of verbs with semantic arguments “theme” and “recipient” has a subcategory that expresses these arguments with an object and a prepositional phrase and another subcategory that in addition permits a double-object construction.

Donate + object(theme) + prepositional phrase(recipient)Gave + double-object He donated a large sum of the money to a church.He gave the church a large sum of money.


Knowing the possible subcategorization frames for verbs is important for parsing.

a. She told [the man] [where Peter grew up].

b. She found [the place [where Peter grew up]].

This information is not stored in dictionaries 50% of parse failures can be due to missing

subcategorizing frames A simple & effective algorithm(Lerner) was

proposed by Brent in 1993.

10/40


Lerner (by Brent)• There are two steps of this algorithm.

① Cues : Define a regular pattern of words and syntactic categories which indicates the presence of the frame with high certainty.

① Certainty is formalized as probability of error② For a certain cue Cj we define error Ej

② Hypothesis testing is done by contradiction ① We assume that frame is not appropriate for the

verb and call is Ho (Null Hypothesis).② we reject the hypothesis if Cj indicates with high

probability that our Ho is wrong.

11/40


Cue for frame “NP NP” (transitive verb)

(OBJ | SUBJ_OBJ | CAP) (PUNC | CC)pronoun or capitalized followed by punctuation

Example)

1. […]greet-V Peter-CAP ,-PUNC […]

2. I came Thursday, before the storm started.

Frame “NP NP”

reject H0.

12/40


Hypotheses testing H0

If pE < α, then we reject H0

Precision : close to 100% (when α = 0.02)

Recall : 47 ~ 100%

n

mr

rnj

rj

jijiE

r

n

mcvCfvPp

)1(

)),(|0)((

n : # of times vi occurs in corpus

m : # of frame f j occurs

vi(f j)=0 : Verb vi does not permit frame f j

C(vi,c j) : # of times that vi occurs

with cue c j

εj : error rate for cue f j

13/40


Manning’s addition • Use tagger and run the cue detection on the

output of the tagger• Allowing low-reliability cue and additional

cues based on tagger output increases the number of cues significantly

More error prone, but much more abundant cues.Examples: She compared the results with earlier

findings. He relies on relatives.

14/40


Table 8.3 Learned subcategorization frames Verb Correct Incorrect OALD bridge 1 1 1 burden 2 2 depict 2 3 Emanate 1 1 leak 1 5 occupy 1 3 remark 1 1 4 retire 2 1 5 shed 1 2 troop 0 3

15/40

Two of the errors are prepositional phrases (PPs)

to bridge between and to retire in.

OALD (NP in-PP is not included)

“And here we are 10 years later with the

same problems,” Mr. Smith remarked.


(8.14) The children ate the cake with a spoon. I saw the man with a telescope

Syntactic ambiguity (Log) Likelihood Ratio [a common and good way of

comparing between two exclusive alternatives]

Problem: ignores preference for attaching phrase “low” in parse tree

16/40

)|(

)|(log),,(

npP

vpPpnv


Chrysler confirmed that it would end its troubled venture with Maserati.

16/40


Event space: all V NP PP sequences, How likely for a preposition to attach with a verb or noun

VAp: Is there a PP headed by p which attaches to v

NAp: Is there a PP headed by p which attaches to n

Both can be 1:

He put the book on World War II on the table She sent him into the nursery to gather up his toys.

)|1(

)|0()|1(log

),|)((

),|)((log),,(

2

2

nNAP

vNAPvVAP

nvnpAttachP

nvvpAttachPpnv

p

pp

)(

),()|1(

)(

),()|1(

nC

pnCnNAP

vC

pvCvVAP

p

p

18/40


18/40


Model’s limitations Only consider the identity of the preposition, noun and

the verb Consider only the most basic case of PP immediately

after an NP object which is modifying either the immediately preceding n or v.

The board approved [its acquisition] [by Royal Trustco Ltd.] [of Toronto]

[for $27 a share] [at its monthly meeting]

Other attachment issues• Attachment ambiguity in noun compounds

(a) [[Door bell] manufacturer] : left-branching(b)[Woman [aid worker]] : right-branching

19/40


Selectional Preference(or Selectional restriction) Most verbs prefer arguments of a particular type. Preference ↔ Rules

eat + non-food argument Example)

eating one’s words.

21/40


Acquisition of selectional preference is important in Statistical NLP for a number of reasons

Durian is missing in dictionary then we can infer part of its meaning from selection restrictions

Another important use is ranking the parse of a sentenceGive high scores to the parses where verb has natural

arguments

22/40


Resnik’s Model(1993,1996)1. Selectional Preference Strength How strongly the verb constrains its direct object. Two Assumptions

① Take only head noun② Classes of nouns.

c cP

vcPvcPCPvCPDvS

)(

)|(log)|())(||)|(()(

23/40

•P(C) : overall probability distribution

of noun classes•P(C|v): probability distribution of noun

classes in the direct object position of v


Noun class c P( c) P(c |eat) P(c | see) P(c | find)

People 0.25 0.01 0.25 0.33

Furniture 0.25 0.01 0.25 0.33

Food 0.25 0.97 0.25 0.33

Action 0.25 0.01 0.25 0.01

SPS S(v) 1.76 0.00 0.35

Table 8.5 Selectional Preference Strength

24/40


The Notion of the Resnik’s Model (conti’)2. Selectional Association between a verb v and a class

c

• A rule for assigning strength to nouns

Ex) (8.31) Susan interrupted the chair.

)()()|(

log)|(),(

vScPvcP

vcPcvA

),(max),()(

cvAnvAnclassesc

)peopleinterrupt,(

),interrupt(max)chair interrupt,()(

A

cAAchairclassesc

25/40


• Estimate the Probability of P(c|v) = P(v,c) / P(v)

'

)'(

)()(

v

vC

vCvP

)(

),(|)(|

11

),(1

),(

cwordsn

nvCnclassesN

cvCN

cvP

N : total number of verb-object pairs in the corpus

words(c) : set of all nouns in class c

|classes(n)| : number of noun classes that contain n as a member

C(v,n) : number of verb-object pairs with v as the verb and n as the head of the

object NP

26/40


Resnik’s experiments on the Brown corpus (1996) : Table 8.6• Left half : typical objects• Right half : atypical objects• For most verbs, association strength predicts which

object is typical• Most errors the model makes are due to the fact that it

performs a form of disambiguation, by choosing the highest A(v,c) for A(v,n)

Implicit object alternationa. Mike ate the cake.b. Mike ate.◦ The more constraints a verb puts on its object, the more

likely it is to permit the implicit-object construction◦ Selectional Preference Strength (SPS) is seen as the

more basic phenomenon which explains the occurrence of implicit-objects as well as association strength

27/40


27/40


Lexical Acquisition The Acquisition of meaning Semantic Similarity

• Automatically acquiring a relative measure of how similar a new word is to known words is much easier than determining what the meaning actually is

• Most often used for generalization under the assumption that semantically similar words behave similarly

ex) Susan had never eaten a fresh durian before.• Similarity-based Generalization VS. Class-based

Generalization– Similarity-based generalization : Consider the closest

neighbors– Class-based generalization : Consider the whole class

• Usage of Semantic Similarity– Query expansion : astronaut cosmonaut– k nearest neighbors classification

28/40


A Notion of Semantic Similarity• Extension of synonymy and refers to cases of near-

synonymy like the pair dwelling/abode• Two words are from the Same domain or topic

ex) Doctor, nurse, fever, intravenous• Judgements of Semantic Similarity explained by the

degree of contextual interchangeability ( Miller and Charles – 1991)

• Ambiguity presents a problem for all notions of semantic similarity

When applied to ambiguous words, semantically similar usually means ‘similar to the appropriate sense’ex) litigation ≒ suit (≠ clothes)

Similarity Measures• Vector space measures• Probabilistic measures

29/40


29/40


30/40


3Similarity measures for binary vectors ( Table 8.7 ) Matching coefficient simply counts the number of dimension

on which both vectors are non-zero.

Dice coefficient normailizes for length of the vectors and the total number of non zero entries.

Jaccard (or Tanimoto) coefficient penalizes a small number of shared entries more than the Dice coefficient does.

YX

||||

||2

YX

YX

||

||

YX

YX

31/40


Similarity measures for binary vectors (conti’) Overlap coefficient has a value of 1.0 if every dimension

with a non-zero value for the first vector is also non-zero for the second vector.

Cosine penalizes less in cases where the number of non-zero entries is very different.

Real-valued vector space More powerful representation for linguistic objects. The length of a vector

|)||,min(|

||

YX

YX

||||

||

YX

YX

32/40


Real-valued vector space (conti’) The dot product between two vectors

The cosine measure

The Euclidean distance

The advantage of vector spaces as a representational medium.• Simplicity.• Computational efficiency.

The disadvantage of vector spaces Operate on binary data except for cosine Cosine has its own problem

• Cosine assumes a Euclidean space• Euclidean space is not well-motivated choice if the vectors we are

dealing with are vectors of probability or counts

n

i i

n

i i

n

i ii

yx

yx

yx

yxyx

1

2

1

2

1

||||),cos(

33/40


Transform semantic similarity into the similarity of two probability distribution

Transform matrices of counts in Figure 8.3, 8.4 and 8.5 into matrices of conditional probability

Ex) (American, Astronaut) P(American|astronaut) = ½ = 0.5

Measures of (dis-)similarity between probability distributions

( Table 8.9 )• 3 measures of dissimilarity between probability distributions

investigated by Dagan et al.(1997)1. KL divergence

–

– Measures how much information is lost if we assume distribution q when the true distribution is p

– Two Problems for Practical applications Get value of infinity when qi=0 and pi ≠ 0 Asymmetric ( D(p||q) ≠ D(q||p) )

i i

ii q

ppqpD log)||(

34/40


Measures of similarity between probability distributions (conti’)

2. Information radius (IRAD)

–

– Symmetric and no problem with infinite values.– Measures how much information is lost if we describe

the two words that correspond to p and q with their average distribution

3. norm.– – A measure of the expected proportion of events that

are going to be different between the distributions p and q

)2

||()2

||(qp

qDqp

pD

i

ii qp ||2

1

35/40


Measures of similarity between probability distributions (conti’)

• norm’s example ( by figure 8.5 )p1 = P(Soviet | cosmonaut) = 0.5p2 = 0p3 = P(spacewalking | cosmonaut)=0.5q1 = 0q2 = P(American | astronaut) = 0.5q3 = P(spacewalking | astronaut) = 0.5

5.0)05.05.0(

2

1||

2

1

iii qp

36/40


Lexical acquisition plays a key role in statistical NLP because available lexical resources are always lacking in some way.• The cost of building lexical resources manually.• The quantitative part of lexical acquisition almost

always has to be done automatically. • Many lexical resources were designed for human

consumption. The best solution : the augmentation of a manual

resource by automatic means.• The main reason : The inherent productivity of

language.

38/40


Look harder for sources of prior knowledge that can constrain the process of lexical acquisition.

Much of the hard work of lexical acquisition will be in building interfaces that admit easy specification of prior knowledge and easy correction of mistake made in automatic learning.

Linguistic theory-important source of prior knowledge- has been surprisingly underutilized in Statistical NLP.

Dictionaries are only one source of information that can be important in lexical acquisition in addition to text corpora.

( Other source : encyclopedias, thesauri, gazeteers, collections of technical vocabulary etc.)

If we succeed in emulating human acquisition of language by tapping into this rich source of information, then a breakthrough in the effectiveness of lexical acquisition can be expected.

40/40

ch.8 lexical acquisition

Documents