ch.8 lexical acquisition
DESCRIPTION
Ch.8 Lexical Acquisition. 10-05-2009. 1/40. Introduction. In this chapter, we will look at the acquisition of more complex syntactic and semantic properties of words. Main areas covered in this chapter are Verb sub categorization Attachment Ambiguity Selectional Preference - PowerPoint PPT PresentationTRANSCRIPT
Natural Language Processing
10-05-20091/40
Natural Language Processing
In this chapter, we will look at the acquisition of more complex syntactic and semantic properties of words.
Main areas covered in this chapter are• Verb sub categorization• Attachment Ambiguity• Selectional Preference• Semantic categorization
3/40
Natural Language Processing
The General Goal of Lexical Acquisition Develop algorithms and statistical
techniques for filling the holes in existing Machine-Readable Dictionary by looking at the occurrence patterns of words in large text corpora.
Many lexical acquisition problems besides collocations• Selectional preference• Subcategorization• Semantic categorization
4/40
Natural Language Processing
Lexicon• That part of the grammar of a language
which includes the lexical entries for all the words and/or morphemes in the language and which may also include various other information, depending on the particular theory of grammar.
• (8.1)(a)The children ate the cake with their hands.
(b) The children ate the cake with blue icing.
5/40
Natural Language Processing
Evaluation in IR makes frequent use of the notions of precision and recall, and their use has crossed over into work on evaluating Statistical NLP models
fp fntp
tn
selected target
Precision & Recall
• Precision = tp / (tp+fp)
• Recall = tp / (tp+fn)
Actual
System Target Not target
Selected tp fp
Not selected fn tn
6/40
Natural Language Processing
RP
F1
)1(1
1
F measure
Combine precision and recall in a single measure of overall performance.
P : Precision
R : Recall
α : a factor which determines the weighting of P & R.
α= 0.5 is chosen often for equal weighting
7/40
Natural Language Processing
Verb Subcategorization : Verbs subcategorize for different semantic categories.
Verb Subcategorization Frame : A particular set of syntactic category that a verb can appear with is called a subcategorization frame.
9/40
Natural Language Processing
Each category has several subcategories that express their semantic arguments using different syntactic means.
The class of verbs with semantic arguments “theme” and “recipient” has a subcategory that expresses these arguments with an object and a prepositional phrase and another subcategory that in addition permits a double-object construction.
Donate + object(theme) + prepositional phrase(recipient)Gave + double-object He donated a large sum of the money to a church.He gave the church a large sum of money.
Natural Language Processing
Knowing the possible subcategorization frames for verbs is important for parsing.
a. She told [the man] [where Peter grew up].
b. She found [the place [where Peter grew up]].
This information is not stored in dictionaries 50% of parse failures can be due to missing
subcategorizing frames A simple & effective algorithm(Lerner) was
proposed by Brent in 1993.
10/40
Natural Language Processing
Lerner (by Brent)• There are two steps of this algorithm.
① Cues : Define a regular pattern of words and syntactic categories which indicates the presence of the frame with high certainty.
① Certainty is formalized as probability of error② For a certain cue Cj we define error Ej
② Hypothesis testing is done by contradiction ① We assume that frame is not appropriate for the
verb and call is Ho (Null Hypothesis).② we reject the hypothesis if Cj indicates with high
probability that our Ho is wrong.
11/40
Natural Language Processing
Cue for frame “NP NP” (transitive verb)
(OBJ | SUBJ_OBJ | CAP) (PUNC | CC)pronoun or capitalized followed by punctuation
Example)
1. […]greet-V Peter-CAP ,-PUNC […]
2. I came Thursday, before the storm started.
Frame “NP NP”
reject H0.
12/40
Natural Language Processing
Hypotheses testing H0
If pE < α, then we reject H0
Precision : close to 100% (when α = 0.02)
Recall : 47 ~ 100%
n
mr
rnj
rj
jijiE
r
n
mcvCfvPp
)1(
)),(|0)((
n : # of times vi occurs in corpus
m : # of frame f j occurs
vi(f j)=0 : Verb vi does not permit frame f j
C(vi,c j) : # of times that vi occurs
with cue c j
εj : error rate for cue f j
13/40
Natural Language Processing
Manning’s addition • Use tagger and run the cue detection on the
output of the tagger• Allowing low-reliability cue and additional
cues based on tagger output increases the number of cues significantly
More error prone, but much more abundant cues.Examples: She compared the results with earlier
findings. He relies on relatives.
14/40
Natural Language Processing
Table 8.3 Learned subcategorization frames Verb Correct Incorrect OALD bridge 1 1 1 burden 2 2 depict 2 3 Emanate 1 1 leak 1 5 occupy 1 3 remark 1 1 4 retire 2 1 5 shed 1 2 troop 0 3
15/40
Two of the errors are prepositional phrases (PPs)
to bridge between and to retire in.
OALD (NP in-PP is not included)
“And here we are 10 years later with the
same problems,” Mr. Smith remarked.
Natural Language Processing
(8.14) The children ate the cake with a spoon. I saw the man with a telescope
Syntactic ambiguity (Log) Likelihood Ratio [a common and good way of
comparing between two exclusive alternatives]
Problem: ignores preference for attaching phrase “low” in parse tree
16/40
)|(
)|(log),,(
npP
vpPpnv
Natural Language Processing
Chrysler confirmed that it would end its troubled venture with Maserati.
16/40
Natural Language Processing
Event space: all V NP PP sequences, How likely for a preposition to attach with a verb or noun
VAp: Is there a PP headed by p which attaches to v
NAp: Is there a PP headed by p which attaches to n
Both can be 1:
He put the book on World War II on the table She sent him into the nursery to gather up his toys.
)|1(
)|0()|1(log
),|)((
),|)((log),,(
2
2
nNAP
vNAPvVAP
nvnpAttachP
nvvpAttachPpnv
p
pp
)(
),()|1(
)(
),()|1(
nC
pnCnNAP
vC
pvCvVAP
p
p
18/40
Natural Language Processing
18/40
Natural Language Processing
Model’s limitations Only consider the identity of the preposition, noun and
the verb Consider only the most basic case of PP immediately
after an NP object which is modifying either the immediately preceding n or v.
The board approved [its acquisition] [by Royal Trustco Ltd.] [of Toronto]
[for $27 a share] [at its monthly meeting]
Other attachment issues• Attachment ambiguity in noun compounds
(a) [[Door bell] manufacturer] : left-branching(b)[Woman [aid worker]] : right-branching
19/40
Natural Language Processing
Selectional Preference(or Selectional restriction) Most verbs prefer arguments of a particular type. Preference ↔ Rules
eat + non-food argument Example)
eating one’s words.
21/40
Natural Language Processing
Acquisition of selectional preference is important in Statistical NLP for a number of reasons
Durian is missing in dictionary then we can infer part of its meaning from selection restrictions
Another important use is ranking the parse of a sentenceGive high scores to the parses where verb has natural
arguments
22/40
Natural Language Processing
Resnik’s Model(1993,1996)1. Selectional Preference Strength How strongly the verb constrains its direct object. Two Assumptions
① Take only head noun② Classes of nouns.
c cP
vcPvcPCPvCPDvS
)(
)|(log)|())(||)|(()(
23/40
•P(C) : overall probability distribution
of noun classes•P(C|v): probability distribution of noun
classes in the direct object position of v
Natural Language Processing
Noun class c P( c) P(c |eat) P(c | see) P(c | find)
People 0.25 0.01 0.25 0.33
Furniture 0.25 0.01 0.25 0.33
Food 0.25 0.97 0.25 0.33
Action 0.25 0.01 0.25 0.01
SPS S(v) 1.76 0.00 0.35
Table 8.5 Selectional Preference Strength
24/40
Natural Language Processing
The Notion of the Resnik’s Model (conti’)2. Selectional Association between a verb v and a class
c
• A rule for assigning strength to nouns
Ex) (8.31) Susan interrupted the chair.
)()()|(
log)|(),(
vScPvcP
vcPcvA
),(max),()(
cvAnvAnclassesc
)peopleinterrupt,(
),interrupt(max)chair interrupt,()(
A
cAAchairclassesc
25/40
Natural Language Processing
• Estimate the Probability of P(c|v) = P(v,c) / P(v)
'
)'(
)()(
v
vC
vCvP
)(
),(|)(|
11
),(1
),(
cwordsn
nvCnclassesN
cvCN
cvP
N : total number of verb-object pairs in the corpus
words(c) : set of all nouns in class c
|classes(n)| : number of noun classes that contain n as a member
C(v,n) : number of verb-object pairs with v as the verb and n as the head of the
object NP
26/40
Natural Language Processing
Resnik’s experiments on the Brown corpus (1996) : Table 8.6• Left half : typical objects• Right half : atypical objects• For most verbs, association strength predicts which
object is typical• Most errors the model makes are due to the fact that it
performs a form of disambiguation, by choosing the highest A(v,c) for A(v,n)
Implicit object alternationa. Mike ate the cake.b. Mike ate.◦ The more constraints a verb puts on its object, the more
likely it is to permit the implicit-object construction◦ Selectional Preference Strength (SPS) is seen as the
more basic phenomenon which explains the occurrence of implicit-objects as well as association strength
27/40
Natural Language Processing
27/40
Natural Language Processing
Lexical Acquisition The Acquisition of meaning Semantic Similarity
• Automatically acquiring a relative measure of how similar a new word is to known words is much easier than determining what the meaning actually is
• Most often used for generalization under the assumption that semantically similar words behave similarly
ex) Susan had never eaten a fresh durian before.• Similarity-based Generalization VS. Class-based
Generalization– Similarity-based generalization : Consider the closest
neighbors– Class-based generalization : Consider the whole class
• Usage of Semantic Similarity– Query expansion : astronaut cosmonaut– k nearest neighbors classification
28/40
Natural Language Processing
A Notion of Semantic Similarity• Extension of synonymy and refers to cases of near-
synonymy like the pair dwelling/abode• Two words are from the Same domain or topic
ex) Doctor, nurse, fever, intravenous• Judgements of Semantic Similarity explained by the
degree of contextual interchangeability ( Miller and Charles – 1991)
• Ambiguity presents a problem for all notions of semantic similarity
When applied to ambiguous words, semantically similar usually means ‘similar to the appropriate sense’ex) litigation ≒ suit (≠ clothes)
Similarity Measures• Vector space measures• Probabilistic measures
29/40
Natural Language Processing
29/40
Natural Language Processing
30/40
Natural Language Processing
3Similarity measures for binary vectors ( Table 8.7 ) Matching coefficient simply counts the number of dimension
on which both vectors are non-zero.
Dice coefficient normailizes for length of the vectors and the total number of non zero entries.
Jaccard (or Tanimoto) coefficient penalizes a small number of shared entries more than the Dice coefficient does.
YX
||||
||2
YX
YX
||
||
YX
YX
31/40
Natural Language Processing
Similarity measures for binary vectors (conti’) Overlap coefficient has a value of 1.0 if every dimension
with a non-zero value for the first vector is also non-zero for the second vector.
Cosine penalizes less in cases where the number of non-zero entries is very different.
Real-valued vector space More powerful representation for linguistic objects. The length of a vector
|)||,min(|
||
YX
YX
||||
||
YX
YX
32/40
Natural Language Processing
Real-valued vector space (conti’) The dot product between two vectors
The cosine measure
The Euclidean distance
The advantage of vector spaces as a representational medium.• Simplicity.• Computational efficiency.
The disadvantage of vector spaces Operate on binary data except for cosine Cosine has its own problem
• Cosine assumes a Euclidean space• Euclidean space is not well-motivated choice if the vectors we are
dealing with are vectors of probability or counts
n
i i
n
i i
n
i ii
yx
yx
yx
yxyx
1
2
1
2
1
||||),cos(
33/40
Natural Language Processing
Transform semantic similarity into the similarity of two probability distribution
Transform matrices of counts in Figure 8.3, 8.4 and 8.5 into matrices of conditional probability
Ex) (American, Astronaut) P(American|astronaut) = ½ = 0.5
Measures of (dis-)similarity between probability distributions
( Table 8.9 )• 3 measures of dissimilarity between probability distributions
investigated by Dagan et al.(1997)1. KL divergence
–
– Measures how much information is lost if we assume distribution q when the true distribution is p
– Two Problems for Practical applications Get value of infinity when qi=0 and pi ≠ 0 Asymmetric ( D(p||q) ≠ D(q||p) )
i i
ii q
ppqpD log)||(
34/40
Natural Language Processing
Measures of similarity between probability distributions (conti’)
2. Information radius (IRAD)
–
– Symmetric and no problem with infinite values.– Measures how much information is lost if we describe
the two words that correspond to p and q with their average distribution
3. norm.– – A measure of the expected proportion of events that
are going to be different between the distributions p and q
)2
||()2
||(qp
qDqp
pD
i
ii qp ||2
1
35/40
Natural Language Processing
Measures of similarity between probability distributions (conti’)
• norm’s example ( by figure 8.5 )p1 = P(Soviet | cosmonaut) = 0.5p2 = 0p3 = P(spacewalking | cosmonaut)=0.5q1 = 0q2 = P(American | astronaut) = 0.5q3 = P(spacewalking | astronaut) = 0.5
5.0)05.05.0(
2
1||
2
1
iii qp
36/40
Natural Language Processing
Lexical acquisition plays a key role in statistical NLP because available lexical resources are always lacking in some way.• The cost of building lexical resources manually.• The quantitative part of lexical acquisition almost
always has to be done automatically. • Many lexical resources were designed for human
consumption. The best solution : the augmentation of a manual
resource by automatic means.• The main reason : The inherent productivity of
language.
38/40
Natural Language Processing
Look harder for sources of prior knowledge that can constrain the process of lexical acquisition.
Much of the hard work of lexical acquisition will be in building interfaces that admit easy specification of prior knowledge and easy correction of mistake made in automatic learning.
Linguistic theory-important source of prior knowledge- has been surprisingly underutilized in Statistical NLP.
Dictionaries are only one source of information that can be important in lexical acquisition in addition to text corpora.
( Other source : encyclopedias, thesauri, gazeteers, collections of technical vocabulary etc.)
If we succeed in emulating human acquisition of language by tapping into this rich source of information, then a breakthrough in the effectiveness of lexical acquisition can be expected.
40/40