search engine statistics beyond the n-gram: application to noun compound bracketing
DESCRIPTION
Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing. Preslav Nakov and Marti Hearst Computer Science Division and SIMS University of California, Berkeley. Supported by NSF DBI-0317510 and a gift from Genentech. Overview. Unsupervised algorithm - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/1.jpg)
Search Engine Statistics Beyond the n-gram:
Application to Noun Compound Bracketing
Search Engine Statistics Beyond the n-gram:
Application to Noun Compound Bracketing
Preslav Nakov and Marti HearstComputer Science Division and SIMS
University of California, Berkeley
Supported by NSF DBI-0317510 and a gift from Genentech
![Page 2: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/2.jpg)
Overview
Unsupervised algorithm Applied here to noun compound bracketing, but
promising for structural ambiguity generally
Features n-grams, 2 , MI Beyond the n-gram
surface features paraphrases
State-of-the art accuracy
![Page 3: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/3.jpg)
Noun Compound Bracketing
(a) [ [ liver cell ] antibody ] (left bracketing)
(b) [ liver [cell line] ] (right bracketing)
In (a), the antibody targets the liver cell. In (b), the cell line is derived from the liver.
liver cell line liver cell antibody
![Page 4: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/4.jpg)
Related Work
Marcus(1980), Pustejosky&al.(1993), Resnik(1993) adjacency model: Pr(w1|w2) vs. Pr(w2|w3)
Lauer (1995) dependency model: Pr(w1|w2) vs. Pr(w1|w3)
Keller & Lapata (2004): use the Web unigrams and bigrams
Girju & al. (2005) supervised model bracketing in context requires WordNet senses to be given
Pr that w1 precedes w2
This work:• 2 • Web• n-grams• paraphrases• surface features
![Page 5: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/5.jpg)
Adjacency & Dependency (1)
right bracketing: [w1[w2w3] ] w2w3 is a compound (modified by w1)
home health care
w1 and w2 independently modify w3
adult male rat
left bracketing : [ [w1w2 ]w3] only 1 modificational choice possible
law enforcement officer
w1 w2 w3
w1 w2 w3
![Page 6: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/6.jpg)
Adjacency & Dependency (2)
right bracketing: [w1[w2w3] ] w2w3 is a compound (modified by w1)
w1 and w2 independently modify w3
adjacency model Is w2w3 a compound?
(vs. w1w2 being a compound)
dependency model Does w1 modify w3?
(vs. w1 modifying w2)
w1 w2 w3
w1 w2 w3
w1 w2 w3
![Page 7: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/7.jpg)
Frequencies
Adjacency model Compare #(w1,w2) to #(w2,w3)
Dependency model Compare #(w1,w2) to #(w1,w3)
rightleft
w1 w2 w3
w1 w2 w3
Frequency of w1w2
![Page 8: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/8.jpg)
Probabilities
Adjacency model Compare Pr(w1w2|w2) to Pr(w2w3|w3)
Dependency model Compare Pr(w1w2|w2) to Pr(w1w3|w3)
leftright
w1 w2 w3
w1 w2 w3
Pr that w1 modifies w2
![Page 9: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/9.jpg)
Probabilities: Dependency
Dependency model Pr(left) = Pr(w1w2|w2)Pr(w2w3|w3)
Pr(right) = Pr(w1w3|w3)Pr(w2w3|w3)
So we compare Pr(w1w2|w2) to Pr(w1w3|w3)
BUT! No cancellation in
the Lauer’s model:
w1 w2 w3
left
right
![Page 10: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/10.jpg)
Probabilities: Estimation
Using page hits as a proxy for n-gram counts
Pr(w1w2|w2) = #(w1,w2) / #(w2) #(w2) word frequency; query for “w2”
#(w1,w2) bigram frequency; query for “w1 w2”
smoothed by 0.5
![Page 11: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/11.jpg)
Probabilities: Why? (1)
Why should we use: (a) Pr(w1w2|w2), rather than (b) Pr(w2w1|w1)?
Keller&Lapata (2004) calculate: AltaVista queries:
(a): 70.49% (b): 68.85%
British National Corpus: (a): 63.11% (b): 65.57%
![Page 12: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/12.jpg)
Probabilities: Why? (2)
Why should we use: (a) Pr(w1w2|w2), rather than
(b) Pr(w2w1|w1)?
Maybe to introduce a bracketing prior. Just like Lauer (1995) did.
But otherwise, no reason to prefer either one. Do we need probabilities? (association is OK) Do we need a directed model? (symmetry is OK)
![Page 13: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/13.jpg)
Association Models: 2 (Chi Squared)
A = #(wi,wj)
B = #(wi) – #(wi,wj)
C = #(wj) – #(wi,wj)
D = N – (A+B+C) N = 8 trillion (= A+B+C+D)
8 billion Web pages x 1,000 words
![Page 14: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/14.jpg)
Web-derived Surface Features
Authors often disambiguate noun compounds using surface markers, e.g.: amino-acid sequence left brain stem’s cell left brain’s stem cell right
The enormous size of the Web makes them frequent enough to be useful.
![Page 15: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/15.jpg)
Web-derived Surface Features:Dash (hyphen)
Left dash cell-cycle analysis left
Right dash donor T-cell right fiber optics-system should be left..
Double dash T-cell-depletion unusable…
![Page 16: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/16.jpg)
Web-derived Surface Features:Possessive Marker
Attached to the first word brain’s stem cell right
Attached to the second word brain stem’s cell left
Combined features brain’s stem-cell right
![Page 17: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/17.jpg)
Web-derived Surface Features:Capitalization
don’t-care – lowercase – uppercase Plasmodium vivax Malaria left plasmodium vivax Malaria left
lowercase – uppercase – don’t-care brain Stem cell right brain Stem Cell right
Disabled on: Roman digits Single-letter words: e.g. vitamin D deficiency
![Page 18: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/18.jpg)
Web-derived Surface Features:Embedded Slash
Left embedded slash leukemia/lymphoma cell right
![Page 19: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/19.jpg)
Web-derived Surface Features:Parentheses
Single-word growth factor (beta) left (brain) stem cell right
Two-word (growth factor) beta left brain (stem cell) right
![Page 20: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/20.jpg)
Web-derived Surface Features:Column, dot, semi-column
Following the first word home. health care right adult, male rat right
Following the second word health care, provider left lung cancer: patients left
![Page 21: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/21.jpg)
Web-derived Surface Features:Dash to External Word
External word to the left mouse-brain stem cell right
External word to the right tumor necrosis factor-alpha left
![Page 22: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/22.jpg)
Web-derived Surface Features:Problems & Solutions
Problem: search engines ignore punctuation “brain-stem cell” does not work
Solution: query for “brain stem cell” obtain 1,000 document summaries look for the features in these summaries
![Page 23: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/23.jpg)
Other Web-derived Features:Abbreviation
After the second word tumor necrosis factor (NF) right
After the third word tumor necrosis (TN) factor right
We query for e.g. “tumor necrosis tn factor” Problems:
Roman digits: IV, VI States: CA Short words: me
![Page 24: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/24.jpg)
Other Web-derived Features:Concatenation
Consider health care reform healthcare : 79,500,000 carereform : 269 healthreform: 812
Adjacency model healthcare vs. carereform
Dependency model healthcare vs. healthreform
Triples “healthcare reform” vs. “health carereform”
![Page 25: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/25.jpg)
Other Web-derived Features:Using Google’s *
Each * allows an one-word wildcard
Single star “health care * reform” left “health * care reform” right
More stars and/or reverse order “care reform * * health” right
Adjacency model
![Page 26: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/26.jpg)
Other Web-derived Features:Reorder
Reorders for “health care reform” “care reform health” right “reform health care” left
![Page 27: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/27.jpg)
Other Web-derived Features:Internal Inflection Variability
First word ???
Second word tyrosine kinase activation tyrosine kinases activation
![Page 28: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/28.jpg)
Other Web-derived Features:Switch The First Two Words
Predict right, if we can reorder adult male rat as male adult rat
![Page 29: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/29.jpg)
Paraphrases (1)
The semantics of a noun compound is often made overt by a paraphrase (Warren,1978) Prepositional
stem cells in the brain right cells from the brain stem right
Verbal virus causing human immunodeficiency left pain associated with arthritis migraine right
Copula office building that is a skyscraper right
![Page 30: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/30.jpg)
Paraphrases (2)
Lauer(1995), Keller&Lapata(2003), Girju&al. (2005) predict NC semantics by choosing the most likely preposition: of, for, in, at, on, from, with, about, (like)
This could be problematic, when more than one preposition is possible
In contrast: we try to predict syntax, not semantics we do not disambiguate, just add up all counts
cells in (the) bone marrow left cells from (the) bone marrow left
![Page 31: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/31.jpg)
Paraphrases (3)
prepositional paraphrases: We use: ~150 prepositions
verbal paraphrases: We use: associated with, caused by, contained in,
derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for.
copula paraphrases: We use: is/was and that/which/who
optional elements: articles: a, an, the quantifiers: some, every, etc. pronouns: this, these, etc.
![Page 32: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/32.jpg)
Evaluation: Datasets
Lauer Set 244 noun compounds (NCs)
from Grolier’s encyclopedia inter-annotator agreement: 81.5%
Biomedical Set 430 NCs
from MEDLINE inter-annotator agreement: 88% ( =.606)
![Page 33: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/33.jpg)
Evaluation: Experiments
Exact phrase queries Limited to English
Inflections: Lauer Set: Carroll’s morphological tools Biomedical Set: UMLS Specialist Lexicon
![Page 34: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/34.jpg)
Results: Lauer (1)correct
N/Awrong
![Page 35: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/35.jpg)
Results Lauer (2)correct
N/Awrong
![Page 36: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/36.jpg)
Results Lauer (3)
![Page 37: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/37.jpg)
Results: Bio (1)correct
N/Awrong
![Page 38: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/38.jpg)
Results Bio (2)correct
N/Awrong
![Page 39: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/39.jpg)
Individual Surface Features Performance: Bio
![Page 40: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/40.jpg)
Paraphrase and Surface Features Performance
Lauer Set
Biomedical Set
![Page 41: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/41.jpg)
Discussion
Lauer Bio
Adjacency vs. Dependency 2 vs. frequencies vs. probabilities
![Page 42: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/42.jpg)
Conclusion
Introduced search engine statistics that go beyond the n-gram (applicable to other tasks) surface features paraphrases
Obtained new state-of-the-art results on NC bracketing more robust than Lauer (1995) more accurate than Keller&Lapata (2004)
![Page 43: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/43.jpg)
Future Work
Recognize ambiguous cases Bracket more than 3 nouns Not just bracketing but dependences:
e.g. growth factor alpha Bracket NPs in general (other POS)
augment Penn Treebank with NP-internal dependences
Application to other structural ambiguity problems: Prepositional phrase attachment Noun phrase coordination
![Page 44: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/44.jpg)
The End
Thank you!
![Page 45: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing](https://reader030.vdocuments.net/reader030/viewer/2022032804/56812a68550346895d8decb6/html5/thumbnails/45.jpg)
Web Counts: Problems
Page hits are inaccurate This may be ok (Keller&Lapata,2003)
The Web lacks linguistic annotation Pr(health|care) = #(“health care”) / #(care)
health: noun care: both verb and noun can be adjacent by chance can come from different sentences
Cannot find: stem cells VERB PREPOSITION brain protein synthesis’ inhibition