1 ucb digital library project an experiment in using lexical disambiguation to enhance information...
Post on 22-Dec-2015
213 Views
Preview:
TRANSCRIPT
1
UCB Digital Library Project
An Experiment in Using An Experiment in Using Lexical Disambiguation to Lexical Disambiguation to
Enhance Information AccessEnhance Information Access
Robert Wilensky, Isaac Cheng, Timotius Tjahjadi, and Heyning Cheng
2
UCB Digital Library Project
GoalGoal
Enhance information access by– fully automated text categorization– by adding searching by word sense
Applied to the World Wide Web
3
UCB Digital Library Project
Manual vs. Automatically Manual vs. Automatically Created DirectoriesCreated Directories
Manual classification of documents is– Expensive
– Not scalable Hard to keep up with the rapid growth and changes of
information sources such as the Web
Would like fully automatic classification– no training set
– no rules
– appeal instead to “intrinsic semantics”
4
UCB Digital Library Project
Lexical DisambiguationLexical Disambiguation
Problem: Determine the intended sense of ambiguous word
Approach: Based on Yarowsky, et al.– Thesaurus categories as proxies for senses
We used Roget’s 5th
– Training: Count nearby word-category co-occurrence
– Deployment: Add up the word-category evidence
5
UCB Digital Library Project
Counting Co-occurrences of Counting Co-occurrences of Terms with CategoriesTerms with Categories
…while storks and cranes make their nests in the bank…
Result is category co-occurrence vector for each term.
[Tools, Animals]
6
UCB Digital Library Project
Automatic Topic Assignment Automatic Topic Assignment Based on Word SenseBased on Word Sense
Hearst– Topic word-category association vectors
Fisher and Wilensky– Contrasted different algorithms
– Concluded that exploiting word senses may improve topic assignment
We use prior prob. dist. of word senses, (and more recently, disambiguation per se.)
7
UCB Digital Library Project
IAGO 0.1 vs. 1.0IAGO 0.1 vs. 1.0
IAGO 0.1:– Eliminated short (< 100 content words) pages– Trained on newswire text
IAGO 1.0:– Trained on Encarta encyclopedia– Estimated word sense priors on the Web (used 10 million words of
random web documents)– ignored proper nouns– augmented stop-list to deal with various problems
Tested categorization by mapping Yahoo categories to ours Tested disambiguation on newswire, then sampled Web.
8
UCB Digital Library Project
IAGO! OverviewIAGO! Overview
Thin client
DirectoryFront-end
Classificationby Priors
SearchingFront-end
LexicalDisambiguation
Filter
The Web
Preprocessing
PreprocessingDatabase
Internet Directory
Search by Word Senses
9
UCB Digital Library Project
Classification ResultsClassification Results
Category Name Precision Recall------------- --------- ------ComputerScience 87.5% 19.4%FinanceInvestment 100.0% 13.4%FitnessExercise 100.0% 1.8%MotionPictures 100.0% 54.8%Music 98.2% 42.4%Nutrition 97.9% 29.9%Occupation 97.8% 30.3%TheEnvironment n/a 0.0%Travel 75.0% 15.4%
Overall precision = 97%Overall recall = 21%
Now: (version 1.0)
Category Name Precision Recall------------- --------- ------ComputerScience 31.6% 17.1%FinanceInvestment 94.4% 22.0%FitnessExercise 100.0% 4.3%MotionPictures 100.0% 57.1%Music 97.5% 58.3%Nutrition 80.3% 35.6%Occupation 100.0% 13.1%TheEnvironment n/a 0.0%Travel 50.0% 5.7%
Overall precision = 88%Overall recall = 23%
Then: (version 0.1)
(92.3% and 20.4% if no adjustment by hand)
10
UCB Digital Library Project
IAGO! 1.0 Internet DirectoryIAGO! 1.0 Internet Directory
Used engine to classify a few tens of thousands of web documents into Roget’s categories.
12
UCB Digital Library Project
Disambiguation ResultsDisambiguation Results
78
58
93100
0102030405060708090
100
Acc
ura
cy (
%)
"interest" "issue" "sentence" "star"
BaselineVersion 0.1Version 1.0
13
UCB Digital Library Project
Application to Text SearchingApplication to Text Searching
Present user with set of known word senses from which to select – e.g., keyword = “rock”
=stone =kind of music
Retrieve by word, filter by word sense Rank by number of matching word senses
16
UCB Digital Library Project
Is it Useful?Is it Useful?
Results in the literature generally suggest disambiguation not useful for long queries, and utility is highly sensitive to disambiguation accuracy.
However, 40% of search queries on the web are reported to be single words.
So, does disambiguation work well enough to aid with single word queries?
17
UCB Digital Library Project
UsefulnessUsefulness
Let r be the frequency of the most common of (non-overlapping) senses.
Can show that, to be better than just using keyword retrieval, disambiguation accuracy needs to be at least 50%, increasing in accuracy as r increases, but need not be highly accurate. (In fact, it can perform below the baseline.)
IAGO! 1.0 performs well above this level.
18
UCB Digital Library Project
UsefulnessUsefulness
Key word retrieval will produce word sense retrieval precision and recall of r and 1 for common sense, (1-r) , 1 for less common
A disambiguation method that was correct p of the time would have precision and recall values of
and p for a word sense with frequency r. Using E as the metric, can show that p needs to be at least
for a disambiguation method to outperform keyword retrieval
For small r, p must be greater than 50%. For large r, this compares favorably with keyword retrieval even with fairly low disambiguation accuracy.
– E.g., with a 90/10 distribution of word senses, then, for the more common word sense case, E, with a beta of .5, is better for a disambiguation algorithm with an accuracy over 77% than for keyword retrieval. (For the less common word sense, a “disambiguation” algorithm that is completely random gives a superior result.)
)1)(1( prrp
rp
)23(2
)12(
r
r
19
UCB Digital Library Project
More resultsMore results
Latest implementation (by Heyning Cheng) reduces training to about 1 hour (from about 24); classifying 1000 documents takes about 10 minutes.
Also improved performance of disambiguation. This made it practical to use disambiguation in topic assignment:
– I.e, produces slightly better results; also appears to be less sensitive to changes in stoplist, and can be made to run quickly.
Disambiguation with a substantially smaller window size (even as small as 5) did not reduce accuracy; in some cases, a half-window size of 10 out-performed one of 50.
32 word threshold 100 word thresholdPrecision Recall Precision Recall
priors 92.3% 20.4% 88.3% 22.3%disam 94.1 22.4% 93.4% 25.8%
20
UCB Digital Library Project
More results (con’t)More results (con’t)
Weighted word sense priors by IDF of the term
IDF stoplist Precision RecallNot used No computer terms 81.3% 20.7%Not used Computer terms 92.3% 20.4%Used No computer terms 86% 20.3%Used Computer terms 88.5% 20.2%
21
UCB Digital Library Project
More ResultsMore Results
Excluding low-utility or confusing Roget’s categories (down to about 200) improved recall to about 40% on the 1000 document test set.
The “purity” of topic assignment (% of all word senses disambiguated to the assigned topic) seems correlated with accuracy at least as well as IAGO’s ranking algorithm.
22
UCB Digital Library Project
Future WorkFuture Work
Get better word sense proxies! Word-sense searching
– Create word sense index
– Support word-sense searching within more general searches.
– Improve disambiguation by exploiting priors.
– Test against synonym expansion methods
Automatic topic-categorization– Handle multi-word phrases; proper names
23
UCB Digital Library Project
Future Plans: Longer TermFuture Plans: Longer Term
Disambiguation– Handle non-nouns
– Better word sense source Automatic grouping of thesaural word senses
Topic-categorization– Multiple topic assignment
– Quality
Summarization via same techniques Other linguistic choices, e.g., thematic roles
top related