1 ucb digital library project an experiment in using lexical disambiguation to enhance information...

23
1 UCB Digital Library Project An Experiment in Using An Experiment in Using Lexical Disambiguation Lexical Disambiguation to Enhance Information to Enhance Information Access Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi, and Heyning Cheng

Post on 22-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

1

UCB Digital Library Project

An Experiment in Using An Experiment in Using Lexical Disambiguation to Lexical Disambiguation to

Enhance Information AccessEnhance Information Access

Robert Wilensky, Isaac Cheng, Timotius Tjahjadi, and Heyning Cheng

2

UCB Digital Library Project

GoalGoal

Enhance information access by– fully automated text categorization– by adding searching by word sense

Applied to the World Wide Web

3

UCB Digital Library Project

Manual vs. Automatically Manual vs. Automatically Created DirectoriesCreated Directories

Manual classification of documents is– Expensive

– Not scalable Hard to keep up with the rapid growth and changes of

information sources such as the Web

Would like fully automatic classification– no training set

– no rules

– appeal instead to “intrinsic semantics”

4

UCB Digital Library Project

Lexical DisambiguationLexical Disambiguation

Problem: Determine the intended sense of ambiguous word

Approach: Based on Yarowsky, et al.– Thesaurus categories as proxies for senses

We used Roget’s 5th

– Training: Count nearby word-category co-occurrence

– Deployment: Add up the word-category evidence

5

UCB Digital Library Project

Counting Co-occurrences of Counting Co-occurrences of Terms with CategoriesTerms with Categories

…while storks and cranes make their nests in the bank…

Result is category co-occurrence vector for each term.

[Tools, Animals]

6

UCB Digital Library Project

Automatic Topic Assignment Automatic Topic Assignment Based on Word SenseBased on Word Sense

Hearst– Topic word-category association vectors

Fisher and Wilensky– Contrasted different algorithms

– Concluded that exploiting word senses may improve topic assignment

We use prior prob. dist. of word senses, (and more recently, disambiguation per se.)

7

UCB Digital Library Project

IAGO 0.1 vs. 1.0IAGO 0.1 vs. 1.0

IAGO 0.1:– Eliminated short (< 100 content words) pages– Trained on newswire text

IAGO 1.0:– Trained on Encarta encyclopedia– Estimated word sense priors on the Web (used 10 million words of

random web documents)– ignored proper nouns– augmented stop-list to deal with various problems

Tested categorization by mapping Yahoo categories to ours Tested disambiguation on newswire, then sampled Web.

8

UCB Digital Library Project

IAGO! OverviewIAGO! Overview

Thin client

DirectoryFront-end

Classificationby Priors

SearchingFront-end

LexicalDisambiguation

Filter

The Web

Preprocessing

PreprocessingDatabase

Internet Directory

Search by Word Senses

9

UCB Digital Library Project

Classification ResultsClassification Results

Category Name Precision Recall------------- --------- ------ComputerScience 87.5% 19.4%FinanceInvestment 100.0% 13.4%FitnessExercise 100.0% 1.8%MotionPictures 100.0% 54.8%Music 98.2% 42.4%Nutrition 97.9% 29.9%Occupation 97.8% 30.3%TheEnvironment n/a 0.0%Travel 75.0% 15.4%

Overall precision = 97%Overall recall = 21%

Now: (version 1.0)

Category Name Precision Recall------------- --------- ------ComputerScience 31.6% 17.1%FinanceInvestment 94.4% 22.0%FitnessExercise 100.0% 4.3%MotionPictures 100.0% 57.1%Music 97.5% 58.3%Nutrition 80.3% 35.6%Occupation 100.0% 13.1%TheEnvironment n/a 0.0%Travel 50.0% 5.7%

Overall precision = 88%Overall recall = 23%

Then: (version 0.1)

(92.3% and 20.4% if no adjustment by hand)

10

UCB Digital Library Project

IAGO! 1.0 Internet DirectoryIAGO! 1.0 Internet Directory

Used engine to classify a few tens of thousands of web documents into Roget’s categories.

11

UCB Digital Library Project

12

UCB Digital Library Project

Disambiguation ResultsDisambiguation Results

78

58

93100

0102030405060708090

100

Acc

ura

cy (

%)

"interest" "issue" "sentence" "star"

BaselineVersion 0.1Version 1.0

13

UCB Digital Library Project

Application to Text SearchingApplication to Text Searching

Present user with set of known word senses from which to select – e.g., keyword = “rock”

=stone =kind of music

Retrieve by word, filter by word sense Rank by number of matching word senses

14

UCB Digital Library Project

15

UCB Digital Library Project

16

UCB Digital Library Project

Is it Useful?Is it Useful?

Results in the literature generally suggest disambiguation not useful for long queries, and utility is highly sensitive to disambiguation accuracy.

However, 40% of search queries on the web are reported to be single words.

So, does disambiguation work well enough to aid with single word queries?

17

UCB Digital Library Project

UsefulnessUsefulness

Let r be the frequency of the most common of (non-overlapping) senses.

Can show that, to be better than just using keyword retrieval, disambiguation accuracy needs to be at least 50%, increasing in accuracy as r increases, but need not be highly accurate. (In fact, it can perform below the baseline.)

IAGO! 1.0 performs well above this level.

18

UCB Digital Library Project

UsefulnessUsefulness

Key word retrieval will produce word sense retrieval precision and recall of r and 1 for common sense, (1-r) , 1 for less common

A disambiguation method that was correct p of the time would have precision and recall values of

and p for a word sense with frequency r. Using E as the metric, can show that p needs to be at least

for a disambiguation method to outperform keyword retrieval

For small r, p must be greater than 50%. For large r, this compares favorably with keyword retrieval even with fairly low disambiguation accuracy.

– E.g., with a 90/10 distribution of word senses, then, for the more common word sense case, E, with a beta of .5, is better for a disambiguation algorithm with an accuracy over 77% than for keyword retrieval. (For the less common word sense, a “disambiguation” algorithm that is completely random gives a superior result.)

)1)(1( prrp

rp

)23(2

)12(

r

r

19

UCB Digital Library Project

More resultsMore results

Latest implementation (by Heyning Cheng) reduces training to about 1 hour (from about 24); classifying 1000 documents takes about 10 minutes.

Also improved performance of disambiguation. This made it practical to use disambiguation in topic assignment:

– I.e, produces slightly better results; also appears to be less sensitive to changes in stoplist, and can be made to run quickly.

Disambiguation with a substantially smaller window size (even as small as 5) did not reduce accuracy; in some cases, a half-window size of 10 out-performed one of 50.

32 word threshold 100 word thresholdPrecision Recall Precision Recall

priors 92.3% 20.4% 88.3% 22.3%disam 94.1 22.4% 93.4% 25.8%

20

UCB Digital Library Project

More results (con’t)More results (con’t)

Weighted word sense priors by IDF of the term

IDF stoplist Precision RecallNot used No computer terms 81.3% 20.7%Not used Computer terms 92.3% 20.4%Used No computer terms 86% 20.3%Used Computer terms 88.5% 20.2%

21

UCB Digital Library Project

More ResultsMore Results

Excluding low-utility or confusing Roget’s categories (down to about 200) improved recall to about 40% on the 1000 document test set.

The “purity” of topic assignment (% of all word senses disambiguated to the assigned topic) seems correlated with accuracy at least as well as IAGO’s ranking algorithm.

22

UCB Digital Library Project

Future WorkFuture Work

Get better word sense proxies! Word-sense searching

– Create word sense index

– Support word-sense searching within more general searches.

– Improve disambiguation by exploiting priors.

– Test against synonym expansion methods

Automatic topic-categorization– Handle multi-word phrases; proper names

23

UCB Digital Library Project

Future Plans: Longer TermFuture Plans: Longer Term

Disambiguation– Handle non-nouns

– Better word sense source Automatic grouping of thesaural word senses

Topic-categorization– Multiple topic assignment

– Quality

Summarization via same techniques Other linguistic choices, e.g., thematic roles