dsc 2008 – 26-27 june 2008, thessaloniki, greece automatic acquisition of synonyms using the web...

28
DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment Ohridski" [email protected] 3rd Annual South East European Doctoral Student Conference (DSC2008): Infusing Knowledge and Research in South East Europe

Upload: mason-snyder

Post on 26-Mar-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Automatic Acquisition of Synonyms Using

the Web as a Corpus

Svetlin Nakov, Sofia University "St. Kliment Ohridski"

[email protected]

3rd Annual South East European Doctoral Student Conference (DSC2008): Infusing

Knowledge and Research in South East Europe

Page 2: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Introduction We want to automatically extract all pairs

of synonyms inside given text

Our goal is:

Design an algorithm that can distinguish between synonyms and non-synonyms

Our approach:

Measure semantic similarity using the Web as a corpus

Synonyms are expected to have higher semantic similarity than non-synonyms

Page 3: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

The Paper in One Slide Measuring semantic similarity

Analyze the words local contexts

Use the Web as a corpus

Similar contexts similar words

TF.IDF weighting & reverse context lookup

Evaluation 94 words (Russian fine arts terminology)

50 synonym pairs to be found

11pt average precision: 63.16%

Page 4: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Contextual Web Similarity What is local context?

Few words before and after the target word

The words in the local context of given word are semantically related to it

Need to exclude the stop words: prepositions, pronouns, conjunctions, etc.

Stop words appear in all contexts

Need of sufficiently big corpus

Same day delivery of fresh flowers, roses, and unique gift baskets

from our online boutique. Flower delivery online by local florists for

birthday flowers.

Page 5: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Contextual Web Similarity Web as a corpus

The Web can be used as a corpus to extract the local context for given word

The Web is the largest possible corpus

Contains large corpora in any language

Searching some word in Google can return up to 1 000 snippets of texts

The target word is given along with its local context: few words before and after it

Target language can be specified

Page 6: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Contextual Web Similarity Web as a corpus

Example: Google query for "flower"

Flowers, Plants, Gift Baskets - 1-800-FLOWERS.COM - Your Florist ...

Flowers, balloons, plants, gift baskets, gourmet food, and teddy bears presented by 1-800-FLOWERS.COM, Your Florist of Choice for over 30 years.

Margarita Flowers - Delivers in Bulgaria for you! - gifts, flowers, roses ...

Wide selection of BOUQUETS, FLORAL ARRANGEMENTS, CHRISTMAS ECORATIONS, PLANTS, CAKES and GIFTS appropriate for various occasions. CREDIT cards acceptable.

Flowers, plants, roses, & gifts. Flowers delivery with fewer ...

Flowers, roses, plants and gift delivery. Order flowers from ProFlowers once, and you will never use flowers delivery from florists again.

Page 7: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Contextual Web Similarity Measuring semantic similarity

For given two words their local contexts are extracted from the Web

A set of words and their frequencies

Semantic similarity is measured as similarity between these local contexts

Local contexts are represented as frequency vectors for given set of words

Cosine between the frequency vectors in the Euclidean space is calculated

Page 8: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Contextual Web Similarity Example of context words frequencies

word countfresh 217

order 204

rose 183

delivery 165

gift 124

welcome 98

red 87

... ...

word: flower

word countInternet 291

PC 286

technology 252

order 185

new 174

Web 159

site 146

... ...

word: computer

Page 9: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Contextual Web Similarity Example of frequency vectors

Similarity = cosine(v1, v2)

# word freq.0 alias 3

1 alligator 2

2 amateur 0

3 apple 5

... ... ...

4999 zap 0

5000 zoo 6

v1: flower

# word freq.0 alias 7

1 alligator 0

2 amateur 8

3 apple 133

... ... ...

4999 zap 3

5000 zoo 0

v2: computer

Page 10: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

TF.IDF Weighting TF.IDF (term frequency times inverted

document frequency) Statistical measure in information retrieval

Shows how important is a certain word for a given document in a set of documents

Increases proportionally to the number of word's occurrences in the document

Decreases proportionally to the total number of documents containing the word

Page 11: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Reverse Context Lookup Local context extracted from the Web can

contain arbitrary parasite words like "online", "home", "search", "click", etc.

Internet terms appear in any Web page

Such words are not likely to be associated with the target word

Example (for the word flowers)

"send flowers online", "flowers here", "order flowers here"

Will the word "flowers" appear in the local context of "send", "online" and "here"?

Page 12: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Reverse Context Lookup If two words are semantically related, then

Both of them should appear in the local contexts of each other

Let #{x,y} = number of occurrences of x in the local context of y

For any word w and a word from its local context wc, we define their strength of semantic association p(w,wc) as follows:

p(w, wc) = min{ #(w, wc), #(wc,w) }

We use p(w, wc) as vector coordinates

We introduce a minimal occurrence threshold (e.g. 5) to filter words appearing just by chance

Page 13: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Data Set We use a list of 94 Russian words:

Terms extracted from texts in the subject of fine arts

Limited to nouns only

The data set:

There are 50 synonym pairs in these words

We expect to find them by our algorithms

абрис, адгезия, алмаз, алтарь, амулет, асфальт, беломорит, битум, бородки, ваятель, вермильон, ..., шлифовка, штихель, экспрессивность, экспрессия, эстетизм, эстетство

Page 14: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Experiments We tested few modifications of our

contextual Web similarity algorithm Basic algorithm (without modifications)

TF.IDF weighting

Reverse context lookup with different frequency threshold

Page 15: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Experiments RAND – random ordering of all the pairs

SIM – the basic algorithm for extraction of semantic similarity from the Web Context size of 3 words

Without analyzing the reverse context

With lemmatization

SIM+TFIDF – modification of the SIM algorithm with TF.IDF weighting

REV2, REV3, REV4, REV5, REV6, REV7 – the SIM algorithm + “reverse context lookup” with frequency thresholds of: 2, 3, 4, 5, 6 and 7

Page 16: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Resources Used We used the following resources:

Google Web search engine: extracted the first 1 000 results for 82 645 Russian words

Russian lemma dictionary: 1 500 000 wordforms and 100 000 lemmata

A list of 507 Russian stop words

Page 17: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Evaluation Our algorithms arrange all pairs of words

according to their semantic similarity

We expect the 50 synonyms pairs to be at the top of the result list

We count how many synonyms are found in the top N results (e.g. top 5, top 10, etc.)

We measure precision and recall We measure 11pt average precision to

evaluate the results

Page 18: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

SIM Algorithm – Results

n Word 1 Words 2Semantic Similarity

Syno-nyms

Precision @ n

Recall @ n

1 выжигание пирография 0.433805 yes 100.00% 2%

2 тонирование тонировка 0.382357 yes 100.00% 4%

3 гематит кровавик 0.325138 yes 100.00% 6%

4 подрамок подрамник 0.271659 yes 100.00% 8%

5 оливин перидот 0.252256 yes 100.00% 10%

6 полирование шлифование 0.220559 no 83.33% 10%

7 полировка шлифовка 0.216347 no 71.43% 10%

8 амулет талисман 0.200595 yes 75.00% 12%

9 пластификаторы мягчители 0.170770 yes 77.78% 14%

... ... ... ... ... ... ...

Precision and recall obtained by the SIM algorithm

Page 19: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Comparison of the Algorithms

Comparison of the algorithms (number of synonyms in the top results)

Algorithm 1 5 10 20 30 40 50 100 200 Max

RAND 0 0.1 0.1 0.2 0.3 0.4 0.6 1.1 2.3 50

SIM 1 5 8 15 18 23 25 39 48 50

SIM+TFIDF 1 4 8 16 22 27 29 43 48 50

REV2 1 4 8 16 21 27 32 42 43 46

REV3 1 4 8 16 20 28 32 41 42 46

REV4 1 4 8 15 20 28 33 41 42 45

REV5 1 4 8 15 20 28 33 40 41 42

REV6 1 4 8 15 22 28 32 39 40 42

REV7 1 4 8 15 21 27 30 37 39 40

Page 20: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Comparison of the Algorithms(11pt Average Precision)

Comparing RAND, SIM, SIM+TDIDF and REV2 … REV7

11pt Average Precision

1,15%

58,98%

63,16%

n/a n/a n/a n/a n/a n/a0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%

70,00%

RAND SIM SIM+TFIDF REV2 REV3 REV4 REV5 REV6 REV7

Page 21: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Results (Precision-Recall Graph)

Comparing the recall-precision graphs of evaluated algorithms

Page 22: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Discussion Our approach is original because:

Measures automatically semantic similarity

Uses the Web as a corpus

Does not rely on any preexisting corpora

Does not requires semantic resources like WordNet and EuroWordNet

Works for any language

Tested for Bulgarian and Russian

Uses reverse-context lookup and TF.IDF

Significant improvement in quality

Page 23: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Discussion Good accuracy, but far away from 100% Known problems of the proposed algorithms:

Semantically related words are not always synonyms red – blue wood – pine apple – computer

Similar contexts does not always mean similar words (distributional hypothesis)

The Web as a corpus introduces noise Google returns the first 1 000 results only

Page 24: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Discussion Known problems of the proposed algorithms:

Google ranks higher news portals, travel agencies and retail sites than books, articles and forum messages

Local context always contain noise Working with words, not capturing phrases

Page 25: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Conclusion and Future Work Conclusion

Our algorithms can distinguish between synonyms and non-synonyms

Accuracy should be improved

Future Work

Additional techniques to distinguish between synonyms and semantically related words

Improve the semantic similarity measure algorithm

Page 26: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

References Hearst M. (1991). "Noun Homograph Disambiguation Using Local Context in

Large Text Corpora". In Proceedings of the 7th Annual Conference of the University of Waterloo Centre for the New OED and Text Research, Oxford, England, pages 1-22.

Nakov P., Nakov S., Paskaleva E. (2007a). “Improved Word Alignments Using the Web as a Corpus”. In Proceedings of RANLP'2007, pages 400-405, Borovetz, Bulgaria.

Nakov S., Nakov P., Paskaleva E. (2007b). “Cognate or False Friend? Ask the Web!”. In Proceedings of the Workshop on Acquisition and Management of Multilin gual Lexicons, held in conjunction with RANLP'2007, pages 55-62, Borovetz, Bulgaria.

Sparck-Jones K. (1972). “A Statistical Interpretation of Term Specificity and its Application in Retrieval”. Journal of Documentation, volume 28, pages 11-21.

Salton G., McGill M. (1983), Introduction to Modern Information Retrieval, McGraw-Hill, New York.

Paskaleva E. (2002). “Processing Bulgarian and Russian Resources in Unified Format”. In Proceedings of the 8th International Scientific Symposium MAPRIAL, Veliko Tarnovo, Bulgaria, pages 185-194.

Harris, Z. (1954). "Distributional structure”. Word, 10, pages 146-162. Lin D. (1998). "Automatic Retrieval and Clustering of Similar Words". In

Proceedings of COLING-ACL'98, Montreal, Canada, pages 768-774. Curran J., Moens M. (2002). "Improvements in Аutomatic Тhesaurus

Еxtraction". In Proceedings of the Workshop on Unsupervised Lexical Acquisition, SIGLEX 2002, Philadelphia, USA, pages 59-67.

Page 27: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

References Plas L., Tiedeman J. (2006). "Finding Synonyms Using Automatic Word

Alignment and Measures of Distribu tional Similarity". In Proceedings of COLING/ACL 2006, Sydney, Australia.

Och F., Ney H. (2003). "A Systematic Comparison of Various Statistical Alignment Models". Computational Linguistics, 29 (1), 2003.

Hagiwara М., Ogawa Y., Toyama K. (2007). "Effectiveness of Indirect Dependency for Automatic Synonym Acquisition". In Proceedings of CoSMo 2007 Workshop, held in conjuction with CONTEXT 2007, Roskilde, Denmark.

Kilgarriff A., Grefenstette G. (2003). "Introduction to the Special Issue on the Web as Corpus", Computational Linguistics, 29(3):333–347.

Inkpen D. (2007). "Near-synonym Choice in an Intelligent Thesaurus". In Proceedings of the NAACL-HLT, New York, USA.

Chen H., Lin M., Wei Y. (2006). "Novel Association Measures Using Web Search with Double Checking". In Proceedings of the COLING/ACL 2006, Sydney, Australia, pages 1009-1016.

Sahami M., Heilman T. (2006). "A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets". In Proceedings of 15th International World Wide Web Conference, Edinburgh, Scotland.

Bollegala D., Matsuo Y., Ishizuka M. (2007). "Measuring Semantic Similarity between Words Using Web Search Engines", In Proceedings of the 16th International World Wide Web Conference (WWW2007), Banff, Canada, pages 757-766.

Sanchez D., Moreno A. (2005), "Automatic Discovery of Synonyms and Lexicalizations from the Web". Artificial Intelligence Research and Development, Volume 131, 2005.

Page 28: DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment

DSC 2008 – 26-27 June 2008, Thessaloniki, Greece

Questions?

Automatic Acquisition of Synonyms Using the Web as a

Corpus