discriminating word senses using mcquitty’s similarity analysis

21
1 Using Using McQuitty’s Similarity McQuitty’s Similarity Analysis Analysis Amruta Purandare Amruta Purandare University of Minnesota, Duluth University of Minnesota, Duluth Advisor : Dr Ted Pedersen Advisor : Dr Ted Pedersen Research supported by National Science Foundation (NSF) Research supported by National Science Foundation (NSF) Faculty Early Career Development Award (#0092784) Faculty Early Career Development Award (#0092784)

Upload: thomas-jordan

Post on 15-Mar-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Discriminating Word Senses Using McQuitty’s Similarity Analysis. Amruta Purandare University of Minnesota, Duluth Advisor : Dr Ted Pedersen Research supported by National Science Foundation (NSF) Faculty Early Career Development Award (#0092784). Discriminating “line”. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

11

Discriminating Word Senses Discriminating Word Senses Using Using

McQuitty’s Similarity AnalysisMcQuitty’s Similarity Analysis

Amruta PurandareAmruta PurandareUniversity of Minnesota, DuluthUniversity of Minnesota, Duluth

Advisor : Dr Ted PedersenAdvisor : Dr Ted Pedersen

Research supported by National Science Foundation Research supported by National Science Foundation (NSF)(NSF)

Faculty Early Career Development Award Faculty Early Career Development Award (#0092784)(#0092784)

Page 2: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

22

Discriminating “line”Discriminating “line”

They will begin line formation before ceremonyConnect modem to any jack on your line

Quit printing after the last line of each fileYour line will not get tied while you are connected to net

Stand balanced and comfortable during line upLines that do not fit a page are truncated

New line service provides reliable connections Pages are separated by line feed characters They stand far right when in line formation

Page 3: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

33

They will begin line formation before ceremonyStand balanced and comfortable during line up

They stand far right when in line formation

Your line will not get tied while you are connected to netConnect modem to any jack on your line

New line service provides reliable connections

Quit printing after the last line of each pageLines that do not fit a page are truncated

Pages are separated by line feed characters

Page 4: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

44

IntroductionIntroduction• What is Word Sense Discrimination ?What is Word Sense Discrimination ?• Unsupervised learning Unsupervised learning

Training

Test

Features

Feature Vectors

Clusters

Page 5: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

55

Representing contextRepresenting context• Features (from training)Features (from training)

•Bi grams Bi grams •Unigrams Unigrams •Second Order Co-occurrences/SOCs Second Order Co-occurrences/SOCs

(Schütze98)(Schütze98)•MixtureMixture

• Feature vectors (Binary)Feature vectors (Binary)• Measuring similarity Measuring similarity

•CosineCosine•MatchMatch

Page 6: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

66

Feature examplesFeature examples<features> for line<features> for line

UnigraUnigramm

<blank> <text> <service> <connection> <blank> <text> <service> <connection> <modem><modem>

<paragraph> <jack> <reliable> <circuit> <file><paragraph> <jack> <reliable> <circuit> <file>

Bi gramBi gram <blank, <blank, lineline> <text, > <text, lineline> > <text, paragraph> <blank, space><text, paragraph> <blank, space>

<<lineline, service> <modem, jack>, service> <modem, jack><phone, service> <connection, <phone, service> <connection, lineline>>

<reliable, connection><reliable, connection>

SOCsSOCs <space> <paragraph> <phone> <reliable><space> <paragraph> <phone> <reliable>

Page 7: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

77

McQuitty’s methodMcQuitty’s method• Pedersen & Bruce, Pedersen & Bruce,

19971997• AgglomerativeAgglomerative• UPGMA / Average UPGMA / Average

LinkLink• Stopping rules Stopping rules

– Number of clustersNumber of clusters– Score cutoffScore cutoff

Page 8: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

88

EvaluationEvaluationS1S1 S2S2 S3S3 S4S4

C1C1 1010 00 33 22C2C2 11 11 77 11C3C3 22 11 11 66C4C4 22 1515 11 22

Page 9: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

99

EvaluationEvaluationS1S1 S3S3 S4S4 S2S2

C1C1 1010 33 22 00 1515C2C2 11 77 11 11 1010C3C3 22 11 66 11 1010C4C4 22 11 22 1515 2020

1515 1212 1111 1717 5555

Page 10: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

1010

Majority Sense ClassifierMajority Sense Classifier

Page 11: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

1111

Experimental DataExperimental DataLineLine Senseval-2Senseval-2

#Senses#Senses 66 VariableVariableSelected top 5Selected top 5

#instanc#instanceses

41464146(1200:600)(1200:600)

120/word, 73 words120/word, 73 words(100-150:50-100)(100-150:50-100)

Page 12: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

1212

Scope of the experimentsScope of the experiments

• 584 experiments (73 * 4 * 2)584 experiments (73 * 4 * 2)– 73 Words: 72 Senseval-2, LINE73 Words: 72 Senseval-2, LINE– 4 Features: Bi grams, Unigrams, SOCs, Mix4 Features: Bi grams, Unigrams, SOCs, Mix– 2 Similarity Measures: Match, Cosine2 Similarity Measures: Match, Cosine

• Window = 5 Window = 5 – for Bi grams and SOCsfor Bi grams and SOCs

• Frequency cutoff = 2Frequency cutoff = 2

Page 13: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

1313

Senseval-2 Results POS wiseSenseval-2 Results POS wise

66 7755 3377 88

COSCOS MATMAT

SOSOCCBIBIUNIUNI

COCOSS

MAMATT

COCOSS

MAMATT11 11

00 0011 00

1111 6655 551313 99

SOSOCCBIBIUNIUNI

SOSOCCBIBIUNIUNI

No of words of a POS for which experiment obtained accuracy more than Majority

Page 14: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

1414

Senseval-2 Results Feature Senseval-2 Results Feature wisewise

66 771111 6611 11

COSCOS MATMATNNVVADJADJ

COCOSS

MAMATT COCO

SSMAMATT77 88

1313 9911 00

55 3355 5500 00

NNVVADJADJ

NNVVADJADJ

Page 15: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

1515

Senseval-2 Results Measure Senseval-2 Results Measure wisewise

66 55 771111 55 131311 00 11

SOCSOC BIBI UNIUNINNVVADJADJ

SOSOCC

BIBI UNIUNI

77 33 8866 55 9911 00 00

NNVVADJADJ

Page 16: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

1616

0.250.25 0.230.230.190.19 0.180.180.210.21 0.200.20

COSCOS MATMATSOCSOCBIBI

UNIUNI

Line Results Line Results

On uniform distribution of 6 senses

Page 17: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

1717

Sample Confusion Table Sample Confusion Table (fine.soc.cos)(fine.soc.cos)

5.005.0011.6711.6763.3363.3316.6716.673.333.33

11.611.677

8.338.33 5050 23.323.333

6.66.677

22 00 00 11 0011 00 44 22 0022 55 2525 22 4411 00 00 99 0011 00 11 00 00

S0S0 S1S1 S2S2 S3S3 S4S4

77 55 3030 1414 44

33773838101022

60 S0 = elegantS1 = small grained

S2 = superior S3 = satisfactory

S4 = thin

Page 18: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

1818

ConclusionsConclusions• Small set of SOCs was powerfulSmall set of SOCs was powerful

– Half the number of unigrams/bigramsHalf the number of unigrams/bigrams• Scaling done by Cosine helps !Scaling done by Cosine helps !• Need more training data!Need more training data!• Need to improve feature… Need to improve feature…

• Selection (Tests of associations)Selection (Tests of associations)• extraction (Stemming)extraction (Stemming)• matching (Fuzzy matching)matching (Fuzzy matching)

… …strategies for bi grams strategies for bi grams • Explore new featuresExplore new features

• POS POS • CollocationsCollocations

Page 19: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

1919

Recent workRecent work• PDL implementation PDL implementation • Cluto - Clustering Toolkit Cluto - Clustering Toolkit

http://www-users.cs.umn.edu/~karypis/clutohttp://www-users.cs.umn.edu/~karypis/cluto•6 clustering methods, 12 merging criteria6 clustering methods, 12 merging criteria

• PlansPlans– Comparing clustering in Comparing clustering in

similarity space Vs vector space (similarity space Vs vector space (Schütze, Schütze, 19981998))

– Stopping rulesStopping rules

Page 20: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

2020

They will begin line formation before ceremonyStand balanced and comfortable during line up

They stand far right when in line formation

Your line will not get tied while you are connected to netConnect modem to any jack on your line

New line service provides reliable connections

Quit printing after the last line of each fileLines that do not fit a page are truncated

Pages are separated by line feed characters

Sense labelingSense labeling

Page 21: Discriminating Word Senses Using  McQuitty’s Similarity Analysis

2121

Software PackagesSoftware Packages• SenseClusters SenseClusters (Our Discrimination Toolkit)(Our Discrimination Toolkit)

http://www.d.umn.edu/~tpederse/senseclusters.htmlhttp://www.d.umn.edu/~tpederse/senseclusters.html• PDL PDL (Used to implement clustering algorithms)(Used to implement clustering algorithms)

http://pdl.perl.org/http://pdl.perl.org/• NSP NSP (Used for extracting features)(Used for extracting features)

http://www.d.umn.edu/~tpederse/nsp.htmlhttp://www.d.umn.edu/~tpederse/nsp.html• SenseTools SenseTools (Used for preprocessing, feature (Used for preprocessing, feature

matching)matching)http://www.d.umn.edu/~tpederse/sensetools.htmlhttp://www.d.umn.edu/~tpederse/sensetools.html

• Cluto Cluto (Clustering Toolkit)(Clustering Toolkit)http://www-users.cs.umn.edu/~karypis/clutohttp://www-users.cs.umn.edu/~karypis/cluto