graph-based wsd の続き

27
Graph-based WSD ののの DMLA 2008-12-10 22/06/23 小小小 .

Upload: shanta

Post on 19-Jan-2016

32 views

Category:

Documents


4 download

DESCRIPTION

小町守. Graph-based WSD の続き. DMLA 2008-12-10. Word sense disambiguation task of Senseval-3 English Lexical Sample. Predict the sense of “bank”. … the financial benefits of the bank (finance) 's employee package ( cheap mortgages and pensions, etc ) , bring this up to …. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Graph-based WSD  の続き

Graph-based WSD の続き

DMLA 2008-12-10

23/04/21

小町守 .

Page 2: Graph-based WSD  の続き

04/21/232

Word sense disambiguation task of Senseval-3 English Lexical Sample

Predict the sense of “bank”

… the financial benefits of the bank (finance) 's employee package ( cheap mortgages and pensions, etc ) , bring this up to …

In that same year I was posted to South Shields on the south bank (bank of the river) of the River Tyne and quickly became aware that I had an enormous burden

Possibly aligned to water a sort of bank(???) by a rushing river.

Training instances are annotated with their sense

Predict the sense of target word in the test set

Page 3: Graph-based WSD  の続き

WSD with adjacency matrix Assumption

Similar examples tend to have the same label Can define (dis-)similarity between examples

Prior knowledge, kNN

Idea Perform clustering on an adjacency matrix

3

Page 4: Graph-based WSD  の続き

Intuition behind using similarity graph Can propagate known labels to unlabeled data

without any overlapping

(Pictures taken from Zhu 2007)

4

Page 5: Graph-based WSD  の続き

Using unlabeled data by similarity graph

5

Page 6: Graph-based WSD  の続き

Pros and cons• Pros

– Mathematically well-founded– Can achieve high performance if the graph is well-

constructed• Cons

– Hard to determine appropriate graph structure (and its edges’ weight)

– Relatively large computational complexity– Mostly transductive

• Transductive learning: (unlabeled) test instances are given when building classification model

• Inductive: test instances are not known during training

6

Page 7: Graph-based WSD  の続き

04/21/237

Word sense disambiguation by kNN

Seed instance = the instance to predict its sense

System output = k-nearest neighbor (k=3)

Seed instance

Page 8: Graph-based WSD  の続き

04/21/238

Simplified Espresso is HITS

Simplified Espresso = HITS in a bipartite graph whose

adjacency matrix is A

Problem

No matter which seed you start with, the same instance is always ranked topmostSemantic drift (also called topic drift in HITS)

The ranking vector i tends to the principal eigenvector of ATA as the iteration proceedsregardless of the seed instances!

Page 9: Graph-based WSD  の続き

04/21/239

Convergence process of EspressoHeuristics in Espresso helps reducing semantic

drift(However, early stopping is required for optimal

performance)

Output the most frequent sense regardless of input

Original Espresso

Simplified Espresso

Most frequent sense (baseline)

Semantic drift occurs (always outputs the most

frequent sense)

Page 10: Graph-based WSD  の続き

Learning curve of Original Espresso:per-sense breakdown

04/21/2310

# of most frequent sense predictions increases

Recall for infrequent senses worsens even with original

Espresso

Most frequent sense

Other senses

Page 11: Graph-based WSD  の続き

Q. What caused drift in Espresso?A. Espresso's resemblance to HITS

HITS is an importance computation method(gives a single ranking list for any seeds)

Why not use a method for another type of link analysis measure - which takes seeds into account?"relatedness" measure

(it gives different rankings for different seeds)

04/21/2311

Page 12: Graph-based WSD  の続き

04/21/2312

The regularized Laplacian kernel A relatedness measure Takes higher-order relations into account Has only one parameter

L = D− AGraph Laplacian

R n ( L)n (I L) 1n0

Regularized Laplacian matrix

A: adjacency matrix of the graphD: (diagonal) degree matrix

β: parameterEach column of Rβ gives the rankings relative to a node

Page 13: Graph-based WSD  の続き

algorithm F measur

e

Most frequent sense (baseline)

54.5

HyperLex 64.6

PageRank 64.6

Simplified Espresso 44.1

Espresso (after convergence) 46.9

Espresso (optimal stopping) 66.5

Regularized Laplacian (β=10-

2)67.104/21/2313

WSD on all nouns in Senseval-3

Outperforms other graph-based methods

Espresso needs optimal stopping to achieve an equivalent performance

Page 14: Graph-based WSD  の続き

More experiments on WSD dataset Niu et al. “Word Sense Disambiguation using

LP-based Semi-Supervised Learning” (ACL-2005)

Pham et al. “Word Sense Disambiguation with Semi-Supervised Learning” (AAAI-2005)

04/21/2314

Page 15: Graph-based WSD  の続き

Dataset Pedersen (2000) line, interest data

Line: six senses = 線 , 生産物 , … Interest: four senses = 利息 , 関心 , …

Features Bag-of-words feature Local collocation feature Parts-of-speech feature

04/21/2315

Page 16: Graph-based WSD  の続き

Result

04/21/2316

MFS Niu et al. Pham et al.

BB proposed

interest 54.6% 79.8% 76.4% 75.5% 75.6%

line 53.5% 59.4% 68.0% 62.7% 61.3%

S3LS (1%) 54.5% 30.8% 42.1%

S3LS (10%)

54.5% 56.5% 56.0%

S3LS (25%)

54.5% 64.9% 63.2%

S3LS (50%)

54.5% 68.6% 66.3%

S3LS (75%)

54.5% 70.3% 68.8%

S3LS (100%)

54.5% 71.8% 69.8%

Page 17: Graph-based WSD  の続き

Discussion Proposed method (simple k-NN) achieved

comparable performance to previous semi-supervised WSD systems

Does additional data help?

04/21/2317

Page 18: Graph-based WSD  の続き

“line” data with 90 labeled instances

04/21/2318

Page 19: Graph-based WSD  の続き

“line” data with 150 labeled instances

04/21/2319

Page 20: Graph-based WSD  の続き

“interest” data with 60 labeled instances

04/21/2320

Page 21: Graph-based WSD  の続き

“interest” data with 300 labeled instances

04/21/2321

Page 22: Graph-based WSD  の続き

Discussion (cont.) Additional data doesn’t always help

Sometimes gets worse than nothing!

Haven’t succeeded to use large-scale data on this task (BNC data can be used)

All system suffers from data sparseness problem Needs robust feature selection (smoothing)

04/21/2322

Page 23: Graph-based WSD  の続き

Multiple clusters in similarity graphs

23

P(ii, p j ) = P(z)P(ii | z)P(p j | z)z=1

N

∑Generative model of co-occurrence

Page 24: Graph-based WSD  の続き

Construction of similarity matrix Let Gz be a hidden topic graph

The edge between ii and ij has weight P(z|ii,pj)

Adjacency graph Az = A(Gz) is a graph whose (i,j)-th element holds P(z|ii,pj) and all the other element are set 0

A similarity matrix is computed by AzTAz

The (i,j)-th element holds the co-occurrence value between instance ii and ij with respect to topic z

04/21/2324

Page 25: Graph-based WSD  の続き

Combination of von Neumann kernels The von Neumann kernel matrix is defined as

follows:

Final kernel matrix is computed by summing the kernel matrices of all hidden topic

04/21/2325

Kβ = R(I −βR)−1 = R β nRn

n=0

R = ATA

Mβ = Kβz=1

N

Page 26: Graph-based WSD  の続き

Result

04/21/2326

MFS Niu et al. K-NN pLSI

S3LS 54.5% 71.8% 69.8% 51.7%

Page 27: Graph-based WSD  の続き

Discussion Poor result on proposed method

Likely to be caused by mis-implimentation or a bug

The number of clusters (hidden variable: z) does not seem to strongly affect the performance (tested |z| = 5, 20. Got 3 points improvement on increasing |z| to 20, but still below most frequent sense baseline)

04/21/2327