1 a web search engine-based approach to measure semantic similarity between words presenter: guan-yu...

1

A Web Search Engine-Based Approach to Measure Semantic

Similarity between Words

Presenter: Guan-Yu Chen

IEEE Trans. on Knowledge & Data Engineering, 23(7), 2011.

Danushka Bollegala, Yutaka Matsuo, & Mitsuru Ishizuka

2

Outline

1. Introduction

2. Related Work

3. Method

4. Experiments

5. Conclusion

3

1. Introduction (1/5)• Semantic Similarity

– Web mining: community extraction, relation detection, & entity disambiguation.

– Information retrieval: to retrieve a set of documents that is semantically related to a given user query.

– Natural language processing: word sense disambiguation, textual entailment, & automatic text summarization.

4

1. Introduction (2/5)• Web search engines

– Page count: the number of pages that contain the query words.

– Snippets: a brief window of text extracted by a search engine around the query term in a document.

5

1. Introduction (3/5)• Page count

– In Google, “apple” AND “computer” is 288,000,000;“banana” AND “computer” is 3,590,000.

– “apple” AND “computer” is much similar than “banana” AND “computer”.

6

• Snippets– “Jaguar” AND “cat”

– Jaguar is the largest cat X is the largest Y

1. Introduction (4/5)

7

1. Introduction (5/5)

• Web search engine (Google) +Page count + Snippets Semantic Similarity

8

2. Related Work (1/2)• Normalized Google Distance (NGD)

– Cilibrasi & Vitanyi, 2007.

P and Q: the two words;

NGD(P,Q): the distance between P and Q;

H(P),H(Q): the page count for the word P and Q;

H(P,Q): the page count for the query “P AND Q”.

max{log ( ), log ( )} log ( , )( , )

log min{log ( ), log ( )}

H p H Q H P QNGD P Q

N H P H Q

9

2. Related Work (2/2)• Co-occurrence Double-Checking (CODC)

– Chen et al., 2006.0, ( @ ) 0,

( , ) ( @ ) ( @ )exp log , .

( ) ( )

if f P Q

CODC P Q f P Q f Q Potherwise

H P H Q

F(P@Q): the number of occurrences of P in the top-ranking snippets for the query Q in Google;H(P): the page count for query P;α: a constant in this model, which is experimentally set to the value 0.15.

10

3. Method

1. Outline

2. Page Count-Based Co-Occurrence Measures

3. Lexical Pattern Extraction

4. Lexical Pattern Clustering

5. Measuring Semantic Similarity

6. Training

11

3.1 Outline

12

3.2 Page Count-Based Co-Occurrence Measures (1/2)

• P∩Q denotes the conjunction query “P AND Q”.

0, ( ) ,

( , ) ( ), .

( ) ( ) ( )

if H P Q c

WebJaccard P Q H P Qotherwise

H P H Q H P Q

0, ( ) ,

( ), .

min{ ( ), ( )}

if H P Q c

WebOverlap H P Qotherwise

H P H Q

13

3.2 Page Count-Based Co-Occurrence Measures (2/2)

0, ( ) ,

( , ) 2 ( ), .

( ) _ ( )

if H P Q c

WebDice P Q H P Qotherwise

H P H Q

2

0, ( ) ,

( )( , )

log , .( ) ( )

if H P Q c

H P QWebPMI P Q N otherwise

H P H QN N

N: the number of documents indexed by the search engine.

14

3.3 Lexical Pattern Extraction (1/2)Conditions:1. A subsequence must contain exactly one occurrence of each

X and Y .2. The maximum length of a subsequence is L words.3. A subsequence is allowed to skip one or more words.

However, we do not skip more than g number of words consecutively. Moreover, the total number of words skipped in a subsequence should not exceed G.

4. We expand all negation contractions in a context. For example, didn’t is expanded to did not. We do not skip the word not when generating subsequences. For example, this condition ensures that from the snippet X is not a Y, we do not produce the subsequence X is a Y.

15

3.3 Lexical Pattern Extraction (2/2)

• X, a large Y

• X a flightless Y

• X, large Y lives

A snippet retrieved for the query “ostrich*******bird.”

16

3.4 Lexical Pattern Clustering (1/2)

word-pair frequency: ( , , )i i jf P Q a

total occurrence: ( ) ( , , )j i j ji

a f P Q a

aj: a pattern in pattern vector a.

(Pi,Qj): a word pair.

17

3.4 Lexical Pattern Clustering (2/2)

18

3.5 Measuring Semantic Similarity(1/5)

Weight to a pattern ai in a cluster cj:

( )

( )j

iij

t c

aw

t

The jth feature for a word pair (P, Q):

( , , )i j

j ij ia c

f w f P Q a

19


Feature vector for a word pair (P, Q) :

1

N

PQPQ

PQ

PQ

PQ

f

f

WebJaccardf

WebOverlap

WebDice

WebPMI

20


Train a two-class SVM: ( synonymous / nonsynonymous )

{( , , )}k k kS P Q y {1, 1}ky

* * * *( , ) ( 1 )sim P Q p y f

Semantic similarity:

21


* *( ) ( , )k k ki

d f y K f f b Distance:

b: the bias term and the hyperplane.ak: the Lagrange multiplier.fk: support vector.K(fk, f): the value of the kernel function.f : the instance to classify.

22


The probability:1

( 1 ( ))1 exp( ( )

p y d fd f

Log likelihood:

1

1

( , ) log ( , , )

{ log( )} (1 ) log(1 )

N

kkk

N

kk

L p y f

t pk tk pk

k　　　＝ t

( 1) / 2k k kt y y

23

3.6 Training (1/5)

Number of Patterns Extracted for Training Data

Synonymous(A, B)(C, D)

Nonsynonymous

(A, D)(C, B)

24

3.6 Training (2/5)

• L = 5, g = 2, G = 4, & T = 5, for lexical pattern extraction conditions.

Distribution of patterns extracted from synonymous word pairs.

25

3.6 Training (3/5)

Average similarity versus clustering threshold θ.

26

3.6 Training (4/5)The centroid vector of all feature vectors:

( , )

1w PQ

P Q W

f fW

The average Mahalanobis distance ：

( , )

1( ) ( , )W PS

P Q W

D Mahala f fW

1( , ) ( ) ( )T

W PQ W PQ W PQMahala f f f f C f f

|W|: the number of word pairs in W.

C-1: the inverse of the intercluster correlation Matrix.

27

3.6 Training (5/5)

Distribution of patterns extracted from nonsynonymous word pairs.

[0,1]

ˆ arg min ( )D

28

4. Experiments

1. Benchmark Data Sets

2. Semantic Similarity

3. Community Mining

29

5. Conclusion (1/3)1. A semantic similarity measure using both page

counts and snippets retrieved from a web search engine for two words.

2. Four word co-occurrence measures were computed using page counts.

3. A lexical pattern extraction algorithm to extract numerous semantic relations that exist between two words.

30

5. Conclusion (2/3)4. A sequential pattern clustering algorithm was

proposed to identify different lexical patterns that describe the same semantic relation.

5. Both page counts-based co-occurrence measures and lexical pattern clusters were used to define features for a word pair.

6. A two-class SVM was trained using those features extracted for synonymous and nonsynonymous word pairs selected from WordNet synsets.

31

5. Conclusion (2/3)• Experimental results on three benchmark data sets

showed that the proposed method outperforms various baselines as well as previously proposed web-based semantic similarity measures, achieving a high correlation with human ratings.

• The proposed method improved the F-score in a community mining task, thereby underlining its usefulness in real-world tasks, that include named entities not adequately covered by manually created resources.

32

The End~

Thanks for your attention!!

1 a web search engine-based approach to measure semantic similarity between words presenter: guan-yu...

Documents