1 a web search engine-based approach to measure semantic similarity between words presenter: guan-yu...
TRANSCRIPT
1
A Web Search Engine-Based Approach to Measure Semantic
Similarity between Words
Presenter: Guan-Yu Chen
IEEE Trans. on Knowledge & Data Engineering, 23(7), 2011.
Danushka Bollegala, Yutaka Matsuo, & Mitsuru Ishizuka
2
Outline
1. Introduction
2. Related Work
3. Method
4. Experiments
5. Conclusion
3
1. Introduction (1/5)• Semantic Similarity
– Web mining: community extraction, relation detection, & entity disambiguation.
– Information retrieval: to retrieve a set of documents that is semantically related to a given user query.
– Natural language processing: word sense disambiguation, textual entailment, & automatic text summarization.
4
1. Introduction (2/5)• Web search engines
– Page count: the number of pages that contain the query words.
– Snippets: a brief window of text extracted by a search engine around the query term in a document.
5
1. Introduction (3/5)• Page count
– In Google, “apple” AND “computer” is 288,000,000;“banana” AND “computer” is 3,590,000.
– “apple” AND “computer” is much similar than “banana” AND “computer”.
6
• Snippets– “Jaguar” AND “cat”
– Jaguar is the largest cat X is the largest Y
1. Introduction (4/5)
7
1. Introduction (5/5)
• Web search engine (Google) +Page count + Snippets Semantic Similarity
8
2. Related Work (1/2)• Normalized Google Distance (NGD)
– Cilibrasi & Vitanyi, 2007.
P and Q: the two words;
NGD(P,Q): the distance between P and Q;
H(P),H(Q): the page count for the word P and Q;
H(P,Q): the page count for the query “P AND Q”.
max{log ( ), log ( )} log ( , )( , )
log min{log ( ), log ( )}
H p H Q H P QNGD P Q
N H P H Q
9
2. Related Work (2/2)• Co-occurrence Double-Checking (CODC)
– Chen et al., 2006.0, ( @ ) 0,
( , ) ( @ ) ( @ )exp log , .
( ) ( )
if f P Q
CODC P Q f P Q f Q Potherwise
H P H Q
F(P@Q): the number of occurrences of P in the top-ranking snippets for the query Q in Google;H(P): the page count for query P;α: a constant in this model, which is experimentally set to the value 0.15.
10
3. Method
1. Outline
2. Page Count-Based Co-Occurrence Measures
3. Lexical Pattern Extraction
4. Lexical Pattern Clustering
5. Measuring Semantic Similarity
6. Training
11
3.1 Outline
12
3.2 Page Count-Based Co-Occurrence Measures (1/2)
• P∩Q denotes the conjunction query “P AND Q”.
0, ( ) ,
( , ) ( ), .
( ) ( ) ( )
if H P Q c
WebJaccard P Q H P Qotherwise
H P H Q H P Q
0, ( ) ,
( ), .
min{ ( ), ( )}
if H P Q c
WebOverlap H P Qotherwise
H P H Q
13
3.2 Page Count-Based Co-Occurrence Measures (2/2)
0, ( ) ,
( , ) 2 ( ), .
( ) _ ( )
if H P Q c
WebDice P Q H P Qotherwise
H P H Q
2
0, ( ) ,
( )( , )
log , .( ) ( )
if H P Q c
H P QWebPMI P Q N otherwise
H P H QN N
N: the number of documents indexed by the search engine.
14
3.3 Lexical Pattern Extraction (1/2)Conditions:1. A subsequence must contain exactly one occurrence of each
X and Y .2. The maximum length of a subsequence is L words.3. A subsequence is allowed to skip one or more words.
However, we do not skip more than g number of words consecutively. Moreover, the total number of words skipped in a subsequence should not exceed G.
4. We expand all negation contractions in a context. For example, didn’t is expanded to did not. We do not skip the word not when generating subsequences. For example, this condition ensures that from the snippet X is not a Y, we do not produce the subsequence X is a Y.
15
3.3 Lexical Pattern Extraction (2/2)
• X, a large Y
• X a flightless Y
• X, large Y lives
A snippet retrieved for the query “ostrich*******bird.”
16
3.4 Lexical Pattern Clustering (1/2)
word-pair frequency: ( , , )i i jf P Q a
total occurrence: ( ) ( , , )j i j ji
a f P Q a
aj: a pattern in pattern vector a.
(Pi,Qj): a word pair.
17
3.4 Lexical Pattern Clustering (2/2)
18
3.5 Measuring Semantic Similarity(1/5)
Weight to a pattern ai in a cluster cj:
( )
( )j
iij
t c
aw
t
The jth feature for a word pair (P, Q):
( , , )i j
j ij ia c
f w f P Q a
19
3.5 Measuring Semantic Similarity(2/5)
Feature vector for a word pair (P, Q) :
1
N
PQPQ
PQ
PQ
PQ
f
f
WebJaccardf
WebOverlap
WebDice
WebPMI
20
3.5 Measuring Semantic Similarity(3/5)
Train a two-class SVM: ( synonymous / nonsynonymous )
{( , , )}k k kS P Q y {1, 1}ky
* * * *( , ) ( 1 )sim P Q p y f
Semantic similarity:
21
3.5 Measuring Semantic Similarity(4/5)
* *( ) ( , )k k ki
d f y K f f b Distance:
b: the bias term and the hyperplane.ak: the Lagrange multiplier.fk: support vector.K(fk, f): the value of the kernel function.f : the instance to classify.
22
3.5 Measuring Semantic Similarity(5/5)
The probability:1
( 1 ( ))1 exp( ( )
p y d fd f
Log likelihood:
1
1
( , ) log ( , , )
{ log( )} (1 ) log(1 )
N
kkk
N
kk
L p y f
t pk tk pk
k = t
( 1) / 2k k kt y y
23
3.6 Training (1/5)
Number of Patterns Extracted for Training Data
Synonymous(A, B)(C, D)
Nonsynonymous
(A, D)(C, B)
24
3.6 Training (2/5)
• L = 5, g = 2, G = 4, & T = 5, for lexical pattern extraction conditions.
Distribution of patterns extracted from synonymous word pairs.
25
3.6 Training (3/5)
Average similarity versus clustering threshold θ.
26
3.6 Training (4/5)The centroid vector of all feature vectors:
( , )
1w PQ
P Q W
f fW
The average Mahalanobis distance :
( , )
1( ) ( , )W PS
P Q W
D Mahala f fW
1( , ) ( ) ( )T
W PQ W PQ W PQMahala f f f f C f f
|W|: the number of word pairs in W.
C-1: the inverse of the intercluster correlation Matrix.
27
3.6 Training (5/5)
Distribution of patterns extracted from nonsynonymous word pairs.
[0,1]
ˆ arg min ( )D
28
4. Experiments
1. Benchmark Data Sets
2. Semantic Similarity
3. Community Mining
29
5. Conclusion (1/3)1. A semantic similarity measure using both page
counts and snippets retrieved from a web search engine for two words.
2. Four word co-occurrence measures were computed using page counts.
3. A lexical pattern extraction algorithm to extract numerous semantic relations that exist between two words.
30
5. Conclusion (2/3)4. A sequential pattern clustering algorithm was
proposed to identify different lexical patterns that describe the same semantic relation.
5. Both page counts-based co-occurrence measures and lexical pattern clusters were used to define features for a word pair.
6. A two-class SVM was trained using those features extracted for synonymous and nonsynonymous word pairs selected from WordNet synsets.
31
5. Conclusion (2/3)• Experimental results on three benchmark data sets
showed that the proposed method outperforms various baselines as well as previously proposed web-based semantic similarity measures, achieving a high correlation with human ratings.
• The proposed method improved the F-score in a community mining task, thereby underlining its usefulness in real-world tasks, that include named entities not adequately covered by manually created resources.
32
The End~
Thanks for your attention!!