semantic, hierarchical, online clustering of web search results yisheng dong

17
Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

Upload: maximilian-stephens

Post on 02-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

Semantic, Hierarchical, Online Clustering of Web Search Results

Yisheng Dong

Page 2: Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

2

Overview

Introduction Previous Related Works SHOC Approach Prototype System Conclusion

Page 3: Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

3

Motivation The Web is the biggest data source. Search engine is the most commonly used tool

for Web information retrieval. Its current status is far from the satisfaction.

Solution Clustering of Web search results would help a

lot. SHOC can generate both reasonable and

readable cluster.

Introduction

Page 4: Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

4

Basic requirements (clustering approach for web search result)

Semantic Each cluster should correspond to a concept. Avoid confining each Web page to only on

cluster. A label can describe the topic of cluster well.

Hierarchical Eye-browsing tree structure. Taking advantage of the relationship between

them. Online

Provide fresh clustering result “just-in-time”.

Page 5: Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

5

Previous Related Work

Scatter/Gather system traditional heuristic clustering algorithm. It has some limitations.

Based on hyperlink It needs to download and parse original Web

page. Cannot cluster immediately.

STC It is not appropriate for Oriental language. Extract many meaningless partial phrases. Synonymy and polysemy are not considered.

Page 6: Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

6

SOHC step

1. Data acquisition2. Data cleaning3. Feature extraction4. Identifying base clusters5. Combining base clusters

Page 7: Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

7

Data acquision

The data acquisition task here is actually meta-search.

Use 2-level parallelization mechanism

1. Call several engines simultaneously.2. Fetch all of its search result

simultaneously.

Page 8: Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

8

Data cleaning

Sentence boundaries are identified via the following. punctuation marks (e.g. ‘.’, ‘,’, ‘;’, ‘?’, etc.) HTML tags (e.g.<p>, <br>, <li>, <td> etc.)

Non-word tokens are stripped.(e.g. punctuation marks and HTML tags)

Redundant spaces are compressed. Stemming algorithm may be applied.

(for English text)

Page 9: Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

9

Feature extraction (Overview)

Words Most clustering algorithm treat a document as

“bag of words”. Ignoring word order and proximity.

Key phrases Advantage

Improve the quality of the clusters. Useful in constructing labels.

Data structures (key phrase discovery) Suffix tree

Related to the alphabet size of language. Suffix array

Scalable over alphabet size.

Page 10: Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

10

Feature extraction(key phrase discovery)

Completeness Left-completeness Right-completeness

Stability (Mutual Information) S =“c1c2∙∙∙cp”, SL =“c1∙∙∙cp-1”, SR =“c2∙∙∙cp”

Significance se(S) = freq(S) * g(|S|) g(x) 0 (x=1)

log2x (2≤x≤8) 3 (x>8)

freq(S) - )freq(S )freq(S

freq(S))(

RL SMI

Page 11: Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

11

Feature extraction (Suffix array)

Suffix array An array of all N suffixes, sorted alphabetically

LCP (Longest Common Prefix) Use to accelerate searching in text

<Suffix array and lcp of the “to_be_or_not_to_be”>

Page 12: Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

12

Feature extraction (Discover rcs)

void discover_rcs(){ typedef structure{ int ID; int frequency; } RCSTYPE; RSCTYPE rcs_stack[N]; // N is the document's

length Initialize rcs_stack; int sp = -1; // the stack pointer int i = 1; while(i < N+1) { if(sp < 0){ // the stack is empty if(lcp[i] > 0){ sp++; rcs_stack[sp].ID = i; rcs_stack[sp].frequency = 2; } i++; } else{ . . . } }}

int r = rcs_stack[sp].ID;if(lcp[r] < lcp[i]) { sp++; rcs_stack[sp].ID = i; rcs_stack[sp].frequency = 2; i++;}else if(lcp[r] == lcp[i]) { rcs_stack[sp].frequecny++; i++;}else{ Output rcs_stack[sp]; // ID & frequency int f = rcs_stack[sp].frequency; sp--; if(sp >= 0){ rcs_stack[sp].frequency = rcs_stack[sp].frequency + f -1; }}

Page 13: Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

13

Feature extraction (Intersect lcs_rcs)

void intersect_lcs_rcs(sorted lcs array, sorted rcs array)

{ int i =0, j=0; while(i<L && j < R) { string str_l = lcs[i].ID denoted LCS; string str_r = rcs[j].ID denoted RCS; if(str_l == str_r) { Output lcs[i]; i++; j++; } if(str_l < str_r){ i++; } if(str_l > str_r){ j++; } }}

rcs array

ID frequency RCS

1 2 _be

2 5 _

6 2 be

8 2 e

11 2 o_be

12 4 o

16 3 to_be

17 2 t

cs array

ID frequency CS

2 5 _

12 4 o

16 3 t

17 2 to_be

Page 14: Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

14

Identifying base clustersTerms

(key phrases) Documents

The association between terms and documents

Page 15: Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

15

Combining base clusters

Combine base cluster X and Y

if ( |X ∩ Y| / |X ∪ Y| > t1 ) { X and Y are merged into one cluster;}else { if ( |X| > |Y| ) { if ( |X ∩ Y| / |Y| > t2 ) { let Y become X’s child; } } else { if ( |X ∩ Y| / |X| > t2 ) { let X become Y’s child; } }}

Merging Label

if ( label x is a substring of label y ) { label_xy = label_y;}else if ( label_y is a substring of label_x ){ label_xy = label_x;}else{ label_xy = “ label_x + label_y ”;}

Page 16: Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

16

Prototype system

Crate a prototype system named WICE (Web Information Clustering Engine)

Doing well for dealing with the special problems related to Chinese

Output for query “object oriented” object oriented programming object oriented analysis, etc.

Page 17: Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

17

Conclusion

Main contribution The benefit of using key phrase. Method based on suffix array for key phrase. The concept of orthogonal clustering. The WICE system is designed and implemented.

Further works Detailed analysis. Further experimenting. Interpretation of experiment results. Comparing with other clustering algorithms.