ieee iri 16 - clustering web pages based on structure and style similarity
TRANSCRIPT
July 28-30, 2016; IEEE IRI, Pittsburgh, PA, USA
Thamme Gowda@thammegowda
Dr. Chris Mattmann@chrismattmann
1
CLUSTERING WEB PAGES BASED ON STRUCTURE AND STYLE SIMILARITY
Information Retrieval and Data Science
2
OUTLINE• Problem Statement• Method Overview• Steps
• Tree Edit Distance• Style Similarity• Shared Near Neighbor Clustering
• Evaluation• Challenges
Information Retrieval and Data Science
3
PROBLEM STATEMENT
Information Retrieval and Data Science
• Scraping data from online marketplaces
• Start with homepage → categories →listing → Actual stuff (Detail page)
SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
4
1 2 3 4
8765
USELESS
USELESS
5SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
1 2 3 4
8765
USELESS
USELESS
6SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
CRAWLER: YESANALYSIS: NO
CRAWLER: YESANALYSIS: NO
CRAWLER: YESANALYSIS: NO
1 2 3 4
8765
USELESS
USELESS
7SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
CRAWLER: YESANALYSIS: NO
CRAWLER: YESANALYSIS: NO
CRAWLER: YESANALYSIS: NO
USEFUL USEFUL USEFUL
1 2 3 4
8765
8
METHOD OVERVIEW
Information Retrieval and Data Science
CLUSTERING
• “task of grouping a set of objects in such a way that objects in the same group are more similar (in some sense or the other) to each other than to those in the other groups”
– Wikipedia
• There are many ways to achieve this.
9Information Retrieval and Data Science
CLUSTERING
10
HOW DO WE CLUSTER
Information Retrieval and Data Science
• Based on similarity between pages• Semantic similarity
• meaning of the web pages (keywords, topics,…)• Syntactic similarity
• Web page structure, CSS styles• This presentation has focus on syntactic aspect
• HTML ✓• CSS ✓• JavaScript ×
11Information Retrieval and Data Science
SIMILARITY CHECK
12
METHOD : INPUT
Information Retrieval and Data Science
WEB PAGES FROM CRAWLER LIKE APACHE NUTCH
13
METHOD : STEP #1
Information Retrieval and Data Science
WEB PAGES FROM CRAWLER LIKE APACHE NUTCH
STRUCTURAL SIMILARITY
STRUCTURAL SIMILARITY
14
STRUCTURAL SIMILARITY
Information Retrieval and Data Science
• Web pages are built with HTML
• HTML Doc → DOM tree• a labeled ordered tree• Structural similarity using
tree edit distance(TED)
HTML
HEAD BODY
TITLE DIV P
15
MINIMUM TREE EDIT DISTANCE
Information Retrieval and Data Science
• Edit distance measure similar to strings, but on hierarchical data instead of sequences
• Number of editing operations required to transform one tree into another.
• Three basic editing operations: INSERT, REMOVE and REPLACE.
• An useful measure to quantify how similar (or dissimilar) two trees are.
● Edit operations● Normalized
distance
* Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6), 1245-1262.
16
MINIMUM TREE EDIT DISTANCE*
Information Retrieval and Data Science
1 2
3 4
17
METHOD : STEP #2
Information Retrieval and Data Science
WEB PAGES FROM CRAWLER LIKE APACHE NUTCH
STYLE SIMILARITY
STYLE SIMILARITY
• Similar web pages have similar css styles• XPath : ”//*[@class]/@class”• Simple measure -
• Jaccard Similarity on CSS class names
18Information Retrieval and Data Science
STYLE SIMILARITY
19
METHOD : STEP #3
Information Retrieval and Data Science
AGGREGATED = k.STRUCTURAL+ (1-k).STYLE
STRUCTURAL
STYLE
20
METHOD : STEP #4
Information Retrieval and Data Science
SIMILARITY MATRIX CLUSTERS
CLUSTERING( SHARED NEAR NEIGHBOR)
“If two data points share a threshold number of neighbors, then they must belong to the same cluster” *
21Information Retrieval and Data Science
SHARED NEAR NEIGHBOR (SNN) ALGORITHM
* Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034.
Web Pages
• Guessing k in k-means is hardMeaningful question - “Make clusters of 90% similarity”
instead of “Make 10 clusters”• Mean / Average of documents in a cluster?
• Average of DOM Trees?• Average of CSS styles?
• Circular / Spherical / Globular shapes?
22Information Retrieval and Data Science
WHAT’S GOOD ABOUT SNN ALGORITHM
23
METHOD : LAST STEP*
Information Retrieval and Data Science
LABELING
CLUSTERS CATEGORIES /USABLE CLUSTERS
24
METHOD : LAST STEP*
Information Retrieval and Data Science
LABELING
CLUSTERS CATEGORIES /USABLE CLUSTERS
* HUMAN INTERVENTION - THIS STEP REQUIRES DOMAIN EXPERTISE
25
SOME APPLICATIONS?
Information Retrieval and Data Science
• Separate the interesting web pages?• Drop uninteresting/noisy web pages• Categorical treatment of clusters
• Extract Structured data using XPath• Automated extraction using alignment
26Information Retrieval and Data Science
WORKFLOW: PART #1
27Information Retrieval and Data Science
WORKFLOW: PART #2
DATASET : 1310 Web Pages from http://armslist.com
• 987 Ad detail pages• 311 Ad listing pages• 12 others – index, contact, FAQs etc
PARAMETERS:• 50% weightage for CSS style 50% weight for HTML structure• Series of experiments on various thresholds : 85%, 90%, 95%
Information Retrieval and Data Science
EVALUATION
28
Information Retrieval and Data Science
EVALUATION
29
PARAMETERS:SIMILARITY = 90%SHARED NEIGHBORS = 90%
Information Retrieval and Data Science
EVALUATION
30
PARAMETERS:SIMILARITY = 95%SHARED NEIGHBORS = 95%
Information Retrieval and Data Science
EVALUATION
31
PARAMETERS:SIMILARITY = 85%SHARED NEIGHBORS = 85%
• TED very expensive• Zhang-Shasha’s TED
• O(|T1| x |T2| x Min{depth(T1), leaves(T1)} x Min{depth(T2), leaves(T2)})
• That’s O(n4)• Approx. 1000 HTML Tags• That’s O(1012)
Information Retrieval and Data Science
CHALLENGES
32
Number of HTML Tags
Tim
e Co
mpl
exity
Information Retrieval and Data Science
ACKNOWLEDGMENTSDARPA MEMEX
33
* Photo Credits : http://memex.jpl.nasa.gov/
• Source Code https://github.com/USCDataScience
/autoextractor
• Tutorialhttps://git.io/vwS69
• Follow up• Thamme Gowda - @thammegowda• Chris Mattmann - @chrismattmann
34Information Retrieval and Data Science
THANK YOU