ieee iri 16 - clustering web pages based on structure and style similarity

July 28-30, 2016; IEEE IRI, Pittsburgh, PA, USA

Thamme Gowda@thammegowda

Dr. Chris Mattmann@chrismattmann

1

CLUSTERING WEB PAGES BASED ON STRUCTURE AND STYLE SIMILARITY

Information Retrieval and Data Science

https://twitter.com/thammegowda


https://twitter.com/chrismattmann


2

OUTLINE• Problem Statement• Method Overview• Steps

• Tree Edit Distance• Style Similarity• Shared Near Neighbor Clustering

• Evaluation• Challenges


3

PROBLEM STATEMENT


• Scraping data from online marketplaces

• Start with homepage → categories →listing → Actual stuff (Detail page)

SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

4

1 2 3 4

8765

http://www.armslist.com/

http://trec-dd.org/dataset.html

http://memex.jpl.nasa.gov/

USELESS

USELESS

5SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

1 2 3 4

8765




USELESS

USELESS


CRAWLER: YESANALYSIS: NO



1 2 3 4

8765




USELESS

USELESS





USEFUL USEFUL USEFUL

1 2 3 4

8765




8

METHOD OVERVIEW


CLUSTERING

• “task of grouping a set of objects in such a way that objects in the same group are more similar (in some sense or the other) to each other than to those in the other groups”

– Wikipedia

• There are many ways to achieve this.

9Information Retrieval and Data Science

CLUSTERING

10

HOW DO WE CLUSTER


• Based on similarity between pages• Semantic similarity

• meaning of the web pages (keywords, topics,…)• Syntactic similarity

• Web page structure, CSS styles• This presentation has focus on syntactic aspect

• HTML ✓• CSS ✓• JavaScript ×


SIMILARITY CHECK

12

METHOD : INPUT


WEB PAGES FROM CRAWLER LIKE APACHE NUTCH

13

METHOD : STEP #1



STRUCTURAL SIMILARITY


14



• Web pages are built with HTML

• HTML Doc → DOM tree• a labeled ordered tree• Structural similarity using

tree edit distance(TED)

HTML

HEAD BODY

TITLE DIV P

15

MINIMUM TREE EDIT DISTANCE


• Edit distance measure similar to strings, but on hierarchical data instead of sequences

• Number of editing operations required to transform one tree into another.

• Three basic editing operations: INSERT, REMOVE and REPLACE.

• An useful measure to quantify how similar (or dissimilar) two trees are.

● Edit operations● Normalized

distance

* Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6), 1245-1262.

16

MINIMUM TREE EDIT DISTANCE*


1 2

3 4

17

METHOD : STEP #2



STYLE SIMILARITY

STYLE SIMILARITY

• Similar web pages have similar css styles• XPath : ”//*[@class]/@class”• Simple measure -

• Jaccard Similarity on CSS class names


STYLE SIMILARITY

19

METHOD : STEP #3


AGGREGATED = k.STRUCTURAL+ (1-k).STYLE

STRUCTURAL

STYLE

20

METHOD : STEP #4


SIMILARITY MATRIX CLUSTERS

CLUSTERING( SHARED NEAR NEIGHBOR)

“If two data points share a threshold number of neighbors, then they must belong to the same cluster” *


SHARED NEAR NEIGHBOR (SNN) ALGORITHM

* Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034.

Web Pages

• Guessing k in k-means is hardMeaningful question - “Make clusters of 90% similarity”

instead of “Make 10 clusters”• Mean / Average of documents in a cluster?

• Average of DOM Trees?• Average of CSS styles?

• Circular / Spherical / Globular shapes?


WHAT’S GOOD ABOUT SNN ALGORITHM

23

METHOD : LAST STEP*


LABELING

CLUSTERS CATEGORIES /USABLE CLUSTERS

24

METHOD : LAST STEP*


LABELING

CLUSTERS CATEGORIES /USABLE CLUSTERS

* HUMAN INTERVENTION - THIS STEP REQUIRES DOMAIN EXPERTISE

25

SOME APPLICATIONS?


• Separate the interesting web pages?• Drop uninteresting/noisy web pages• Categorical treatment of clusters

• Extract Structured data using XPath• Automated extraction using alignment


WORKFLOW: PART #1


WORKFLOW: PART #2

DATASET : 1310 Web Pages from http://armslist.com

• 987 Ad detail pages• 311 Ad listing pages• 12 others – index, contact, FAQs etc

PARAMETERS:• 50% weightage for CSS style 50% weight for HTML structure• Series of experiments on various thresholds : 85%, 90%, 95%


EVALUATION

28

http://armslist.com/


EVALUATION

29

PARAMETERS:SIMILARITY = 90%SHARED NEIGHBORS = 90%


EVALUATION

30



EVALUATION

31


• TED very expensive• Zhang-Shasha’s TED

• O(|T1| x |T2| x Min{depth(T1), leaves(T1)} x Min{depth(T2), leaves(T2)})

• That’s O(n4)• Approx. 1000 HTML Tags• That’s O(1012)


CHALLENGES

32

Number of HTML Tags

Tim

e Co

mpl

exity


ACKNOWLEDGMENTSDARPA MEMEX

33

* Photo Credits : http://memex.jpl.nasa.gov/

• Source Code https://github.com/USCDataScience

/autoextractor

• Tutorialhttps://git.io/vwS69

• Follow up• Thamme Gowda - @thammegowda• Chris Mattmann - @chrismattmann


THANK YOU

https://github.com/USCDataScience/autoextractor




https://git.io/vwS69

https://git.io/vwS69



ieee iri 16 - clustering web pages based on structure and style similarity

Data & Analytics