finding replicated web collections

1

Finding ReplicatedWeb Collections

Junghoo ChoNarayanan ShivakumarHector Garcia-Molina

2

Replication is common!

3

Statistics (Preview)

Collection # of copies

# of pages

TUCOWS 360 1052

LDP (Linux Documentation Project) 143 1359

Apache manual 125 115

Java manual 59 552

Mars Pathfinder 49 498

More than 48% of pages have copies!

4

Reasons for replication

Actual replicationSimple copying or Mirroring

Apparent replicationAliases (multiple site names)Symbolic linksMultiple mount points

5

Challenges

Subgraph isomorphism: NPHundreds of millions of pagesSlight differences between copies

6

Outline

Definitions Web graph, collection Identical collection

Similar collectionAlgorithmApplicationsResults

7

Web graph

Node: web pageEdge:

link between pages

Node label: page content (excluding links)

8

Identical web collection

Collection: induced subgraphIdentical collection: one-to-one (equi-

size)

9

Collection similarity

Coincides with intuitively similar collections

Computable similarity measure

10


Page content

11

Page content similarity

Fingerprint-based approach (chunking) Shingles [Broders et al., 1997] Sentence [Brin et al., 1995] Word [Shivakumar et al., 1995]

Many interesting issues Threshold value Iceberg query

12


Link structure

13


Size

14


Size vs. Cardinality

15

Growth strategy

16

Essential property

Rb

a a

bbb

aRa

|Ra| = Ls = Ld = |Rb|

Ls: # of pages linked from

Ld: # of pages linked to

17

Essential property

a a

bbb

a

Rb

Ra

|Ra| Ls = Ld |Rb|

Ls: # of pages linked from

Ld: # of pages linked to

18

Algorithm

Based on the property we identifiedInput: set of pages collected from

webOutput: set of similar collectionsComplexity: O(n log n)

19

Algorithm

Step 1: Similar page identification(iceberg query)

25 million pagesFingerprint computation: 44 hoursReplicated page computation: 10 hours

Step 1web pages

Rid Pid

11

122

103753895014545

102618633

20

Algorithm

Step 2: link structure check

Rid Pid

11

12

103753895014545

1026

Rid Pid

11

12

103753895014545

1026

Pid Pid

11

22

23610

Group by (R1.Rid, R2.Rid)

Ra = |R1|, Ls = Count(R1.Rid), Ld = Count(R2.Rid), Rb = |R2|

LinkR1 R2 (Copy of R1)

21

Algorithm

Step 3:S = {}

For every (|Ra|, Ls, Ld, |Rb|) in step 2

If (|Ra| = Ls = Ld = |Rb|)

S = S U {<Ra, Rb>}

Union-Find(S)

Step 2-3: 10 hours

22

Experiment

25 widely replicated collections(cardinality: 5-10 copies, size: 50-1000

pages)

=> Total number of pages : 35,000 + 15,000 random pagesResult: 180 collections

149 “good” collections 31 “problem” collections

23

Results

24

Applications

Web crawling & archiving Save network bandwidth Save disk storage

25

Application (web crawling)

Before experiment: 48%With our technique: 13%

initialcrawl

offline copydetection

secondcrawl

replicationinfo

crawledpages

26

Applications (web search)

27

Related work

Collection similarity Altavista [Bharat et al., 1999]

Page similarity COPS [Brin et al., 1995]: sentence SCAM [Shivakumar et al., 1995]: word Altavista [Broder et al., 1997]: shingle

28

Summary

Computable similarity measureEfficient replication-detection

algorithmApplication to real-world problems

finding replicated web collections

Documents