finding replicated web collections

28
1 Finding Replicated Web Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina

Upload: alvaro

Post on 12-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Finding Replicated Web Collections. Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina. Replication is common!. Statistics (Preview). More than 48% of pages have copies!. Reasons for replication. Actual replication Simple copying or Mirroring Apparent replication - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Finding Replicated Web Collections

1

Finding ReplicatedWeb Collections

Junghoo ChoNarayanan ShivakumarHector Garcia-Molina

Page 2: Finding Replicated Web Collections

2

Replication is common!

Page 3: Finding Replicated Web Collections

3

Statistics (Preview)

Collection # of copies

# of pages

TUCOWS 360 1052

LDP (Linux Documentation Project) 143 1359

Apache manual 125 115

Java manual 59 552

Mars Pathfinder 49 498

More than 48% of pages have copies!

Page 4: Finding Replicated Web Collections

4

Reasons for replication

Actual replicationSimple copying or Mirroring

Apparent replicationAliases (multiple site names)Symbolic linksMultiple mount points

Page 5: Finding Replicated Web Collections

5

Challenges

Subgraph isomorphism: NPHundreds of millions of pagesSlight differences between copies

Page 6: Finding Replicated Web Collections

6

Outline

Definitions Web graph, collection Identical collection

Similar collectionAlgorithmApplicationsResults

Page 7: Finding Replicated Web Collections

7

Web graph

Node: web pageEdge:

link between pages

Node label: page content (excluding links)

Page 8: Finding Replicated Web Collections

8

Identical web collection

Collection: induced subgraphIdentical collection: one-to-one (equi-

size)

Page 9: Finding Replicated Web Collections

9

Collection similarity

Coincides with intuitively similar collections

Computable similarity measure

Page 10: Finding Replicated Web Collections

10

Collection similarity

Page content

Page 11: Finding Replicated Web Collections

11

Page content similarity

Fingerprint-based approach (chunking) Shingles [Broders et al., 1997] Sentence [Brin et al., 1995] Word [Shivakumar et al., 1995]

Many interesting issues Threshold value Iceberg query

Page 12: Finding Replicated Web Collections

12

Collection similarity

Link structure

Page 13: Finding Replicated Web Collections

13

Collection similarity

Size

Page 14: Finding Replicated Web Collections

14

Collection similarity

Size vs. Cardinality

Page 15: Finding Replicated Web Collections

15

Growth strategy

Page 16: Finding Replicated Web Collections

16

Essential property

Rb

a a

bbb

aRa

|Ra| = Ls = Ld = |Rb|

Ls: # of pages linked from

Ld: # of pages linked to

Page 17: Finding Replicated Web Collections

17

Essential property

a a

bbb

a

Rb

Ra

|Ra| Ls = Ld |Rb|

Ls: # of pages linked from

Ld: # of pages linked to

Page 18: Finding Replicated Web Collections

18

Algorithm

Based on the property we identifiedInput: set of pages collected from

webOutput: set of similar collectionsComplexity: O(n log n)

Page 19: Finding Replicated Web Collections

19

Algorithm

Step 1: Similar page identification(iceberg query)

25 million pagesFingerprint computation: 44 hoursReplicated page computation: 10 hours

Step 1web pages

Rid Pid

11

122

103753895014545

102618633

Page 20: Finding Replicated Web Collections

20

Algorithm

Step 2: link structure check

Rid Pid

11

12

103753895014545

1026

Rid Pid

11

12

103753895014545

1026

Pid Pid

11

22

23610

Group by (R1.Rid, R2.Rid)

Ra = |R1|, Ls = Count(R1.Rid), Ld = Count(R2.Rid), Rb = |R2|

LinkR1 R2 (Copy of R1)

Page 21: Finding Replicated Web Collections

21

Algorithm

Step 3:S = {}

For every (|Ra|, Ls, Ld, |Rb|) in step 2

If (|Ra| = Ls = Ld = |Rb|)

S = S U {<Ra, Rb>}

Union-Find(S)

Step 2-3: 10 hours

Page 22: Finding Replicated Web Collections

22

Experiment

25 widely replicated collections(cardinality: 5-10 copies, size: 50-1000

pages)

=> Total number of pages : 35,000 + 15,000 random pagesResult: 180 collections

149 “good” collections 31 “problem” collections

Page 23: Finding Replicated Web Collections

23

Results

Page 24: Finding Replicated Web Collections

24

Applications

Web crawling & archiving Save network bandwidth Save disk storage

Page 25: Finding Replicated Web Collections

25

Application (web crawling)

Before experiment: 48%With our technique: 13%

initialcrawl

offline copydetection

secondcrawl

replicationinfo

crawledpages

Page 26: Finding Replicated Web Collections

26

Applications (web search)

Page 27: Finding Replicated Web Collections

27

Related work

Collection similarity Altavista [Bharat et al., 1999]

Page similarity COPS [Brin et al., 1995]: sentence SCAM [Shivakumar et al., 1995]: word Altavista [Broder et al., 1997]: shingle

Page 28: Finding Replicated Web Collections

28

Summary

Computable similarity measureEfficient replication-detection

algorithmApplication to real-world problems