finding replicated web collections
DESCRIPTION
Finding Replicated Web Collections. Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina. Replication is common!. Statistics (Preview). More than 48% of pages have copies!. Reasons for replication. Actual replication Simple copying or Mirroring Apparent replication - PowerPoint PPT PresentationTRANSCRIPT
1
Finding ReplicatedWeb Collections
Junghoo ChoNarayanan ShivakumarHector Garcia-Molina
2
Replication is common!
3
Statistics (Preview)
Collection # of copies
# of pages
TUCOWS 360 1052
LDP (Linux Documentation Project) 143 1359
Apache manual 125 115
Java manual 59 552
Mars Pathfinder 49 498
More than 48% of pages have copies!
4
Reasons for replication
Actual replicationSimple copying or Mirroring
Apparent replicationAliases (multiple site names)Symbolic linksMultiple mount points
5
Challenges
Subgraph isomorphism: NPHundreds of millions of pagesSlight differences between copies
6
Outline
Definitions Web graph, collection Identical collection
Similar collectionAlgorithmApplicationsResults
7
Web graph
Node: web pageEdge:
link between pages
Node label: page content (excluding links)
8
Identical web collection
Collection: induced subgraphIdentical collection: one-to-one (equi-
size)
9
Collection similarity
Coincides with intuitively similar collections
Computable similarity measure
10
Collection similarity
Page content
11
Page content similarity
Fingerprint-based approach (chunking) Shingles [Broders et al., 1997] Sentence [Brin et al., 1995] Word [Shivakumar et al., 1995]
Many interesting issues Threshold value Iceberg query
12
Collection similarity
Link structure
13
Collection similarity
Size
14
Collection similarity
Size vs. Cardinality
15
Growth strategy
16
Essential property
Rb
a a
bbb
aRa
|Ra| = Ls = Ld = |Rb|
Ls: # of pages linked from
Ld: # of pages linked to
17
Essential property
a a
bbb
a
Rb
Ra
|Ra| Ls = Ld |Rb|
Ls: # of pages linked from
Ld: # of pages linked to
18
Algorithm
Based on the property we identifiedInput: set of pages collected from
webOutput: set of similar collectionsComplexity: O(n log n)
19
Algorithm
Step 1: Similar page identification(iceberg query)
25 million pagesFingerprint computation: 44 hoursReplicated page computation: 10 hours
Step 1web pages
Rid Pid
11
122
103753895014545
102618633
20
Algorithm
Step 2: link structure check
Rid Pid
11
12
103753895014545
1026
Rid Pid
11
12
103753895014545
1026
Pid Pid
11
22
23610
Group by (R1.Rid, R2.Rid)
Ra = |R1|, Ls = Count(R1.Rid), Ld = Count(R2.Rid), Rb = |R2|
LinkR1 R2 (Copy of R1)
21
Algorithm
Step 3:S = {}
For every (|Ra|, Ls, Ld, |Rb|) in step 2
If (|Ra| = Ls = Ld = |Rb|)
S = S U {<Ra, Rb>}
Union-Find(S)
Step 2-3: 10 hours
22
Experiment
25 widely replicated collections(cardinality: 5-10 copies, size: 50-1000
pages)
=> Total number of pages : 35,000 + 15,000 random pagesResult: 180 collections
149 “good” collections 31 “problem” collections
23
Results
24
Applications
Web crawling & archiving Save network bandwidth Save disk storage
25
Application (web crawling)
Before experiment: 48%With our technique: 13%
initialcrawl
offline copydetection
secondcrawl
replicationinfo
crawledpages
26
Applications (web search)
27
Related work
Collection similarity Altavista [Bharat et al., 1999]
Page similarity COPS [Brin et al., 1995]: sentence SCAM [Shivakumar et al., 1995]: word Altavista [Broder et al., 1997]: shingle
28
Summary
Computable similarity measureEfficient replication-detection
algorithmApplication to real-world problems