1 massive data sets: theory & practice ziv bar-yossef ibm almaden research center
Post on 21-Dec-2015
225 views
TRANSCRIPT
![Page 1: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/1.jpg)
1
Massive Data Sets:Theory & Practice
Ziv Bar-Yossef
IBM Almaden Research Center
![Page 2: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/2.jpg)
2
What are Massive Data Sets?
Technology
The World-Wide WebIP packet flowsPhone call logs
Science
Genomic dataAstronomical sky surveys
Weather data
Business
Credit card transactionsBilling records
Supermarket salesPetabytes
Terabytes
Gigabytes
• Huge • Distributed• Dynamic• Heterogeneous• Noisy• Unstructured / semi-structured
![Page 3: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/3.jpg)
3
Nontraditional Challenges
Traditionally
Cope with the complexity of the problem
New challenges• How to efficiently compute on massive data sets?
– Restricted access to the data– Not enough time to read the whole data– Tiny fraction of the data can be held in main memory
• How to find desired information in the data?• How to summarize the data?• How to clean the data?
Massive Data Sets
Cope with the complexity of the data
![Page 4: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/4.jpg)
4Algorithm
• Sampling Query a small number of data elements
• Data streams Stream through the data;limited main memory storage
• Sketching Compress data chunks into small “sketches”; compute over the sketches
Computational Models for Massive Data Sets
Algorithm
Data Set
Algorithm
Data Set
Data Set
![Page 5: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/5.jpg)
5
Outline of the Talk
• Web statistics
• Sampling lower bounds
• Hamming distance sketching
• Template detection
“Theory”“Practice”
![Page 6: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/6.jpg)
6
Web Statistics(with A. Berg, S. Chien, J. Fakcharoenphol, D. Weitz, VLDB 2000)
The “BowTie” Structure of the Web
[Broder et al, 2000]
crawlable web
• What fraction of the web is covered by Google?
• Which is the largest country domain on the web?
• What is the percentage of French language pages?
• How large is the web?
![Page 7: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/7.jpg)
7
Our Approach• Straightforward solution:
– Crawl the crawlable web– Generate statistics based on the crawl
• Drawbacks:– Expensive– Complicated implementation– Slow– Inaccurate
• Our approach: uniform sampling by random walks– Random walk on an undirected & regular version of the crawlable web
• Advantages:– Provably uniform samples from the crawlable web– Runs on a desktop PC in a couple of days
![Page 8: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/8.jpg)
8
Undirected Regular Random Walk
Fact:
A random walk on a connected (non-bipartite) undirected regular graph converges to a uniform limit distribution.
w(v) = degmax - deg(v)
1
2
31
4
02 3
03
2
2
4
4
3
3
3
1
2
5
Follow a random out-link or a random in-link at each step
Use weighted self loops to even out page degrees
![Page 9: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/9.jpg)
9
Convergence Rate (“Mixing Time”)
Theorem Mixing time log(N)/
(N = graph size, = transition matrix’s spectral gap)
Experiment (based on a crawl)
For the web, 10-5
Mixing time: 3.3 million steps
• Self loop steps are free• 29,999 out of 30,000 steps are self loop steps
Actual mixing time is only 110 steps
![Page 10: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/10.jpg)
10
Realization of the Random Walk
Problems• The in-links of pages are not readily available• The degree of pages is not available
Available sources of in-links:• Previously visited nodes • Reverse link services of search engines
Experiments indicate samples are still nearly uniform.
![Page 11: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/11.jpg)
11
Top 20 Internet Domains (summer 2003)
10.36%
5.57%4.15%3.01%
0.61%
9.19%
51.15%
0%
10%
20%
30%
40%
50%
60%
.com
.org
.net
.edu .d
e .uk
.au .u
s.e
s .jp .ca .nl .it .ch .p
l .il .nz
.gov
.info .m
x
![Page 12: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/12.jpg)
12
Search Engine Coverage (summer 2000)
68%
54%50% 50%
48%
38%
0%
10%
20%
30%
40%
50%
60%
70%
80%
Google AltaVista Fast Lycos HotBot Go
![Page 13: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/13.jpg)
13
Subsequent Extensions • Focused Sampling
(with T. Kanungo and R. Krauthgamer, 2003)
– “Focused statistics” about web communities:• Statistics about the .uk domain• Statistics about pages on bicycling• Statistics about Arabic language pages
– Based on a sophisticated extension of the above random walk.
• Study of the web’s decay (with A. Broder, R. Kumar, and A. Tomkins, 2003)
– A measure for how well-maintained web pages are.– Based on a random walk idea.
![Page 14: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/14.jpg)
14
Sampling Lower Bounds (STOC 2003)
Q1. How many samples are needed to estimate:– The fraction of pages covered by Google?– The number of distinct web-sites?– The distribution of languages on the web?
Q2. Can we save samples by sampling non-uniformly?
A2. For “symmetric” functions, uniform sampling is the best possible.(“symmetric” – invariant under permutations of data elements)
A1. A “recipe” for obtaining sampling lower bounds for symmetric functions.
![Page 15: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/15.jpg)
15
Algorithm
Optimality of Uniform Sampling(with R. Kumar and D. Sivakumar, STOC 2001)
Theorem
When estimating symmetric functions, uniform sampling is the best possible.
Proof idea
X1 X2 X3 X4 X5 X6 X7 X8X1 X2 X3 X4 X5 X6 X7 X8X1 X2 X3 X4 X5 X6 X7 X8
X2 X7 X5
original algorithmsimulation
x
x) X2 X7 X5
![Page 16: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/16.jpg)
16
Preliminaries
Bf(a) f(b)
pairwise “disjoint inputs“f(c)
f: An B : symmetric function
approximation parameter
1 1 1 2 2 3x1) = 1/2 (2) = 1/3 (3) = 1/6
input “sample distribution”
![Page 17: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/17.jpg)
17
The Lower Bound Recipe
x1,…,xm: “pairwise disjoint” inputs
1,…,m: “sample distributions” on x1,…,xm
Theorem:Any algorithm approximating f requires q samples, where
Proof steps:• Reduction from statistical classification• Lower bound for statistical classification
( 0 · JS(1,…,m) · log m )
![Page 18: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/18.jpg)
18
Reduction from Statistical Classification
Bf(a) f(b)pairwise
f(c)
“disjoint inputs”
Statistical classification:
Given uniform samples from x { a, b, c }, decide whether x = a or x = b or x = c.
f: An B: symmetric function
Can be solved by any sampling algorithm approximating f
![Page 19: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/19.jpg)
19
The “Election Problem”• input: a sequence x of n votes to k parties
7/18 4/18 3/18 2/18 1/18 1/18
(n = 18, k = 6)
• Want to get s.t. || - x|| < .Vote Distribution x
Theorem
A poll of size (k/2) is required for estimating the election problem.
![Page 20: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/20.jpg)
20
Combinatorial Designs
1. Each of them constitutes half of U.2. The intersection of each two of them is
relatively small.
B1
B2
B3U
A family of subsets B1,…,Bm of a universe U s.t.
Fact: There exist designs of size exponential in |U|.
![Page 21: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/21.jpg)
21
Proof of the Lower Bound for the Election Problem
Step 1: Identification of a set S of pairwise disjoint inputs:
B1,…,Bm µ {1,…,k}: a design of size m = 2(k).
S = { x1,…,xm }, where in xi:
Bi Bic
Step 2: JS(1,…,m) = O(2).
By our theorem, # of queries is at least (k/2).
• ½ + of the votes are split among parties in Bi.
• ½ - of the votes are split among parties in Bi
c.
![Page 22: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/22.jpg)
22
Hamming Distance Sketching(with T.S. Jayram and R. Kumar, 2003)
Alice Bob
Referee
Ham(x,y) > k
x y
x)
y)
Ham(x,y) · k
$$
![Page 23: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/23.jpg)
23
Hamming Distance Sketching
Applications• Maintenance of large crawls• Comparison of large files over the network
Previous schemes:• Sketches of size O(k2)
[Kushilevitz, Ostrovsky, Rabani, 98], [Yao 03]
• Lower bound: (k)
Our scheme:• Sketches of size O(k log k)
![Page 24: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/24.jpg)
24
Preliminaries
Balls and Bins:
• When throwing n balls into n/log n bins, then with high probability the fullest bin has O(log n) balls.
• When throwing n balls into n2 bins, then with high probability no two balls fall into the same bin.
• Using KOR scheme, can assume Ham(x,y) · 2k.
![Page 25: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/25.jpg)
25
First Level Hashing
1 0 0 1 1 1 0 0 0 1 0 0 1 1 0 0x
1 1 0 1 0 1 0 1 1 1 0 0 1 0 0 1y
1 0 0 1 1 1 0 0 0 1 0 0 1 1 0 0
1 1 0 1 0 1 0 1 1 1 0 0 1 0 0 1
k/log k bins
k/log k bins
y1 y2 y3
x2x1 x3
Ham(x,y) =
i Ham(xi,yi)
8i, Ham(xi,yi) · O(log k)
![Page 26: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/26.jpg)
26
Second Level Hashing
y3
x3
1 1
0 1
00
0 1
1 1
10
1 1
0 1
00
0 1
1 1
10
log2 k bins
log2 k bins
3,1 3,2 3,3
3,4 3,5 3,6
3,1 3,2 3,3
3,4 3,5 3,6
3,j = 3,j iff # of “pink positions” in the j-th bin is even.
• If no collisions, Ham(3,3) = Ham(x3,y3)
• If collisions, Ham(3,3) · Ham(x3,y3)
![Page 27: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/27.jpg)
27
The Sketch
• (x) = { ij | i = 1,…,k/log k, j = 1,…,t }
• (y) = { ij | i = 1,… k/log k, j = 1,…,t }
• Referee decides Ham(x,y) · k if and only if
i maxj Ham(ij, i
j) · k
• Probability of collision: a small constant
• For each i = 1,…,k/log k, repeat second level hashing t = O(log k) times, obtaining (i
1,i1),…,(i
t,it).
• With probability at least 1 – 1/k,
Ham(xi,yi) = maxj Ham(ij,i
j)
![Page 28: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/28.jpg)
28
Other Sketching Results
• A sketching scheme for the edit distance– Leads to the first almost-linear time
approximation algorithm for the edit distance.
• Sketch lower bounds for (compressed) pattern matching.
![Page 29: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/29.jpg)
29
Template Detection (with S. Rajagopalan, WWW 2002)
Template – Master HTML shell page used for composing new pages.
Our contributions:
• Efficient algorithm for template detection
• Application to improvement of search engine precision
![Page 30: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/30.jpg)
30
Templates are Bad for Web IR
• Pose a significant source of “noise” in web pages– Their content is not related to the topics of pages
in which they reside– Create spurious linkage to unimportant pages
• Extremely common– Became standard in website design
![Page 31: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/31.jpg)
31
Pagelets [Chakrabarti 01]
• has a single theme
• not nested within a bigger region with the same theme
Navigational bar pagelet
Search pagelet
Directory pagelet
News headlines pagelet
Pagelet – a region in a page that:
![Page 32: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/32.jpg)
32
Template Definition
Template = a collection of pagelets that:
1.Belong to the same website.
2.Are nearly-identical.
![Page 33: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/33.jpg)
33
Template Detection
Template Detection Algorithm• Group the pages in S according to website.• For each website w:
– For each page p 2 w: • Partition p into pagelets p1,…,pk
• Compute a “shingle” sketch for each pagelet [Broder et al. 1997]
– Group the resulting pagelets by their sketches.– Output all the pagelet groups of size > 1.
Template Detection Problem:
Given a set of pages S, find all the templates in S.
![Page 34: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/34.jpg)
34
HITS & Clever[Kleinberg 1997, Chakrabarti et al. 1998]
Hubs Authorities
h(p) = q 2 out(p) a(q)
a(p) = q 2 in(p) h(q)
![Page 35: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/35.jpg)
35
“Template” Clever
Hubs Authorities
• Hubs – all the non-templatized constituent pagelets of pages in the base set.
• Authorities – all pages in the base set.
Page
Pagelet
Templatized pagelet
Legend
![Page 36: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/36.jpg)
36
Classical Clever vs. Template Clever
Average Precision @ 50 for broad queries
0
20
40
60
80
100
120
10 20 30 40 50
Pre
csio
n
Classical Clever
Template Clever
![Page 37: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/37.jpg)
37
Template Proliferation
Template Frequency for ARC Set Queries
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
recycling_cans
gardening
mutual_funds
java
Zener
San_Francisco
field_hockey
Penelope_Fitzgerald
HIV
bicycling
affirmative_action
amusement_parks
Thailand_tourism
cruises
volcano
stamp_collecting
architecture
Shakespeare
Gulf_war
zen_buddhism
lyme_disease
Death_Valley
citrus_groves
cheese
table_tennis
blues
classical_guitar
telecommuting
parallel_architecture
![Page 38: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/38.jpg)
38
Summary
• Web data mining via random walks on the web graph:– Web statistics– Focused statistics– Web decay
• Sampling lower bounds– Optimality of uniform sampling for symmetric functions– A “recipe” for lower bounds
• Sketching of string distance measures– Hamming distance– Edit distance
• Template detection
![Page 39: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/39.jpg)
39
Some of My Other Work
• Database– Semi-structured data and XML
• Computational Complexity – Communication complexity– Pseudo-randomness and de-randomization– Space-bounded computations– Parallel computation complexity
• Algorithm Design– Data stream algorithms– Internet auctions
![Page 40: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/40.jpg)
40
![Page 41: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/41.jpg)
41
Web Statistics(with A. Berg, S. Chien, J. Fakcharoenphol, D. Weitz, VLDB 2000)
The “BowTie” Structure of the Web
[Broder et al, 2000]
crawlable web
SCCOUTIN
• What fraction of the web is covered by Google?
• Which is the largest country domain on the web?
• What is the percentage of porn pages?
• How large is the web?
![Page 42: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/42.jpg)
42
Straightforward Random Walk
• Gets stuck in sinks and in dense web communities
• Biased towards popular pages
• Converges slowly, if at all
yahoo.com
amazon.com
www.almaden.ibm.com/cs/people/ziv
Follow a random out-link at each step
1
2
3
4
56
7
8
9
![Page 43: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/43.jpg)
43
Undirected Regular Random Walk
Fact:
A random walk on a connected (non-bipartite) undirected regular graph converges to a uniform limit distribution.
w(v) = degmax - deg(v)
yahoo.com1
2
31
amazon.com
4
02 3
0
3
2
2
4
4
3
3
3
1
2
5
Follow a random out-link or a random in-link at each step
Use weighted self loops to even out page degrees
www.almaden.ibm.com/cs/people/ziv
![Page 44: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/44.jpg)
44
Evaluation: Bias towards High Degree Nodes
Deciles of nodes ordered by degree
High Degree
Low Degree
Percent of nodes from walk
![Page 45: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/45.jpg)
45
Evaluation: Bias towards the Search Engines
Search engine size30% 50%
Estimate of search engine size
![Page 46: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/46.jpg)
46
Link-Based Web IR Applications
• Search and ranking – HITS and Clever [Kleinberg 1997,Chakrabarti et al. 1998]– PageRank [Brin and Page 1998]– SALSA [Lempel and Moran 2000]
• Similarity search– Co-Citation [Dean and Henzinger 1999]
• Categorization– Hyperclass [Chakrabarti, Dom, Indyk 1998]
• Focused crawling– FOCUS [Chakrabarti, van der Berg, Dom 1999]
• …
![Page 47: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/47.jpg)
47
Hypertext IR Principles
• Relevant Linkage Principle [Kleinberg 1997]
– p links to q q is relevant to p
• Topical Unity Principle [Kessler 1963, Small 1973]
– q1 and q2 are co-cited in p q1 and q2 are related to each other
• Lexical Affinity Principle [Maarek et al. 1991]
– The closer the links to q1 and q2 are the stronger the relation between them.
Underlying principles of link analysis:
p q
pq1
q2
p
q1
q2
q3
![Page 48: 1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d575503460f94a36579/html5/thumbnails/48.jpg)
48
Example: HITS & Clever[Kleinberg 1997, Chakrabarti et al. 1998]
• Relevant Linkage Principle– All links propagate score from hubs
to authorities and vice versa.
• Topical Unity Principle– Co-cited authorities propagate
score to each other.
• Lexical Affinity Principle (Clever)– Text around links is used to weight
relevance of the links.
Hubs Authorities
h(p) = q 2 out(p) a(q)
a(p) = q 2 in(p) h(q)