@ carnegie mellon databases user-centric web crawling sandeep pandey & christopher olston...
TRANSCRIPT
@Carnegie MellonDatabases
User-Centric Web Crawling
Sandeep Pandey & Christopher Olston
Carnegie Mellon University
2
@Carnegie MellonDatabases
Web Crawling
One important application (our focus): search Topic-specific search engines +
General-purpose ones
repository
index
search queries
user crawler
WWW
3
@Carnegie MellonDatabases
Out-of-date Repository
Web is always changing [Arasu et.al., TOIT’01]
– 23% of Web pages change daily– 40% commercial Web pages change daily
Many problems may arise due to an out-of-date repository– Hurt both precision and recall
4
@Carnegie MellonDatabases
Web Crawling Optimization Problem
Not enough resources to (re)download every web document every day/hour– Must pick and choose optimization problem
Others: objective function = avg. freshness, age Our goal: focus directly on impact on users
repository
index
search queries
user crawler
WWW
5
@Carnegie MellonDatabases
Web Search User Interface
1. User enters keywords
2. Search engine returns ranked list of results
3. User visits subset of results
1. ---------2. ---------3. ---------4. …
documents
6
@Carnegie MellonDatabases
Objective: Maximize Repository Quality (as perceived by users)
Suppose a user issues search query q:
Qualityq = Σdocuments D (likelihood of viewing D) x (relevance of D to q)
Given a workload W of user queries:
Average quality = 1/K x Σqueries q W (freqq x Qualityq)
7
@Carnegie MellonDatabases
Viewing Likelihood
0
0.2
0.4
0.6
0.8
1
1.2
0 50 100 150
Rank
Pro
bab
ilit
y o
f V
iew
ing
vie
w p
rob
abi
lity
rank
• Depends primarily on rank in list [Joachims KDD’02]
• From AltaVista data [Lempel et al. WWW’03]:
ViewProbability(r) r –1.5
8
@Carnegie MellonDatabases
Search engines’ internal notion of how well a document matches a query
Each D/Q pair numerical score [0,1]
Combination of many factors, including:– Vector-space similarity (e.g., TF.IDF cosine metric)– Link-based factors (e.g., PageRank) – Anchortext of referring pages
Relevance Scoring Function
9
@Carnegie MellonDatabases
(Caveat)
Using scoring function for absolute relevance– Normally only used for relative ranking– Need to craft scoring function carefully
10
@Carnegie MellonDatabases
Measuring Quality
Avg. Quality =
Σq (freqq x ΣD (likelihood of viewing D) x (relevance of D to q))
query logs
scoring function over (possibly stale) repository
scoring function over “live” copy of D
usage logs
ViewProb( Rank(D, q) )
11
@Carnegie MellonDatabases
Lessons from Quality Metric
ViewProb(r) monotonically nonincreasing Quality maximized when ranking function orders documents in
descending order of relevance
Out-of-date repository: scrambles ranking lowers quality
Avg. Quality =
Σq (freqq x ΣD (ViewProb( Rank(D, q) ) x (relevance of D to q))
Let ΔQD = loss in quality due to inaccurate information about D Alternatively, improvement in quality if we (re)download D
12
@Carnegie MellonDatabases
ΔQD: Improvement in Quality
REDOWNLOAD
Web Copy of D(fresh)
Repository Copy of D(stale)
Repository Quality += ΔQD
13
@Carnegie MellonDatabases
Download Prioritization
Two difficulties:
1. Live copy unavailable
2. Given both the “live” and repository copies of D, measuring ΔQD may require computing ranks of all documents for all queries
Q: How to measure ΔQD?
Idea: Given ΔQD for each doc., prioritize (re)downloading accordingly
Approach:
(1) Estimate ΔQD for past versions, (2) Forecast current ΔQD
14
@Carnegie MellonDatabases
Overhead of Estimating ΔQD
Estimate while updating inverted
index
15
@Carnegie MellonDatabases
Forecast Future ΔQD
Top 50%
Top 80%
Top 90%
first 24 weeks
seco
nd 2
4 w
ee
ks
Avg. weekly ΔQD :
Data: 48 weekly snapshots of15 web sites sampled from OpenDirectory topics
Queries: AltaVista query log
16
@Carnegie MellonDatabases
Summary
Estimate ΔQD at index time
Forecast future ΔQD
Prioritize downloading according to forecasted ΔQD
17
@Carnegie MellonDatabases
Overall Effectiveness
Staleness = fraction of out-of-date documents* [Cho et al. 2000]
Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002]
* Used “shingling” to filter out “trivial” changes
Scoring function: PageRank (similar results for TF.IDF) Quality (fraction of ideal)
reso
urc
e r
equ
irem
ent
Min. StalenessMin. EmbarrassmentUser-Centric
18
@Carnegie MellonDatabases
(boston.com)
Does not rely on size of text change to estimate importance
Tagged as important by shingling measure, although did not match many queries in workload
Reasons for Improvement
19
@Carnegie MellonDatabases
Reasons for Improvement
Accounts for “false negatives” Does not always ignore
frequently-updated pages
User-centric crawling repeatedly re-downloads this page
(washingtonpost.com)
20
@Carnegie MellonDatabases
Related Work (1/2)
General-purpose Web crawling: – Min. Staleness [Cho, Garcia-Molina, SIGMOD’00]
Maximize average freshness or age for fixed set of docs.
– Min. Embarrassment [Wolf et al., WWW’02]: Maximize weighted avg. freshness for fixed set of docs. Document weights determined by prob. of “embarrassment”
– [Edwards et al., WWW’01] Maximize average freshness for a growing set of docs. How to balance new downloads vs. redownloading old docs.
21
@Carnegie MellonDatabases
Related Work (2/2)
Focused/topic-specific crawling– [Chakrabarti, many others]– Select subset of pages that match user interests– Our work: given a set of pages, decide when to
(re)download each based on predicted content shifts + user interests
22
@Carnegie MellonDatabases
Summary
Crawling: an optimization problem
Objective: maximize quality as perceived by users
Approach:– Measure ΔQD using query workload and usage logs– Prioritize downloading based on forecasted ΔQD
Various reasons for improvement– Accounts for false positives and negatives – Does not rely on size of text change to estimate importance– Does not always ignore frequently updated pages
23
@Carnegie MellonDatabases
THE END
Paper available at:www.cs.cmu.edu/~olston
24
@Carnegie MellonDatabases
Most Closely Related Work
[Wolf et al., WWW’02]:– Maximize weighted avg. freshness for fixed set of docs.– Document weights determined by prob. of “embarrassment”
User-Centric Crawling:– Which queries affected by a change, and by how much?
Change A: significantly alters relevance to several common queries Change B: only affects relevance to infrequent queries, and not by much
– Metric penalizes false negatives Doc. ranked #1000 for a popular query should be ranked #2
– Small embarrassment but big loss in quality
25
@Carnegie MellonDatabases
Inverted Index
Cancer
Seminar
Symptoms
Word Posting listDocID (freq)
Doc7 (2) Doc9 (1) Doc1 (1)
Doc5 (1) Doc1 (1)Doc6 (1)
Doc1 (1) Doc8 (2)Doc4 (3)
Seminar:Cancer
Symptoms
Doc1
26
@Carnegie MellonDatabases
Updating Inverted Index
Seminar:Cancer
Symptoms
Cancer management:how to detect breast cancer
Stale Doc1 Live Doc1
Cancer Doc7 (2) Doc9 (1)Doc1 (1)Doc1 (2)
27
@Carnegie MellonDatabases
Measure ΔQD While Updating Index
Compute previous and new scores of the downloaded document while updating postings
Maintain an approximate mapping between score and rank for each query term (20 bytes per mapping in our exps.)
Compute previous and new ranks (approximately) using the computed scores and score-to-rank mapping
Measure ΔQD using previous and new ranks (by applying an approximate function derived in the paper)
28
@Carnegie MellonDatabases
Out-of-date Repository
Web Copy of D(fresh)
Repository Copy of D(stale)