@ carnegie mellon databases user-centric web crawling sandeep pandey & christopher olston...

@Carnegie MellonDatabases

User-Centric Web Crawling

Sandeep Pandey & Christopher Olston

Carnegie Mellon University

2


Web Crawling

One important application (our focus): search Topic-specific search engines +

General-purpose ones

repository

index

search queries

user crawler

WWW

3


Out-of-date Repository

Web is always changing [Arasu et.al., TOIT’01]

– 23% of Web pages change daily– 40% commercial Web pages change daily

Many problems may arise due to an out-of-date repository– Hurt both precision and recall

4


Web Crawling Optimization Problem

Not enough resources to (re)download every web document every day/hour– Must pick and choose optimization problem

Others: objective function = avg. freshness, age Our goal: focus directly on impact on users

repository

index

search queries

user crawler

WWW

5


Web Search User Interface

1. User enters keywords

2. Search engine returns ranked list of results

3. User visits subset of results

1. ---------2. ---------3. ---------4. …

documents

6


Objective: Maximize Repository Quality (as perceived by users)

Suppose a user issues search query q:

Qualityq = Σdocuments D (likelihood of viewing D) x (relevance of D to q)

Given a workload W of user queries:

Average quality = 1/K x Σqueries q W (freqq x Qualityq)

7


Viewing Likelihood

0

0.2

0.4

0.6

0.8

1

1.2

0 50 100 150

Rank

Pro

bab

ilit

y o

f V

iew

ing

vie

w p

rob

abi

lity

rank

• Depends primarily on rank in list [Joachims KDD’02]

• From AltaVista data [Lempel et al. WWW’03]:

ViewProbability(r) r –1.5

8


Search engines’ internal notion of how well a document matches a query

Each D/Q pair numerical score [0,1]

Combination of many factors, including:– Vector-space similarity (e.g., TF.IDF cosine metric)– Link-based factors (e.g., PageRank) – Anchortext of referring pages

Relevance Scoring Function

9


(Caveat)

Using scoring function for absolute relevance– Normally only used for relative ranking– Need to craft scoring function carefully

10


Measuring Quality

Avg. Quality =

Σq (freqq x ΣD (likelihood of viewing D) x (relevance of D to q))

query logs

scoring function over (possibly stale) repository

scoring function over “live” copy of D

usage logs

ViewProb( Rank(D, q) )

11


Lessons from Quality Metric

ViewProb(r) monotonically nonincreasing Quality maximized when ranking function orders documents in

descending order of relevance

Out-of-date repository: scrambles ranking lowers quality

Avg. Quality =

Σq (freqq x ΣD (ViewProb( Rank(D, q) ) x (relevance of D to q))

Let ΔQD = loss in quality due to inaccurate information about D Alternatively, improvement in quality if we (re)download D

12


ΔQD: Improvement in Quality

REDOWNLOAD

Web Copy of D(fresh)

Repository Copy of D(stale)

Repository Quality += ΔQD

13


Download Prioritization

Two difficulties:

1. Live copy unavailable

2. Given both the “live” and repository copies of D, measuring ΔQD may require computing ranks of all documents for all queries

Q: How to measure ΔQD?

Idea: Given ΔQD for each doc., prioritize (re)downloading accordingly

Approach:

(1) Estimate ΔQD for past versions, (2) Forecast current ΔQD

14


Overhead of Estimating ΔQD

Estimate while updating inverted

index

15


Forecast Future ΔQD

Top 50%

Top 80%

Top 90%

first 24 weeks

seco

nd 2

4 w

ee

ks

Avg. weekly ΔQD :

Data: 48 weekly snapshots of15 web sites sampled from OpenDirectory topics

Queries: AltaVista query log

16


Summary

Estimate ΔQD at index time

Forecast future ΔQD

Prioritize downloading according to forecasted ΔQD

17


Overall Effectiveness

Staleness = fraction of out-of-date documents* [Cho et al. 2000]

Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002]

* Used “shingling” to filter out “trivial” changes

Scoring function: PageRank (similar results for TF.IDF) Quality (fraction of ideal)

reso

urc

e r

equ

irem

ent

Min. StalenessMin. EmbarrassmentUser-Centric

18


(boston.com)

Does not rely on size of text change to estimate importance

Tagged as important by shingling measure, although did not match many queries in workload

Reasons for Improvement

19


Reasons for Improvement

Accounts for “false negatives” Does not always ignore

frequently-updated pages

User-centric crawling repeatedly re-downloads this page

(washingtonpost.com)

20


Related Work (1/2)

General-purpose Web crawling: – Min. Staleness [Cho, Garcia-Molina, SIGMOD’00]

Maximize average freshness or age for fixed set of docs.

– Min. Embarrassment [Wolf et al., WWW’02]: Maximize weighted avg. freshness for fixed set of docs. Document weights determined by prob. of “embarrassment”

– [Edwards et al., WWW’01] Maximize average freshness for a growing set of docs. How to balance new downloads vs. redownloading old docs.

21


Related Work (2/2)

Focused/topic-specific crawling– [Chakrabarti, many others]– Select subset of pages that match user interests– Our work: given a set of pages, decide when to

(re)download each based on predicted content shifts + user interests

22


Summary

Crawling: an optimization problem

Objective: maximize quality as perceived by users

Approach:– Measure ΔQD using query workload and usage logs– Prioritize downloading based on forecasted ΔQD

Various reasons for improvement– Accounts for false positives and negatives – Does not rely on size of text change to estimate importance– Does not always ignore frequently updated pages

23


THE END

Paper available at:www.cs.cmu.edu/~olston

24


Most Closely Related Work

[Wolf et al., WWW’02]:– Maximize weighted avg. freshness for fixed set of docs.– Document weights determined by prob. of “embarrassment”

User-Centric Crawling:– Which queries affected by a change, and by how much?

Change A: significantly alters relevance to several common queries Change B: only affects relevance to infrequent queries, and not by much

– Metric penalizes false negatives Doc. ranked #1000 for a popular query should be ranked #2

– Small embarrassment but big loss in quality

25


Inverted Index

Cancer

Seminar

Symptoms

Word Posting listDocID (freq)

Doc7 (2) Doc9 (1) Doc1 (1)

Doc5 (1) Doc1 (1)Doc6 (1)

Doc1 (1) Doc8 (2)Doc4 (3)

Seminar:Cancer

Symptoms

Doc1

26


Updating Inverted Index

Seminar:Cancer

Symptoms

Cancer management:how to detect breast cancer

Stale Doc1 Live Doc1

Cancer Doc7 (2) Doc9 (1)Doc1 (1)Doc1 (2)

27


Measure ΔQD While Updating Index

Compute previous and new scores of the downloaded document while updating postings

Maintain an approximate mapping between score and rank for each query term (20 bytes per mapping in our exps.)

Compute previous and new ranks (approximately) using the computed scores and score-to-rank mapping

Measure ΔQD using previous and new ranks (by applying an approximate function derived in the paper)

28


Out-of-date Repository

Web Copy of D(fresh)

Repository Copy of D(stale)

@ carnegie mellon databases user-centric web crawling sandeep pandey & christopher olston...

Documents

q d slide

q x relevance of d

query q

q freq q x d viewprob

repository copies of

measuring quality

q query logs

lowers quality