@ carnegie mellon databases user-centric web crawling sandeep pandey & christopher olston...

28
@Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

Upload: imogen-williamson

Post on 16-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

@Carnegie MellonDatabases

User-Centric Web Crawling

Sandeep Pandey & Christopher Olston

Carnegie Mellon University

Page 2: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

2

@Carnegie MellonDatabases

Web Crawling

One important application (our focus): search Topic-specific search engines +

General-purpose ones

repository

index

search queries

user crawler

WWW

Page 3: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

3

@Carnegie MellonDatabases

Out-of-date Repository

Web is always changing [Arasu et.al., TOIT’01]

– 23% of Web pages change daily– 40% commercial Web pages change daily

Many problems may arise due to an out-of-date repository– Hurt both precision and recall

Page 4: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

4

@Carnegie MellonDatabases

Web Crawling Optimization Problem

Not enough resources to (re)download every web document every day/hour– Must pick and choose optimization problem

Others: objective function = avg. freshness, age Our goal: focus directly on impact on users

repository

index

search queries

user crawler

WWW

Page 5: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

5

@Carnegie MellonDatabases

Web Search User Interface

1. User enters keywords

2. Search engine returns ranked list of results

3. User visits subset of results

1. ---------2. ---------3. ---------4. …

documents

Page 6: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

6

@Carnegie MellonDatabases

Objective: Maximize Repository Quality (as perceived by users)

Suppose a user issues search query q:

Qualityq = Σdocuments D (likelihood of viewing D) x (relevance of D to q)

Given a workload W of user queries:

Average quality = 1/K x Σqueries q W (freqq x Qualityq)

Page 7: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

7

@Carnegie MellonDatabases

Viewing Likelihood

0

0.2

0.4

0.6

0.8

1

1.2

0 50 100 150

Rank

Pro

bab

ilit

y o

f V

iew

ing

vie

w p

rob

abi

lity

rank

• Depends primarily on rank in list [Joachims KDD’02]

• From AltaVista data [Lempel et al. WWW’03]:

ViewProbability(r) r –1.5

Page 8: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

8

@Carnegie MellonDatabases

Search engines’ internal notion of how well a document matches a query

Each D/Q pair numerical score [0,1]

Combination of many factors, including:– Vector-space similarity (e.g., TF.IDF cosine metric)– Link-based factors (e.g., PageRank) – Anchortext of referring pages

Relevance Scoring Function

Page 9: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

9

@Carnegie MellonDatabases

(Caveat)

Using scoring function for absolute relevance– Normally only used for relative ranking– Need to craft scoring function carefully

Page 10: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

10

@Carnegie MellonDatabases

Measuring Quality

Avg. Quality =

Σq (freqq x ΣD (likelihood of viewing D) x (relevance of D to q))

query logs

scoring function over (possibly stale) repository

scoring function over “live” copy of D

usage logs

ViewProb( Rank(D, q) )

Page 11: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

11

@Carnegie MellonDatabases

Lessons from Quality Metric

ViewProb(r) monotonically nonincreasing Quality maximized when ranking function orders documents in

descending order of relevance

Out-of-date repository: scrambles ranking lowers quality

Avg. Quality =

Σq (freqq x ΣD (ViewProb( Rank(D, q) ) x (relevance of D to q))

Let ΔQD = loss in quality due to inaccurate information about D Alternatively, improvement in quality if we (re)download D

Page 12: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

12

@Carnegie MellonDatabases

ΔQD: Improvement in Quality

REDOWNLOAD

Web Copy of D(fresh)

Repository Copy of D(stale)

Repository Quality += ΔQD

Page 13: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

13

@Carnegie MellonDatabases

Download Prioritization

Two difficulties:

1. Live copy unavailable

2. Given both the “live” and repository copies of D, measuring ΔQD may require computing ranks of all documents for all queries

Q: How to measure ΔQD?

Idea: Given ΔQD for each doc., prioritize (re)downloading accordingly

Approach:

(1) Estimate ΔQD for past versions, (2) Forecast current ΔQD

Page 14: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

14

@Carnegie MellonDatabases

Overhead of Estimating ΔQD

Estimate while updating inverted

index

Page 15: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

15

@Carnegie MellonDatabases

Forecast Future ΔQD

Top 50%

Top 80%

Top 90%

first 24 weeks

seco

nd 2

4 w

ee

ks

Avg. weekly ΔQD :

Data: 48 weekly snapshots of15 web sites sampled from OpenDirectory topics

Queries: AltaVista query log

Page 16: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

16

@Carnegie MellonDatabases

Summary

Estimate ΔQD at index time

Forecast future ΔQD

Prioritize downloading according to forecasted ΔQD

Page 17: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

17

@Carnegie MellonDatabases

Overall Effectiveness

Staleness = fraction of out-of-date documents* [Cho et al. 2000]

Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002]

* Used “shingling” to filter out “trivial” changes

Scoring function: PageRank (similar results for TF.IDF) Quality (fraction of ideal)

reso

urc

e r

equ

irem

ent

Min. StalenessMin. EmbarrassmentUser-Centric

Page 18: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

18

@Carnegie MellonDatabases

(boston.com)

Does not rely on size of text change to estimate importance

Tagged as important by shingling measure, although did not match many queries in workload

Reasons for Improvement

Page 19: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

19

@Carnegie MellonDatabases

Reasons for Improvement

Accounts for “false negatives” Does not always ignore

frequently-updated pages

User-centric crawling repeatedly re-downloads this page

(washingtonpost.com)

Page 20: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

20

@Carnegie MellonDatabases

Related Work (1/2)

General-purpose Web crawling: – Min. Staleness [Cho, Garcia-Molina, SIGMOD’00]

Maximize average freshness or age for fixed set of docs.

– Min. Embarrassment [Wolf et al., WWW’02]: Maximize weighted avg. freshness for fixed set of docs. Document weights determined by prob. of “embarrassment”

– [Edwards et al., WWW’01] Maximize average freshness for a growing set of docs. How to balance new downloads vs. redownloading old docs.

Page 21: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

21

@Carnegie MellonDatabases

Related Work (2/2)

Focused/topic-specific crawling– [Chakrabarti, many others]– Select subset of pages that match user interests– Our work: given a set of pages, decide when to

(re)download each based on predicted content shifts + user interests

Page 22: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

22

@Carnegie MellonDatabases

Summary

Crawling: an optimization problem

Objective: maximize quality as perceived by users

Approach:– Measure ΔQD using query workload and usage logs– Prioritize downloading based on forecasted ΔQD

Various reasons for improvement– Accounts for false positives and negatives – Does not rely on size of text change to estimate importance– Does not always ignore frequently updated pages

Page 23: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

23

@Carnegie MellonDatabases

THE END

Paper available at:www.cs.cmu.edu/~olston

Page 24: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

24

@Carnegie MellonDatabases

Most Closely Related Work

[Wolf et al., WWW’02]:– Maximize weighted avg. freshness for fixed set of docs.– Document weights determined by prob. of “embarrassment”

User-Centric Crawling:– Which queries affected by a change, and by how much?

Change A: significantly alters relevance to several common queries Change B: only affects relevance to infrequent queries, and not by much

– Metric penalizes false negatives Doc. ranked #1000 for a popular query should be ranked #2

– Small embarrassment but big loss in quality

Page 25: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

25

@Carnegie MellonDatabases

Inverted Index

Cancer

Seminar

Symptoms

Word Posting listDocID (freq)

Doc7 (2) Doc9 (1) Doc1 (1)

Doc5 (1) Doc1 (1)Doc6 (1)

Doc1 (1) Doc8 (2)Doc4 (3)

Seminar:Cancer

Symptoms

Doc1

Page 26: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

26

@Carnegie MellonDatabases

Updating Inverted Index

Seminar:Cancer

Symptoms

Cancer management:how to detect breast cancer

Stale Doc1 Live Doc1

Cancer Doc7 (2) Doc9 (1)Doc1 (1)Doc1 (2)

Page 27: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

27

@Carnegie MellonDatabases

Measure ΔQD While Updating Index

Compute previous and new scores of the downloaded document while updating postings

Maintain an approximate mapping between score and rank for each query term (20 bytes per mapping in our exps.)

Compute previous and new ranks (approximately) using the computed scores and score-to-rank mapping

Measure ΔQD using previous and new ranks (by applying an approximate function derived in the paper)

Page 28: @ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

28

@Carnegie MellonDatabases

Out-of-date Repository

Web Copy of D(fresh)

Repository Copy of D(stale)