user-centric web crawling* christopher olston cmu & yahoo! research** * joint work with sandeep...

37
User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Upload: branden-malone

Post on 18-Jan-2018

220 views

Category:

Documents


0 download

DESCRIPTION

Christopher Olston 3 Workload-driven Approach Goal: meet usage needs, while adhering to resource constraints Tactic: pay attention to workload workload = usage + data dynamics this talk Current focus: autonomous sources – Data archival from Web sources [VLDB’04] – Supporting Web search [WWW’05] Thesis work: cooperative sources [VLDB’00, SIGMOD’01, SIGMOD’02, SIGMOD’03a, SIGMOD’03b]

TRANSCRIPT

Page 1: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

User-Centric Web Crawling*

Christopher OlstonCMU & Yahoo! Research**

* Joint work with Sandeep Pandey** Work done at Carnegie Mellon

Page 2: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston2

Distributed Sources of Dynamic Information

source A source B source C

resource constraints

central monitoring node• Support integrated querying• Maintain historical archive

• Sensors• Web sites

Page 3: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston3

Workload-driven Approach

Goal: meet usage needs, while adhering to resource constraints

Tactic: pay attention to workload• workload = usage + data dynamics

this talk

Current focus: autonomous sources– Data archival from Web sources [VLDB’04]– Supporting Web search [WWW’05]

Thesis work: cooperative sources [VLDB’00, SIGMOD’01, SIGMOD’02, SIGMOD’03a, SIGMOD’03b]

Page 4: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston4

Outline

Introduction: monitoring distributed sources User-centric web crawling

– Model + approach– Empirical results– Related & future work

Page 5: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston5

Web Crawling to Support Search

web site A web site B web site C

resource constraint

search engine

repository

index

search queries

userscrawler

Q: Given a full repository, when to refresh each page?

Page 6: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston6

Approach

Faced with optimization problem Others:

– Maximize freshness, age, or similar– Boolean model of document change

Our approach:– User-centric optimization objective– Rich notion of document change,

attuned to user-centric objective

Page 7: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston7

Web Search User Interface

1. User enters keywords

2. Search engine returns ranked list of results

3. User visits subset of results

1. ---------2. ---------3. ---------4. …

documents

Page 8: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston8

Objective: Maximize Repository Quality, from Search Perspective

Suppose a user issues search query q

Qualityq = Σdocuments d (likelihood of viewing d) x (relevance of d to q)

Given a workload W of user queries:

Average quality = 1/K x Σqueries q W (freqq x Qualityq)

Page 9: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston9

Viewing Likelihood

0

0.2

0.4

0.6

0.8

1

1.2

0 50 100 150

Rank

Prob

abili

ty o

f Vie

win

gvi

ew p

roba

bilit

y

rank

• Depends primarily on rank in list [Joachims KDD’02]

• From AltaVista data [Lempel et al. WWW’03]:

ViewProbability(r) r –1.5

Page 10: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston10

Search engines’ internal notion of how well a document matches a query

Each D/Q pair numerical score [0,1] Combination of many factors, e.g.:

– Vector-space similarity (e.g., TF.IDF cosine metric)– Link-based factors (e.g., PageRank) – Anchortext of referring pages

Relevance Scoring Function

Page 11: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston11

(Caveat)

Using scoring function for absolute relevance(Normally only used for relative ranking)– Need to ensure scoring function has meaning on an

absolute scale Probabilistic IR models, PageRank okay Unclear whether TF-IDF does (still debated, I believe)

Bottom line: stricter interpretability requirement than “good relative ordering”

Page 12: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston12

Measuring Quality

Avg. Quality =

Σq (freqq x Σd (likelihood of viewing d) x (relevance of d to q))

query logs

scoring function over (possibly stale) repository

scoring function over “live” copy of d

usage logs

ViewProb( Rank(d, q) )

Page 13: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston13

Lessons from Quality Metric

ViewProb(r) monotonically nonincreasing Quality maximized when ranking function orders documents in

descending order of true relevance

Out-of-date repository: scrambles ranking lowers quality

Avg. Quality =

Σq (freqq x Σd (ViewProb( Rank(d, q) ) x Relevance(d, q)) )

Let ΔQD = loss in quality due to inaccurate information about D Alternatively, improvement in quality if we (re)download D

Page 14: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston14

ΔQD: Improvement in Quality

REDOWNLOAD

Web Copy of D(fresh)

Repository Copy of D(stale)

Repository Quality += ΔQD

Page 15: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston15

Formula for Quality Gain (ΔQD)

Quality beforehand:

Quality after re-download:

Quality gain:

Q(t–) = Σq (freqq x Σd (ViewProb( Rankt–(d, q) ) x Relevance(d, q)) )

Q(t) = Σq (freqq x Σd (ViewProb( Rankt(d, q) ) x Relevance(d, q)) )

∆QD(t) = Q(t) – Q(t–) = Σq (freqq x Σd (VP x Relevance(d, q)) )where VP = ViewProb( Rankt(d, q) ) – ViewProb( Rankt–(d, q) )

Re-download document D at time t.

Page 16: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston16

Download Prioritization

Three difficulties:1. ΔQD depends on order of downloading

2. Given both the “live” and repository copies of D, measuring ΔQD is computationally expensive

3. Live copy usually unavailable

Idea: Given ΔQD for each doc., prioritize (re)downloading accordingly

Page 17: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston17

Difficulty 1: Order of Downloading Matters

ΔQD depends on relative rank positions of D Hence, ΔQD depends on order of downloading

To reduce implementation complexity, avoid tracking inter-document ordering dependencies

Assume ΔQD independent of downloading of other docs.

QD(t) = Σq (freqq x Σd (VP x Relevance(d, q)) )where VP = ViewProb( Rankt(d, q) ) – ViewProb( Rankt–(d, q) )

Page 18: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston18

Difficulty 3: Live Copy Unavailable

Take measurements upon re-downloading D(live copy available at that time)

Use forecasting techniques to project forward

timepast re-downloads now

forecast ΔQD(tnow)

ΔQD(t1) ΔQD(t2)

Page 19: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston19

Ability to Forecast ΔQD

Top 50%

Top 80%

Top 90%

first 24 weeks

seco

nd 2

4 w

eeks

Avg. weekly ΔQD (log scale)

Data: 15 web sites sampled from OpenDirectory topics

Queries: AltaVista query log

Docs downloaded once per week, in random order

Page 20: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston20

Strategy So Far

Measure shift in quality (ΔQD) each time re-download document D

Forecast future ΔQD– Treat each D independently

Prioritize re-downloading by ΔQD

Remaining difficulty:2. Given both the “live” and repository copies of D,

measuring ΔQD is computationally expensive

Page 21: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston21

Difficulty 2: Metric Expensive to Compute

Example: “Live” copy of D becomes less relevant to query q than

before• Now D is ranked too high• Some users visit D in lieu of Y, which is

more relevant• Result: less-than-ideal quality

Upon redownloading D, measuring quality gain requires knowing relevancy of Y, Z

Solution: estimate! Use approximate relevancerank mapping functions,

fit in advance for each query

Results for q

Actual Ideal1. X 1. X2. D 2. Y3. Y 3. Z4. Z 4. D

One problem: measurements of other documents required.

Page 22: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston22

Estimation Procedure

Focus on query q (later we’ll see how to sum across all affected queries)

Let Fq(rel) be relevancerank mapping for q– We use piecewise linear function in log-log space– Let r1 = D’s old rank (r1 = Fq(Rel(Dold, q))), r2 = new rank

– Use integral approximation of summation

DETAIL

QD,q = Σd (ViewProb(d,q) x Relevance(d,q)) = VP(D,q) x Rel(D,q) + Σd≠D (VP(d,q) x Rel(d,q))

≈ Σr=r1+1…r2 (VP(r–1) – VP(r)) x F–1q(r)

Page 23: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston23

Where we stand …

QD,q = VP(D,q) x Rel(D,q) + Σd≠D (VP(d,q) x Rel(d,q))

DETAIL

≈ f(Rel(D,q), Rel(Dold,q))

≈ VP( Fq(Rel(D, q)) ) – VP( Fq(Rel(Dold, q)) )

QD,q ≈ g(Rel(D,q), Rel(Dold,q))

Context: QD = Σq (freqq x QD,q )

Page 24: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston24

Difficulty 2, continued

Additional problem: must measure effect of shift in rank across all queries.

Solution: couple measurements with index updating operations

Sketch:– Basic index unit: posting. Conceptually:

– Each time insert/delete/update a posting, compute old & new relevance contribution from term/document pair*

– Transform using estimation procedure, and accumulate across postings touched to get ΔQD

term ID

document ID

scoring factors

* assumes scoring function treats term/document pairs independently

Page 25: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston25

Background: Text Indexes

Dictionary Postings

Term # docs Total freq

aid 1 1

all 2 2

cold 1 1

duck 1 2

Doc # Freq58 1

37 1

62 1

15 1

41 2

Basic index unit: posting– One posting for each term/document pair– Contains information needed for scoring function

Number of occurrences, font size, etc.

DETAIL

Page 26: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston26

Pre-Processing: Approximate the Workload

Break multi-term queries into set of single-term queries

– Now, term query– Index has one posting for each query/document pair

DETAIL

Dictionary Postings

Term # docs Total freq

aid 1 1

all 2 2

cold 1 1

duck 1 2

Doc # Freq58 1

37 1

62 1

15 1

41 2

= query

Page 27: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston27

Taking Measurements During Index Maintenance

While updating index:

– Initialize bank of ΔQD accumulators, one per document

(actually, materialized on demand using hash table)

– Each time insert/delete/update a posting: Compute new & old relevance contributions for

query/document pair: Rel(D,q), Rel(Dold,q) Compute ΔQD,q using estimation procedure, add to

accumulator: ΔQD += freqq x g(Rel(D,q), Rel(Dold,q))

DETAIL

Page 28: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston28

Measurement Overhead

Caveat:Does not handle factors that do not depend on a single term/doc. pair, e.g. term proximity and anchortext inclusion

Implemented in Lucene

Page 29: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston29

Summary of Approach

User-centric metric of search repository quality

(Re)downloading document improves quality

Prioritize downloading by expected quality gain

Metric adaptations to enable feasible+efficient implementation

Page 30: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston30

Next: Empirical Results

Introduction: monitoring distributed sources User-centric web crawling

– Model + approach– Empirical results– Related & future work

Page 31: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston31

Overall Effectiveness

Staleness = fraction of out-of-date documents* [Cho et al. 2000]

Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002]

* Used “shingling” to filter out “trivial” changes

Scoring function: PageRank (similar results for TF.IDF)

Quality (fraction of ideal)

reso

urce

requ

irem

ent

Min. StalenessMin. EmbarrassmentUser-Centric

Page 32: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston32(boston.com)

Does not rely on size of text change to estimate importance

Tagged as important by staleness- and embarrassment-based techniques, although did not match many queries in workload

Reasons for Improvement

Page 33: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston33

Reasons for Improvement

Accounts for “false negatives” Does not always ignore

frequently-updated pages

User-centric crawling repeatedly re-downloads this page

(washingtonpost.com)

Page 34: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston34

Related Work (1/2)

General-purpose web crawling– [Cho, Garcia-Molina, SIGMOD’00], [Edwards et al., WWW’01]

Maximize average freshness or age Balance new downloads vs. redownloading old documents

Focused/topic-specific crawling– [Chakrabarti, many others]

Select subset of documents that match user interests Our work: given a set of docs., decide when to (re)download

Page 35: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston35

Most Closely Related Work

[Wolf et al., WWW’02]:– Maximize weighted average freshness– Document weight = probability of “embarrassment” if not fresh

User-Centric Crawling:– Measure interplay between update and query workloads

When document X is updated, which queries are affected by the update, and by how much?

– Metric penalizes false negatives Doc. ranked #1000 for a popular query should be ranked #2 Small embarrassment but big loss in quality

Page 36: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston36

Future Work: Detecting Change-Rate Changes

Current techniques schedule monitoring to exploit existing change-rate estimates (e.g., ΔQD)

No provision to explore change-rates explicitly

Explore/exploit tradeoff– Ongoing work on Bandit Problem formulation

Bad case: change-rate = 0, so never monitor– Won’t notice future increase in change-rate

Page 37: User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

Christopher Olston37

Summary

Approach:– User-centric metric of search engine quality– Schedule downloading to maximize quality

Empirical results:– High quality with few downloads– Good at picking “right” docs. to re-download