monitoring the dynamic web to respond to continuous queries sandeep pandey krithi ramamritham soumen...

21
Monitoring the dynamic Web to respond to Continuous Queries Sandeep Pandey Krithi Ramamritham Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/laiir/

Post on 21-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Monitoring the dynamic Web to respond to Continuous Queries

Sandeep PandeyKrithi RamamrithamSoumen Chakrabarti

IIT Bombaywww.cse.iitb.ac.in/laiir/

2

Motivation Web pages change rapidly:

• 40% commercial pages• 23% of all pages

change per day (Sethuraman et al.) Current search engine users

• Need to repeat queries (how often?) and• Diff results with recent versions• Or poll frequently updated collections

(e.g., Google news)

3

Continuous Queries (CQ) Users register long-lived queries of

interest Pages of interest may be added,

modified, and deleted System continually updates

responses Example applications

• Commuter updates: traffic and weather conditions

• Alerts on cricket scores, stock portfolios

4

Discrete vs. continuous queries Query lives for an

“instant”, one-shot anwer

Optimize corpus freshness at all times

Objective penalizes delay from update to refresh

Usually handled by bulk crawls with diverse periods

Queries have positive lifetime, many updates over time

Updates must track changes closely

Objective penalizes number or importance of missed updates

Dynamic monitoring with more restrictive network resources

5

Talk outline Introduction and motivation Previous approaches Our contributions

• Continuous Adaptive Monitoring (CAM)• How to allocate limited polling resources

among pages• How to schedule poll instants

Experiments Conclusion

6

Related work CONQUER and WebCQ (Liu, Pu and Tang)

• Query language and architecture for CQ• Do not address monitoring for freshness

optimization

NIAGARA (DeWitt and Naughton)• Query evaluation and optimization techniques• Database query optimization setting

ChangeDetector (Boyapati et al.)• Fixed-priority polling for given set of pages

Freshness for discrete queries• Poisson updates (Cho and Garcia-Molina)• Quasi-deterministic and other distributions

(Sethuraman, Wolf, Squillante, Yu)

7

Our contributions New statistical recency objective for

CQs New monitoring framework to fit

statistical models of page change behavior

Recency optimization problem constrained by network resources

Two-phase solution to optimization tailored to CQ search systems• Resource allocation (knapsack)• Poll scheduling (flow-shop)

8

Continuous Adaptive Monitoring Planning horizon or “epoch”

Time proceeds in discrete steps {j } over epoch

Each time step j, each page i has probability ρi,j of an update• Can capture predictable bursts,

periodicityj ρi,j = i, the expected #updates to page i

(“change rate”)

Decision variables yij

• Is page i polled at time step j?

9

Profit, relevance and importance Each registered query q has a profit q

Relevance riq of page i w.r.t. query q• We use cosine in TFIDF space as in IR• Other measures (e.g. PageRank) may be

integrated

Page i has “importance” Wi —function of• Currently resident queries and their “profits”• Relevance of page i to each resident query

Importance

q qiqi rW

10

Returned Information Ratio Update information reported for page

i is

Goal is to maximize importance-weighted updates reported, iWiRi subject to polling resource constraint

Returned info ratio (RIR) is

Cyji ij ,

j ijiji yR

i ii

i j ijiji

W

yW

Importance-weighted updatescaptured by system

Total importance-weightedexpected updates

11

CAM system overview Time proceeds in

epochs At the end of every

epoch we re-evaluate• Relevance• Update probabilities

For the next epoch• We select instants at

which to poll each page (resource allocation)

• Schedule these instants subject to resource constraint

Determiningrelevant pages

Parametertracking

Resourceallocation

Scheduling

Monit

ori

ng

13

Resource allocation Existing policies

• Uniform: Resources (#polls) distributed uniformly among all pages irrespective of their change frequency

• Proportional: #polls allocated to a page is proportional to the frequency with which it changes

For discrete queries, uniform better than proportional for any inter-update distribution

CAM: solve a knapsack problem • Better than uniform and proportional• Proportional better than uniform• Evidence that CQ objective discrete objective

14

Scheduling

Suppose our crawler can fetch M pages concurrently, and

An epoch is T time steps long Then we can fetch a total of

C=MT pages during an epoch• Ensured by resource allocation

phase But at each instant we cannot

schedule more than M fetches• Want small planned-to-actual poll

delays• May fail to schedule all poll jobs in

an epoch

Determiningrelevant pages

Parametertracking

Resourceallocation

Scheduling

Monit

ori

ng

Tentative yijs

15

A flow-shop problem M “machines” available at any time Each yij which is equal to 1 is a “job”

Job “k” is “released” at time step rk (= j )

“Processing time” = crawl time = tj

“Completion time” of job j is Cj

Want to minimize “total flow”

NP-hard problem• We use earliest deadline heuristic

k kk rC )(

Time

Job

16

Experiments Synthetic data

• Change frequency distribution: a few pages change very often (Zipfian)

• Update probability distribution: a few ρi,j ’s are large, most are small (Zipfian again)

• Page importance distribution: also Zipfian (Wolman, 1999)

Real data• Eight cricket score sites• High update rate

FIXME0

50

100

150

200

250

300

350

1 4 7 10 13

Change frequencyN

umbe

r of

pag

es

17

CAM > Proportional > Uniform Uniform update and

importance distrib. Plot RIR against ratio

of resources toexpected changes

RIR for CAM is >3times better

Proportional is betterthan uniform in theCQ setting• Intuition from “minimum total stale

duration” does not apply to CQ

0

0.020.04

0.060.08

0.1

0.120.14

0.160.18

0.2

2 4 6 8Monitor/change ratio

RIR

UniformProportionalCAM

18

Resource allocation

00.10.20.30.4

1 2 3 4 5 6 7 8 9 10Page Bins

RIR Uniform Proportional CAM Total info

Sort pages by increasing change rate Place in ten equally populated bins

(10=fastest) Uniform spends same resource for each bin Proportional wastes fewer resources on slow-

changing bins, but is not aggressive enough CAM invests more aggressively in fast-

changing bins, achieving the greatest RIR

19

Skew-handling and adaptation Fixed monitoring/

change ratio Vary skew in

update probability distribution

CAM’s gains increase with skew

CAM improves over initial epochs

Change distribution estimates stabilize within a few epochs

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.5 1 1.5Zipf parameterRIR

CAMProportionalUniform

RIR

20

Experiments on real pages Eight sites with

dynamic cricket match information• In fact, Zipfian

updates

Adversarial setup: monitor/change < 1• CAM close to best

possible

For M/C=2, CAM updates on 80% of the information changed

0

100

200

300

400

500

1 2 3 4 5 6 7 8Page Index

Number of Changes

0

0.2

0.4

0.6

0.8

1

0.3 1 10Monitoring-Change RatioR

IR

Uniform

Proportional

CAM

21

Conclusion Continual queries are inherently

different from discrete queries Approach used in CAM

• Identify relevant pages• Track the pages as they change• Characterize page change behavior• Decide when to monitor the pages in

future CAM approach performs better than

other naïve approaches

22

References J. Cho, H. Gracia-Molina.

Synchronizing the database to improve freshness. ACM-SIGMOD, 2000.

J. Cho, H. Gracia-Molina. Estimating frequency of change. Technical Report, 2000.

J. Sethuram, J. L. Wolf, M. S. Squillante, P. S. Yu. Optimal Crawling strategies for Web search-engines. World Wide Web, 2002.