wic: a general-purpose algorithm for monitoring web information sources sandeep pandey (speaker)...

33
WIC WIC : : A General-Purpose Algorithm for A General-Purpose Algorithm for Monitoring Web Information Sources Monitoring Web Information Sources Sandeep Pandey Sandeep Pandey (speaker) (speaker) Kedar Dhamdhere Kedar Dhamdhere Christopher Olston Christopher Olston Carnegie Mellon University Carnegie Mellon University

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

WICWIC:: A General-Purpose Algorithm for A General-Purpose Algorithm for Monitoring Web Information SourcesMonitoring Web Information Sources

Sandeep PandeySandeep Pandey (speaker) (speaker)

Kedar DhamdhereKedar Dhamdhere

Christopher OlstonChristopher Olston

Carnegie Mellon UniversityCarnegie Mellon University

Page 2: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

Dynamic Information on the Dynamic Information on the WebWeb

Bulletin boardsBulletin boards Online auctionsOnline auctions NewsNews WeatherWeather Roadway conditions, Sports scores, Roadway conditions, Sports scores,

etc…etc…

Page 3: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

Online Shopping, AuctionsOnline Shopping, Auctions

Page 4: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

Stock MarketStock Market

Page 5: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

Continuous Query SystemsContinuous Query Systems

Process information from dynamic Web Process information from dynamic Web sources automaticallysources automatically

e.g., CONQUER [Liu et al. WWW 1999] e.g., CONQUER [Liu et al. WWW 1999]

Niagara [Naughton et al. SIGMOD Niagara [Naughton et al. SIGMOD 2000]2000]

WebCQ [Liu et al. CIKM 2000]WebCQ [Liu et al. CIKM 2000]

Page 6: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

Past Research on CQPast Research on CQ Systems Systems

FocusFocus on language design, query processing on language design, query processing

Assume “push” modelAssume “push” model of information access of information access Information shows up at doorstepInformation shows up at doorstep

Web sources are “pull” orientedWeb sources are “pull” oriented Must explicitly download Web pages, check for Must explicitly download Web pages, check for

changes, submit changes to CQ enginechanges, submit changes to CQ engine

Page 7: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

Converting Pull Converting Pull Push Push

Auction sitesAuction sites

Sports sitesSports sites

WICCQ

engine

pullpull

puspushh

pullpull

??

Page 8: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

Converting Pull Converting Pull Push Push

Topic has received little attentionTopic has received little attention So far only heuristics with no formal So far only heuristics with no formal

guaranteesguarantees Periodical polling of sourcesPeriodical polling of sources

Not scalableNot scalable CAM [Pandey et al. WWW’03]CAM [Pandey et al. WWW’03] Gal et al. [JACM 2001]:Gal et al. [JACM 2001]:

Take into account predicted change Take into account predicted change behaviorbehavior

Create Create monitoring schedulemonitoring schedule in advance in advance

Page 9: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

A good first step, but …A good first step, but …

No formal guaranteesNo formal guarantees

Suits Suits narrow range of applicationsnarrow range of applications

Page 10: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

Example Application Example Application ScenariosScenarios

Timeliness not critical

Timeliness is critical

Append-only Complete overwrite

maintaining a maintaining a searchable resume searchable resume databasedatabase

collecting “front-collecting “front-page” news stories page” news stories for archivalfor archival

capturing new capturing new Internet security Internet security bulletins for bulletins for automatic automatic dissemination dissemination within an within an organizationorganization

reacting in real-reacting in real-time to stock time to stock market market fluctuations, online fluctuations, online auction bidsauction bids

Page 11: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

OutlineOutline

IntroductionIntroduction Problem statementProblem statement WIC: Web Information CollectorWIC: Web Information Collector Formal results:Formal results:

WIC is a 2-approximationWIC is a 2-approximation Experimental results: Experimental results:

Timeliness-completeness tradeoffTimeliness-completeness tradeoff

Page 12: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Model of Pull-Oriented Model of Pull-Oriented SourcesSources

Proposed by Wolf et al. [WWW 2002]Proposed by Wolf et al. [WWW 2002]

Set of Web pages of interest PSet of Web pages of interest P11 … P … Pnn

Importance weight associated with each pageImportance weight associated with each page

Time is divided into discrete time instantsTime is divided into discrete time instants

Change: An update posted on a Web pageChange: An update posted on a Web page

Known probability Known probability ππijij that page that page PPii will change at will change at time time TTjj

We do not address the problem of estimating change We do not address the problem of estimating change probabilitiesprobabilities

Databases Databases @Carnegie Mellon@Carnegie Mellon

Page 13: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Our ModelOur Model

,*1

,*2

TimeTime

0.41.0

0.3 0.4 0.60.1 0.3

0.90.4 0.6

0.2

1.0

0.20.6

0.1 0.3

1.0

0.1 0.30.6

0.2 0.4 0.2

0.8

0.8 1.00.6 0.9

0.4 0.70.1

0.9 0.7 0.8 0.61.0

P1P1

P2P2

P3P3,*3

Databases Databases @Carnegie Mellon@Carnegie Mellon

Page 14: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

Modeling the Change Modeling the Change CharacteristicsCharacteristics

Timeliness not critical

Timeliness is critical

Append-only Complete overwrite

resume databaseresume database news stories news stories archivalarchival

security bulletinssecurity bulletins online auction bidsonline auction bids

Page 15: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Modeling the Change Modeling the Change CharacteristicsCharacteristics

),( kjlifei

k

jqqii kjlife

1, )1(),(

the probability of a change to page Pthe probability of a change to page P ii at at

time Ttime Tjj to remain available at time T to remain available at time Tkk

TTjj

Case 1Case 1: changes overwrite old info.: changes overwrite old info.

Case 2Case 2: append-only: append-only

1),( kjlifeiAlso: sliding window, others …Also: sliding window, others …

Databases Databases @Carnegie Mellon@Carnegie Mellon

Page 16: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

Web Monitoring Web Monitoring RequirementsRequirements

Timeliness not critical

Timeliness is critical

Append-only Complete overwrite

resume databaseresume database news stories news stories archivalarchival

security bulletinssecurity bulletins online auction bidsonline auction bids

Page 17: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Conflicting Conflicting Requirements Requirements

CompletenessCompleteness: maximize number of : maximize number of changes capturedchanges captured

TimelinessTimeliness: minimize delay in: minimize delay in capturing changescapturing changes

Limited resourcesLimited resources Up to C pages can be monitored per time instantUp to C pages can be monitored per time instant

When resources are not plentiful, the When resources are not plentiful, the twotwo objectives can be at odds with each otherobjectives can be at odds with each other

Databases Databases @Carnegie Mellon@Carnegie Mellon

Page 18: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Timeliness-Completeness Timeliness-Completeness tradeofftradeoff

0.41.0

0.3 0.4 0.60.1 0.3

0.90.4 0.6

0.2

1.0

Resource constraint: C=1Resource constraint: C=1

P1 P1 (append-(append-only)only)

P2P2(overwrite)(overwrite)0.3

0.9

0.2 0.3 0.50.0 0.2

0.80.3 0.5

0.1

0.9

Databases Databases @Carnegie Mellon@Carnegie Mellon

Page 19: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Only TimelinessOnly Timeliness

0.41.0

0.3 0.4 0.60.1 0.3

0.90.4 0.6

0.2

1.0

0.30.9

0.2 0.3 0.50.0 0.2

0.80.3 0.5

0.1

0.9

Objective: Changes must be captured Objective: Changes must be captured with zero delaywith zero delay

P1 P1 (append-(append-only)only)

P2P2(overwrite)(overwrite)

Databases Databases @Carnegie Mellon@Carnegie Mellon

Page 20: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Only CompletenessOnly Completeness

0.41.0

0.3 0.4 0.60.1 0.3

0.90.4 0.6

0.2

1.0

0.30.9

0.2 0.3 0.50.0 0.2

0.80.3 0.5

0.1

0.9

Objective: Maximize the number of Objective: Maximize the number of changes captured changes captured

P1 P1 (append-(append-only)only)

P2P2(overwrite)(overwrite)

Databases Databases @Carnegie Mellon@Carnegie Mellon

Page 21: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Controlling the TradeoffControlling the Tradeoff

UrgencyUrgency : Importance of information captured : Importance of information captured

as a function of delay in capturingas a function of delay in capturing

Example urgency functions

Databases Databases @Carnegie Mellon@Carnegie Mellon

Page 22: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

steep urgency curve

gradual urgency curve

Web Monitoring Web Monitoring RequirementsRequirements

Timeliness not critical

Timeliness is critical

Append-only Complete overwrite

resume databaseresume database news stories news stories archivalarchival

security bulletinssecurity bulletins online auction bidsonline auction bids

Page 23: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

Web Monitoring ObjectiveWeb Monitoring Objective Maximize UtilityMaximize Utility

Utility = Expected number of changes Utility = Expected number of changes captured, weighted by delay according to captured, weighted by delay according to urgencyurgency function function

Each monitoring action takes unit amount of Each monitoring action takes unit amount of resourceresource

Resource constraint:Resource constraint: amount of resource amount of resource

per time unit constrainedper time unit constrained

Page 24: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

Our SolutionOur Solution

Web Information Collector (WIC)Web Information Collector (WIC)

2-approximation for all scenarios2-approximation for all scenarios Total utility accrued at least half that Total utility accrued at least half that

accrued by optimal monitoring scheduleaccrued by optimal monitoring schedule

Finds optimal solution in the following Finds optimal solution in the following special case:special case: Timeliness is critical, changes overwriteTimeliness is critical, changes overwrite

Page 25: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

Web Information Collector Web Information Collector (WIC)(WIC)

Online, greedy strategyOnline, greedy strategy

At each time instant, download page(s) At each time instant, download page(s) with highest with highest utilityutility

Utility combines:Utility combines: Probability that a change has occurredProbability that a change has occurred Probability that change has not been erasedProbability that change has not been erased Delay in capturing change (weighted according Delay in capturing change (weighted according

to urgency function)to urgency function)

Page 26: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

WIC continuedWIC continued

Running time:Running time: O(# pages) per time instantO(# pages) per time instant

under most settings of life and urgencyunder most settings of life and urgency

WIC is an online algorithmWIC is an online algorithm Forecasting can be done at last minuteForecasting can be done at last minute

Databases Databases @Carnegie Mellon@Carnegie Mellon

Page 27: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

Proof of 2-ApproximationProof of 2-Approximation

See our paperSee our paper

Page 28: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

ExperimentsExperiments

Timeliness not critical

Timeliness is critical

Append-only Complete overwrite

Data: 7550 auction pages Data: 7550 auction pages

Exponential decaying urgency function Exponential decaying urgency function parameterized by parameterized by rr

Page 29: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

Experimental Results in Experimental Results in PaperPaper

Sensitivity to error in predictionSensitivity to error in prediction Not unduly sensitiveNot unduly sensitive

Comparison against prior approach Comparison against prior approach (CAM)(CAM) Up to 80% improvementUp to 80% improvement Handles more applicationsHandles more applications

Timeliness-Completeness tradeoffTimeliness-Completeness tradeoff

Page 30: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Timeliness-Completeness Timeliness-Completeness tradeofftradeoff

favor completenessfavor timeliness

Page 31: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

SummarySummary Pull->pushPull->push

Can’t have it allCan’t have it all

- - Choose a combination of timelinessChoose a combination of timeliness

and completenessand completeness

Our solution: WICOur solution: WIC - Handles many applications - Handles many applications

- Formal guarantee: - Formal guarantee: 2-approximation2-approximation

- Online algorithm- Online algorithm

Databases Databases @Carnegie Mellon@Carnegie Mellon

Page 32: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie MellonUrgency Parameter Controls Urgency Parameter Controls Timeliness-Completeness Timeliness-Completeness

TradeoffTradeoff Best curve to use depends on Best curve to use depends on

applicationapplication

Ap 1Ap 1: Agent to monitor and bid in online : Agent to monitor and bid in online auctions on behalf of many customersauctions on behalf of many customers Use steep curve (timeliness is critical)Use steep curve (timeliness is critical)

Ap 2Ap 2: Program to maintain database of : Program to maintain database of large number of online resumeslarge number of online resumes Use gradual curve (timeliness less critical)Use gradual curve (timeliness less critical)

Page 33: WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

Databases Databases @Carnegie Mellon@Carnegie Mellon

ExperimentsExperiments

Determine exact change occurrence timesDetermine exact change occurrence times

Add noise to simulate prediction inaccuracy:Add noise to simulate prediction inaccuracy: - - FFalse alse ppositivesositives - - FFalse alse nnegativesegatives - Gaussian spreading- Gaussian spreading