wic: a general-purpose algorithm for monitoring web information sources sandeep pandey (speaker)...
Post on 21-Dec-2015
214 views
TRANSCRIPT
WICWIC:: A General-Purpose Algorithm for A General-Purpose Algorithm for Monitoring Web Information SourcesMonitoring Web Information Sources
Sandeep PandeySandeep Pandey (speaker) (speaker)
Kedar DhamdhereKedar Dhamdhere
Christopher OlstonChristopher Olston
Carnegie Mellon UniversityCarnegie Mellon University
Databases Databases @Carnegie Mellon@Carnegie Mellon
Dynamic Information on the Dynamic Information on the WebWeb
Bulletin boardsBulletin boards Online auctionsOnline auctions NewsNews WeatherWeather Roadway conditions, Sports scores, Roadway conditions, Sports scores,
etc…etc…
Databases Databases @Carnegie Mellon@Carnegie Mellon
Online Shopping, AuctionsOnline Shopping, Auctions
Databases Databases @Carnegie Mellon@Carnegie Mellon
Stock MarketStock Market
Databases Databases @Carnegie Mellon@Carnegie Mellon
Continuous Query SystemsContinuous Query Systems
Process information from dynamic Web Process information from dynamic Web sources automaticallysources automatically
e.g., CONQUER [Liu et al. WWW 1999] e.g., CONQUER [Liu et al. WWW 1999]
Niagara [Naughton et al. SIGMOD Niagara [Naughton et al. SIGMOD 2000]2000]
WebCQ [Liu et al. CIKM 2000]WebCQ [Liu et al. CIKM 2000]
Databases Databases @Carnegie Mellon@Carnegie Mellon
Past Research on CQPast Research on CQ Systems Systems
FocusFocus on language design, query processing on language design, query processing
Assume “push” modelAssume “push” model of information access of information access Information shows up at doorstepInformation shows up at doorstep
Web sources are “pull” orientedWeb sources are “pull” oriented Must explicitly download Web pages, check for Must explicitly download Web pages, check for
changes, submit changes to CQ enginechanges, submit changes to CQ engine
Databases Databases @Carnegie Mellon@Carnegie Mellon
Converting Pull Converting Pull Push Push
Auction sitesAuction sites
Sports sitesSports sites
WICCQ
engine
pullpull
puspushh
pullpull
??
Databases Databases @Carnegie Mellon@Carnegie Mellon
Converting Pull Converting Pull Push Push
Topic has received little attentionTopic has received little attention So far only heuristics with no formal So far only heuristics with no formal
guaranteesguarantees Periodical polling of sourcesPeriodical polling of sources
Not scalableNot scalable CAM [Pandey et al. WWW’03]CAM [Pandey et al. WWW’03] Gal et al. [JACM 2001]:Gal et al. [JACM 2001]:
Take into account predicted change Take into account predicted change behaviorbehavior
Create Create monitoring schedulemonitoring schedule in advance in advance
Databases Databases @Carnegie Mellon@Carnegie Mellon
A good first step, but …A good first step, but …
No formal guaranteesNo formal guarantees
Suits Suits narrow range of applicationsnarrow range of applications
Databases Databases @Carnegie Mellon@Carnegie Mellon
Example Application Example Application ScenariosScenarios
Timeliness not critical
Timeliness is critical
Append-only Complete overwrite
maintaining a maintaining a searchable resume searchable resume databasedatabase
collecting “front-collecting “front-page” news stories page” news stories for archivalfor archival
capturing new capturing new Internet security Internet security bulletins for bulletins for automatic automatic dissemination dissemination within an within an organizationorganization
reacting in real-reacting in real-time to stock time to stock market market fluctuations, online fluctuations, online auction bidsauction bids
Databases Databases @Carnegie Mellon@Carnegie Mellon
OutlineOutline
IntroductionIntroduction Problem statementProblem statement WIC: Web Information CollectorWIC: Web Information Collector Formal results:Formal results:
WIC is a 2-approximationWIC is a 2-approximation Experimental results: Experimental results:
Timeliness-completeness tradeoffTimeliness-completeness tradeoff
Model of Pull-Oriented Model of Pull-Oriented SourcesSources
Proposed by Wolf et al. [WWW 2002]Proposed by Wolf et al. [WWW 2002]
Set of Web pages of interest PSet of Web pages of interest P11 … P … Pnn
Importance weight associated with each pageImportance weight associated with each page
Time is divided into discrete time instantsTime is divided into discrete time instants
Change: An update posted on a Web pageChange: An update posted on a Web page
Known probability Known probability ππijij that page that page PPii will change at will change at time time TTjj
We do not address the problem of estimating change We do not address the problem of estimating change probabilitiesprobabilities
Databases Databases @Carnegie Mellon@Carnegie Mellon
Our ModelOur Model
,*1
,*2
TimeTime
0.41.0
0.3 0.4 0.60.1 0.3
0.90.4 0.6
0.2
1.0
0.20.6
0.1 0.3
1.0
0.1 0.30.6
0.2 0.4 0.2
0.8
0.8 1.00.6 0.9
0.4 0.70.1
0.9 0.7 0.8 0.61.0
P1P1
P2P2
P3P3,*3
Databases Databases @Carnegie Mellon@Carnegie Mellon
Databases Databases @Carnegie Mellon@Carnegie Mellon
Modeling the Change Modeling the Change CharacteristicsCharacteristics
Timeliness not critical
Timeliness is critical
Append-only Complete overwrite
resume databaseresume database news stories news stories archivalarchival
security bulletinssecurity bulletins online auction bidsonline auction bids
Modeling the Change Modeling the Change CharacteristicsCharacteristics
),( kjlifei
k
jqqii kjlife
1, )1(),(
the probability of a change to page Pthe probability of a change to page P ii at at
time Ttime Tjj to remain available at time T to remain available at time Tkk
TTjj
Case 1Case 1: changes overwrite old info.: changes overwrite old info.
Case 2Case 2: append-only: append-only
1),( kjlifeiAlso: sliding window, others …Also: sliding window, others …
Databases Databases @Carnegie Mellon@Carnegie Mellon
Databases Databases @Carnegie Mellon@Carnegie Mellon
Web Monitoring Web Monitoring RequirementsRequirements
Timeliness not critical
Timeliness is critical
Append-only Complete overwrite
resume databaseresume database news stories news stories archivalarchival
security bulletinssecurity bulletins online auction bidsonline auction bids
Conflicting Conflicting Requirements Requirements
CompletenessCompleteness: maximize number of : maximize number of changes capturedchanges captured
TimelinessTimeliness: minimize delay in: minimize delay in capturing changescapturing changes
Limited resourcesLimited resources Up to C pages can be monitored per time instantUp to C pages can be monitored per time instant
When resources are not plentiful, the When resources are not plentiful, the twotwo objectives can be at odds with each otherobjectives can be at odds with each other
Databases Databases @Carnegie Mellon@Carnegie Mellon
Timeliness-Completeness Timeliness-Completeness tradeofftradeoff
0.41.0
0.3 0.4 0.60.1 0.3
0.90.4 0.6
0.2
1.0
Resource constraint: C=1Resource constraint: C=1
P1 P1 (append-(append-only)only)
P2P2(overwrite)(overwrite)0.3
0.9
0.2 0.3 0.50.0 0.2
0.80.3 0.5
0.1
0.9
Databases Databases @Carnegie Mellon@Carnegie Mellon
Only TimelinessOnly Timeliness
0.41.0
0.3 0.4 0.60.1 0.3
0.90.4 0.6
0.2
1.0
0.30.9
0.2 0.3 0.50.0 0.2
0.80.3 0.5
0.1
0.9
Objective: Changes must be captured Objective: Changes must be captured with zero delaywith zero delay
P1 P1 (append-(append-only)only)
P2P2(overwrite)(overwrite)
Databases Databases @Carnegie Mellon@Carnegie Mellon
Only CompletenessOnly Completeness
0.41.0
0.3 0.4 0.60.1 0.3
0.90.4 0.6
0.2
1.0
0.30.9
0.2 0.3 0.50.0 0.2
0.80.3 0.5
0.1
0.9
Objective: Maximize the number of Objective: Maximize the number of changes captured changes captured
P1 P1 (append-(append-only)only)
P2P2(overwrite)(overwrite)
Databases Databases @Carnegie Mellon@Carnegie Mellon
Controlling the TradeoffControlling the Tradeoff
UrgencyUrgency : Importance of information captured : Importance of information captured
as a function of delay in capturingas a function of delay in capturing
Example urgency functions
Databases Databases @Carnegie Mellon@Carnegie Mellon
Databases Databases @Carnegie Mellon@Carnegie Mellon
steep urgency curve
gradual urgency curve
Web Monitoring Web Monitoring RequirementsRequirements
Timeliness not critical
Timeliness is critical
Append-only Complete overwrite
resume databaseresume database news stories news stories archivalarchival
security bulletinssecurity bulletins online auction bidsonline auction bids
Databases Databases @Carnegie Mellon@Carnegie Mellon
Web Monitoring ObjectiveWeb Monitoring Objective Maximize UtilityMaximize Utility
Utility = Expected number of changes Utility = Expected number of changes captured, weighted by delay according to captured, weighted by delay according to urgencyurgency function function
Each monitoring action takes unit amount of Each monitoring action takes unit amount of resourceresource
Resource constraint:Resource constraint: amount of resource amount of resource
per time unit constrainedper time unit constrained
Databases Databases @Carnegie Mellon@Carnegie Mellon
Our SolutionOur Solution
Web Information Collector (WIC)Web Information Collector (WIC)
2-approximation for all scenarios2-approximation for all scenarios Total utility accrued at least half that Total utility accrued at least half that
accrued by optimal monitoring scheduleaccrued by optimal monitoring schedule
Finds optimal solution in the following Finds optimal solution in the following special case:special case: Timeliness is critical, changes overwriteTimeliness is critical, changes overwrite
Databases Databases @Carnegie Mellon@Carnegie Mellon
Web Information Collector Web Information Collector (WIC)(WIC)
Online, greedy strategyOnline, greedy strategy
At each time instant, download page(s) At each time instant, download page(s) with highest with highest utilityutility
Utility combines:Utility combines: Probability that a change has occurredProbability that a change has occurred Probability that change has not been erasedProbability that change has not been erased Delay in capturing change (weighted according Delay in capturing change (weighted according
to urgency function)to urgency function)
WIC continuedWIC continued
Running time:Running time: O(# pages) per time instantO(# pages) per time instant
under most settings of life and urgencyunder most settings of life and urgency
WIC is an online algorithmWIC is an online algorithm Forecasting can be done at last minuteForecasting can be done at last minute
Databases Databases @Carnegie Mellon@Carnegie Mellon
Databases Databases @Carnegie Mellon@Carnegie Mellon
Proof of 2-ApproximationProof of 2-Approximation
See our paperSee our paper
Databases Databases @Carnegie Mellon@Carnegie Mellon
ExperimentsExperiments
Timeliness not critical
Timeliness is critical
Append-only Complete overwrite
Data: 7550 auction pages Data: 7550 auction pages
Exponential decaying urgency function Exponential decaying urgency function parameterized by parameterized by rr
Databases Databases @Carnegie Mellon@Carnegie Mellon
Experimental Results in Experimental Results in PaperPaper
Sensitivity to error in predictionSensitivity to error in prediction Not unduly sensitiveNot unduly sensitive
Comparison against prior approach Comparison against prior approach (CAM)(CAM) Up to 80% improvementUp to 80% improvement Handles more applicationsHandles more applications
Timeliness-Completeness tradeoffTimeliness-Completeness tradeoff
Timeliness-Completeness Timeliness-Completeness tradeofftradeoff
favor completenessfavor timeliness
SummarySummary Pull->pushPull->push
Can’t have it allCan’t have it all
- - Choose a combination of timelinessChoose a combination of timeliness
and completenessand completeness
Our solution: WICOur solution: WIC - Handles many applications - Handles many applications
- Formal guarantee: - Formal guarantee: 2-approximation2-approximation
- Online algorithm- Online algorithm
Databases Databases @Carnegie Mellon@Carnegie Mellon
Databases Databases @Carnegie Mellon@Carnegie MellonUrgency Parameter Controls Urgency Parameter Controls Timeliness-Completeness Timeliness-Completeness
TradeoffTradeoff Best curve to use depends on Best curve to use depends on
applicationapplication
Ap 1Ap 1: Agent to monitor and bid in online : Agent to monitor and bid in online auctions on behalf of many customersauctions on behalf of many customers Use steep curve (timeliness is critical)Use steep curve (timeliness is critical)
Ap 2Ap 2: Program to maintain database of : Program to maintain database of large number of online resumeslarge number of online resumes Use gradual curve (timeliness less critical)Use gradual curve (timeliness less critical)
Databases Databases @Carnegie Mellon@Carnegie Mellon
ExperimentsExperiments
Determine exact change occurrence timesDetermine exact change occurrence times
Add noise to simulate prediction inaccuracy:Add noise to simulate prediction inaccuracy: - - FFalse alse ppositivesositives - - FFalse alse nnegativesegatives - Gaussian spreading- Gaussian spreading