Download - The Stanford Data Streams Research Project
The Stanford Data Streams The Stanford Data Streams Research ProjectResearch Project
Profs. Rajeev Motwani & Jennifer Widom
And a cast of full- and part-time students:Arvind Arasu, Brian Babcock, Shivnath Babu,
Mayur Datar, Gurmeet Manku, Liadan O’Callaghan, Justin Rosentein, Qi Sun, Rohit Varma
stanfordstreamdatamanager
stanfordstreamdatamanager 2
Data StreamsData Streams• Traditional DBMS -- data stored in finite,
persistent data setsdata sets
• New applications -- data as multiple, continuous, rapid, time-varying data streamsdata streams– Network monitoring and traffic engineering– Security applications– Telecom call records– Financial applications– Web logs and click-streams– Sensor networks– Manufacturing processes
stanfordstreamdatamanager 3
ChallengesChallenges• Multiple, continuous, rapid, time-varyingMultiple, continuous, rapid, time-varying
streams of data
• Queries may be continuous continuous (not just one-time)– Evaluated continuously as stream data arrives– Answer updated over time
• Queries may be complexcomplex– Beyond element-at-a-time processing– Beyond stream-at-a-time processing
stanfordstreamdatamanager 4
Using Traditional DatabaseUsing Traditional Database
User/ApplicationUser/Application
LoaderLoader
QueryQuery ResultResultResultResult
……QueryQuery
……
stanfordstreamdatamanager 5
New Approach for Data StreamsNew Approach for Data Streams
User/ApplicationUser/Application
Register QueryRegister Query
Stream QueryStream QueryProcessorProcessor
ResultResult
stanfordstreamdatamanager 6
New Approach for Data StreamsNew Approach for Data Streams
User/ApplicationUser/Application
Register QueryRegister Query
Stream QueryStream QueryProcessorProcessor
ResultResult
Scratch SpaceScratch Space(Memory and/or Disk)(Memory and/or Disk)
DataStream
ManagementSystem(DSMS)
stanfordstreamdatamanager 7
DBMS versus DSMSDBMS versus DSMS
stanfordstreamdatamanager 8
DBMS versus DSMSDBMS versus DSMS• Persistent relations • Transient streams (and
persistent relations)
stanfordstreamdatamanager 9
DBMS versus DSMSDBMS versus DSMS• Persistent relations
• One-time queries
• Transient streams (and persistent relations)
• Continuous queries
stanfordstreamdatamanager 10
DBMS versus DSMSDBMS versus DSMS• Persistent relations
• One-time queries
• Random access
• Transient streams (and persistent relations)
• Continuous queries
• Sequential access
stanfordstreamdatamanager 11
DBMS versus DSMSDBMS versus DSMS• Persistent relations
• One-time queries
• Random access
• Access plan determined by query processor and physical DB design
• Transient streams (and persistent relations)
• Continuous queries
• Sequential access
• Unpredictable data arrival and characteristics
stanfordstreamdatamanager 12
DBMS versus DSMSDBMS versus DSMS• Persistent relations
• One-time queries
• Random access
• Access plan determined by query processor and physical DB design
• “Unbounded” disk store
• Transient streams (and persistent relations)
• Continuous queries
• Sequential access
• Unpredictable data arrival and characteristics
• Bounded main memory
stanfordstreamdatamanager 13
Sample ApplicationsSample Applications• Network management and traffic engineering
(e.g., Sprint)– Streams of measurements and packet traces– Queries: detect anomalies, adjust routing
• Telecom call data (e.g., AT&T)– Streams of call records– Queries: fraud detection, customer call patterns,
billing
stanfordstreamdatamanager 14
Sample Applications (cont’d) Sample Applications (cont’d) • Network security
(e.g., iPolicy, NetForensics/Cisco, Netscreen)– Network packet streams, user session information– Queries: URL filtering, detecting intrusions & DOS
attacks & viruses
• Financial applications (e.g., Traderbot)– Streams of trading data, stock tickers, news feeds– Queries: arbitrage opportunities, analytics, patterns
stanfordstreamdatamanager 15
Sample Applications (cont’d) Sample Applications (cont’d) • Web tracking and personalization
(e.g., Yahoo, Google, Akamai)– Clickstreams, user query streams, log records– Queries: monitoring, analysis, personalization
• Truly massive databases (e.g., Astronomy Archives)– Stream the data by once (or over and over)– Queries do the best they can
stanfordstreamdatamanager 16
Making Things ConcreteMaking Things Concrete• Database = two streams of mobile call records
– Outgoing(connectionID, caller, start, end)– Incoming(connectionID, callee, start, end)
• Query language = SQLFROM clauses can refer to streams and/or relations
stanfordstreamdatamanager 17
Query Example 1Query Example 1• Find all outgoing calls longer than 2 minutes
(relational selection)SELECT O.connectionID, O.callerFROM Outgoing OWHERE O.end – O.start > 2
• Result requires unbounded storage
• Can provide result as data stream
stanfordstreamdatamanager 18
Query Example 2Query Example 2• Pair up callers and callees (relational join)
SELECT O.caller, I.calleeFROM Outgoing O, Incoming IWHERE O.connectionID = I.connectionID
• Can still provide result as data stream
• Requires unbounded temporary storage (without additional assumptions)
stanfordstreamdatamanager 19
Query Example 3Query Example 3• Find total connection time for each caller
(relational grouping and aggregation)SELECT O.caller, sum(O.end – O.start)FROM Outgoing OGROUP BY O.caller
• Cannot provide result in (append-only) stream
stanfordstreamdatamanager 20
Project GoalProject Goal Reconsider all aspects of data management
and processing in presence of data streams
stanfordstreamdatamanager 21
Remainder of TalkRemainder of Talk• Data stream model
• Queries over data streams– Language, semantics, evaluation & optimization
• DSMS query processing architecture and system internals
• Results to date
• Ongoing work
• Related work
stanfordstreamdatamanager 22
Data ModelData Model• Database: relations + data streamsrelations + data streams
• Stream characteristics– Type of data (schema)– Data distribution– Flow rate– Stability of distribution and flow– Ordering and other constraints– Synchronization of multiple streams– Distributed streams
stanfordstreamdatamanager 23
Data Stream Queries -- Basic IssuesData Stream Queries -- Basic Issues• Answer availability
– One-time– Multiple-time– Continuous (“standing”), stored or streamed
• Registration time– Predefined– Ad-hoc
• Stream access– Arbitrary– Sliding window (special case: size = 1)
stanfordstreamdatamanager 24
Data Stream Queries -- Basic IssuesData Stream Queries -- Basic Issues• Answer availability
– One-time– Multiple-time– Continuous (“standing”), stored or streamed
• Registration time– Predefined– Ad hoc
• Stream access– Arbitrary– Sliding window (special case: size = 1)
stanfordstreamdatamanager 25
Query Language & SemanticsQuery Language & Semantics• Specifying queries over streams
– SQL-like versus dataflow network of operators– Sliding windows as first-class query construct
• Semantic issues– Blocking operators, e.g., aggregation, order-by– Streams as sets versus lists– Timestamping
stanfordstreamdatamanager 26
Query Evaluation -- ApproximationQuery Evaluation -- Approximation• Why approximate?
– Streams are coming too fast– Exact answer requires unbounded storage or
significant computational resources– Ad hoc queries reference history
• Issues in approximation– Sliding windows, sampling, synopses, …– How is approximation controlled?– How is it understood by user?
• Accuracy-efficiency-storage tradeoffAccuracy-efficiency-storage tradeoff
stanfordstreamdatamanager 27
Query Evaluation -- AdaptivityQuery Evaluation -- Adaptivity• Why adaptivity?
– Queries are long-running– Fluctuating stream arrival & data characteristics– Evolving query loads
• Issues in adaptivity– Adaptive resource allocation (memory,
computation)– Adaptive query execution plans
stanfordstreamdatamanager 28
Query Evaluation -- Multiple QueriesQuery Evaluation -- Multiple Queries• Possibly large number of continuous queries
• Long-running
• Shared resources
• Multi-query optimization
stanfordstreamdatamanager 29
Query Evaluation -- Distributed StreamsQuery Evaluation -- Distributed Streams1 Many physical streams but one logical stream
– E.g., maintain top 100 visited pages at Yahoo
2 Correlate streams at distributed servers– E.g., network monitoring
3 Many streams controlled by a few servers– E.g., sensor networks
• Issues– Move processing to streams, not streams to
processor– Approximation-bandwidth tradeoffApproximation-bandwidth tradeoff
stanfordstreamdatamanager 30
Query Processing ArchitectureQuery Processing Architecture
Input Data Streams
Usersissue
continuous and ad-hoc queries
Administrator can monitor query
executionand adjust run-time
parameters
Applicationsregister
continuous queries
OutputStream
X
X
Waiting Op
Ready Op
Running Op
Synopses Query Plans
stanfordstreamdatamanager 31
DSMS InternalsDSMS Internals• Query plans: operators, synopses, queuesoperators, synopses, queues
• Memory management– Dynamic allocation to buffers, queues, synopses– Accuracy vs. memory tradeoff– Operators adapt gracefully to memory reallocation
• Scheduler– Handles variable-rate input streams– Handles varying operator and query requirements
stanfordstreamdatamanager 32
Some Results to DateSome Results to Date• Algorithms on data streams
– Online clustering [FOCS 2000, ICDE 2002]
– Online quantiles [SIGMOD 98, SIGMOD 99]
– Statistics over sliding windows [SODA 2002]
– Online frequency counting
• Theory of stream query processing– Memory requirements of stream queries [PODS02]
• System design– STREAMSTREAM: stanfordstreamdatamanager
stanfordstreamdatamanager 33
STREAM System ImplementationSTREAM System Implementation• Comprehensive DSMS query processor
• Broad suite of operators and synopses
• Sophisticated “developer’s workbench” interface– Submit queries in extended SQL or algebra– Submit or edit query plans in XML or GUI– Query plan execution visualizer– On-the-fly modification of memory allocation,
scheduling policies, etc.
stanfordstreamdatamanager 34
Ongoing WorkOngoing Work• Algebra for streams
• Synopses and algorithmic issues
• Memory management issues
• Exploiting constraints on streams
• Approximation in query processing
• Distributed stream processing
• System development
stanfordstreamdatamanager 35
Ongoing WorkOngoing Work• Algebra for streams
• Synopses and algorithmic issues
• Memory management issues
• Exploiting constraints on streams
• Approximation in query processing
• Distributed stream processing
• System development
stanfordstreamdatamanager 36
Ongoing Work -- ConstraintsOngoing Work -- Constraints• Exploiting constraints on streams in query
processing– Foreign-key joins, referential integrity, clustering,
ordering– Need not be exact (e.g., k-clustered)– Reduce memory requirements– Unblock blocking operators
stanfordstreamdatamanager 37
Ongoing Work -- Approximation in Ongoing Work -- Approximation in Query ProcessingQuery Processing
• Understanding behavior of approximate operators when composed
• Memory allocation to operators in a plan, given per-operator memory-accuracy curve
• Best query plan, assuming best memory allocation
• Multiple (weighted) queries sharing resources
stanfordstreamdatamanager 38
Related WorkRelated Work• Triggers, alerters, materialized views,
continuous queries on conventional DBs, pub/sub, sequence & temporal databases, …
• TelegraphTelegraph project at UC Berkeley
• NiagaraNiagara project at Wisconsin/OGI
• AmazonAmazon project at Cornell
• AuroraAurora project at Brown/MIT
• And others
For Papers and General Info.For Papers and General Info.
http://www-db.stanford.edu/stream
stanfordstreamdatamanager