the stanford data streams research project

39
The Stanford Data The Stanford Data Streams Research Project Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Gurmeet Manku, Liadan O’Callaghan, Justin Rosentein, Qi Sun, Rohit Varma

Upload: kaycee

Post on 25-Feb-2016

34 views

Category:

Documents


0 download

DESCRIPTION

The Stanford Data Streams Research Project. Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Gurmeet Manku, Liadan O’Callaghan, Justin Rosentein, Qi Sun, Rohit Varma st anfordst re amdat am anager. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Stanford Data Streams Research Project

The Stanford Data Streams The Stanford Data Streams Research ProjectResearch Project

Profs. Rajeev Motwani & Jennifer Widom

And a cast of full- and part-time students:Arvind Arasu, Brian Babcock, Shivnath Babu,

Mayur Datar, Gurmeet Manku, Liadan O’Callaghan, Justin Rosentein, Qi Sun, Rohit Varma

stanfordstreamdatamanager

Page 2: The Stanford Data Streams Research Project

stanfordstreamdatamanager 2

Data StreamsData Streams• Traditional DBMS -- data stored in finite,

persistent data setsdata sets

• New applications -- data as multiple, continuous, rapid, time-varying data streamsdata streams– Network monitoring and traffic engineering– Security applications– Telecom call records– Financial applications– Web logs and click-streams– Sensor networks– Manufacturing processes

Page 3: The Stanford Data Streams Research Project

stanfordstreamdatamanager 3

ChallengesChallenges• Multiple, continuous, rapid, time-varyingMultiple, continuous, rapid, time-varying

streams of data

• Queries may be continuous continuous (not just one-time)– Evaluated continuously as stream data arrives– Answer updated over time

• Queries may be complexcomplex– Beyond element-at-a-time processing– Beyond stream-at-a-time processing

Page 4: The Stanford Data Streams Research Project

stanfordstreamdatamanager 4

Using Traditional DatabaseUsing Traditional Database

User/ApplicationUser/Application

LoaderLoader

QueryQuery ResultResultResultResult

……QueryQuery

……

Page 5: The Stanford Data Streams Research Project

stanfordstreamdatamanager 5

New Approach for Data StreamsNew Approach for Data Streams

User/ApplicationUser/Application

Register QueryRegister Query

Stream QueryStream QueryProcessorProcessor

ResultResult

Page 6: The Stanford Data Streams Research Project

stanfordstreamdatamanager 6

New Approach for Data StreamsNew Approach for Data Streams

User/ApplicationUser/Application

Register QueryRegister Query

Stream QueryStream QueryProcessorProcessor

ResultResult

Scratch SpaceScratch Space(Memory and/or Disk)(Memory and/or Disk)

DataStream

ManagementSystem(DSMS)

Page 7: The Stanford Data Streams Research Project

stanfordstreamdatamanager 7

DBMS versus DSMSDBMS versus DSMS

Page 8: The Stanford Data Streams Research Project

stanfordstreamdatamanager 8

DBMS versus DSMSDBMS versus DSMS• Persistent relations • Transient streams (and

persistent relations)

Page 9: The Stanford Data Streams Research Project

stanfordstreamdatamanager 9

DBMS versus DSMSDBMS versus DSMS• Persistent relations

• One-time queries

• Transient streams (and persistent relations)

• Continuous queries

Page 10: The Stanford Data Streams Research Project

stanfordstreamdatamanager 10

DBMS versus DSMSDBMS versus DSMS• Persistent relations

• One-time queries

• Random access

• Transient streams (and persistent relations)

• Continuous queries

• Sequential access

Page 11: The Stanford Data Streams Research Project

stanfordstreamdatamanager 11

DBMS versus DSMSDBMS versus DSMS• Persistent relations

• One-time queries

• Random access

• Access plan determined by query processor and physical DB design

• Transient streams (and persistent relations)

• Continuous queries

• Sequential access

• Unpredictable data arrival and characteristics

Page 12: The Stanford Data Streams Research Project

stanfordstreamdatamanager 12

DBMS versus DSMSDBMS versus DSMS• Persistent relations

• One-time queries

• Random access

• Access plan determined by query processor and physical DB design

• “Unbounded” disk store

• Transient streams (and persistent relations)

• Continuous queries

• Sequential access

• Unpredictable data arrival and characteristics

• Bounded main memory

Page 13: The Stanford Data Streams Research Project

stanfordstreamdatamanager 13

Sample ApplicationsSample Applications• Network management and traffic engineering

(e.g., Sprint)– Streams of measurements and packet traces– Queries: detect anomalies, adjust routing

• Telecom call data (e.g., AT&T)– Streams of call records– Queries: fraud detection, customer call patterns,

billing

Page 14: The Stanford Data Streams Research Project

stanfordstreamdatamanager 14

Sample Applications (cont’d) Sample Applications (cont’d) • Network security

(e.g., iPolicy, NetForensics/Cisco, Netscreen)– Network packet streams, user session information– Queries: URL filtering, detecting intrusions & DOS

attacks & viruses

• Financial applications (e.g., Traderbot)– Streams of trading data, stock tickers, news feeds– Queries: arbitrage opportunities, analytics, patterns

Page 15: The Stanford Data Streams Research Project

stanfordstreamdatamanager 15

Sample Applications (cont’d) Sample Applications (cont’d) • Web tracking and personalization

(e.g., Yahoo, Google, Akamai)– Clickstreams, user query streams, log records– Queries: monitoring, analysis, personalization

• Truly massive databases (e.g., Astronomy Archives)– Stream the data by once (or over and over)– Queries do the best they can

Page 16: The Stanford Data Streams Research Project

stanfordstreamdatamanager 16

Making Things ConcreteMaking Things Concrete• Database = two streams of mobile call records

– Outgoing(connectionID, caller, start, end)– Incoming(connectionID, callee, start, end)

• Query language = SQLFROM clauses can refer to streams and/or relations

Page 17: The Stanford Data Streams Research Project

stanfordstreamdatamanager 17

Query Example 1Query Example 1• Find all outgoing calls longer than 2 minutes

(relational selection)SELECT O.connectionID, O.callerFROM Outgoing OWHERE O.end – O.start > 2

• Result requires unbounded storage

• Can provide result as data stream

Page 18: The Stanford Data Streams Research Project

stanfordstreamdatamanager 18

Query Example 2Query Example 2• Pair up callers and callees (relational join)

SELECT O.caller, I.calleeFROM Outgoing O, Incoming IWHERE O.connectionID = I.connectionID

• Can still provide result as data stream

• Requires unbounded temporary storage (without additional assumptions)

Page 19: The Stanford Data Streams Research Project

stanfordstreamdatamanager 19

Query Example 3Query Example 3• Find total connection time for each caller

(relational grouping and aggregation)SELECT O.caller, sum(O.end – O.start)FROM Outgoing OGROUP BY O.caller

• Cannot provide result in (append-only) stream

Page 20: The Stanford Data Streams Research Project

stanfordstreamdatamanager 20

Project GoalProject Goal Reconsider all aspects of data management

and processing in presence of data streams

Page 21: The Stanford Data Streams Research Project

stanfordstreamdatamanager 21

Remainder of TalkRemainder of Talk• Data stream model

• Queries over data streams– Language, semantics, evaluation & optimization

• DSMS query processing architecture and system internals

• Results to date

• Ongoing work

• Related work

Page 22: The Stanford Data Streams Research Project

stanfordstreamdatamanager 22

Data ModelData Model• Database: relations + data streamsrelations + data streams

• Stream characteristics– Type of data (schema)– Data distribution– Flow rate– Stability of distribution and flow– Ordering and other constraints– Synchronization of multiple streams– Distributed streams

Page 23: The Stanford Data Streams Research Project

stanfordstreamdatamanager 23

Data Stream Queries -- Basic IssuesData Stream Queries -- Basic Issues• Answer availability

– One-time– Multiple-time– Continuous (“standing”), stored or streamed

• Registration time– Predefined– Ad-hoc

• Stream access– Arbitrary– Sliding window (special case: size = 1)

Page 24: The Stanford Data Streams Research Project

stanfordstreamdatamanager 24

Data Stream Queries -- Basic IssuesData Stream Queries -- Basic Issues• Answer availability

– One-time– Multiple-time– Continuous (“standing”), stored or streamed

• Registration time– Predefined– Ad hoc

• Stream access– Arbitrary– Sliding window (special case: size = 1)

Page 25: The Stanford Data Streams Research Project

stanfordstreamdatamanager 25

Query Language & SemanticsQuery Language & Semantics• Specifying queries over streams

– SQL-like versus dataflow network of operators– Sliding windows as first-class query construct

• Semantic issues– Blocking operators, e.g., aggregation, order-by– Streams as sets versus lists– Timestamping

Page 26: The Stanford Data Streams Research Project

stanfordstreamdatamanager 26

Query Evaluation -- ApproximationQuery Evaluation -- Approximation• Why approximate?

– Streams are coming too fast– Exact answer requires unbounded storage or

significant computational resources– Ad hoc queries reference history

• Issues in approximation– Sliding windows, sampling, synopses, …– How is approximation controlled?– How is it understood by user?

• Accuracy-efficiency-storage tradeoffAccuracy-efficiency-storage tradeoff

Page 27: The Stanford Data Streams Research Project

stanfordstreamdatamanager 27

Query Evaluation -- AdaptivityQuery Evaluation -- Adaptivity• Why adaptivity?

– Queries are long-running– Fluctuating stream arrival & data characteristics– Evolving query loads

• Issues in adaptivity– Adaptive resource allocation (memory,

computation)– Adaptive query execution plans

Page 28: The Stanford Data Streams Research Project

stanfordstreamdatamanager 28

Query Evaluation -- Multiple QueriesQuery Evaluation -- Multiple Queries• Possibly large number of continuous queries

• Long-running

• Shared resources

• Multi-query optimization

Page 29: The Stanford Data Streams Research Project

stanfordstreamdatamanager 29

Query Evaluation -- Distributed StreamsQuery Evaluation -- Distributed Streams1 Many physical streams but one logical stream

– E.g., maintain top 100 visited pages at Yahoo

2 Correlate streams at distributed servers– E.g., network monitoring

3 Many streams controlled by a few servers– E.g., sensor networks

• Issues– Move processing to streams, not streams to

processor– Approximation-bandwidth tradeoffApproximation-bandwidth tradeoff

Page 30: The Stanford Data Streams Research Project

stanfordstreamdatamanager 30

Query Processing ArchitectureQuery Processing Architecture

Input Data Streams

Usersissue

continuous and ad-hoc queries

Administrator can monitor query

executionand adjust run-time

parameters

Applicationsregister

continuous queries

OutputStream

X

X

Waiting Op

Ready Op

Running Op

Synopses Query Plans

Page 31: The Stanford Data Streams Research Project

stanfordstreamdatamanager 31

DSMS InternalsDSMS Internals• Query plans: operators, synopses, queuesoperators, synopses, queues

• Memory management– Dynamic allocation to buffers, queues, synopses– Accuracy vs. memory tradeoff– Operators adapt gracefully to memory reallocation

• Scheduler– Handles variable-rate input streams– Handles varying operator and query requirements

Page 32: The Stanford Data Streams Research Project

stanfordstreamdatamanager 32

Some Results to DateSome Results to Date• Algorithms on data streams

– Online clustering [FOCS 2000, ICDE 2002]

– Online quantiles [SIGMOD 98, SIGMOD 99]

– Statistics over sliding windows [SODA 2002]

– Online frequency counting

• Theory of stream query processing– Memory requirements of stream queries [PODS02]

• System design– STREAMSTREAM: stanfordstreamdatamanager

Page 33: The Stanford Data Streams Research Project

stanfordstreamdatamanager 33

STREAM System ImplementationSTREAM System Implementation• Comprehensive DSMS query processor

• Broad suite of operators and synopses

• Sophisticated “developer’s workbench” interface– Submit queries in extended SQL or algebra– Submit or edit query plans in XML or GUI– Query plan execution visualizer– On-the-fly modification of memory allocation,

scheduling policies, etc.

Page 34: The Stanford Data Streams Research Project

stanfordstreamdatamanager 34

Ongoing WorkOngoing Work• Algebra for streams

• Synopses and algorithmic issues

• Memory management issues

• Exploiting constraints on streams

• Approximation in query processing

• Distributed stream processing

• System development

Page 35: The Stanford Data Streams Research Project

stanfordstreamdatamanager 35

Ongoing WorkOngoing Work• Algebra for streams

• Synopses and algorithmic issues

• Memory management issues

• Exploiting constraints on streams

• Approximation in query processing

• Distributed stream processing

• System development

Page 36: The Stanford Data Streams Research Project

stanfordstreamdatamanager 36

Ongoing Work -- ConstraintsOngoing Work -- Constraints• Exploiting constraints on streams in query

processing– Foreign-key joins, referential integrity, clustering,

ordering– Need not be exact (e.g., k-clustered)– Reduce memory requirements– Unblock blocking operators

Page 37: The Stanford Data Streams Research Project

stanfordstreamdatamanager 37

Ongoing Work -- Approximation in Ongoing Work -- Approximation in Query ProcessingQuery Processing

• Understanding behavior of approximate operators when composed

• Memory allocation to operators in a plan, given per-operator memory-accuracy curve

• Best query plan, assuming best memory allocation

• Multiple (weighted) queries sharing resources

Page 38: The Stanford Data Streams Research Project

stanfordstreamdatamanager 38

Related WorkRelated Work• Triggers, alerters, materialized views,

continuous queries on conventional DBs, pub/sub, sequence & temporal databases, …

• TelegraphTelegraph project at UC Berkeley

• NiagaraNiagara project at Wisconsin/OGI

• AmazonAmazon project at Cornell

• AuroraAurora project at Brown/MIT

• And others

Page 39: The Stanford Data Streams Research Project

For Papers and General Info.For Papers and General Info.

http://www-db.stanford.edu/stream

stanfordstreamdatamanager