continuously adaptive processing of data and query streams michael franklin uc berkeley april 2002...

38
Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group.

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

Continuously Adaptive Processing of Data and Query Streams

Michael FranklinUC Berkeley

April 2002

Joint work w/Joe Hellerstein and the Berkeley DB Group.

Page 2: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

2

Overview

Autonomic Computing is a technique for making computer systems self-managing.

But, real issue is supporting autonomic organizations.

They need information in a timely and tailored manner. workflows, business flows, news, business intelligence, sensor input,

The information infrastructure is the “nervous system” of the organization.

Adaptive, responsive data management systems are essential.

Page 3: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

3

Opportunities for Autonomic Data Mgmt.

The functions currently expected of the DBA must be automated. (see next talk)– Data placement, index selection, caching, pre-

fetching, “freshening”, …– We are pursuing an approach based on User

Profiles. The query processing engine itself must be made

“autonomic”.– This is the focus of today’s talk.– But, issues are related: profiles are a type of standing

query.

Page 4: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

4

Example 1: “Real-Time Business”

Event-driven processing B2B and Enterprise apps

– Supply-Chain, CRM– Trade Reconciliation, Order Processing etc.

(Quasi) real-time flow of events and data Must manage these flows to drive business

processes. Mine flows to create and adjust business rules. Can also “tap into” flows for on-line analysis.

Page 5: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

5

Example 2: Information Dissemination

User Profiles

Users

Filtered Data

Data Sources

•Doc creation or crawler initiates flow of data towards users.•Users + system initiate flow of profiles back towards data.

Page 6: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

6

Example 3: Sensor Nets

Tiny devices measure the environment.

– Berkeley “motes”, Smart Dust, Smart Tags, …

Major research thrust at Berkeley.– Apps: Transportation, Seismic, Energy,…

Form dynamic ad hoc networks, aggregate and communicate streams of values.

Sensors are the stimulus receptors of the autonomic organization.

Page 7: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

7

Common Features Centrality of dataflow

– Architecture is focused on data movement– Moving streams of data through code in a network

Requires intelligent, low-overhead routing

Volatility of the environment– Dynamic resources & topology, partial failures– Long-running (never-ending?) tasks– Potential for user interaction during the flow– Large Scale: users, data, resources, … Requires adaptivity

Page 8: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

8

Review: Distributed Query Processing

SELECT eid,ename,title FROM Emp, Proj, Assign

WHERE Emp.eid = Assign.eid

AND Proj.pid = Assign.pidAND Emp.loc <> Proj.loc

System handles query plan generation & optimization; ensures correct execution.

Originally conceived for corporate networks.

©1998 Ozsu and Valduriez

Page 9: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

9

– Sources may be unreachable or slow to respond.

– Data statistics/cost estimates may be unavailable or unreliable.

– No view into internal processing and structure.

Traditional, static query processing approaches cannot cope with such problems at run-time.

Wide-area + Wrapped sources Wide-area + Wrapped sources UnpredictabilityUnpredictability

Page 10: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

10

Batch processing is inappropriate for many apps.– Must provide feedback to the

user as quickly as possible.

Data access becomes a cooperative, iterative process:– User may correct, refine,

redirect, or change the query.

User-in-the-loop User-in-the-loop UnpredictabilityUnpredictability

Page 11: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

11

– MobilityLocation-centric queriesMoving endpoints change data

staging needs

– Data Streams/SensorsVarying data arrival ratesAdapting resolutionsPush vs. PullQueries can be long lived and streams

may be infinite so a static plan is doomed to failure.

Mobility & Data Streams Mobility & Data Streams UnpredictabilityUnpredictability

Page 12: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

12

Adaptive Query Processing

Existing systems are adaptive but at a slow rate.– Collect Stats– Compile and Optimize Query– Eventually collect stats again or change schema– Re-compile and optimize if necessary.

[Survey: Hellerstein, Franklin, et al., DE Bulletin 2000]

staticplans

currentDBMS

Page 13: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

13

Adaptive Query Processing

Goal: allow adjustments for runtime conditions. Basic idea: leave “options” in compiled plan. At runtime, bind options based on observed state:

– available memory, load, cache contents, etc. Once bound, plan is followed for entire query. [HP88,GW89,IN+92,GC94,AC+96, AZ96,LP97]

Dynamic,Parametric,

Competitive,…

staticplans

latebinding

currentDBMS

Page 14: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

14

Adaptive Query Processing

Start with a compiled plan. Observe performance between blocking (or

blocked) operators. Re-optimize remainder of plan if divergence from

expected data delivery rate [AF+96,UFA98] or data statistics [KD98].

Dynamic,Parametric,

Competitive,…

staticplans

latebinding

inter-operator

currentDBMS

Query Scrambling,MidQuery

Re-opt

Page 15: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

15

Adaptive Query Processing

Join operators themselves can be made adaptive:– to user needs (Ripple Joins [HH99])– to memory limitations (DPHJ [IF+99])– to memory limiations and delays (XJoin [UF00])

Plan Re-optimization can also be done in mid-operation (Convergent QP [IHW02])

Dynamic,Parametric,

Competitive,…

staticplans

latebinding

inter-operator

currentDBMS

Query Scrambling,MidQuery

Re-opt

Ripple Join,XJoin, DPHJ,Convergent

QP

intra-operator

Page 16: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

16

Adaptive Query Processing

This is the region that we are exploring in the Telegraph project at Berkeley.

Dynamic,Parametric,

Competitive,…

staticplans ???

latebinding

inter-operator

per tupl

ecurrentDBMS

Query Scrambling,MidQuery

Re-opt

Eddies,CACQ

XJoin, DPHJConvergent

QP

???

PSoup

intra-operator

Page 17: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

17

Telegraph Overview

An adaptive system for large-scale dataflow processing. A convergence of Data Management and Networking approaches.

Based on an extensible set of operators:1) Ingress (data access) operators

Screen Scraper, Napster/Gnutella readers, File readers, Sensor Proxies

2) Non-Blocking Data processing operators Selections (filters), XJoins, …

3) Adaptive Routing Operators Eddies, STeMs, FLuX, etc.

Page 18: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

18

Routing Operators: Eddies

• How to order and reorder operators over time?

– Traditionally, use performance, economic/admin feedback

– won’t work for never-ending queries over volatile streams

• Instead, use adaptive record routing.

Reoptimization = change in routing policy

[Avnur & Hellerstein, SIGMOD 00]

staticdataflow

A B

C

D

eddy

A B C D

Page 19: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

19

Eddy – Per Tuple Adaptivity

Adjusts flow adaptively– Tuples routed through ops in diff. orders– Must visit each operator once before output

State is maintained on a per-tuple basis Two complementary routing mechanisms

– Back pressure: each operator has a queue, don’t route to ops with full queue – avoids expensive operators.

– Lottery Scheduling: Give preference to operators that are more selective.

Initial Results showed eddy could “learn” good plans and adjust to changes in stream contents over time.

Currently in the process of exploring the inherent tradeoffs of such fine-grained adaptivity.

Page 20: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

20

Traditional Hash Joins block when one input stalls.

Non-Blocking Operators – Join

BuildProbe

Source A Source B

Hash Table A

Page 21: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

21

Non-Blocking Operators – Join

Hash Table A

Hash Table B

•Symm Hash Join [WA91] blocks only if both stall. •Processes tuples as they arrive from sources.•Produces all answer tuples and no spurious duplicates.

BuildProbe

Source A Source B

Page 22: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

22

SteMs:“State Modules” [Raman 01]

A generalization of the Symmetric Join Split join into modules that contain current state

of the operators.– Basically a store with one or more indexes.– Allows sharing of state across multiple queries.

Supports:

XJoin, Continuous Queries, Competitive processing, …

Page 23: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

23

Using SteMs for Joins

HashAHashB HashC

HashD

A B C D

•SteMs maintain intermediate state for multiple joins.•Use Eddy to route tuples through the necessary modules.

•Be sure to enforce “build then probe” processing.•Note, can also maintain results of individual joins in SteMs.

Page 24: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

24

CACQ [Madden et al, SIGMOD 02]

Stocks.

symbol = ‘MSFT’

Stocks.

symbol = ‘APPL’

Query 2

Query 1

Stock Quotes

‘MSFT’

‘APPL’

Stock Quotes

Expect hundreds to thousands of queries over same sources

Continuously Adaptive Continuous Queries– combine operators from many queries to

improve efficiency (i.e. share work).

Page 25: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

25

Combining Queries in CACQ

Q1 = Select * From S whereS.a = s1 and S.b = s4;

Q2 = Select * From S whereS.a = s2 and S.b = s5;

Q3 = Select * From S whereS.a = s3 and S.b = s6;

•Can also use SteMs to store query specifications.•Need a good predicate index if many queries.

•Eddy now is routing tuples for multiple queries simultaneously•CACQ leverages Eddies for adaptive CQ processing

•Results show advantages over static approaches.

Page 26: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

26

In The Beginning

Data

Query

Index

Result

Page 27: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

27

Pub Sub/CQ/Filtering

Queries

Dat

a

Ind

ex

Result

These systems canbe looked at as databasesystems turned upsidedown.

Page 28: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

28

PSoup [Chandrasekaran 02]

Logical (?) conclusion: – Treat queries and data as duals.

CACQ etc are “filters”:– No historical data.– Answers continuously returned as new data streams into

the system. Such continuous answers are often:

– Infeasible – intermittent connectivity and “Data Recharging” profiles.

– Inefficient – often user does not want to be continuously interrupted.

Page 29: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

29

PSoup (cont.)

Query processing is treated as an n-way symmetric join between data tuples and query specifications.

PSoup model:

repeated invocation of standing queries over windows of streaming data.

Page 30: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

30

PSoup: Query & Data Duality

Queries

Index

Result

DataData

Index

Page 31: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

31

PSoup: Query & Data Duality

Queries

Index

Result

Data

Index

Query

Page 32: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

32

PSoup (continued)

Queries specify a “window” (e.g. last 10 min) Queries are registered with the system

– These are entered into a “Query SteM” Data tuples stream into the system

– These are entered into a “Data SteM” Matching is done on the fly (in background).

– Effectively keeping shared materialized views. When a query is invoked:

– The query window is applied to the materialized answer and results are returned to the user.

Page 33: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

33

Data StoreID R.bR.a

34

00

83

37

48

51

50

48

52

49

PSoup

Query StoreID Predicate

0<R.a<=5

R.a=4 AND R.b=3

0>R.b>4

R.a>4 AND R.b=3

23

22

20

21

PSoup: New Select Query Arrival

SELECT *FROM RWHERE R.a <= 4 AND R.b >=3BEGIN (NOW – 600)END (NOW)

NEW QUERY

BUILD

R.a<=4 AND R.b>=324

Page 34: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

34

PROBE

Data StoreID R.bR.a

34

00

83

37

48

51

50

48

52

49R.a<=4 AND R.b>=324

New Selection Spec (continued)

Queries

2120

51

50

48

52

49

2322

T

F

24

F

T

F

Da

ta

RESULTS

Page 35: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

35

Data StoreID R.bR.a

34

00

83

37

48

51

50

48

52

49

Query StoreID Predicate

0<R.a<=5

R.a=4 AND R.b=3

0>R.b>4

R.a>4 AND R.b=3

23

22

20

21

24 R.a<=4 AND R.b>=3

Selection – new data

6353NEW DATA

BUILD 6353

PSoup

Page 36: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

36

PROBE 6353

Query StoreID Predicate

0<R.a<=5

R.a=4 AND R.b=3

0>R.b>4

R.a>4 AND R.b=3

23

22

20

21

24 R.a<=4 AND R.b>=3

Selection – new data (cont.)

Queries

2120

51

50

48

52

49

2322 24

Da

ta

RESULTS

53 FT FF T

Page 37: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

37

PSoup – Query Invocation

PSoup continuously maintains materialized views over streaming data and queries.

Data is not returned to user until the query is invoked.– Invocation requires applying “windows” to precomputed

results. Adaptive approach allows system to continuously

absorb new data and new queries without recompilation.

Lots of issues to study: – Query indexing, Spilling to disk, bulk processing

Page 38: Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group

38

Conclusions

The goal is to support autonomic organizations.– The information infrastructure is the “nervous system” of such

organizations. Existing QP technology is simply too brittle. The Telegraph project at Berkeley is “pushing the envelope” on

adaptive query processing.

staticplans ???

latebinding

inter-operator

per tupl

eEddies,CACQ

???

PSoup

intra-operator