streaming queries over streaming data sirish chandrasekaran (uc berkeley) michael j. franklin (uc...

29
Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

Upload: meagan-hodges

Post on 05-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

Streaming Queries over Streaming Data

Sirish Chandrasekaran (UC Berkeley)

Michael J. Franklin (UC Berkeley)

Presented by Andy Williamson

Page 2: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

About Me

3rd Year ISYE major Minor in Computer Science From Austin, TX Have visited every state but Alaska Intern at Deloitte Consulting focusing

on SAP implementation

Page 3: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

Agenda

Background/Motivation PSoup

Introduction System Overview Query Processing Techniques Implementation Performance Aggregation Queries Conclusions

Critique

Page 4: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

Background/Motivation

Continuous Query (CQ) Systems Treat queries as fixed entities and

stream data over themPrevious systems only allowed

streaming of either data or queriesContinuously deliver results as they

are computed (infeasible/inefficient)• Data Recharging• Monitoring

Page 5: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Introduction

Query processor based on Telegraph query processing framework

Allows both data and queries to be streamed

Partially stores results to support disconnected operation and improve data throughput and response time

Page 6: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: System Overview

User initially registers query specification with system System returns handle which can be used to invoke results

of query later Example Query:SELECT *FROM Data_Stream D_sWHERE (D_s.a < x ^ D_s.b > y)BEGIN(NOW – 10)END(NOW); Begin-End Clause allows:

Snapshot (constant beginning and ending time) Landmark (constant beginning and variable ending time) Sliding window (variable beginning and ending time)

Limited by size of memory

Page 7: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: System Overview

PSoup treats execution of query streams as a join of query and data streams

Maintains State

Modules (SteMs)

for queries and data One query SteM for

all queries in the system, and one data SteM for each data stream

Page 8: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Query Processing Techniques Overview

PSoup assigns unique queryID that it returns to the user

Client can disconnect, reconnect and execute query to obtain updated results

PSoup continuously matches data to query predicates in background and stores the results in its Results Structure

When a query is invoked, PSoup applies the appropriate input window to the Results Structure to return the current results

Page 9: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Query Processing Techniques Entry of new Query specs

New queries split into two parts:• Standing Query Clause (SQC): consists of the

SELECT-FROM-WHERE clauses• BEGIN-END clause, stored in separate

WindowsTable structure

SQC inserted into Query SteM Used to probe Data SteMs corresponding to

tables in FROM clause Resulting tuples stored in Results Structure

Page 10: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Query Processing Techniques Entry of new data

New tuples assigned globally unique tupleID and physical timestamp (physicalID) based on system clock

Inserted into appropriate Data SteMThen used to probe Query SteM to

determine which SQCs it satisfiesTupleIDs and physicalIDs stored in

Results Structure

Page 11: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Query Processing Techniques

Selection Queries over a single stream

Page 12: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Query Processing Techniques

Join Queries Over Multiple Streams

Page 13: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Query Processing Techniques Query Invocation and Result Construction

Results Structure maintains info about which tuples in Data SteM(s) satisfy which SQCs in Query SteM

For each result tuple of each query, it stores tupleID and physicalID of all constituent base tuples of result tuple

Results of a query can be accessed by its queryID

Ordered by timestamp (physicalID)

Page 14: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Implementation

Eddy Each tuple has a predicate attribute and an

Interest List dictating where it is to be routed Provides Stream Prefix Consistency by

storing new and temporary tuples separately in New Tuple Pool and Temporary Tuple Pool

Begins by selecting a tuple from the NTP and then processing everything in the TTP before pickign another tuple from the NTP

Page 15: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Implementation

Data SteMUse tree-based index for data to

provide efficient access to probing queries

One red-black tree for every attributeMaintains hash-based index over

tupleIDs for fast access

Page 16: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Implementation

Query SteM Allows sharing of work between queries that have

overlapping FROM clauses Use red-black trees to index single-attribute single-

relation boolean factors of a query

Page 17: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Implementation

Query SteM For queries involving joins of multiple attributes, tree

structure doesn’t work Instead, a linked list called the predicateList is used Query SteM contains an array in which each cell

represents a query At beginning of probe by a data tuple, each cell is set

to the number of boolean factors in corresponding query

Every time tuple satisfies a boolean factor, cell value is decremented

At end of probe, if cell = 0, that means the data tuple satisfies the given query

Page 18: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Implementation

Results Structure Stores metadata indicating which tuples

satisfy which SQCs Can either be accomplished by previously-

mentioned bitmap or by associating a linked list of satisfactory data tuples for each query

Ordering by timestamp is simple for single-table queries

For Join queries, typically use oldest timestamp

Page 19: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Performance

Implemented in Java with customized versions of Eddy and SteMs

Examined performance of two versions: PSoup-Partial (PSoup-P): Maintain results

corresponding to SQCs in Results Structure, and apply BEGIN-END clauses to retrieve current results on query invocation

PSoup-Complete (PSoup-C): Continuously maintains results corresponding to current input window for each query in linked lists

NoMat: Measurements of a system that doesn’t materialize results

Page 20: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Performance

Storage Requirements NoMat: Storage cost = space taken to store

base data streams within maximum window over which queries are supported, plus size of structures

PSoup-P: Storage cost = storage cost of NoMat + size of Results Structure (either bitarray or linked-list)

PSoup-C: Storage cost >> storage cost of PSoup-P since C always stores current results of standing queries at a given time

Page 21: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Performance

Experimental Setup Varied window sizes (27-216) and number(1-

8)/type of boolean factors Measured response time and maximum

supportable data arrival rate Examined both P and C with and without

predicate indexes Tested scheme to remove redundancies

arising from joins Used synthetic generated query(27-212) /data

streams

Page 22: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Performance

Response Time vs. Window Size

Page 23: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Performance

Response Time vs. # Interval Predicates

Page 24: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Performance

Data Arrival Rate vs. # SQCs

Page 25: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Performance

Summary of Results Materializing results of queries supports

higher query invocation rates Indexing queries and lazily applying windows

improves maximum data throughput PSoup-C requires more memory PSoup-C optimizes query invocation rate PSoup-P optimizes data arrival rate

Page 26: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Performance

Removing Redundancy in Join processingEntry of a query

specification or

new dataComposite tuples

in joins

Page 27: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Aggregation Queries

PSoup can support aggregate functions

Only possible to share data structures across queries with identical SELECT-PROJECT-JOIN clause

Page 28: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

PSoup: Conclusions

Treats data and query streams analogously Can support queries that require access to data that

arrived before and after the query Materializes results to cut down on response time and

to support disconnected operation Enables data recharging and monitoring

Future work: Write data streams to disk and execute queries over

them Transfer queries between disk and memory, allowing

query execution to be scheduled Confront resource constraints when dealing with

infinite streams Query browser for temporal data

Page 29: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

Critique

Strengths Very well written, easy to follow Clear examples, excellent explanation of performance

results Strong method that reduces processing time with

increase in interval predicates Weaknesses

Lacking sufficient data on storage costs Experimentation only tested one multiple-relation

boolean factor for joins; unrealistic Didn’t address whether same (or similar) query could

be entered twice and accidentally given two ID’s