yanlei diao, university of massachusetts amherst capturing data uncertainty in high- volume stream...

27
Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High-Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng, Charles Sutton , Thanh Tran, Michael Zink University of Massachusetts, Amherst University of California, Berkeley

Upload: andrea-fisher

Post on 18-Jan-2018

220 views

Category:

Documents


0 download

DESCRIPTION

Yanlei Diao, University of Massachusetts Amherst Scope of Our Problem  Data modeled as continuous random variables Many types of sensor data. More examples later…  High-volume data streams In contrast to probabilistic databases  An end-to-end solution Uncertainty of raw data Uncertainty of query processing results

TRANSCRIPT

Page 1: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Capturing Data Uncertainty in High-Volume Stream Processing

Yanlei Diao, Boduo Li, Anna Liu, Liping Peng, Charles Sutton†, Thanh Tran, Michael Zink

University of Massachusetts, Amherst† University of California, Berkeley

Page 2: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Uncertain Data Streams Uncertain data streams

• Environmental monitoring sensor networks• Radio Frequency Identification (RFID) networks

• GPS systems • Radar sensor networks

Data: incomplete, imprecise, misleading

Results: unknown quality

Page 3: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Scope of Our Problem Data modeled as continuous random variables • Many types of sensor data. More examples later…

High-volume data streams• In contrast to probabilistic databases

An end-to-end solution• Uncertainty of raw data• Uncertainty of query processing results

Page 4: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Object Tracking and Monitoring Mobile RFID readers

• Handheld, robot-mounted Incomplete, noisy data

• Environmental factors• Orientation of reading

Not directly queriable • Raw data: <tag id, reader id, ts>

• Data needed for querying: e.g., precise object locations

+

Page 5: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Fire Monitoring Application

Display of solid merchandise shall not exceed 200 pounds per square foot of shelf area.

SELECT RSTREAM(area(R.(x,y,z)p), sum(R.weight))FROM R [PARTITION BY R.tag_id ROW 1]GROUP BY area(R.(x,y,z)p)HAVING sum(R.weight) > 200

(time, tag_id, (x,y,z)p)

What is the quality of the alert returned by this query?

Page 6: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Fire Monitoring Application

Alert when a flammable object is exposed to a high temperature.

SELECT RSTREAM(R.tag_id, R.(x,y,z)p, T.tempp)FROM RFIDStream [RANGE 3 seconds] as R,

TempStream [RANGE 3 seconds] as T WHERE object_type(R.tag_id) = ‘flammable’ and

T.tempp > 60°C and location_equals(R.(x,y,z)p, T.(x,y,z))

What is the quality of the alert returned by this query?

(time, (x,y,z), tempp)

(time, tag_id, (x,y,z)p)

Page 7: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Severe Weather Monitoring

Sensing

Merging

Detection/Predication

wireless transmission

Task Generation

Sensing

Transformation& Averaging

Transformation& Averaging

Page 8: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

High-Volume, Uncertain Raw Data

High-Volume: 1.66 million data items,205Mb / sec per radar

Uncertainty:• Environmental noise• Device noise• Transmit frequency• System clock• Positioner• Antenna

Pulses1 2 3 4 5 6 7

Gates (distance)

(time)Raw Pulse

data

SensingSensing

Page 9: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Averaged Moment Data

SensingSensing

Transformation& Averaging

Transformation& Averaging

1 2 3 4 5 6 7 Pulses

Gates (distance)

(time)Moment data

velocity,reflectivity,

Page 10: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Averaged Moment Data

SensingSensing

Transformation& Averaging

Transformation& Averaging

1 2 3 4 5 6 7 Pulses

Gates (distance)

(time)

Uncertainty: what is the effect of averaging over uncertain data?

Moment datavelocity,

reflectivity,…

Page 11: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Merged Data

Sensing

Merging

Detection/Predication

wireless transmission

Sensing

Transformation& Averaging

Transformation& Averaging

What is the quality of the detection result?

Uncertainty: Uneven distribution of data density

Page 12: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst © KSWO TV

© Patrick Marsh May 8, 2007

Series of low-levelcirculations.

NWS TornadoWarnings: 7:16pm,7:39pm, 8:29pm

7:21pm

8:15 pm

9:54pm

11:00pm

Page 13: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Effect of Averaging of Uncertain DataAveraging size

Moment data size

(MB)

Detection running time

(sec)

Reported tornados

False negative

s40 41.49 27 3.75 0

60 27.68 23 1.5 2.25

80 20.79 21 0.5 3.25

100 16.65 21 0.25 3.75

500 3.42 20 0 3.75

1000 1.76 20 0 3.75

Results of 38 second trace at 8:10 pm on May 8, 2007.The averaging size 40 used to represent detection results using fine-grained data.

Page 14: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Challenges Raw data is inherently incomplete and noisy Raw data is not directly queriable

• RFID: <ts, tag_id, reader_id>; <ts, tag_id, (x,y,z)>

• Radar: <ts, gate, (I,Q)h|v>; <ts, gate, (reflectivity, velocity, …)>

High volume raw data streams• RFID: hundreds of readings per second per reader• Radar: 1.66 million data items per second per radar

Sophisticated query processing

Page 15: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

System Overview

T1

T2

T3

A1 A2

A3

A4

J1

tuples w. lineage

Archived tuples

Confidence region

Mean,Variance,Bounds

Page 16: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Data Capture and Transformation Transform raw streams into tuple streams with quantified uncertainty -- compute p(X|O):• Output: continuous random variables X, hidden

• Input: random variables O, observed Existing work

• Statistical machine learning• Sensor stream cleaning and processing

Our goal: choose appropriate statistical models, optimize for high-volume streams

Page 17: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

RFID Streams: Modeling A generative model characterizes how data is generated -- p(X,O)• X: true object location (x,y,z)

• O: boolean for RFID readings

• How state of the world changes• Object movement, reader motion

• How sensing generates data from the state of the world

Probabilistic inference over RFID streams in mobile Environments. T. Tran, C. Sutton, R. Cocci, Y. Nie, Y. Diao, and P. Shenoy. ICDE 2009.

Page 18: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

RFID Streams: Inference Probabilistic inference over streams -- p(X|O)

• Sampling-based inference• Key to performance: using a small number of samples

Standard sampling- based inference

Our optimizations

Accuracy 0.6 - 0.8 foot 0.1 - 0.5 footPerformance 0.1 reading/sec

for 20 objects> 1000 readings/sec for 20,000 objects

7 orders of magnitude improvement!

Page 19: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Radar Streams: Modeling Again, a generative model p(X,O)?

• O: raw pulse data• X: velocity, reflectivity, …• Highly complex sensing process• Extremely high volume, 1.66 million data items/sec

Pulses1 2 3 4 5 6 7

Gates (distance)

(time)• Environmental noise• Device noise• Transmit frequency• System clock• Positioner• Antenna…

Page 20: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Radar Streams: Model Fitting Make output data X observable -- p(X)

• Deterministic heuristic algorithm for O-X transformation

Fit a known model directly• Moving Average (MA) model for p(X1, …, Xn)

Key to performance: model fitting at stream speed• Identify sequences obeying MA at 1.66 million items/sec

X1 X2 X3 X4 X5 X6 X7

E1 E2 E3 E4 E5 E6 E7

Page 21: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Distance from radar

MA seq. length

MA(5)

Dynamically decide MA sequences for averaging

Initial Result of MA Fitting

Efficiently compute distribution of averaging over MA sequences

Page 22: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Relational Processing under Uncertainty A relational paradigm for data processing after initial data capture and transformation • Support , , Aggregation • Compute a distribution for each result, modeled as a continuous random variable

Integral-based approach [Cheng et al., SIGMOD 2003] • Exact, but too slow for stream processing

Sampling-based approach [Ge & Zdonik, ICDE 2008]• Speed-accuracy tradeoff?

Page 23: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Research Issues Techniques for exact derivation that are natural for continuous random variables

Approximation• Achieving speed-accuracy tradeoff more effectively

Correlated intermediate results• When do they occur with , , Aggregation?• Optimizations: avoid intermediate pdfs

• Complex function• Lineage …

Page 24: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

© KSWO TV

Much Work Lies Ahead…

Your comments are welcome.

Page 25: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

RFID Streams: Speed vs. Accuracy

Page 26: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Distance from radar

MA seq. length

MA(5)

MA(20)

Dynamically decide MA sequences for averaging

Performance tradeoff

Page 27: Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng,

Yanlei Diao, University of Massachusetts Amherst

Aggregation: Speed vs. Accuracy

Algorithm Throughput Variance Distance [0,1]

Histogram 3382 0.083

CF (exact)

466 0

CF (approx)

10593 0.012