ics280 presentation by suraj nagasrinivasa

ICS280 Presentationby Suraj

Nagasrinivasa(1) Evaluating Probabilistic Queries over Imprecise Data (SIGMOD 2003) by R Cheng, D Kalashnikov, S Prabhakar

(2) Model-Driven Data Acquisition in Sensor Networks (VLDB 2004) by A Deshpande, C Guestrin, J Hellerstein, W Hong, S Madden

Acknowledgements: Dmitri Kalashnikov and Michal Kapalka

In typical sensor applications... Sensors monitor external environment

continuously

Sensor readings are sent back to the application

Decisions are often made based on these readings

However, we face uncertainty… Typically, DB/server collects sensor readings DB cannot store “true” sensor value at all

points in time Scarce battery power Limited network bandwidth

So, readings recorded at discrete time points Value of phenomenon continuously changing As a result, DB stored reading is mostly

obsolete

Scenario: Answering Minimum Query with discrete DB stored readings

x0 < y0: x is minimum

y1 < x1: y is minimum Wrong query result

x y

x0

x1

y0

y1

Recorded Temperature

Current Temperature

Scenario: Answering Minimum Query with error-bound readings I

x certainly gives the minimum temperature reading


Bound for Current Temperature

x y

x0

y0

Scenario: Answering Minimum Query with error-bound readings II

Both x and y have a chance of yielding the minimum value

Which one has a higher probability?


Bound for Current Temperature

x y

x0

y0

Probabilistic Queries

Based on variation characteristics of sensor value over time: Bounds can be estimated for possible values Probability distribution of values defined within bounds

Evaluate probability for query answers

Probabilistic queries give a correct answer, instead of a potentially incorrect answer

Rest of the paper…

Notation & Uncertainty Model Classification of Probabilistic Queries Evaluating Probabilistic Queries Quality of Probabilistic Queries Object Refreshment Policies Experimental Results

Notation

T : A set of DB objects (e.g. sensors)

a : Dynamic attribute (e.g. pressure)

Ti : ith object of T

Ti.a(t) : Value of ‘a’ in ‘Ti’ at time ‘t’

Uncertainty Model

[li(t) ui(t)]

Ti.a(t)

fi(x,t) – uncertainty pdf

Uncertainty IntervalUi(t)

Can be extended in ‘n’ dimensions

Classification of Probabilistic Queries Type of Result

Value-based: returns single value E.g. Minimum query ([l,u], pdf)

Entity-based: returns set of objects E.g. Range query ({(Ti, pi), pi>0})

Aggregation Non-Aggregate: query result for an object is independent of

other objects E.g. Range query

Aggregate: query result computed from set of objects E.g. Nearest Neighbor query

Classification of Probabilistic Queries

Value-based answer Entity-based answer

Non-aggregate

VSingleQWhat is the temperature of sensor x?

ERQ Which sensor has temperature between 10F and 30F?

Aggregate VAvgQ, VSumQ, VMinQ, VMaxQ What is the average temperature of the sensors?

ENNQ, EMinQ, EMaxQ Which sensor gives the highest temperature?

Query evaluation algorithms and quality metrics are developed for each class

ENNQ algorithm…Projection, Pruning, Bounding & Evaluation

ENNQ algorithm

Quality of Probabilistic Result Introduce a notion of “quality of answer” Proposed metrics for different classes of queries

5.0

|5.0| ipScore

"Is reading of sensor i in range [l,u] ?"

• regular range query"yes" or "no" with 100%

• probabilistic query ERQyes with pi = 95%: OKyes with pi = 5%: OK (95% it is not in [l, u])yes with pi = 50%: NOT OK (not certain!)

l u

a)b )c)

Ri

ip

RERQanofScore

5.0

|5.0|

||

1___

Quality for Entity-Aggregate Queries

"Which sensor, among n, has the minimum reading?"

Recall Result set R = {(Ti, pi)}

e.g. {(T1, 30%), (T2, 40%), (T3, 30%)} B is interval, bounding all possible values

e.g. minimum is somewhere in B = [10,20]

Our metrics for aggregate queries Min, Max, NN objects cannot be treated independently as in ERQ metric uniform distribution (in result set) is the worst case metrics are based on entropy

Quality for Entity-Aggregate Queries H(X) entropy of random variable X (X1 ,…,Xn

with p(X1) ,…, p(Xn))

entropy is smallest (i.e., 0) iff i : p(Xi) = 1 entropy is largest (i.e., log2(n)) iff all Xi's are

equally likely

)(

1log)()( 2

1 i

n

ii XpXpXH

Improving Answer Quality

Is important to pick right update policies that will help improve answer quality Global Choice

Glb_RR (pick random) Local Choice

Loc_RR (pick random) MaxUnc (heuristic chooses max. uncertainty interval ) MinExpEntropy (heuristic choose object with minimum

expected entropy)

Experiments: Simulation Set-up 1 server, 1000 sensors, limited network

bandwidth, “Min” queries tested

Queries arrival is a Poisson distribution

Each query over a random set of 100 sensors

Results

Conclusions

Probabilistic Querying for handling inherent uncertainty in sensor DBs

Classification, Algorithms and Quality of Answer metrics for various query types

Very general model of uncertainty which makes the algorithms not directly implement-able in any sensor network

Besides, in order to achieve any reasonable energy-efficiency in sensor networks, application and network requirements that dictate sensor nodes to be awake have to be tightly coordinated. Especially in the case of multi-hop routing

Outline for ‘Model Driven Data Acquisition for Sensor Networks’ Introduction

Motivation for Model-Based Queries Framework Concept Model Example – Multivariate Gaussian

Algorithm Resolving Model-Based Queries Incorporating Dynamicity Observation Plan / Cost model

Experiments BBQ System Results

Conclusions

Motivation for Model-Based Queries Declarative Queries adopted as key

programming paradigm for large sensor nets However, interpreting sensor nets as

databases results in two major problems: Misinterpretation of Data

Physically observable world is a set of continuous phenomenon in both time and space

Sensor readings are UNLIKELY to be random samples Inefficient approximate queries

If sensor readings are not “true” values, need for quantifying uncertainty to provide reliable answers

Motivation for Model-Based Queries Paper Contribution: To incorporate statistical

models of real-world processes into sensor net query processing architecture

Models help in: Accounting for biases in spatial sampling Identifying sensors providing faulty data Extrapolating values for missing sensors

Framework Concept

Goal: Given a query and model, to devise an efficient data acquisition plan to provide “best” possible answer

Major dependencies: Correlations between sensors captured by the

statistical model Correlation between attributes for given sensor Correlation between sensors for given attribute

Specific connectivity of the wireless network

Framework ConceptObservation Plan parameters

* Correlations in Value * Cost Differential

Framework Concept

Model Example – Multivariate Gaussian

Resolving Model-Based Queries (Range Queries)

Resolving Model-Based Queries(Value Queries) To compute value of Xi with maximum error

‘e’ and confidence ‘1-delta’: Compute mean of Xi (where o – observations)

As in range queries, find probability :

Range Queries for Gaussian

Projection for Gaussian is simple – just drop unnecessary values from mean and variance matrix

The integral

has to be computed.

Incorporating Dynamicity

Use historical measurements to improve confidence of answers

Given pdf in time ‘t’

Compute pdf at time ‘t+1’

Incorporating Dynamicity

Assumption: Markovian Model Dynamicity summarized by “transition model”

Observation Plan / Cost Model What is the cost of making ‘o’ observations? C(o) = acquisition cost + transmission cost Acquisition cost: constant for each attribute Transmission cost:

Network graph Edge weights (link quality) Paths taken could be sub-optimal

Observation Plan / Cost Model A set of attributes (‘theta’) to observe are

determined by computing expected benefit

And finding…

This, being similar to the traveling salesman’s problem, is best dealt with heuristic algorithms

BBQ System

BBQ: A Tiny-Model Query System

Uses Multivariate Gaussians

Has 24 transition models – for different hour of day

Results

Experiment: 11 sensors on a tree, 83000 measurements, 2/3 used for training and 1/3 for tests

Methodology BBQ builds a model based on training data One random query / hour taken – possible observations and

model is updated The answer is compared to the measured value

Compare with two other methods TinyDB: Each query broadcasted over sensor networks using an

overlay tree Approximate-Caching: Base station maintains a view of the

sensor readings

Results

Conclusion

Approximate queries can be well optimized, but model of physical phenomenon is needed

Defining an appropriate model is a challenge The framework works well for “fairly steady”

sensor data values Statistical model is largely static with

refinements to the model based on incoming queries and observations made as a result

ics280 presentation by suraj nagasrinivasa

Documents

minimum query

temperature of sensor

probabilistic query

highest temperature

average temperature

sensor readingsdb

true sensor value

minimum valuewhich