ics280 presentation by suraj nagasrinivasa
DESCRIPTION
ICS280 Presentation by Suraj Nagasrinivasa. (1) Evaluating Probabilistic Queries over Imprecise Data (SIGMOD 2003) by R Cheng, D Kalashnikov, S Prabhakar (2) Model-Driven Data Acquisition in Sensor Networks (VLDB 2004) by A Deshpande, C Guestrin, J Hellerstein, W Hong, S Madden - PowerPoint PPT PresentationTRANSCRIPT
ICS280 Presentationby Suraj
Nagasrinivasa(1) Evaluating Probabilistic Queries over Imprecise Data (SIGMOD 2003) by R Cheng, D Kalashnikov, S Prabhakar
(2) Model-Driven Data Acquisition in Sensor Networks (VLDB 2004) by A Deshpande, C Guestrin, J Hellerstein, W Hong, S Madden
Acknowledgements: Dmitri Kalashnikov and Michal Kapalka
In typical sensor applications... Sensors monitor external environment
continuously
Sensor readings are sent back to the application
Decisions are often made based on these readings
However, we face uncertainty… Typically, DB/server collects sensor readings DB cannot store “true” sensor value at all
points in time Scarce battery power Limited network bandwidth
So, readings recorded at discrete time points Value of phenomenon continuously changing As a result, DB stored reading is mostly
obsolete
Scenario: Answering Minimum Query with discrete DB stored readings
x0 < y0: x is minimum
y1 < x1: y is minimum Wrong query result
x y
x0
x1
y0
y1
Recorded Temperature
Current Temperature
Scenario: Answering Minimum Query with error-bound readings I
x certainly gives the minimum temperature reading
Recorded Temperature
Bound for Current Temperature
x y
x0
y0
Scenario: Answering Minimum Query with error-bound readings II
Both x and y have a chance of yielding the minimum value
Which one has a higher probability?
Recorded Temperature
Bound for Current Temperature
x y
x0
y0
Probabilistic Queries
Based on variation characteristics of sensor value over time: Bounds can be estimated for possible values Probability distribution of values defined within bounds
Evaluate probability for query answers
Probabilistic queries give a correct answer, instead of a potentially incorrect answer
Rest of the paper…
Notation & Uncertainty Model Classification of Probabilistic Queries Evaluating Probabilistic Queries Quality of Probabilistic Queries Object Refreshment Policies Experimental Results
Notation
T : A set of DB objects (e.g. sensors)
a : Dynamic attribute (e.g. pressure)
Ti : ith object of T
Ti.a(t) : Value of ‘a’ in ‘Ti’ at time ‘t’
Uncertainty Model
[li(t) ui(t)]
Ti.a(t)
fi(x,t) – uncertainty pdf
Uncertainty IntervalUi(t)
Can be extended in ‘n’ dimensions
Classification of Probabilistic Queries Type of Result
Value-based: returns single value E.g. Minimum query ([l,u], pdf)
Entity-based: returns set of objects E.g. Range query ({(Ti, pi), pi>0})
Aggregation Non-Aggregate: query result for an object is independent of
other objects E.g. Range query
Aggregate: query result computed from set of objects E.g. Nearest Neighbor query
Classification of Probabilistic Queries
Value-based answer Entity-based answer
Non-aggregate
VSingleQWhat is the temperature of sensor x?
ERQ Which sensor has temperature between 10F and 30F?
Aggregate VAvgQ, VSumQ, VMinQ, VMaxQ What is the average temperature of the sensors?
ENNQ, EMinQ, EMaxQ Which sensor gives the highest temperature?
Query evaluation algorithms and quality metrics are developed for each class
ENNQ algorithm…Projection, Pruning, Bounding & Evaluation
ENNQ algorithm
Quality of Probabilistic Result Introduce a notion of “quality of answer” Proposed metrics for different classes of queries
5.0
|5.0| ipScore
"Is reading of sensor i in range [l,u] ?"
• regular range query"yes" or "no" with 100%
• probabilistic query ERQyes with pi = 95%: OKyes with pi = 5%: OK (95% it is not in [l, u])yes with pi = 50%: NOT OK (not certain!)
l u
a)b )c)
Ri
ip
RERQanofScore
5.0
|5.0|
||
1___
Quality for Entity-Aggregate Queries
"Which sensor, among n, has the minimum reading?"
Recall Result set R = {(Ti, pi)}
e.g. {(T1, 30%), (T2, 40%), (T3, 30%)} B is interval, bounding all possible values
e.g. minimum is somewhere in B = [10,20]
Our metrics for aggregate queries Min, Max, NN objects cannot be treated independently as in ERQ metric uniform distribution (in result set) is the worst case metrics are based on entropy
Quality for Entity-Aggregate Queries H(X) entropy of random variable X (X1 ,…,Xn
with p(X1) ,…, p(Xn))
entropy is smallest (i.e., 0) iff i : p(Xi) = 1 entropy is largest (i.e., log2(n)) iff all Xi's are
equally likely
)(
1log)()( 2
1 i
n
ii XpXpXH
Improving Answer Quality
Is important to pick right update policies that will help improve answer quality Global Choice
Glb_RR (pick random) Local Choice
Loc_RR (pick random) MaxUnc (heuristic chooses max. uncertainty interval ) MinExpEntropy (heuristic choose object with minimum
expected entropy)
Experiments: Simulation Set-up 1 server, 1000 sensors, limited network
bandwidth, “Min” queries tested
Queries arrival is a Poisson distribution
Each query over a random set of 100 sensors
Results
Conclusions
Probabilistic Querying for handling inherent uncertainty in sensor DBs
Classification, Algorithms and Quality of Answer metrics for various query types
Very general model of uncertainty which makes the algorithms not directly implement-able in any sensor network
Besides, in order to achieve any reasonable energy-efficiency in sensor networks, application and network requirements that dictate sensor nodes to be awake have to be tightly coordinated. Especially in the case of multi-hop routing
Outline for ‘Model Driven Data Acquisition for Sensor Networks’ Introduction
Motivation for Model-Based Queries Framework Concept Model Example – Multivariate Gaussian
Algorithm Resolving Model-Based Queries Incorporating Dynamicity Observation Plan / Cost model
Experiments BBQ System Results
Conclusions
Motivation for Model-Based Queries Declarative Queries adopted as key
programming paradigm for large sensor nets However, interpreting sensor nets as
databases results in two major problems: Misinterpretation of Data
Physically observable world is a set of continuous phenomenon in both time and space
Sensor readings are UNLIKELY to be random samples Inefficient approximate queries
If sensor readings are not “true” values, need for quantifying uncertainty to provide reliable answers
Motivation for Model-Based Queries Paper Contribution: To incorporate statistical
models of real-world processes into sensor net query processing architecture
Models help in: Accounting for biases in spatial sampling Identifying sensors providing faulty data Extrapolating values for missing sensors
Framework Concept
Goal: Given a query and model, to devise an efficient data acquisition plan to provide “best” possible answer
Major dependencies: Correlations between sensors captured by the
statistical model Correlation between attributes for given sensor Correlation between sensors for given attribute
Specific connectivity of the wireless network
Framework ConceptObservation Plan parameters
* Correlations in Value * Cost Differential
Framework Concept
Model Example – Multivariate Gaussian
Resolving Model-Based Queries (Range Queries)
Resolving Model-Based Queries(Value Queries) To compute value of Xi with maximum error
‘e’ and confidence ‘1-delta’: Compute mean of Xi (where o – observations)
As in range queries, find probability :
Range Queries for Gaussian
Projection for Gaussian is simple – just drop unnecessary values from mean and variance matrix
The integral
has to be computed.
Incorporating Dynamicity
Use historical measurements to improve confidence of answers
Given pdf in time ‘t’
Compute pdf at time ‘t+1’
Incorporating Dynamicity
Assumption: Markovian Model Dynamicity summarized by “transition model”
Observation Plan / Cost Model What is the cost of making ‘o’ observations? C(o) = acquisition cost + transmission cost Acquisition cost: constant for each attribute Transmission cost:
Network graph Edge weights (link quality) Paths taken could be sub-optimal
Observation Plan / Cost Model A set of attributes (‘theta’) to observe are
determined by computing expected benefit
And finding…
This, being similar to the traveling salesman’s problem, is best dealt with heuristic algorithms
BBQ System
BBQ: A Tiny-Model Query System
Uses Multivariate Gaussians
Has 24 transition models – for different hour of day
Results
Experiment: 11 sensors on a tree, 83000 measurements, 2/3 used for training and 1/3 for tests
Methodology BBQ builds a model based on training data One random query / hour taken – possible observations and
model is updated The answer is compared to the measured value
Compare with two other methods TinyDB: Each query broadcasted over sensor networks using an
overlay tree Approximate-Caching: Base station maintains a view of the
sensor readings
Results
Results
Conclusion
Approximate queries can be well optimized, but model of physical phenomenon is needed
Defining an appropriate model is a challenge The framework works well for “fairly steady”
sensor data values Statistical model is largely static with
refinements to the model based on incoming queries and observations made as a result