probabilistic aggregation in distributed networks ling huang, ben zhao, anthony joseph and john...

Probabilistic Aggregation in

Distributed NetworksLing Huang, Ben Zhao,

Anthony Joseph and John Kubiatowicz

{hling, ravenben, adj, kubitron}@eecs.berkeley.edu

June, 2004

June 2004

Outline Background Motivation Statistical properties of real life data streams Problem of existing approaches Our Approach

Reduce communication overhead Recover from loss

Evaluation Conclusion and future work

June 2004

Background Aggregate functions

MIN, MAX, AVG, COUNT, …, etc. In-Network hierarchical processing

Query propagation Tree construction Aggregates computed epoch by epoch

Addressing fault-tolerance Multi-root Multi-tree Reliable transmission

A

B C

D

E

5

31

2

1

2

1

Count ?

June 2004

Motivation Data aggregation is an important function for all

network infrastructures Sensor networks P2P networks Network monitoring and intrusion detection systems

Exact result not achievable in face of loss and faults High cost when adding fault-tolerance

Low communication overhead, accurate approximation is crucial

But, it’s difficult to achieve

June 2004

Observation: Comparison of Data Streams

Three real-world data traces and a random trace

June 2004

Statistical Properties of Data Streams

x

xxRI ii

1

Density estimation for relative increment

There is temporal correlation in real data stream, by which we can leverage to maintain aggregate data accuracy, while reducing communication overhead and recovering from data loss.

Relative Increment is defined as:

June 2004

Problems in Existing Approaches Few approach exploits the temporal properties and is

designed to handle data loss Simple last-value algorithm for data loss recovery in TAG Multi-root/tree make things worse by consuming more

resource Fragile for large process groups

Need all relevant nodes for participation Difficult to trade accuracy for communication overhead

Good applications need this tradeoff Only need approximation But, minimize resource consumption Centralize solution of adaptive filtering proposed by Olston

et.al.

June 2004

Our Approach Probabilistic data aggregation: a scalable and

robust approach Exploit and leverage statistical properties of data stream

in temporal domain Apply statistical algorithms to data aggregation Develop protocol that handles loss and failures as

essential part of normal operations Nodes participate in aggregation and communication

according to statistical sampling algorithm In the absence of data, estimate value using time series

algorithms Differentiate between voluntary and involuntary Loss

June 2004

Reducing Communication Overhead Trade off between accuracy and resource

consumption Allow selective participation of nodes while maintaining

aggregate accuracy Node participates in the operation with certain probability,

which is the design parameter of the algorithm Sampling strategies:

Uniform Sampling: all nodes use the identical sampling rate Subtree-size based Sampling: sampling rate of a node is

proportional to the size of its subtree Variance based sampling: a sensor only reports a new

value if it is above or below a threshold percentage its last reported value.

June 2004

Performance of Sampling algorithms

As fewer nodes participate, overall accuracy decreases for all algorithms. Uniform sampling performs worst. Variance based sampling is most accurate,

Max Operation AVG Operation

June 2004

Observation: Long-Term Pattern in Data

Data source: bandwidth measurements for the CUDI network interface on an Abilene router with 5-minute average.

Daily patterns in a weekly data stream

Long-term trend

June 2004

Two Level Representation of Data

The data stream can be decomposed into two layers: the long trend (pattern), which changes slowly; the residual, high frequency but low amplitude.

Monday Data

Long-term trend

June 2004

Recovering From Loss

Traditional Approaches Last seen data as approximation for current

epoch Linear Prediction

Two-Level data representation and prediction Long term trend: B-spline estimation High frequency residual: ARMA modeling ARMA stands for AutoRegressive and Moving

Average model, which is a standard time series technique to model chaotic data stream

June 2004

Two-Level Data Prediction B-spline modeling for long term trend

Piecewise continuous, low-degree B-spline can represent complex shapes

Least-square B-spline regression for two-level decomposition

B-Spline extension for future forecasting ARMA forecasting for transient oscillation

System Identification to determine the order of the model Parameter estimation by optimization algorithm Low complexity recursive equation for future forecasting

Statistical properties for the calibration of prediction results

June 2004

Performance of Prediction Algorithms

Performance of Prediction Algorithms For MAX Operation in Lossless Environment

June 2004

Performance of Prediction Algorithms

Performance of prediction algorithms in lossy environments. Average loss rate of the network is 20%. The ration of loss rate between wide-area links and local links is 3:1.

June 2004

Summary of Results All prediction algorithms are effective in

improving the accuracy of aggregation results Two-level prediction approach perform the

best in all situations Achieve more than 90% of accuracy even under

each node nonparticipation with rate up to 60% Is effective even in a high loss environment

June 2004

Conclusion and Future Work Apply statistical algorithms to data aggregation system

quantify the statistical properties of real-world measurement data propose the concept of probabilistic participation of nodes propose multi-level prediction mechanism to recover from

sampling and data loss Uniqueness: multi-level prediction enables high accuracy even

under high loss and voluntary non-participation Future Work

Develop online algorithm and exploit tradeoff between prediction accuracy and computation and storage cost

Build real system for applications in network health monitoring, traffic measurement and router statistics aggregation

Real system implementation and Deployment

June 2004

The Danger of Prediction

Prediction Without Statistical Calibration Prediction With Statistical Calibration

probabilistic aggregation in distributed networks ling huang, ben zhao, anthony joseph and john...

Documents

absence of data

data loss recovery

data loss simple

motivation data aggregation

aggregate data accuracy

comparison of data streams

real data stream

involuntary loss slide