probabilistic aggregation in distributed networks ling huang, ben zhao, anthony joseph and john...
Post on 19-Dec-2015
218 views
TRANSCRIPT
Probabilistic Aggregation in
Distributed NetworksLing Huang, Ben Zhao,
Anthony Joseph and John Kubiatowicz
{hling, ravenben, adj, kubitron}@eecs.berkeley.edu
June, 2004
June 2004
Outline Background Motivation Statistical properties of real life data streams Problem of existing approaches Our Approach
Reduce communication overhead Recover from loss
Evaluation Conclusion and future work
June 2004
Background Aggregate functions
MIN, MAX, AVG, COUNT, …, etc. In-Network hierarchical processing
Query propagation Tree construction Aggregates computed epoch by epoch
Addressing fault-tolerance Multi-root Multi-tree Reliable transmission
A
B C
D
E
5
31
2
1
2
1
Count ?
June 2004
Motivation Data aggregation is an important function for all
network infrastructures Sensor networks P2P networks Network monitoring and intrusion detection systems
Exact result not achievable in face of loss and faults High cost when adding fault-tolerance
Low communication overhead, accurate approximation is crucial
But, it’s difficult to achieve
June 2004
Observation: Comparison of Data Streams
Three real-world data traces and a random trace
June 2004
Statistical Properties of Data Streams
x
xxRI ii
1
Density estimation for relative increment
There is temporal correlation in real data stream, by which we can leverage to maintain aggregate data accuracy, while reducing communication overhead and recovering from data loss.
Relative Increment is defined as:
June 2004
Problems in Existing Approaches Few approach exploits the temporal properties and is
designed to handle data loss Simple last-value algorithm for data loss recovery in TAG Multi-root/tree make things worse by consuming more
resource Fragile for large process groups
Need all relevant nodes for participation Difficult to trade accuracy for communication overhead
Good applications need this tradeoff Only need approximation But, minimize resource consumption Centralize solution of adaptive filtering proposed by Olston
et.al.
June 2004
Our Approach Probabilistic data aggregation: a scalable and
robust approach Exploit and leverage statistical properties of data stream
in temporal domain Apply statistical algorithms to data aggregation Develop protocol that handles loss and failures as
essential part of normal operations Nodes participate in aggregation and communication
according to statistical sampling algorithm In the absence of data, estimate value using time series
algorithms Differentiate between voluntary and involuntary Loss
June 2004
Reducing Communication Overhead Trade off between accuracy and resource
consumption Allow selective participation of nodes while maintaining
aggregate accuracy Node participates in the operation with certain probability,
which is the design parameter of the algorithm Sampling strategies:
Uniform Sampling: all nodes use the identical sampling rate Subtree-size based Sampling: sampling rate of a node is
proportional to the size of its subtree Variance based sampling: a sensor only reports a new
value if it is above or below a threshold percentage its last reported value.
June 2004
Performance of Sampling algorithms
As fewer nodes participate, overall accuracy decreases for all algorithms. Uniform sampling performs worst. Variance based sampling is most accurate,
Max Operation AVG Operation
June 2004
Observation: Long-Term Pattern in Data
Data source: bandwidth measurements for the CUDI network interface on an Abilene router with 5-minute average.
Daily patterns in a weekly data stream
Long-term trend
June 2004
Two Level Representation of Data
The data stream can be decomposed into two layers: the long trend (pattern), which changes slowly; the residual, high frequency but low amplitude.
Monday Data
Long-term trend
June 2004
Recovering From Loss
Traditional Approaches Last seen data as approximation for current
epoch Linear Prediction
Two-Level data representation and prediction Long term trend: B-spline estimation High frequency residual: ARMA modeling ARMA stands for AutoRegressive and Moving
Average model, which is a standard time series technique to model chaotic data stream
June 2004
Two-Level Data Prediction B-spline modeling for long term trend
Piecewise continuous, low-degree B-spline can represent complex shapes
Least-square B-spline regression for two-level decomposition
B-Spline extension for future forecasting ARMA forecasting for transient oscillation
System Identification to determine the order of the model Parameter estimation by optimization algorithm Low complexity recursive equation for future forecasting
Statistical properties for the calibration of prediction results
June 2004
Performance of Prediction Algorithms
Performance of Prediction Algorithms For MAX Operation in Lossless Environment
June 2004
Performance of Prediction Algorithms
Performance of prediction algorithms in lossy environments. Average loss rate of the network is 20%. The ration of loss rate between wide-area links and local links is 3:1.
June 2004
Summary of Results All prediction algorithms are effective in
improving the accuracy of aggregation results Two-level prediction approach perform the
best in all situations Achieve more than 90% of accuracy even under
each node nonparticipation with rate up to 60% Is effective even in a high loss environment
June 2004
Conclusion and Future Work Apply statistical algorithms to data aggregation system
quantify the statistical properties of real-world measurement data propose the concept of probabilistic participation of nodes propose multi-level prediction mechanism to recover from
sampling and data loss Uniqueness: multi-level prediction enables high accuracy even
under high loss and voluntary non-participation Future Work
Develop online algorithm and exploit tradeoff between prediction accuracy and computation and storage cost
Build real system for applications in network health monitoring, traffic measurement and router statistics aggregation
Real system implementation and Deployment
June 2004
The Danger of Prediction
Prediction Without Statistical Calibration Prediction With Statistical Calibration