improving sketch reconstruction accuracy
Post on 05-Jul-2015
53 Views
Preview:
TRANSCRIPT
Improving Sketch Reconstruction Accuracy Using Linear Least Squares Method
Gene Moo Lee, Huiya Liu, Young Yoon, Yin ZhangUniversity of Texas at Austin
gene@cs.utexas.edu
IMC 2005, Berkeley, CA, USA
IMC’05
Roadmap
●Introduction to Sketch●Problem Definition●Our Approach●Evaluation – Accuracy, Tolerance●Conclusion and Future work
IMC’05
Sketch: a data structure
● Sketch is a “lossy” data structure, which is used to summarize massive data streams○ Avoid per-flow state maintenance○ Using constant memory ○ With small number of memory access
● We can use sketch for○ Heavy-hitter detection, Usage-based Pricing,
Bandwidth Provisioning, DoS attack detection
IMC’05
Sketch: a data structure
1
j
H
0 1 K-1…
…
…
Update (key, value): Tj [ hj(k)] += u (for all j)
Say we’ve got an update of (key k, value u)
= hj(k’)
hH(k’)
h1(k’)
hj(k)
hH(k)
h1(k)
IMC’05
Point Estimation
Point Estimation : key • value of the key?
Nontrivial because of collisions!
1
j
H
0 1 K-1…
…
…
= hj(k’)
hH(k’)
h1(k’)
hj(k)
hH(k)
h1(k)
IMC’05
Point Estimation
hj(k)
hH(k)
h1(k)
[5] Countmin : key • minj { Tj [ hj(k)] }
Can we do better than this?
1
j
H
0 1 K-1…
…
…
= hj(k’)
hH(k’)
h1(k’)
hj(k)
hH(k)
h1(k)
take min
IMC’05
Our Approach: lsquare
●Say we have a sketch and a set of keys○We want to accurately estimate the
accumulated values of those keys
● Construct a linear system Ax=b, based on the information sketch provides
● Find the optimal solution using least squares method [10, 13]
IMC’05
An example: constructing a sketch
●A sketch with H=2, K=3○ H1(j) = j mod 3, H2(j) = (j XOR 3) mod 3
●Total update values for keys○ U0 = 5, U1 = 4, U2 = 3, U3 = 9, U4 = 16
IMC’05
An example: building a linear system
●Now, we want to reconstruct the values of key 3 and 4
X3 + Y = 14, X4 + Y = 20, Y = 3
X3 + Y = 14, X4 + y = 19, Y = 4
Here, y is a variable to capture noise effect
IMC’05
An example: solving the linear system
lsquare:
X3 = 10.5
X4 = 16
countmin:
X3 = min{14, 14} = 14
X4 = min{20, 19} = 19
answer:
U3 = 9
U4 = 16
IMC’05
Evaluation - data sets
May 2002 [Bell02]
Feb 2004 [Tera04]
IP addresses with traffic amounts
IMC’05
Evaluation – lsquare vs countmin
X axis = Top 50 hitters
Y axis = Relative error
Lsquare vs Countmin
Lsquare is more accurate than countmin
IMC’05
Evaluation – Accuracy with Light hitters
X axis = Top 200 hitters
Y axis = Traffic amounts
Actual
Countmin vs Lsquare
Lsquare has good accuracy even for
“light” hitters
IMC’05
Evaluation – Multiple noise variables
X axis = Top 20 hitters
Y axis = Relative error
# of noise variable:
1 vs 31 vs 181
We can get better accuracy using more
noise variables
IMC’05
X axis = sketch config
Y axis = avg relative error
Lsquare vs Countmin
Lsquare is tolerant with limited memory
sketch
Evaluation – Tolerant with limited memory
IMC’05
Conclusion
●We propose a new method for point estimation in sketch data structure○ More accurate!○ Tolerant with small-sized sketch
●Future Direction○ Applying statistical inference in data streaming
IMC’05
Q&A
Thank you for your attention!
Questions?
Contact Info: gene@cs.utexas.edu
IMC’05
Evaluation - Time Complexity
●In the experiment, it took just 1~5 seconds to do lsquare○ Time is a function of number
of heavy hitters, which is relatively small number
●Lots of room to further speedup○ exploiting scarcity
IMC’05
How to get the set of keys
● Countmin only computes the value of a single key individually, but we try to find values of a “set” of keys
● Set of keys can be obtained by
○ maintaining a priority queue
○ using reversible sketch
IMC’05
Evaluation – Error Metric
We use a relative error metric and the average of it
n: # of IPs
Uest = estimation
U = real value
top related