improving sketch reconstruction accuracy
TRANSCRIPT
Improving Sketch Reconstruction Accuracy Using Linear Least Squares Method
Gene Moo Lee, Huiya Liu, Young Yoon, Yin ZhangUniversity of Texas at Austin
IMC 2005, Berkeley, CA, USA
IMC’05
Roadmap
●Introduction to Sketch●Problem Definition●Our Approach●Evaluation – Accuracy, Tolerance●Conclusion and Future work
IMC’05
Sketch: a data structure
● Sketch is a “lossy” data structure, which is used to summarize massive data streams○ Avoid per-flow state maintenance○ Using constant memory ○ With small number of memory access
● We can use sketch for○ Heavy-hitter detection, Usage-based Pricing,
Bandwidth Provisioning, DoS attack detection
IMC’05
Sketch: a data structure
1
j
H
0 1 K-1…
…
…
Update (key, value): Tj [ hj(k)] += u (for all j)
Say we’ve got an update of (key k, value u)
= hj(k’)
hH(k’)
h1(k’)
hj(k)
hH(k)
h1(k)
IMC’05
Point Estimation
Point Estimation : key • value of the key?
Nontrivial because of collisions!
1
j
H
0 1 K-1…
…
…
= hj(k’)
hH(k’)
h1(k’)
hj(k)
hH(k)
h1(k)
IMC’05
Point Estimation
hj(k)
hH(k)
h1(k)
[5] Countmin : key • minj { Tj [ hj(k)] }
Can we do better than this?
1
j
H
0 1 K-1…
…
…
= hj(k’)
hH(k’)
h1(k’)
hj(k)
hH(k)
h1(k)
take min
IMC’05
Our Approach: lsquare
●Say we have a sketch and a set of keys○We want to accurately estimate the
accumulated values of those keys
● Construct a linear system Ax=b, based on the information sketch provides
● Find the optimal solution using least squares method [10, 13]
IMC’05
An example: constructing a sketch
●A sketch with H=2, K=3○ H1(j) = j mod 3, H2(j) = (j XOR 3) mod 3
●Total update values for keys○ U0 = 5, U1 = 4, U2 = 3, U3 = 9, U4 = 16
IMC’05
An example: building a linear system
●Now, we want to reconstruct the values of key 3 and 4
X3 + Y = 14, X4 + Y = 20, Y = 3
X3 + Y = 14, X4 + y = 19, Y = 4
Here, y is a variable to capture noise effect
IMC’05
An example: solving the linear system
lsquare:
X3 = 10.5
X4 = 16
countmin:
X3 = min{14, 14} = 14
X4 = min{20, 19} = 19
answer:
U3 = 9
U4 = 16
IMC’05
Evaluation - data sets
May 2002 [Bell02]
Feb 2004 [Tera04]
IP addresses with traffic amounts
IMC’05
Evaluation – lsquare vs countmin
X axis = Top 50 hitters
Y axis = Relative error
Lsquare vs Countmin
Lsquare is more accurate than countmin
IMC’05
Evaluation – Accuracy with Light hitters
X axis = Top 200 hitters
Y axis = Traffic amounts
Actual
Countmin vs Lsquare
Lsquare has good accuracy even for
“light” hitters
IMC’05
Evaluation – Multiple noise variables
X axis = Top 20 hitters
Y axis = Relative error
# of noise variable:
1 vs 31 vs 181
We can get better accuracy using more
noise variables
IMC’05
X axis = sketch config
Y axis = avg relative error
Lsquare vs Countmin
Lsquare is tolerant with limited memory
sketch
Evaluation – Tolerant with limited memory
IMC’05
Conclusion
●We propose a new method for point estimation in sketch data structure○ More accurate!○ Tolerant with small-sized sketch
●Future Direction○ Applying statistical inference in data streaming
IMC’05
Evaluation - Time Complexity
●In the experiment, it took just 1~5 seconds to do lsquare○ Time is a function of number
of heavy hitters, which is relatively small number
●Lots of room to further speedup○ exploiting scarcity
IMC’05
How to get the set of keys
● Countmin only computes the value of a single key individually, but we try to find values of a “set” of keys
● Set of keys can be obtained by
○ maintaining a priority queue
○ using reversible sketch
IMC’05
Evaluation – Error Metric
We use a relative error metric and the average of it
n: # of IPs
Uest = estimation
U = real value