how to summarize the universe: dynamic maintenance of quantiles gilbert, kotidis, muthukrishnan,...
TRANSCRIPT
How to Summarize the Universe:How to Summarize the Universe:Dynamic Maintenance of Dynamic Maintenance of
QuantilesQuantiles
Gilbert, Kotidis, Muthukrishnan, Gilbert, Kotidis, Muthukrishnan, StraussStrauss
Presented by Itay MalingerPresented by Itay Malinger
December 2003December 2003
Problem DefinitionProblem Definition
► The Universe: The Universe: U = {0, …, U = {0, …, ||U U ||-1}-1}►Number of records in data set: ||A||=Number of records in data set: ||A||=NN►Data set can be thought of as an array:Data set can be thought of as an array:
A[i] – number of records with value iA[i] – number of records with value i► AASS – number of records with values in S – number of records with values in S► The The Ф-Ф-quantile of an ordered sequence of N quantile of an ordered sequence of N
data items are the value with rankdata items are the value with rank►Our goal is computing Our goal is computing εε-approximate -approximate ФФ--
quantiles – find a quantiles – find a jjk k such that:such that:
kji
iNk ][A)(
/1,...,2,1for kNk
Nkikji
)(][A
0
2
4
6
8
10
12
A[i]
1 2 3 4 … … |U|
U
TransactionsTransactions
► Insert(i): A[i] Insert(i): A[i] A[i] + 1 A[i] + 1►Delete(i): A[i] Delete(i): A[i] A[i] – 1 A[i] – 1►LetLet►ASSUME: The Universe size |U| is ASSUME: The Universe size |U| is
knownknown
i
tt iAN ][
The Main Algorithmic ResultThe Main Algorithmic Result
►The RSS AlgorithmThe RSS Algorithm►Space ComplexitySpace Complexity►Update In every transaction in Update In every transaction in
O(space) timeO(space) time►Estimation On demand in O(space) Estimation On demand in O(space)
timetime►One Time passOne Time pass
)/))log(
log()((log 22 U
UO
Dyadic IntervalsDyadic Intervals
►Log(|U|)+1 resolution levels jLog(|U|)+1 resolution levels j►2|U|-1 Dyadic intervals2|U|-1 Dyadic intervals
UIiiU
I
jUkjUkkj
I
0,0}{
|),log(|
]1|)log(|2)1(,|)log(|2[,
0 1 2 3 4 5 6 7I(3,0) I(3,1) I(3,2) I(3,3) I(3,4) I(3,5) I(3,6) I(3,7)
I(2,0) I(2,1) I(2,2) I(2,3)
I(1,0) I(1,1)
I(0,0)
Arbitrary intervalsArbitrary intervals►Any Interval can be displayed as a Any Interval can be displayed as a
disjoint union of at most log(|U|) disjoint union of at most log(|U|) dyadic intervalsdyadic intervals
►For example A[0,6] = For example A[0,6] = I(1,0)+I(2,2)+I(3,6)I(1,0)+I(2,2)+I(3,6)
► Intervals starting at 0 will not use the Intervals starting at 0 will not use the same resolution twicesame resolution twice0 1 2 3 4 5 6 7
I(3,0) I(3,1) I(3,2) I(3,3) I(3,4) I(3,5) I(3,6) I(3,7)
I(2,0) I(2,1) I(2,2) I(2,3)
I(1,0) I(1,1)
I(0,0)
Computing quantilesComputing quantiles
►Assuming we have the number of Assuming we have the number of records in each dyadic interval, We can records in each dyadic interval, We can efficiently compute any arbitrary interval efficiently compute any arbitrary interval in A.in A.
►To compute the To compute the фф-quantile for any -quantile for any k, k, we we need a need a jjkk s.t.: s.t.:
A[0,jA[0,jkk) < kФN < A[0,j) < kФN < A[0,jk+1k+1))
►Use binary search to find it.Use binary search to find it.►Keeping all intervals is costly (O(|U|))Keeping all intervals is costly (O(|U|))
Random Subset SumsRandom Subset Sums
► In case j = log(|U|)In case j = log(|U|)►Let S be a subset of ULet S be a subset of U►Each uEach uU has p=½ of being in SU has p=½ of being in S►E(|S|)= ½|U|E(|S|)= ½|U|►Define:Define:►E(|AE(|ASS|)=½||A||=½|)=½||A||=½NN
Si
S A[i] A
Estimating A[i]Estimating A[i]
]A[
AA]A[]A[
A)A]2(A[
)AE[-])SA2(E[
)SAAE(2
AA[i]]SAE[
}\{U
}\{U21
S
S
}\{U21
S
i
ii
i
i
i
i
i
i
i
ImprovementImprovement
► Instead of keeping sets of point dyadic Instead of keeping sets of point dyadic sets, Keep random sets of all sets, Keep random sets of all resolutionsresolutions
►We need a method of keeping a We need a method of keeping a Random set of j-resolution dyadic Random set of j-resolution dyadic intervals (keeping it explicitly is o(|U|)intervals (keeping it explicitly is o(|U|)
► Instead of keeping the sets keep a Instead of keeping the sets keep a small representation of themsmall representation of them
Pseudorandom set generatorPseudorandom set generator
►We need to keep a small We need to keep a small representation of a random set S (Uirepresentation of a random set S (UiS S with p= ½)with p= ½)
►Given a seed of size log(|U|)+1Given a seed of size log(|U|)+1►Represent a set S of size o(|U|)Represent a set S of size o(|U|)►Quickly test if iQuickly test if iS or notS or not►Use Extended Hamming CodeUse Extended Hamming Code
Extended Hamming CodeExtended Hamming Code
►Given a seed, tells whether the iGiven a seed, tells whether the iSS►For example:For example:
|U| = 8|U| = 8 Seed size: log|U|+1 = 4Seed size: log|U|+1 = 4
G(seed, i) = seed X i’th column mod 2G(seed, i) = seed X i’th column mod 2►Efficient to computeEfficient to compute►3-wise disjoint3-wise disjoint
10101010
11001100
11110000
11111111
}7,5,2,0{~
1
0
1
1
The Data StructureThe Data Structure
►For each resolution level j keep For each resolution level j keep num_copies random subsets S of all num_copies random subsets S of all dyadic intervals in that level (we only dyadic intervals in that level (we only keep the representation seed)keep the representation seed)
►KeepKeep►Maintain N = ||A||Maintain N = ||A||►We got SWe got S11,…,S,…,Snum_copiesnum_copies per level per level
2/|)log(|)/|)log(log(|24num_copies UU
Si
S A[i] A
Upon TransactionsUpon Transactions
► Insert(i) / Delete(i)Insert(i) / Delete(i) For Each resolution level jFor Each resolution level j
►Locate the single ILocate the single Ij,kj,k into which i falls into which i falls (high order binary bits)(high order binary bits)
►Determine all SDetermine all Sℓℓ containing I containing Ij,kj,k
►For Each SFor Each Sℓℓ increase/Decrease ||A increase/Decrease ||ASSℓℓ|| by || by 11
Estimating Quantiles: Estimating Quantiles: Dyadic IntervalsDyadic Intervals
► Given a dyadic interval I=IGiven a dyadic interval I=Ij,kj,k
► There are num_copies sets of resolution jThere are num_copies sets of resolution j
GG EE► Quickly test each SQuickly test each Sℓℓ and check if I and check if ISSℓℓ and if so and if so
estimateestimate► Group all estimations into Group all estimations into GG groups of groups of EE
elementselements► For each group g calculate the average of all For each group g calculate the average of all
estimations Aestimations Ag,j,kg,j,k
2/|)log(|8)/|)log(log(|3num_copies UU
AA2A , SSI
Estimating Quantiles:Estimating Quantiles:Arbitrary intervalsArbitrary intervals
►Given an interval I, Write it as a disjoint Given an interval I, Write it as a disjoint union of at most log(|U|) dyadic intervals union of at most log(|U|) dyadic intervals IIj,kj,k
►Form G groups and calculate each Form G groups and calculate each group’s sum of all dyadic interval’s Agroup’s sum of all dyadic interval’s Ag,j,kg,j,k for all Ifor all Ij,kj,k comprising I. comprising I.
►Take the median of all G groups as the Take the median of all G groups as the final estimate of Afinal estimate of AII
► Its more convenient to refer to the result Its more convenient to refer to the result as an overestimate |Aas an overestimate |AII|≤|A|≤|AII||~~≤|A≤|AII|+|+εεNN
3 dyadic intervals
E = 4 Elements per group
G = 3 Groups
SUM
SUM
SUM
SUM
AV
ER
AG
E
MEDIAN
The Interval’s Estimate
AnalysisAnalysis
►LemmaLemma: The algorithm estimates each : The algorithm estimates each quantile to within quantile to within εεN with p>1-N with p>1-δδ
►ProofProof:: For a fixed resolution level j, Let For a fixed resolution level j, Let Then:Then:
otherwise0,
SI,A2X kI
kK
k kXX
AA
AA2
]E[XA2
S]I|E[X
0k
0
k0k
0
0k
0
I
kkII
kkkI
k
2
II
]|var[
A2AXA
0
0k0k
ASIX k
)SIAAE(2
A
kj,S
I kj,
Analysis (cont.)Analysis (cont.)
87
I
222
2
222I
22
jk
jkI
Ijk
j
(j)kkk
εN]AZP[
8
1
/εU8logNε
NUlog
ENε
var(Y)
εN
var(Z)εN]AZP[
NUlogAUlog]SIj|var[Y
]SIj|γN-E[YA
AγA]SIj|E[Y
XYIIII
j
j
j
γ21
2/|)log(|8E U
Analysis (cont.)Analysis (cont.)
►We take G copies of Z and take the median.We take G copies of Z and take the median.►By the Chernoff inequality,By the Chernoff inequality,
►The binary search looked for a jThe binary search looked for a jkk such that such that
►We made log|U| checks in the binary searchWe made log|U| checks in the binary search►The probability any of them failed is log|U| The probability any of them failed is log|U|
times what we achieved, i.e times what we achieved, i.e δδ
)/|)log(log(|3 UG
|U|δ/log1εN]|AmZP[| I
NAANkAA
ANkA
kkkk
kk
jjjj
jj
)1,0[~)1,0[~),0[),0[
~)1,0[~),0[
RSS PropertiesRSS Properties
►The algorithm may return a quantile The algorithm may return a quantile value which was not seen in the inputvalue which was not seen in the input
►Changing the order of insertions and Changing the order of insertions and deletions doesn’t affect resultsdeletions doesn’t affect results
►The RSSs are composable: U can be The RSSs are composable: U can be split to many disjoint ranges and some split to many disjoint ranges and some pre-agreed common random subsetspre-agreed common random subsets
Extension: U is unknownExtension: U is unknown
►Predict a range Predict a range [0, u-1][0, u-1] for U. for U.►Upon insertion of i > u-1, add Upon insertion of i > u-1, add
anotheranother instance of RSS with range instance of RSS with range [u, u[u, u22-1]-1], and so on…, and so on…
►Because RSS is composable, we only Because RSS is composable, we only have to join the result upon queryhave to join the result upon query
► Increased cost factor: logIncreased cost factor: log22log(|U|).log(|U|).
ExperimentsExperiments
►What is the median length of all active What is the median length of all active AT&T calls ?AT&T calls ?
►When call When call Starts: Add timestampStarts: Add timestamp Ends: Delete start timestampEnds: Delete start timestamp
►4 KB used for RSS4 KB used for RSS►ComparedCompared
RSSRSS GKGK GK2GK2
Number of Active Phone Calls Number of Active Phone Calls Over TimeOver Time
Error in Computation of Error in Computation of Median Over TimeMedian Over Time
Average Error for Last 50 Average Error for Last 50 Snapshots, For DecilesSnapshots, For Deciles
The The EndEnd