basic algorithmic techniques for data streams

205
Basic algorithmic techniques for data streams Mariano Zelke Goethe-Universit¨ at Frankfurt am Main Institut f¨ ur Informatik Arbeitsgruppe Theorie Komplexer Systeme DEIS’10, Dagstuhl, Nov. 2010

Upload: others

Post on 19-Dec-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Basic algorithmic techniques for data streams

Basic algorithmic techniques for data streams

Mariano Zelke

Goethe-Universitat Frankfurt am MainInstitut fur Informatik

Arbeitsgruppe Theorie Komplexer Systeme

DEIS’10, Dagstuhl, Nov. 2010

Page 2: Basic algorithmic techniques for data streams

Contents

Basic Definitions

SamplingGeneral IdeaReservoir SamplingSampling Applications

Advanced SamplingAMS SamplingSliding Window Sampling

Count-Min Sketch

Mariano Zelke Basic algorithmic techniques for data streams 2/12

Page 3: Basic algorithmic techniques for data streams

Data Stream Model

I Stream: m elements of some universe of size n

a1, a2, a3, . . . , am ai ∈ 1, . . . , nI Goal: gain information about stream

statistical information (median, frequency moments, ...),longest increasing subsequence,...

I But: algorithms are restricted to

I sequential access to items in streamI limited memory, sublinear in m and n

Mariano Zelke Basic algorithmic techniques for data streams 3/12

Page 4: Basic algorithmic techniques for data streams

Data Stream Model

I Stream: m elements of some universe of size n

a1, a2, a3, . . . , am ai ∈ 1, . . . , nI Goal: gain information about stream

statistical information (median, frequency moments, ...),longest increasing subsequence,...

I But: algorithms are restricted to

I sequential access to items in streamI limited memory, sublinear in m and n

Mariano Zelke Basic algorithmic techniques for data streams 3/12

Page 5: Basic algorithmic techniques for data streams

Data Stream Model

I Stream: m elements of some universe of size n

a1, a2, a3, . . . , am ai ∈ 1, . . . , n

I Goal: gain information about stream

statistical information (median, frequency moments, ...),longest increasing subsequence,...

I But: algorithms are restricted to

I sequential access to items in streamI limited memory, sublinear in m and n

Mariano Zelke Basic algorithmic techniques for data streams 3/12

Page 6: Basic algorithmic techniques for data streams

Data Stream Model

I Stream: m elements of some universe of size n

a1, a2, a3, . . . , am ai ∈ 1, . . . , nI Goal: gain information about stream

statistical information (median, frequency moments, ...),longest increasing subsequence,...

I But: algorithms are restricted to

I sequential access to items in streamI limited memory, sublinear in m and n

Mariano Zelke Basic algorithmic techniques for data streams 3/12

Page 7: Basic algorithmic techniques for data streams

Data Stream Model

I Stream: m elements of some universe of size n

a1, a2, a3, . . . , am ai ∈ 1, . . . , nI Goal: gain information about stream

statistical information (median, frequency moments, ...),longest increasing subsequence,...

I But: algorithms are restricted to

I sequential access to items in streamI limited memory, sublinear in m and n

Mariano Zelke Basic algorithmic techniques for data streams 3/12

Page 8: Basic algorithmic techniques for data streams

Data Stream Model

I Stream: m elements of some universe of size n

a1, a2, a3, . . . , am ai ∈ 1, . . . , nI Goal: gain information about stream

statistical information (median, frequency moments, ...),longest increasing subsequence,...

I But: algorithms are restricted to

I sequential access to items in streamI limited memory, sublinear in m and n

Mariano Zelke Basic algorithmic techniques for data streams 3/12

Page 9: Basic algorithmic techniques for data streams

Data Stream Model

I Stream: m elements of some universe of size n

a1, a2, a3, . . . , am ai ∈ 1, . . . , nI Goal: gain information about stream

statistical information (median, frequency moments, ...),longest increasing subsequence,...

I But: algorithms are restricted toI sequential access to items in stream

I limited memory, sublinear in m and n

Mariano Zelke Basic algorithmic techniques for data streams 3/12

Page 10: Basic algorithmic techniques for data streams

Data Stream Model

I Stream: m elements of some universe of size n

a1, a2, a3, . . . , am ai ∈ 1, . . . , nI Goal: gain information about stream

statistical information (median, frequency moments, ...),longest increasing subsequence,...

I But: algorithms are restricted toI sequential access to items in streamI limited memory, sublinear in m and n

Mariano Zelke Basic algorithmic techniques for data streams 3/12

Page 11: Basic algorithmic techniques for data streams

Sampling

General idea:

I sample (= select) items from stream according to some rule

I rule is randomized or deterministic

I use sampled items to get information about whole stream

I only need to store sampled items⇒ good for streaming

Mariano Zelke Basic algorithmic techniques for data streams 4/12

Page 12: Basic algorithmic techniques for data streams

Sampling

General idea:

I sample (= select) items from stream according to some rule

I rule is randomized or deterministic

I use sampled items to get information about whole stream

I only need to store sampled items⇒ good for streaming

Mariano Zelke Basic algorithmic techniques for data streams 4/12

Page 13: Basic algorithmic techniques for data streams

Sampling

General idea:

I sample (= select) items from stream according to some rule

I rule is randomized or deterministic

I use sampled items to get information about whole stream

I only need to store sampled items⇒ good for streaming

Mariano Zelke Basic algorithmic techniques for data streams 4/12

Page 14: Basic algorithmic techniques for data streams

Sampling

General idea:

I sample (= select) items from stream according to some rule

I rule is randomized or deterministic

I use sampled items to get information about whole stream

I only need to store sampled items⇒ good for streaming

Mariano Zelke Basic algorithmic techniques for data streams 4/12

Page 15: Basic algorithmic techniques for data streams

Sampling

General idea:

I sample (= select) items from stream according to some rule

I rule is randomized or deterministic

I use sampled items to get information about whole stream

I only need to store sampled items⇒ good for streaming

Mariano Zelke Basic algorithmic techniques for data streams 4/12

Page 16: Basic algorithmic techniques for data streams

Sampling

General idea:

I sample (= select) items from stream according to some rule

I rule is randomized or deterministic

I use sampled items to get information about whole stream

I only need to store sampled items⇒ good for streaming

Mariano Zelke Basic algorithmic techniques for data streams 4/12

Page 17: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 18: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 19: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 20: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,m

I Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 21: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 22: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 23: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 24: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 25: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 26: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 27: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 28: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 29: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 30: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 31: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 32: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 33: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 34: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 35: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item

ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 36: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 37: Basic algorithmic techniques for data streams

Easy Starter

Sample out a single item uniformly at random from the stream

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ar , . . . , am

I Easy if m is known in advance:

I Pick a random number r ∈ 1, 2, . . . ,mI Go over the stream and snatch out sampled item ar

But what if m is not known in advance?

Mariano Zelke Basic algorithmic techniques for data streams 5/12

Page 38: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 39: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , ai+1, ai+2, . . . , am

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 40: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , ai+1, ai+2, . . . , am

current sample:

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 41: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , ai+1, ai+2, . . . , am

current sample: a1

take as sample

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 42: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , ai+1, ai+2, . . . , am

current sample: ar

replace with probability 1/2

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 43: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , ai+1, ai+2, . . . , am

current sample: ar

replace with probability 1/3

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 44: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , ai+1, ai+2, . . . , am

current sample: ar

replace with probability 1/4

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 45: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , ai+1, ai+2, . . . , am

current sample: ar

replace with probability 1/i

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 46: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , ai+1, ai+2, . . . , am

current sample: ar

replace with probability 1/m

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 47: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , ai+1, ai+2, . . . , am

final sample: ar

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 48: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , ai+1, ai+2, . . . , am

current sample:

Pr [final sample ai ] =

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 49: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , ai+1, ai+2, . . . , am

current sample: ai

replace

Pr [final sample ai ] = 1i

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 50: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , ai+1, ai+2, . . . , am

current sample: ai

do not replace

Pr [final sample ai ] = 1i × (1− 1

i+1 )

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 51: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , ai+1, ai+2, . . . , am

current sample: ai

do not replace

Pr [final sample ai ] = 1i × (1− 1

i+1 )× (1− 1i+2 )

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 52: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , ai+1, ai+2, . . . , am

current sample: ai

do not replace

. . .

Pr [final sample ai ] = 1i × (1− 1

i+1 )× (1− 1i+2 )× . . . × (1− 1

m )

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 53: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , ai+1, ai+2, . . . , am

final sample: ai

Pr [final sample ai ] = 1i × (1− 1

i+1 )× (1− 1i+2 )× . . . × (1− 1

m )

= 1i ×

ii+1 × i+1

i+2 × . . . × m−1m

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 54: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , ai+1, ai+2, . . . , am

final sample: ai

Pr [final sample ai ] = 1i × (1− 1

i+1 )× (1− 1i+2 )× . . . × (1− 1

m )

= 1i ×

ii+1 × i+1

i+2 × . . . × m−1m

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 55: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , ai+1, ai+2, . . . , am

final sample: ai

Pr [final sample ai ] = 1i × (1− 1

i+1 )× (1− 1i+2 )× . . . × (1− 1

m )

= 1m

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 56: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out a single item uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , ai+1, ai+2, . . . , am

final sample: ai

Pr [final sample ai ] = 1i × (1− 1

i+1 )× (1− 1i+2 )× . . . × (1− 1

m )

= 1m

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 57: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out several items uniformly at random from the streamwithout knowing its length

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 58: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out several items uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 59: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out several items uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current samples

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 60: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out several items uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current samples a1 a2 a3 a4

take into sample

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 61: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out several items uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current samples a1 a2 a3 a4

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 62: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out several items uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current samples a1 a2 a3 a4

a5

sample with probability 1/5

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 63: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out several items uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current samples a1 a2 a3 a4

a5

replace a sampled elementuniformly at random 1

414

14

14

sample with probability 1/5

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 64: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out several items uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current samples a1 a2 a3 a4

a5

replace a sampled elementuniformly at random

sample with probability 1/5

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 65: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out several items uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current samples a1 a5 a3 a4

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 66: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out several items uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current samples a1 a5 a3 a4

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 67: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out several items uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current samples a1 a5 a3 a4

a6

sample with probability 1/6

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 68: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out several items uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current samples a1 a5 a3 a4

a6

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 69: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out several items uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current samples a1 a5 a3 a4

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 70: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out several items uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current samples ar 1 ar 2 ar 3 ar 4

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 71: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out several items uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current samples ar 1 ar 2 ar 3 ar 4

ai

sample with probability 1/i

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 72: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out several items uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current samples ar 1 ar 2 ar 3 ar 4

ai

replace a sampled elementuniformly at random 1

414

14

14

sample with probability 1/i

Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 73: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out several items uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current samples ar 1 ar 2 ar 3 ar 4

ai

replace a sampled elementuniformly at random 1

414

14

14

sample with probability 1/i

Memory usage for sampling k items:Mariano Zelke Basic algorithmic techniques for data streams 6/12

Page 74: Basic algorithmic techniques for data streams

Reservoir Sampling [J. Vitter ’85]

Sample out several items uniformly at random from the streamwithout knowing its length

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current samples ar 1 ar 2 ar 3 ar 4

ai

replace a sampled elementuniformly at random 1

414

14

14

sample with probability 1/i

Memory usage for sampling k items: k · log nMariano Zelke Basic algorithmic techniques for data streams 6/12

Page 75: Basic algorithmic techniques for data streams

Sampling Applications

stream:

Mariano Zelke Basic algorithmic techniques for data streams 7/12

Page 76: Basic algorithmic techniques for data streams

Sampling Applications

stream:

sampling

samples:

Mariano Zelke Basic algorithmic techniques for data streams 7/12

Page 77: Basic algorithmic techniques for data streams

Sampling Applications

stream:

sampling

samples:

Goal: determine query selectivity on items of stream

(assume query to be invariant of stream order)

Mariano Zelke Basic algorithmic techniques for data streams 7/12

Page 78: Basic algorithmic techniques for data streams

Sampling Applications

stream:

sampling

samples:

determine selectivity of query on samples

Mariano Zelke Basic algorithmic techniques for data streams 7/12

Page 79: Basic algorithmic techniques for data streams

Sampling Applications

stream:

sampling

samples:

determine selectivity of query on samples

To get (1± ε)-estimate with probability 1− δ: What sample size s ?

Mariano Zelke Basic algorithmic techniques for data streams 7/12

Page 80: Basic algorithmic techniques for data streams

Sampling Applications

stream:

sampling

samples:

determine selectivity of query on samples

To get (1± ε)-estimate with probability 1− δ: What sample size s ?

Query selects mc stream items ⇒ Exp[ s+ ] = s

c

Mariano Zelke Basic algorithmic techniques for data streams 7/12

Page 81: Basic algorithmic techniques for data streams

Sampling Applications

stream:

sampling

samples:

determine selectivity of query on samples

To get (1± ε)-estimate with probability 1− δ: What sample size s ?

Query selects mc stream items ⇒ Exp[ s+ ] = s

c

Chernoff-Hoeffding-Ineq.: Pr [ |s+ − Exp[ s+ ]| > ε · s+ ] ≤ e−Θ(ε2s)

Mariano Zelke Basic algorithmic techniques for data streams 7/12

Page 82: Basic algorithmic techniques for data streams

Sampling Applications

stream:

sampling

samples:

determine selectivity of query on samples

To get (1± ε)-estimate with probability 1− δ: What sample size s ?

Query selects mc stream items ⇒ Exp[ s+ ] = s

c

Chernoff-Hoeffding-Ineq.: Pr [ |s+ − Exp[ s+ ]| > ε · s+ ] ≤ e−Θ(ε2s)

sample O(

1ε2 · log 1

δ

)items

Mariano Zelke Basic algorithmic techniques for data streams 7/12

Page 83: Basic algorithmic techniques for data streams

Sampling Applications

stream:

sampling

samples:

determine selectivity of query on samples

To get (1± ε)-estimate with probability 1− δ: What sample size s ?

Query selects mc stream items ⇒ Exp[ s+ ] = s

c

Chernoff-Hoeffding-Ineq.: Pr [ |s+ − Exp[ s+ ]| > ε · s+ ] ≤ e−Θ(ε2s)

Memory usage: O(

1ε2 · log 1

δ

)items × log n bits

Mariano Zelke Basic algorithmic techniques for data streams 7/12

Page 84: Basic algorithmic techniques for data streams

Sampling Applications

stream:

sampling

samples:

determine selectivity of query on samples

To get (1± ε)-estimate with probability 1− δ: What sample size s ?

Query selects mc stream items ⇒ Exp[ s+ ] = s

c

Chernoff-Hoeffding-Ineq.: Pr [ |s+ − Exp[ s+ ]| > ε · s+ ] ≤ e−Θ(ε2s)

Memory usage: O(

1ε2 · log 1

δ · log n)

bits

Mariano Zelke Basic algorithmic techniques for data streams 7/12

Page 85: Basic algorithmic techniques for data streams

Sampling Applications

stream:

sampling

samples:

Mariano Zelke Basic algorithmic techniques for data streams 7/12

Page 86: Basic algorithmic techniques for data streams

Sampling Applications

stream:

sampling

samples:

Goal: find the median of the stream

Mariano Zelke Basic algorithmic techniques for data streams 7/12

Page 87: Basic algorithmic techniques for data streams

Sampling Applications

stream:

sampling

samples:

sort samples

sorted

samples: ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤

Mariano Zelke Basic algorithmic techniques for data streams 7/12

Page 88: Basic algorithmic techniques for data streams

Sampling Applications

stream:

sampling

samples:

sort samples

sorted

samples: ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤

Pick median of samples as an estimate of stream’s median

median of samples

Mariano Zelke Basic algorithmic techniques for data streams 7/12

Page 89: Basic algorithmic techniques for data streams

Sampling Applications

stream:

sampling

samples:

sort samples

sorted

samples: ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤median of samples

Pick median of samples as an estimate of stream’s median

To get (1± ε)-estimate with probability 1− δ:

sample O(

1ε2 · log 1

δ

)items

Mariano Zelke Basic algorithmic techniques for data streams 7/12

Page 90: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

Stream a1, a2, a3, . . . , am

Frequency of an item: f i =∣∣j : a j = i

∣∣kth frequency moment Fk =

n∑i=1

f ki

Frequency moments provide useful statistics:

I F0: number of distinct elements in streamI F1: length of stream, mI F2: size of self joinI Fk , k ≥ 2: skew of distribution

Trivial determination of Fk : maintain counters for each f i

⇒ Ω(n) bits needed

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 91: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

Stream a1, a2, a3, . . . , am

Frequency of an item: f i =∣∣j : a j = i

∣∣kth frequency moment Fk =

n∑i=1

f ki

Frequency moments provide useful statistics:

I F0: number of distinct elements in streamI F1: length of stream, mI F2: size of self joinI Fk , k ≥ 2: skew of distribution

Trivial determination of Fk : maintain counters for each f i

⇒ Ω(n) bits needed

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 92: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

Stream a1, a2, a3, . . . , am

Frequency of an item: f i =∣∣j : a j = i

∣∣

kth frequency moment Fk =n∑

i=1f ki

Frequency moments provide useful statistics:

I F0: number of distinct elements in streamI F1: length of stream, mI F2: size of self joinI Fk , k ≥ 2: skew of distribution

Trivial determination of Fk : maintain counters for each f i

⇒ Ω(n) bits needed

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 93: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

Stream a1, a2, a3, . . . , am

Frequency of an item: f i =∣∣j : a j = i

∣∣Example: stream 2, 3, 3, 2, 3, 1, 2, 3

f 1 = 1, f 2 = 3, f 3 = 4

Frequency moments provide useful statistics:

I F0: number of distinct elements in streamI F1: length of stream, mI F2: size of self joinI Fk , k ≥ 2: skew of distribution

Trivial determination of Fk : maintain counters for each f i

⇒ Ω(n) bits needed

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 94: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

Stream a1, a2, a3, . . . , am

Frequency of an item: f i =∣∣j : a j = i

∣∣kth frequency moment Fk =

n∑i=1

f ki

Frequency moments provide useful statistics:

I F0: number of distinct elements in streamI F1: length of stream, mI F2: size of self joinI Fk , k ≥ 2: skew of distribution

Trivial determination of Fk : maintain counters for each f i

⇒ Ω(n) bits needed

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 95: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

Stream a1, a2, a3, . . . , am

Frequency of an item: f i =∣∣j : a j = i

∣∣kth frequency moment Fk =

n∑i=1

f ki

Frequency moments provide useful statistics:I F0: number of distinct elements in streamI F1: length of stream, mI F2: size of self joinI Fk , k ≥ 2: skew of distribution

Trivial determination of Fk : maintain counters for each f i

⇒ Ω(n) bits needed

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 96: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

Stream a1, a2, a3, . . . , am

Frequency of an item: f i =∣∣j : a j = i

∣∣kth frequency moment Fk =

n∑i=1

f ki

Frequency moments provide useful statistics:I F0: number of distinct elements in streamI F1: length of stream, mI F2: size of self joinI Fk , k ≥ 2: skew of distribution

Trivial determination of Fk : maintain counters for each f i

⇒ Ω(n) bits needed

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 97: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

Stream a1, a2, a3, . . . , am

Frequency of an item: f i =∣∣j : a j = i

∣∣kth frequency moment Fk =

n∑i=1

f ki

Frequency moments provide useful statistics:I F0: number of distinct elements in streamI F1: length of stream, mI F2: size of self joinI Fk , k ≥ 2: skew of distribution

Trivial determination of Fk : maintain counters for each f i

⇒ Ω(n) bits needed

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 98: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)

1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 99: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream

2. Compute r =∣∣j ′ : j ′ ≥ j , aj′ = aj

∣∣3. At the end of the stream calculate

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 100: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 101: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

Example: stream 2, 3, 3, 2, 3, 1, 2, 3

aj = 3, r = 3

3. At the end of the stream calculate

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 102: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

m(rk − (r − 1)k)

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 103: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

m(rk − (r − 1)k)

for k ≥ 1 : Exp[m(rk − (r − 1)k) ]

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 104: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

m(rk − (r − 1)k)

for k ≥ 1 : Exp[m(rk − (r − 1)k) ]

=[m(f k1 −(f1−1)k)

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 105: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

m(rk − (r − 1)k)

for k ≥ 1 : Exp[m(rk − (r − 1)k) ]

=[m(f k1 −(f1−1)k) + m((f1−1)k−(f1−2)k)

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 106: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

m(rk − (r − 1)k)

for k ≥ 1 : Exp[m(rk − (r − 1)k) ]

=[m(f k1 −(f1−1)k) + m((f1−1)k−(f1−2)k) +. . .+ m(2k−1k)

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 107: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

m(rk − (r − 1)k)

for k ≥ 1 : Exp[m(rk − (r − 1)k) ]

=[m(f k1 −(f1−1)k) + m((f1−1)k−(f1−2)k) +. . .+ m(2k−1k) + m · 1k

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 108: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

m(rk − (r − 1)k)

for k ≥ 1 : Exp[m(rk − (r − 1)k) ]

=[m(f k1 −(f1−1)k) + m((f1−1)k−(f1−2)k) +. . .+ m(2k−1k) + m · 1k

+m(f k2 −(f2−1)k) + m((f2−1)k−(f2−2)k) +. . .+ m(2k−1k) + m · 1k

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 109: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

m(rk − (r − 1)k)

for k ≥ 1 : Exp[m(rk − (r − 1)k) ]

=[m(f k1 −(f1−1)k) + m((f1−1)k−(f1−2)k) +. . .+ m(2k−1k) + m · 1k

+m(f k2 −(f2−1)k) + m((f2−1)k−(f2−2)k) +. . .+ m(2k−1k) + m · 1k

. . .

+m(f kn −(fn−1)k) + m((fn−1)k−(fn−2)k) +. . .+ m(2k−1k) + m · 1k]

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 110: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

m(rk − (r − 1)k)

for k ≥ 1 : Exp[m(rk − (r − 1)k) ]

=[m(f k1 −(f1−1)k) + m((f1−1)k−(f1−2)k) +. . .+ m(2k−1k) + m · 1k

+m(f k2 −(f2−1)k) + m((f2−1)k−(f2−2)k) +. . .+ m(2k−1k) + m · 1k

. . .

+m(f kn −(fn−1)k) + m((fn−1)k−(fn−2)k) +. . .+ m(2k−1k) + m · 1k]

1m

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 111: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

m(rk − (r − 1)k)

for k ≥ 1 : Exp[m(rk − (r − 1)k) ]

= f k1 + f k2 + f k3 + . . . + f kn

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 112: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

m(rk − (r − 1)k)

for k ≥ 1 : Exp[m(rk − (r − 1)k) ]

= f k1 + f k2 + f k3 + . . . + f kn = Fk

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 113: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

m(rk − (r − 1)k)

To get (1± ε)-estimate of Fk with probability 1− δ:

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 114: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

m(rk − (r − 1)k)

To get (1± ε)-estimate of Fk with probability 1− δ:

Run O(n1−1/k

ε2

)parallel instances, take average A

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 115: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

m(rk − (r − 1)k)

To get (1± ε)-estimate of Fk with probability 1− δ:

Run O(n1−1/k

ε2

)parallel instances, take average A

⇒ Chebyshev-Ineq.: Pr [ |A− Fk | > ε · Fk ] ≤ p < 12

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 116: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

m(rk − (r − 1)k)

To get (1± ε)-estimate of Fk with probability 1− δ:

Run O(n1−1/k

ε2

)parallel instances, take average A

⇒ Chebyshev-Ineq.: Pr [ |A− Fk | > ε · Fk ] ≤ p < 12

Take median M over O(log 1

δ

)such averages

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 117: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

m(rk − (r − 1)k)

To get (1± ε)-estimate of Fk with probability 1− δ:

Run O(n1−1/k

ε2

)parallel instances, take average A

⇒ Chebyshev-Ineq.: Pr [ |A− Fk | > ε · Fk ] ≤ p < 12

Take median M over O(log 1

δ

)such averages

⇒ Chernoff-Ineq.: Pr [ |M − Fk | > ε · Fk ] ≤ δ

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 118: Basic algorithmic techniques for data streams

AMS Sampling [Alon, Matias, Szegedy ’96]

AMS Sampling (for estimating Fk)1. Pick random item aj from stream2. Compute r =

∣∣j ′ : j ′ ≥ j , aj′ = aj∣∣

3. At the end of the stream calculate

m(rk − (r − 1)k)

To get (1± ε)-estimate of Fk with probability 1− δ:

Run O(n1−1/k

ε2

)parallel instances, take average A

⇒ Chebyshev-Ineq.: Pr [ |A− Fk | > ε · Fk ] ≤ p < 12

Take median M over O(log 1

δ

)such averages

⇒ Chernoff-Ineq.: Pr [ |M − Fk | > ε · Fk ] ≤ δ

Memory consumption: O(n1−1/k

ε2 · log 1δ · (log n + logm)

)bits

Mariano Zelke Basic algorithmic techniques for data streams 8/12

Page 119: Basic algorithmic techniques for data streams

Sliding Window Sampling

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 120: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 121: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 122: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current sample: a2

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 123: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current sample: a2

Current sample might be ”far away“ from actual item

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 124: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current sample: a2

Current sample might be ”far away“ from actual item

- fine for some applications

- for others only w recent items matter

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 125: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current sample: a2

Current sample might be ”far away“ from actual item

- fine for some applications

- for others only w recent items matter

⇒ only consider items in a sliding window of size w

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 126: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Current sample might be ”far away“ from actual item

- fine for some applications

- for others only w recent items matter

⇒ only consider items in a sliding window of size w

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 127: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Current sample might be ”far away“ from actual item

- fine for some applications

- for others only w recent items matter

⇒ only consider items in a sliding window of size w

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 128: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Current sample might be ”far away“ from actual item

- fine for some applications

- for others only w recent items matter

⇒ only consider items in a sliding window of size w

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 129: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Current sample might be ”far away“ from actual item

- fine for some applications

- for others only w recent items matter

⇒ only consider items in a sliding window of size w

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 130: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Current sample might be ”far away“ from actual item

- fine for some applications

- for others only w recent items matter

⇒ only consider items in a sliding window of size w

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 131: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Current sample might be ”far away“ from actual item

- fine for some applications

- for others only w recent items matter

⇒ only consider items in a sliding window of size w

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 132: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Current sample might be ”far away“ from actual item

- fine for some applications

- for others only w recent items matter

⇒ only consider items in a sliding window of size w

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 133: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Current sample might be ”far away“ from actual item

- fine for some applications

- for others only w recent items matter

⇒ only consider items in a sliding window of size w

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 134: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Current sample might be ”far away“ from actual item

- fine for some applications

- for others only w recent items matter

⇒ only consider items in a sliding window of size w

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 135: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Current sample might be ”far away“ from actual item

- fine for some applications

- for others only w recent items matter

⇒ only consider items in a sliding window of size w

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 136: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Goal: sample an item uniformly from sliding window of size w

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 137: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Goal: sample an item uniformly from sliding window of size w

Trivial: - memorize whole window content ⇒ w · log n bits

- impractical if w is large

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 138: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Goal: sample an item uniformly from sliding window of size w

Trivial: - memorize whole window content ⇒ w · log n bits

- impractical if w is large

Better idea:

Algorithm:

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 139: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Goal: sample an item uniformly from sliding window of size w

Trivial: - memorize whole window content ⇒ w · log n bits

- impractical if w is large

Better idea:

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 140: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Goal: sample an item uniformly from sliding window of size w

Trivial: - memorize whole window content ⇒ w · log n bits

- impractical if w is large

Better idea:

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 141: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Goal: sample an item uniformly from sliding window of size w

Trivial: - memorize whole window content ⇒ w · log n bits

- impractical if w is large

Better idea:

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 142: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 143: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 144: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

random value r 1 = 0.2

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 145: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

random value r 1 = 0.2

a1

0.2

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 146: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

current

sample

random value r 1 = 0.2

a1

0.2

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 147: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

current

sample

random value r 2 = 0.4

a1

0.2

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 148: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

current

sample

random value r 2 = 0.4

a1

0.2

a2

0.4

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 149: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

current

sample

random value r 2 = 0.4

a1

0.2

a2

0.4

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 150: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

current

sample

random value r 3 = 0.7

a1

0.2

a2

0.4

a3

0.7

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 151: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

current

sample

random value r 4 = 0.8

a1

0.2

a2

0.4

a3

0.7

a4

0.8

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 152: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

current

sample

random value r 5 = 0.6

a1

0.2

a2

0.4

a3

0.7

a4

0.8

a5

0.6

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 153: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

current

sample

random value r 5 = 0.6

a1

0.2

a2

0.4

a3

0.7

a4

0.8

a5

0.6

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 154: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

current

sample

random value r 5 = 0.6

a1

0.2

a2

0.4

a3

0.7

a4

0.8

a5

0.6

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 155: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

current

sample

random value r 5 = 0.6

a1

0.2

a2

0.4

a5

0.6

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 156: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

current

sample

random value r 6 = 0.5

a1

0.2

a2

0.4

a5

0.6

a6

0.5

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 157: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

current

sample

random value r 6 = 0.5

a1

0.2

a2

0.4

a5

0.6

a6

0.5

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 158: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

current

sample

random value r 6 = 0.5

a1

0.2

a2

0.4

a6

0.5

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 159: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

current

sample

random value r 6 = 0.5

a1

0.2

a2

0.4

a6

0.5

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 160: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Algorithm: 1: For each ai pick random value ri ∈ (0, 1)

2: In window (ai−w+1, . . . , ai ) choose aj with smallest rj

3: Only maintain items in window whose r -value is

3: minimal among subsequent r -values

current

sample

random value r 6 = 0.5

a2

0.4

a6

0.5

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 161: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

current

sample

axrx

a5

0.6

ayry

· · ·

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 162: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Analysis: What is the length of ` ?

current

sample

axrx

a5

0.6

ayry

· · ·

linked list `

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 163: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Analysis: What is the length of ` ?

Worst case: |`| = w But that is very unlikely.

current

sample

axrx

a5

0.6

ayry

· · ·

linked list `

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 164: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Analysis: What is the length of ` ?

Exp[ |`| ] =

current

sample

axrx

a5

0.6

ayry

· · ·

linked list `

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 165: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Analysis: What is the length of ` ?

Exp[ |`| ] = Pr [ai ∈ ` ] +

current

sample

axrx

a5

0.6

ayry

· · ·

linked list `

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 166: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Analysis: What is the length of ` ?

Exp[ |`| ] = Pr [ai ∈ ` ] + Pr [ai−1 ∈ ` ] +

current

sample

axrx

a5

0.6

ayry

· · ·

linked list `

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 167: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Analysis: What is the length of ` ?

Exp[ |`| ] = Pr [ai ∈ ` ] + Pr [ai−1 ∈ ` ] + . . .+ Pr [ai−w+1 ∈ ` ]

current

sample

axrx

a5

0.6

ayry

· · ·

linked list `

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 168: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Analysis: What is the length of ` ?

Exp[ |`| ] = Pr [ai ∈ ` ] + Pr [ai−1 ∈ ` ] + . . .+ Pr [ai−w+1 ∈ ` ]

= 1 +

current

sample

axrx

a5

0.6

ayry

· · ·

linked list `

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 169: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Analysis: What is the length of ` ?

Exp[ |`| ] = Pr [ai ∈ ` ] + Pr [ai−1 ∈ ` ] + . . .+ Pr [ai−w+1 ∈ ` ]

= 1 + 12 +

current

sample

axrx

a5

0.6

ayry

· · ·

linked list `

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 170: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Analysis: What is the length of ` ?

Exp[ |`| ] = Pr [ai ∈ ` ] + Pr [ai−1 ∈ ` ] + . . .+ Pr [ai−w+1 ∈ ` ]

= 1 + 12 + 1

3 +

current

sample

axrx

a5

0.6

ayry

· · ·

linked list `

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 171: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Analysis: What is the length of ` ?

Exp[ |`| ] = Pr [ai ∈ ` ] + Pr [ai−1 ∈ ` ] + . . .+ Pr [ai−w+1 ∈ ` ]

= 1 + 12 + 1

3 + 14 + . . .+ 1

w

current

sample

axrx

a5

0.6

ayry

· · ·

linked list `

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 172: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Analysis: What is the length of ` ?

Exp[ |`| ] = Pr [ai ∈ ` ] + Pr [ai−1 ∈ ` ] + . . .+ Pr [ai−w+1 ∈ ` ]

= 1 + 12 + 1

3 + 14 + . . .+ 1

w = O(logw)

current

sample

axrx

a5

0.6

ayry

· · ·

linked list `

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 173: Basic algorithmic techniques for data streams

Sliding Window Sampling

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , ai , . . . , am

Analysis:

Exp[ memory usage ] = O(logw · log n)

current

sample

axrx

a5

0.6

ayry

· · ·

linked list `

Mariano Zelke Basic algorithmic techniques for data streams 9/12

Page 174: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 175: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

Point query: fi = ?

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 176: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

Point query: fi = ?

Trivial solution: maintain n counters

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 177: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 178: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

rand. hash funct. h1

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 179: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

rand. hash funct. h1

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 180: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

rand. hash funct. h1

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 181: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

rand. hash funct. h1 +1

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 182: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

rand. hash funct. h1 +1

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 183: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

rand. hash funct. h1 +1

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 184: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

rand. hash funct. h1

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 185: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

Point query: fi = ?

2/ε

rand. hash funct. h1

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 186: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

Point query: fi = ?

2/ε

rand. hash funct. h1

i

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 187: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

Point query: fi = ?

2/ε

rand. hash funct. h1

i

fi

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 188: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

Point query: fi = ?

Give fi as estimate for fi

2/ε

rand. hash funct. h1

i

fi

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 189: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

Point query: fi = ?

Give fi as estimate for fi

Analysis:

2/ε

rand. hash funct. h1

i

fi

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 190: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

Point query: fi = ?

Give fi as estimate for fi

Analysis: fi ≥ fi

2/ε

rand. hash funct. h1

i

fi

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 191: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

Point query: fi = ?

Give fi as estimate for fi

Analysis: fi ≥ fi

Exp[ fi ] ≤ fi + ε ·m/2

2/ε

rand. hash funct. h1

i

fi

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 192: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

Point query: fi = ?

Give fi as estimate for fi

Analysis: fi ≥ fi

Exp[ fi ] ≤ fi + ε ·m/2

Markov-Ineq.: Pr [ fi > fi + ε ·m ] ≤ 12

2/ε

rand. hash funct. h1

i

fi

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 193: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

rand. hash funct. h1

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 194: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

rand. hash funct. h1

rand. hash funct. h2

rand. hash funct. h3

rand. hash funct. hd

d =

log 1δ

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 195: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

rand. hash funct. h1

rand. hash funct. h2

rand. hash funct. h3

rand. hash funct. hd

d =

log 1δ

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 196: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

rand. hash funct. h1

rand. hash funct. h2

rand. hash funct. h3

rand. hash funct. hd

d =

log 1δ

+1

+1

+1

+1

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 197: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

rand. hash funct. h1

rand. hash funct. h2

rand. hash funct. h3

rand. hash funct. hd

d =

log 1δ

+1

+1

+1

+1

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 198: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

rand. hash funct. h1

rand. hash funct. h2

rand. hash funct. h3

rand. hash funct. hd

d =

log 1δ

Point query: fi = ?

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 199: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

rand. hash funct. h1

rand. hash funct. h2

rand. hash funct. h3

rand. hash funct. hd

d =

log 1δ

i

Point query: fi = ?

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 200: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

rand. hash funct. h1

rand. hash funct. h2

rand. hash funct. h3

rand. hash funct. hd

d =

log 1δ

i

Point query: fi = ?

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 201: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

rand. hash funct. h1

rand. hash funct. h2

rand. hash funct. h3

rand. hash funct. hd

d =

log 1δ

i

Point query: fi = ?

Give fi := minimum of -values as estimate for fi

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 202: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

rand. hash funct. h1

rand. hash funct. h2

rand. hash funct. h3

rand. hash funct. hd

d =

log 1δ

i

Analysis: Pr [ fi > fi + ε ·m ] ≤(

12

)d= δ

Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 203: Basic algorithmic techniques for data streams

Count-Min Sketch [Cormode, Muthukrishnan ’04]

a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, . . . , aj , . . . , am

2/ε

rand. hash funct. h1

rand. hash funct. h2

rand. hash funct. h3

rand. hash funct. hd

d =

log 1δ

i

Analysis: Pr [ fi > fi + ε ·m ] ≤(

12

)d= δ

Memory consumption: O(

1ε · log 1

δ · logm + log 1δ · log n

)Mariano Zelke Basic algorithmic techniques for data streams 10/12

Page 204: Basic algorithmic techniques for data streams

Recap

I Reservoir sampling for uniform selection

I AMS sampling for frequency moments

I Sliding window sampling

I Count-Min sketch for point queries

Mariano Zelke Basic algorithmic techniques for data streams 11/12

Page 205: Basic algorithmic techniques for data streams

References

I N. Alon, Y. Matias, and M. Szegedy: The space complexity ofapproximating the frequency moments. J. Comput. Syst. Sci.58, pp. 137-147, 1999.

I G. Cormode and S. Muthukrishnan: An Improved DataStream Summary: The Count-Min Sketch and itsApplications. Journal of Algorithms 55(1), pp. 58-75, 2005.

I B. Lahiri and S. Tirthapura: Stream Sampling. Pages2838-2842 in: Ling Liu, M. Tamer zsu (Eds.): Encyclopedia ofDatabase Systems. Springer US, 2009.

I S. Muthukrishnan: Data Streams: Algorithms andApplications. Foundations and Trends in TheoreticalComputer Science, 1(2), 2005.

I J. Vitter: Random Sampling with a Reservoir. ACMTransactions on Mathematical Software, 11(1), pp. 37-57,1985.

Mariano Zelke Basic algorithmic techniques for data streams 12/12