leveraging big data: lecture 2

59
Leveraging Big Data: Lecture 2 Instructors: http://www.cohenwang.com/edith/ bigdataclass2013 Edith Cohen Amos Fiat Haim Kaplan Tova Milo

Upload: elle

Post on 22-Mar-2016

74 views

Category:

Documents


8 download

DESCRIPTION

http://www.cohenwang.com/edith/bigdataclass2013. Leveraging Big Data: Lecture 2. Edith Cohen Amos Fiat Haim Kaplan Tova Milo. Instructors:. Counting Distinct Elements. 4,. 32,. 6 ,. 12,. 12,. 1 4 ,. 32 ,. 7 ,. 12,. 32,. 7,. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Leveraging Big Data:  Lecture 2

Leveraging Big Data: Lecture 2

Instructors:

http://www.cohenwang.com/edith/bigdataclass2013

Edith CohenAmos FiatHaim KaplanTova Milo

Page 2: Leveraging Big Data:  Lecture 2

Counting Distinct Elements

Elements occur multiple times, we want to count the number of distinct elements.

Number of distinct element is ( =6 in example) Total number of elements is 11 in this example

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,

Exact counting of distinct element requires a structure of size We are happy with an approximate count that uses a small-size working memory.

Page 3: Leveraging Big Data:  Lecture 2

Distinct Elements: Approximate Counting 32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,

We want to be able to compute and maintain a small sketch of the set of distinct items seen so far

Page 4: Leveraging Big Data:  Lecture 2

Distinct Elements: Approximate Counting Size of sketch Can query to get a good estimate of (small

relative error) For a new element easy to compute from and

For data stream computation If and are (possibly overlapping) sets then we can

compute the union sketch from their sketches: from and For distributed computation

Page 5: Leveraging Big Data:  Lecture 2

Distinct Elements: Approximate Counting

Size-estimation/Minimum value technique:[Flajolet-Martin 85, C 94]

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,

is a random hash function from element IDs to uniform random numbers in

Maintain the Min-Hash value : Initialize Processing an element

Page 6: Leveraging Big Data:  Lecture 2

Distinct Elements: Approximate Counting

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,𝑥

h (𝑥)

𝑦

𝑛0

1

12 33 4 4 4 45 5 6

0.45 0.21

0.35 0.92

0.14

0.45 0.450.450.74

0.35 0.35

0.35 0.35

0.35

0.210.21

0.21 0.21 0.21 0.140.14

0.14

The minimum hash value is: Unaffected by repeated elements. Is non-increasing with the number of distinct elements .

Page 7: Leveraging Big Data:  Lecture 2

Distinct Elements: Approximate Counting

How does the minimum hash give information on the number of distinct elements ?

0 1

The expectation of the minimum is

minimum

A single value gives only limited information. To boost information, we maintain values

Page 8: Leveraging Big Data:  Lecture 2

Why expectation is ?

Take a circle of length 1 Throw a random red point to “mark” the start of

a segment of length (circle points map to ) Throw another point independently at random The circle is cut into segments by these points. The expected length of each segment is Same also for the segment clockwise from the

red point.

Page 9: Leveraging Big Data:  Lecture 2

Min-Hash Sketches

k-mins sketch: Use “independent” hash functions: Track the respective minimum for each function.

Bottom-k sketch: Use a single hash function: Track the smallest values

k-partition sketch: Use a single hash function: Use the first bits of to map uniformly to one of parts. Call the remaining bits .For : Track the minimum hash value of the elements in part .

These sketches maintain values from the range of the hash function (distribution).

All sketches are the same for

Page 10: Leveraging Big Data:  Lecture 2

Min-Hash Sketchesk-mins, bottom-k, k-partition

Why study all 3 variants ? Different tradeoffs between update cost, accuracy, usage…

Beyond distinct counting: Min-Hash sketches correspond to sampling schemes of

large data sets Similarity queries between datasets Selectivity/subset queries

These patterns generally apply as methods to gather increased confidence from a random “projection”/sample.

Page 11: Leveraging Big Data:  Lecture 2

Min-Hash Sketches: Examplesk-mins, k-partition, bottom-k

The min-hash value and sketches only depend on The random hash function/s The set of distinct elements

Not on the order elements appear or their multiplicity

32 12 14 7 6 4𝑁={ }, , , , ,

Page 12: Leveraging Big Data:  Lecture 2

Min-Hash Sketches: Examplek-mins

32 12 14 7 6 4

0.920.45 0.740.35 0.21 0.14𝑥h1(𝑥 )

0.200.19 0.070.51 0.70 0.55h2(𝑥 )0.180.10 0.930.71 0.50 0.89h3(𝑥)

(𝑦¿¿1 , 𝑦2 , 𝑦3)=(, ,)¿0.14 0.07 0.10

Page 13: Leveraging Big Data:  Lecture 2

Min-Hash Sketches: k-minsk-mins sketch: Use “independent” hash functions: Track the respective minimum for each function.

Processing a new element :For :

Computation: Whether sketch is actually updated or not.

h1 (𝑥 )=0.35h2 (𝑥 )=0.51h3 (𝑥 )=0.71

Page 14: Leveraging Big Data:  Lecture 2

Min-Hash Sketches: Examplek-partition

32 12 14 7 6 4

32 13 1 2𝑥𝑖(𝑥)

0.200.19 0.070.51 0.70 0.55h (𝑥)

(𝑦¿¿1 , 𝑦2 , 𝑦3)=(, ,)¿0.07 0.19 0.20

part-hash

value-hash

Page 15: Leveraging Big Data:  Lecture 2

Min-Hash Sketches: k-partition

Processing a new element :

Computation: to test or update

k-partition sketch: Use a single hash function: Use the first bits of to map uniformly to one of parts. Call the remaining bits .For : Track the minimum hash value of the elements in part .

Page 16: Leveraging Big Data:  Lecture 2

Min-Hash Sketches: ExampleBottom-k

32 12 14 7 6 4𝑥0.200.19 0.070.51 0.70 0.55h (𝑥)

(𝑦¿¿1 , 𝑦2 , 𝑦3)=(, ,)¿0.07 0.19 0.20

Page 17: Leveraging Big Data:  Lecture 2

Min-Hash Sketches: bottom-k

Processing a new element :If : Computation: The sketch is maintained as a sorted list or as a priority queue. to test if an update is needed to update a sorted list. to update a priority queue.

Bottom-k sketch: Use a single hash function: Track the smallest values

We will see that #changes #distinct elements

Page 18: Leveraging Big Data:  Lecture 2

Min-Hash Sketches: Number of updatesClaim: The expected number of actual updates (changes) of the min-hash sketch is

Proof: First Consider . Look at distinct elements in the order they first occur. The th distinct element has lower hash value than the current minimum with probability . This is the probability of being first in a random permutation of elements.Total expected number of updates is .

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,UpdateProb.

121 1

314

15

160 0 0 0 0

Page 19: Leveraging Big Data:  Lecture 2

Min-Hash Sketches: Number of updates

Claim: The expected number of actual updates (changes) of the min-hash sketch is Proof (continued): Recap for (single min-hash value): the th distinct element causes an update with probability expected total is .k-mins: min-hash values (apply times)Bottom-k: We keep the smallest elements, so update probability of the th distinct element is (probability of being in the first in a random permutation)k-partition: min-hash values for distinct values.

Page 20: Leveraging Big Data:  Lecture 2

Merging Min-Hash Sketches

The union sketch from sketches of two sets ’,: k-mins: take minimum per hash function k-partition: take minimum per part Bottom-k: The smallest in union of data must be in

the smallest of their own set:

!! We apply the same set of hash function to all elements/data sets/streams.

Page 21: Leveraging Big Data:  Lecture 2

Using Min-Hash Sketches

Recap: We defined Min-Hash Sketches (3 types) Adding elements, merging Min-Hash sketches Some properties of these sketches

Next: We put Min-Hash sketches to work Estimating Distinct Count from a Min-Hash

SketchTools from estimation theory

Page 22: Leveraging Big Data:  Lecture 2

The Exponential Distribution PDF ; CDF ; Very useful properties:

Memorylessness: Min-to-Sum conversion:

Relation with uniform:

Page 23: Leveraging Big Data:  Lecture 2

Estimating Distinct Count from a Min-Hash Sketch: k-mins

• Change to exponential distribution • Using Min-to-Sum property, – In fact, we can just work with and use when

estimating.• Number of distinct elements becomes a parameter

estimation problem:

Given independent samples from , estimate

Page 24: Leveraging Big Data:  Lecture 2

Estimating Distinct Count from a Min-Hash Sketch: k-mins

Each has expectation and variance The average has expectation and variance The

cv is . is a good unbiased estimator for But which is the inverse of what we want. What about estimating ?

Page 25: Leveraging Big Data:  Lecture 2

Estimating Distinct Count from a Min-Hash Sketch: k-mins

What about estimating ? 1) We can use the biased estimator

To say something useful on the estimate quality: We apply Chebyshev’s inequality to bound the probability that is far from its expectation and thus is far from

2) Maximum Likelihood Estimation (general and powerful technique)

Page 26: Leveraging Big Data:  Lecture 2

Chebyshev’s InequalityFor any random variable with expectation and standard deviation , for any

For ,

For Using

Page 27: Leveraging Big Data:  Lecture 2

Using Chebyshev’s Inequality

For =

Page 28: Leveraging Big Data:  Lecture 2

Maximum Likelihood EstimationSet of independent ; we do not know The MLE is the value that maximizes the likelihood (joint density) function . The maximum over of the probability of observing

Properties: Principled way of deriving estimators Converges in probability to true value (with

enough i.i.d samples)… but generally biased (Asymptotically!) optimal – minimizes MSE (mean

square error) – meets Cramér-Rao lower bound

Page 29: Leveraging Big Data:  Lecture 2

Estimating Distinct Count from a Min-Hash Sketch: k-mins MLE

Likelihood function for (joint density function): Take a logarithm (does not change the

maximum): Differentiate to find maximum: MLE estimate

Given independent samples from , estimate

We get the same estimator, depends only on the sum!

Page 30: Leveraging Big Data:  Lecture 2

We can think of several ways to combine and use these samples and decrease the variance:• average (sum)• median• remove outliers and average remaining, …

Given independent samples from , estimate

We want to get the most value (best estimate) from the information we have (the sketch). What combinations should we consider ?

Page 31: Leveraging Big Data:  Lecture 2

Sufficient StatisticA function is a sufficient statistic for estimating some function of the parameter if the likelihood function has the factored form

Likelihood function (joint density) for exponential i.i.d random variables from :The sum is a sufficient statistic for

Page 32: Leveraging Big Data:  Lecture 2

Sufficient StatisticA function is a sufficient statistic for estimating some function of the parameter if the likelihood function has the factored form

In particular: The MLE depends on only through The maximum with respect to does not depend

on . The maximum of , computed by deriving with

respect to , is a function of T.

Page 33: Leveraging Big Data:  Lecture 2

Sufficient Statistic is a sufficient statistic for if the likelihood function has the form

Lemma: Conditional distribution of given does not depend on

If we fix , the density function is If we know the density up to fixed factor, it is determined completely by normalizing to 1

Page 34: Leveraging Big Data:  Lecture 2

Rao-Blackwell TheoremRecap: is a sufficient statistic for Conditional distribution of given does not depend on

Rao-Blackwell Theorem: Given an estimator of that is not a function of the sufficient statistic, we can get an estimator with at most the same MSE that depends only on :

does not depend on (critical) Process is called: Rao-Blackwellization of

Page 35: Leveraging Big Data:  Lecture 2

Rao-Blackwell Theorem

(1,3)

𝑓 (𝑦1 , 𝑦2;𝜃)

(4,0)

(3,1)

(2,2)

(1,2)

(2,1)

(3,0)

(1,4)

(3,2)

Density function of given parameter

Page 36: Leveraging Big Data:  Lecture 2

Rao-Blackwell Theorem

(1,3)

𝑓 (𝑦1 , 𝑦2;𝜃)T

(4,0)

(3,1)

(2,2)

(1,2)

(2,1)

(3,0)

(1,4)

(3,2)

Sufficient statistic:

Page 37: Leveraging Big Data:  Lecture 2

Rao-Blackwell Theorem

(1,3)

𝑓 (𝑦1 , 𝑦2;𝜃)T

(4,0)

(3,1)

(2,2)

(1,2)

(2,1)

(3,0)

(1,4)

(3,2)

Sufficient statistic:

Page 38: Leveraging Big Data:  Lecture 2

Rao-Blackwell Theorem

(1,3)

𝑓 (𝑦1 , 𝑦2;𝜃)T

(4,0)

(3,1)

(2,2)

(1,2)

(2,1)

(3,0)

(1,4)

(3,2)

Sufficient statistic:𝑓 (𝑦 1 , 𝑦 2;𝜃 )∨ y1+ y2

Page 39: Leveraging Big Data:  Lecture 2

Rao-Blackwell Theorem

(1,3)

Estimator T

(4,0)

(3,1)

(2,2)

(1,2)

(2,1)

(3,0)

(1,4)

(3,2)

3

0

2

1

0

4

2

1

2

Page 40: Leveraging Big Data:  Lecture 2

Rao-Blackwell Theorem

(1,3)

�̂�(𝒚𝟏 , 𝒚𝟐) T

Rao-Blackwell:

(4,0)

(3,1)

(2,2)

(1,2)

(2,1)

(3,0)

(1,4)

(3,2)

3

0

2

1

0

4

2

1

21.5 1

3

Page 41: Leveraging Big Data:  Lecture 2

Rao-Blackwell Theorem�̂�(𝒚𝟏 , 𝒚𝟐) T

Rao-Blackwell:

Law of total expectation:

Expectation (bias) remains the same MSE (Mean Square Error) can only decrease

Page 42: Leveraging Big Data:  Lecture 2

Why does the MSE decrease?

Suppose we have two points with equal probabilities. We have an estimator of that gives estimates and on these points.

We replace it by an estimator that instead returns the average:

The (scaled) contribution of these two points to the square error changes from

to

Page 43: Leveraging Big Data:  Lecture 2

Why does the MSE decrease?

Show that

Page 44: Leveraging Big Data:  Lecture 2

Sufficient Statistic for estimating from k-mins sketches

The sum is a sufficient statistic for estimating any

function of (including , ) Rao-Blackwell We can not gain by using

estimators with a different dependence on (e.g. functions of median or of a smaller sum)

Given independent samples from , estimate

Page 45: Leveraging Big Data:  Lecture 2

Estimating Distinct Count from a Min-Hash Sketch: k-mins MLE

• , the sum of i.i.d random variables), has PDF

The expectation of the MLE estimate is

MLE estimate

Page 46: Leveraging Big Data:  Lecture 2

Estimating Distinct Count from a Min-Hash Sketch: k-mins

The variance of the unbiased estimate is

The CV is Is this the best we can do ?

Unbiased Estimator (for )

Page 47: Leveraging Big Data:  Lecture 2

Cramér-Rao lower bound (CRLB)

Are we using the information in the sketch in the best possible way ?

Page 48: Leveraging Big Data:  Lecture 2

Cramér-Rao lower bound (CRLB)Information theoretic lower bound on the variance of any unbiased estimator of . Likelihood function: Log likelihood:

Fisher Information

CRLB: Any unbiased estimator has

Page 49: Leveraging Big Data:  Lecture 2

CRLB for estimating

Likelihood function for Log likelihood Negated second derivative: Fisher information: CRLB :

Page 50: Leveraging Big Data:  Lecture 2

Our estimator has CV

The Cramér-Rao lower bound on CV is we are using the information in the sketch nearly optimally !

Estimating Distinct Count from a Min-Hash Sketch: k-mins

Unbiased Estimator (for )

Page 51: Leveraging Big Data:  Lecture 2

Estimating Distinct Count from a Min-Hash Sketch: Bottom-k

Bottom-k sketch Can we specify the distribution? Use Exponential D.

What is the relation with k-mins sketches?

same as k-mins The minimum of the remaining elements is

Since memoryless, . More generally .

Page 52: Leveraging Big Data:  Lecture 2

Bottom-k versus k-mins sketchesBottom-k sketch: samples from

Bottom-k sketches carry strictly more information than k-mins sketches!

K-mins sketch: samples from

To obtain from (without knowing ) we can take where

We can use k-mins estimators with bottom-k. Can do even better by taking expectation over choices of .

Page 53: Leveraging Big Data:  Lecture 2

Estimating Distinct Count from a Min-Hash Sketch: Bottom-k

Likelihood function of :

Does not depend on n Depends on nWhat does estimation theory tell us?

Page 54: Leveraging Big Data:  Lecture 2

Estimating Distinct Count from a Min-Hash Sketch: Bottom-k

Likelihood function

(maximum value in the sketch) is a sufficient statistic for estimating (or any function of ).Captures everything we can glean from the bottom-k sketch on

What does estimation theory tell us?

Page 55: Leveraging Big Data:  Lecture 2

Bottom-k: MLE for Distinct CountLikelihood function (probability density) is

Find the value of which maximizes :Look only at part that depends on Take the logarithm (same maximum)

ℓ (𝑦 ;𝑛)=∑𝑖=0

𝑘−1

ln (𝑛−𝑖)− (𝑛+1 ) 𝑦𝑘

Page 56: Leveraging Big Data:  Lecture 2

Bottom-k: MLE for Distinct Count

ℓ (𝑦 ;𝑛)=∑𝑖=0

𝑘−1

ln (𝑛−𝑖)− (𝑛+1 ) 𝑦𝑘

We look for which maximizes

𝜕ℓ ( 𝑦 ;𝑛 )𝜕𝑛 =∑

𝑖=0

𝑘−1 1𝑛− 𝑖− 𝑦𝑘

∑𝑖=0

𝑘−1 1𝑛−𝑖=𝑦𝑘MLE is the solution of:

Need to solve numerically

Page 57: Leveraging Big Data:  Lecture 2

Summary: k-mins count estimators k-mins sketch with dist: With dist: Sufficient statistic for (any function of) : MLE/Unbiased est for : cv: CRLB: MLE for : Unbiased est for : cv: CRLB:

Page 58: Leveraging Big Data:  Lecture 2

Summary: bottom-k count estimators

Bottom-k sketch with : With dist: Sufficient statistic for (any function of) : Contains strictly more information than k-

mins When , approximately the same as k-mins MLE for is the solution of:∑

𝑖=0

𝑘−1 1𝑛−𝑖=𝑦𝑘

Page 59: Leveraging Big Data:  Lecture 2

Bibliography

• See lecture 3• We will continue with Min Hash sketches• Use as random samples• Applications to similarity• Inverse-Probability based distinct count

estimators