leveraging big data: lecture 2

Leveraging Big Data: Lecture 2

Instructors:

http://www.cohenwang.com/edith/bigdataclass2013

Edith CohenAmos FiatHaim KaplanTova Milo

Counting Distinct Elements

Elements occur multiple times, we want to count the number of distinct elements.

Number of distinct element is ( =6 in example) Total number of elements is 11 in this example

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,

Exact counting of distinct element requires a structure of size We are happy with an approximate count that uses a small-size working memory.

Distinct Elements: Approximate Counting 32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,

We want to be able to compute and maintain a small sketch of the set of distinct items seen so far

Distinct Elements: Approximate Counting Size of sketch Can query to get a good estimate of (small

relative error) For a new element easy to compute from and

For data stream computation If and are (possibly overlapping) sets then we can

compute the union sketch from their sketches: from and For distributed computation

Distinct Elements: Approximate Counting

Size-estimation/Minimum value technique:[Flajolet-Martin 85, C 94]

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,

is a random hash function from element IDs to uniform random numbers in

Maintain the Min-Hash value : Initialize Processing an element


32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,𝑥

h (𝑥)

𝑦

𝑛0

1

12 33 4 4 4 45 5 6

0.45 0.21

0.35 0.92

0.14

0.45 0.450.450.74

0.35 0.35

0.35 0.35

0.35

0.210.21

0.21 0.21 0.21 0.140.14

0.14

The minimum hash value is: Unaffected by repeated elements. Is non-increasing with the number of distinct elements .


How does the minimum hash give information on the number of distinct elements ?

0 1

The expectation of the minimum is

minimum

A single value gives only limited information. To boost information, we maintain values

Why expectation is ?

Take a circle of length 1 Throw a random red point to “mark” the start of

a segment of length (circle points map to ) Throw another point independently at random The circle is cut into segments by these points. The expected length of each segment is Same also for the segment clockwise from the

red point.

Min-Hash Sketches

k-mins sketch: Use “independent” hash functions: Track the respective minimum for each function.

Bottom-k sketch: Use a single hash function: Track the smallest values

k-partition sketch: Use a single hash function: Use the first bits of to map uniformly to one of parts. Call the remaining bits .For : Track the minimum hash value of the elements in part .

These sketches maintain values from the range of the hash function (distribution).

All sketches are the same for

Min-Hash Sketchesk-mins, bottom-k, k-partition

Why study all 3 variants ? Different tradeoffs between update cost, accuracy, usage…

Beyond distinct counting: Min-Hash sketches correspond to sampling schemes of

large data sets Similarity queries between datasets Selectivity/subset queries

These patterns generally apply as methods to gather increased confidence from a random “projection”/sample.

Min-Hash Sketches: Examplesk-mins, k-partition, bottom-k

The min-hash value and sketches only depend on The random hash function/s The set of distinct elements

Not on the order elements appear or their multiplicity

32 12 14 7 6 4𝑁={ }, , , , ,

Min-Hash Sketches: Examplek-mins

32 12 14 7 6 4

0.920.45 0.740.35 0.21 0.14𝑥h1(𝑥 )

0.200.19 0.070.51 0.70 0.55h2(𝑥 )0.180.10 0.930.71 0.50 0.89h3(𝑥)

(𝑦¿¿1 , 𝑦2 , 𝑦3)=(, ,)¿0.14 0.07 0.10

Min-Hash Sketches: k-minsk-mins sketch: Use “independent” hash functions: Track the respective minimum for each function.

Processing a new element :For :

Computation: Whether sketch is actually updated or not.

h1 (𝑥 )=0.35h2 (𝑥 )=0.51h3 (𝑥 )=0.71

Min-Hash Sketches: Examplek-partition

32 12 14 7 6 4

32 13 1 2𝑥𝑖(𝑥)

0.200.19 0.070.51 0.70 0.55h (𝑥)

(𝑦¿¿1 , 𝑦2 , 𝑦3)=(, ,)¿0.07 0.19 0.20

part-hash

value-hash

Min-Hash Sketches: k-partition

Processing a new element :

Computation: to test or update

k-partition sketch: Use a single hash function: Use the first bits of to map uniformly to one of parts. Call the remaining bits .For : Track the minimum hash value of the elements in part .

Min-Hash Sketches: ExampleBottom-k

32 12 14 7 6 4𝑥0.200.19 0.070.51 0.70 0.55h (𝑥)

(𝑦¿¿1 , 𝑦2 , 𝑦3)=(, ,)¿0.07 0.19 0.20

Min-Hash Sketches: bottom-k

Processing a new element :If : Computation: The sketch is maintained as a sorted list or as a priority queue. to test if an update is needed to update a sorted list. to update a priority queue.

Bottom-k sketch: Use a single hash function: Track the smallest values

We will see that #changes #distinct elements

Min-Hash Sketches: Number of updatesClaim: The expected number of actual updates (changes) of the min-hash sketch is

Proof: First Consider . Look at distinct elements in the order they first occur. The th distinct element has lower hash value than the current minimum with probability . This is the probability of being first in a random permutation of elements.Total expected number of updates is .

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,UpdateProb.

121 1

314

15

160 0 0 0 0

Min-Hash Sketches: Number of updates

Claim: The expected number of actual updates (changes) of the min-hash sketch is Proof (continued): Recap for (single min-hash value): the th distinct element causes an update with probability expected total is .k-mins: min-hash values (apply times)Bottom-k: We keep the smallest elements, so update probability of the th distinct element is (probability of being in the first in a random permutation)k-partition: min-hash values for distinct values.

Merging Min-Hash Sketches

The union sketch from sketches of two sets ’,: k-mins: take minimum per hash function k-partition: take minimum per part Bottom-k: The smallest in union of data must be in

the smallest of their own set:

!! We apply the same set of hash function to all elements/data sets/streams.

Using Min-Hash Sketches

Recap: We defined Min-Hash Sketches (3 types) Adding elements, merging Min-Hash sketches Some properties of these sketches

Next: We put Min-Hash sketches to work Estimating Distinct Count from a Min-Hash

SketchTools from estimation theory

The Exponential Distribution PDF ; CDF ; Very useful properties:

Memorylessness: Min-to-Sum conversion:

Relation with uniform:

Estimating Distinct Count from a Min-Hash Sketch: k-mins

• Change to exponential distribution • Using Min-to-Sum property, – In fact, we can just work with and use when

estimating.• Number of distinct elements becomes a parameter

estimation problem:

Given independent samples from , estimate


Each has expectation and variance The average has expectation and variance The

cv is . is a good unbiased estimator for But which is the inverse of what we want. What about estimating ?


What about estimating ? 1) We can use the biased estimator

To say something useful on the estimate quality: We apply Chebyshev’s inequality to bound the probability that is far from its expectation and thus is far from

2) Maximum Likelihood Estimation (general and powerful technique)

Chebyshev’s InequalityFor any random variable with expectation and standard deviation , for any

For ,

For Using

Using Chebyshev’s Inequality

For =

Maximum Likelihood EstimationSet of independent ; we do not know The MLE is the value that maximizes the likelihood (joint density) function . The maximum over of the probability of observing

Properties: Principled way of deriving estimators Converges in probability to true value (with

enough i.i.d samples)… but generally biased (Asymptotically!) optimal – minimizes MSE (mean

square error) – meets Cramér-Rao lower bound

Estimating Distinct Count from a Min-Hash Sketch: k-mins MLE

Likelihood function for (joint density function): Take a logarithm (does not change the

maximum): Differentiate to find maximum: MLE estimate


We get the same estimator, depends only on the sum!

We can think of several ways to combine and use these samples and decrease the variance:• average (sum)• median• remove outliers and average remaining, …


We want to get the most value (best estimate) from the information we have (the sketch). What combinations should we consider ?

Sufficient StatisticA function is a sufficient statistic for estimating some function of the parameter if the likelihood function has the factored form

Likelihood function (joint density) for exponential i.i.d random variables from :The sum is a sufficient statistic for

Sufficient StatisticA function is a sufficient statistic for estimating some function of the parameter if the likelihood function has the factored form

In particular: The MLE depends on only through The maximum with respect to does not depend

on . The maximum of , computed by deriving with

respect to , is a function of T.

Sufficient Statistic is a sufficient statistic for if the likelihood function has the form

Lemma: Conditional distribution of given does not depend on

If we fix , the density function is If we know the density up to fixed factor, it is determined completely by normalizing to 1

Rao-Blackwell TheoremRecap: is a sufficient statistic for Conditional distribution of given does not depend on

Rao-Blackwell Theorem: Given an estimator of that is not a function of the sufficient statistic, we can get an estimator with at most the same MSE that depends only on :

does not depend on (critical) Process is called: Rao-Blackwellization of

Rao-Blackwell Theorem

(1,3)

𝑓 (𝑦1 , 𝑦2;𝜃)

(4,0)

(3,1)

(2,2)

(1,2)

(2,1)

(3,0)

(1,4)

(3,2)

Density function of given parameter


(1,3)

𝑓 (𝑦1 , 𝑦2;𝜃)T

(4,0)

(3,1)

(2,2)

(1,2)

(2,1)

(3,0)

(1,4)

(3,2)

Sufficient statistic:


(1,3)

𝑓 (𝑦1 , 𝑦2;𝜃)T

(4,0)

(3,1)

(2,2)

(1,2)

(2,1)

(3,0)

(1,4)

(3,2)

Sufficient statistic:𝑓 (𝑦 1 , 𝑦 2;𝜃 )∨ y1+ y2


(1,3)

Estimator T

(4,0)

(3,1)

(2,2)

(1,2)

(2,1)

(3,0)

(1,4)

(3,2)

3

0

2

1

0

4

2

1

2


(1,3)

�̂�(𝒚𝟏 , 𝒚𝟐) T

Rao-Blackwell:

(4,0)

(3,1)

(2,2)

(1,2)

(2,1)

(3,0)

(1,4)

(3,2)

3

0

2

1

0

4

2

1

21.5 1

3

Rao-Blackwell Theorem�̂�(𝒚𝟏 , 𝒚𝟐) T

Rao-Blackwell:

Law of total expectation:

Expectation (bias) remains the same MSE (Mean Square Error) can only decrease

Why does the MSE decrease?

Suppose we have two points with equal probabilities. We have an estimator of that gives estimates and on these points.

We replace it by an estimator that instead returns the average:

The (scaled) contribution of these two points to the square error changes from

to

Why does the MSE decrease?

Show that

Sufficient Statistic for estimating from k-mins sketches

The sum is a sufficient statistic for estimating any

function of (including , ) Rao-Blackwell We can not gain by using

estimators with a different dependence on (e.g. functions of median or of a smaller sum)


Estimating Distinct Count from a Min-Hash Sketch: k-mins MLE

• , the sum of i.i.d random variables), has PDF

The expectation of the MLE estimate is

MLE estimate


The variance of the unbiased estimate is

The CV is Is this the best we can do ?

Unbiased Estimator (for )

Cramér-Rao lower bound (CRLB)

Are we using the information in the sketch in the best possible way ?

Cramér-Rao lower bound (CRLB)Information theoretic lower bound on the variance of any unbiased estimator of . Likelihood function: Log likelihood:

Fisher Information

CRLB: Any unbiased estimator has

CRLB for estimating

Likelihood function for Log likelihood Negated second derivative: Fisher information: CRLB :

Our estimator has CV

The Cramér-Rao lower bound on CV is we are using the information in the sketch nearly optimally !


Unbiased Estimator (for )

Estimating Distinct Count from a Min-Hash Sketch: Bottom-k

Bottom-k sketch Can we specify the distribution? Use Exponential D.

What is the relation with k-mins sketches?

same as k-mins The minimum of the remaining elements is

Since memoryless, . More generally .

Bottom-k versus k-mins sketchesBottom-k sketch: samples from

Bottom-k sketches carry strictly more information than k-mins sketches!

K-mins sketch: samples from

To obtain from (without knowing ) we can take where

We can use k-mins estimators with bottom-k. Can do even better by taking expectation over choices of .


Likelihood function of :

Does not depend on n Depends on nWhat does estimation theory tell us?


Likelihood function

(maximum value in the sketch) is a sufficient statistic for estimating (or any function of ).Captures everything we can glean from the bottom-k sketch on

What does estimation theory tell us?

Bottom-k: MLE for Distinct CountLikelihood function (probability density) is

Find the value of which maximizes :Look only at part that depends on Take the logarithm (same maximum)

ℓ (𝑦 ;𝑛)=∑𝑖=0

𝑘−1

ln (𝑛−𝑖)− (𝑛+1 ) 𝑦𝑘

Bottom-k: MLE for Distinct Count

ℓ (𝑦 ;𝑛)=∑𝑖=0

𝑘−1

ln (𝑛−𝑖)− (𝑛+1 ) 𝑦𝑘

We look for which maximizes

𝜕ℓ ( 𝑦 ;𝑛 )𝜕𝑛 =∑

𝑖=0

𝑘−1 1𝑛− 𝑖− 𝑦𝑘

∑𝑖=0

𝑘−1 1𝑛−𝑖=𝑦𝑘MLE is the solution of:

Need to solve numerically

Summary: k-mins count estimators k-mins sketch with dist: With dist: Sufficient statistic for (any function of) : MLE/Unbiased est for : cv: CRLB: MLE for : Unbiased est for : cv: CRLB:

Summary: bottom-k count estimators

Bottom-k sketch with : With dist: Sufficient statistic for (any function of) : Contains strictly more information than k-

mins When , approximately the same as k-mins MLE for is the solution of:∑

𝑖=0

𝑘−1 1𝑛−𝑖=𝑦𝑘

Bibliography

• See lecture 3• We will continue with Min Hash sketches• Use as random samples• Applications to similarity• Inverse-Probability based distinct count

estimators

leveraging big data: lecture 2

Documents

element distinct elements

number of distinct elements

distinct search queries

distinct elementselements

minimum hash value

repeated elements

random hash function

set of distinct items