cs b351 statistical learning

62
CS B351 STATISTICAL LEARNING

Upload: ayasha

Post on 22-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

CS b351 Statistical Learning. Agenda. Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE) Priors, maximum a posteriori estimation (MAP) Bayesian estimation. Learning Coin Flips. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS  b351 Statistical Learning

CS B351STATISTICAL LEARNING

Page 2: CS  b351 Statistical Learning

AGENDA Learning coin flips, learning Bayes net

parameters Likelihood functions, maximum likelihood

estimation (MLE) Priors, maximum a posteriori estimation

(MAP) Bayesian estimation

Page 3: CS  b351 Statistical Learning

LEARNING COIN FLIPS Observe that c out of N draws are cherries

(data) Intuition: c/N might be a good hypothesis for

the fraction of cherries in the bag “Intuitive” parameter estimate: empirical

distribution P(cherry) c / N(Why is this reasonable? Perhaps we got a bad draw!)

Page 4: CS  b351 Statistical Learning

LEARNING COIN FLIPS Observe that c out of N draws are cherries

(data) Let the unknown fraction of cherries be q

(hypothesis) Probability of drawing a cherry is q Assumption: draws are independent and

identically distributed (i.i.d)

Page 5: CS  b351 Statistical Learning

LEARNING COIN FLIPS Probability of drawing a cherry is q Assumption: draws are independent and

identically distributed (i.i.d) Probability of drawing 2 cherries is q*q Probability of drawing 2 limes is (1-q)2

Probability of drawing 1 cherry and 1 lime: q*(1-q)

Page 6: CS  b351 Statistical Learning

LIKELIHOOD FUNCTION Likelihood: the probability of the data d={d1,

…,dN} given the hypothesis q P(d|q) = Pj P(dj|q)

i.i.d assumption

Page 7: CS  b351 Statistical Learning

LIKELIHOOD FUNCTION Likelihood: the probability of the data d={d1,

…,dN} given the hypothesis q P(d|q) = Pj P(dj|q) = Pj

q if dj=Cherry1-q if dj=Lime

Probability model, assuming q is given

Page 8: CS  b351 Statistical Learning

LIKELIHOOD FUNCTION Likelihood: the probability of the data d={d1,

…,dN} given the hypothesis q P(d|q) = Pj P(dj|q) = Pj

= qc (1-q)N-c

Gather c cherry terms together, then N-c lime terms

q if dj=Cherry1-q if dj=Lime

Page 9: CS  b351 Statistical Learning

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1/1 cherry

q

P(da

ta|q

)

Page 10: CS  b351 Statistical Learning

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

2/2 cherry

q

P(da

ta|q

)

Page 11: CS  b351 Statistical Learning

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.020.040.060.080.1

0.120.140.16

2/3 cherry

q

P(da

ta|q

)

Page 12: CS  b351 Statistical Learning

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07

2/4 cherry

q

P(da

ta|q

)

Page 13: CS  b351 Statistical Learning

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.0050.01

0.0150.02

0.0250.03

0.0350.04

2/5 cherry

q

P(da

ta|q

)

Page 14: CS  b351 Statistical Learning

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.10.20.30.40.50.60.70.80.9 10

0.0000002

0.0000004

0.0000006

0.0000008

0.000001

0.0000012

10/20 cherry

q

P(da

ta|q

)

Page 15: CS  b351 Statistical Learning

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1E-312E-313E-314E-315E-316E-317E-318E-319E-31

50/100 cherry

q

P(da

ta|q

)

Page 16: CS  b351 Statistical Learning

MAXIMUM LIKELIHOOD Peaks of likelihood function seem to hover

around the fraction of cherries… Sharpness indicates some notion of

certainty…

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1E-312E-313E-314E-315E-316E-317E-318E-319E-31

50/100 cherry

q

P(da

ta|q

)

Page 17: CS  b351 Statistical Learning

MAXIMUM LIKELIHOOD P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the

maximum likelihood estimate (MLE)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1/1 cherry

q

P(da

ta|q

)

q=1 is MLE

Page 18: CS  b351 Statistical Learning

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

2/2 cherry

q

P(da

ta|q

)MAXIMUM LIKELIHOOD P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the

maximum likelihood estimate (MLE)

q=1 is MLE

Page 19: CS  b351 Statistical Learning

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.020.040.060.080.1

0.120.140.16

2/3 cherry

q

P(da

ta|q

)MAXIMUM LIKELIHOOD P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the

maximum likelihood estimate (MLE)

q=2/3 is MLE

Page 20: CS  b351 Statistical Learning

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07

2/4 cherry

q

P(da

ta|q

)MAXIMUM LIKELIHOOD P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the

maximum likelihood estimate (MLE)

q=1/2 is MLE

Page 21: CS  b351 Statistical Learning

MAXIMUM LIKELIHOOD P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the

maximum likelihood estimate (MLE)

q=2/5 is MLE

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.0050.01

0.0150.02

0.0250.03

0.0350.04

2/5 cherry

q

P(da

ta|q

)

Page 22: CS  b351 Statistical Learning

PROOF: EMPIRICAL FREQUENCY IS THE MLE l(q) = log P(d|q) = log [ qc (1-q)N-c]

Page 23: CS  b351 Statistical Learning

PROOF: EMPIRICAL FREQUENCY IS THE MLE l(q) = log P(d|q) = log [ qc (1-q)N-c]

= log [ qc ] + log [(1-q)N-c]

Page 24: CS  b351 Statistical Learning

PROOF: EMPIRICAL FREQUENCY IS THE MLE l(q) = log P(d|q) = log [ qc (1-q)N-c]

= log [ qc ] + log [(1-q)N-c]= c log q + (N-c) log (1-q)

Page 25: CS  b351 Statistical Learning

PROOF: EMPIRICAL FREQUENCY IS THE MLE l(q) = log P(d|q) = c log q + (N-c) log (1-q) Setting dl/dq(q) = 0 gives the maximum likelihood

estimate

Page 26: CS  b351 Statistical Learning

PROOF: EMPIRICAL FREQUENCY IS THE MLE dl/dq(q) = c/q – (N-c)/(1-q) At MLE, c/q – (N-c)/(1-q) = 0

…=> q = c/N

Page 27: CS  b351 Statistical Learning

MAXIMUM LIKELIHOOD FOR BN For any BN, the ML parameters of any CPT

can be derived by the fraction of observed values in the data, conditioned on matched parent values

Alarm

Earthquake Burglar

E: 500 B: 200

N=1000

P(E) = 0.5 P(B) = 0.2

A|E,B: 19/20A|B: 188/200A|E: 170/500A| : 1/380

E B P(A|E,B)T T 0.95F T 0.95T F 0.34F F 0.003

Page 28: CS  b351 Statistical Learning

FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])

Each d[i] is a complete example of all variables in the Bayes net

Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN

Page 29: CS  b351 Statistical Learning

FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])

Each d[i] is a complete example of all variables in the Bayes net

Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN

Suppose BN has a single variable X Estimate X’s CPT, P(X) X

Page 30: CS  b351 Statistical Learning

FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])

Each d[i] is a complete example of all variables in the Bayes net

Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN

Suppose BN has a single variable X Estimate X’s CPT, P(X) (Just learning a coin flip as usual) PMLE(X) = empirical distribution of D

PMLE(X=T) = Count(X=T) / M PMLE(X=F) = Count(X=F) / M

X

Page 31: CS  b351 Statistical Learning

FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])

Each d[i] is a complete example of all variables in the Bayes net

Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN

Suppose BN to the right: Estimate P(X), P(Y|X)

Estimate PMLE(X) as usualX

Y

Page 32: CS  b351 Statistical Learning

FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])

Each d[i] is a complete example of all variables in the Bayes net

Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN

Estimate PMLE(Y|X) with…X

Y

P(Y|X)

XT F

YT

F

Page 33: CS  b351 Statistical Learning

FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])

Each d[i] is a complete example of all variables in the Bayes net

Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN

Estimate PMLE(Y|X) with…X

Y

P(Y|X)

XT F

YT Count(Y=T,X=T)

/ Count(X=T)Count(Y=T,X=F)

/ Count(X=F)F Count(Y=F,X=T)

/ Count(X=T)Count(Y=F,X=F) /

Count(X=F)

Page 34: CS  b351 Statistical Learning

FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])

Each d[i] is a complete example of all variables in the Bayes net

Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN

In general, for P(Y|X1,…,Xk): For each setting of (y,x1,…,xk):

Compute Count(y, x1,…,xk) Compute Count(x1,…,xk) Set

X2

Y

X1 X3

Page 35: CS  b351 Statistical Learning

OTHER MLE RESULTS Categorical distributions (Non-binary discrete

variables): empirical distribution is MLE Make histogram, divide by N

Continuous Gaussian distributions Mean = average of data Standard deviation = standard deviation of data

0 50 100 150 200 2500

0.0010.0020.0030.0040.0050.0060.0070.0080.009

0 20 40 60 80 1001201401601802000

0.050.1

0.150.2

0.250.3

0.350.4

Gaussian (normal) distributionHistogram

Page 36: CS  b351 Statistical Learning

NICE PROPERTIES OF MLE Easy to compute (for certain probability

models) With enough data, the qMLE estimate will

approach the true unknown value of q

Page 37: CS  b351 Statistical Learning

PROBLEMS WITH MLE The MLE was easy to compute… but what

happens when we don’t have much data? Motivation

You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?

Page 38: CS  b351 Statistical Learning

PROBLEMS WITH MLE The MLE was easy to compute… but what

happens when we don’t have much data? Motivation

You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?

qMLE has a high variance with small sample sizes

Page 39: CS  b351 Statistical Learning

VARIANCE OF AN ESTIMATOR: INTUITION The dataset D is just a sample of the underlying

distribution, and if we could “do over” the sample, then we might get a new dataset D’.

With D’, our MLE estimate qMLE’ might be different

How much? How often?

Assume all values of q are equally likely In the case of 1 draw, D would have just as likely been

a Lime. In that case, qMLE = 0 So with probability 0.5, qMLE would be 1, and with the

same probability, qMLE would be 0. High variance: typical “do overs” give drastically

different results!

Page 40: CS  b351 Statistical Learning

IS THERE A BETTER WAY? BAYESIAN LEARNING

Page 41: CS  b351 Statistical Learning

AN ALTERNATIVE APPROACH: BAYESIAN ESTIMATION P(D|q) is the likelihood P(q) is the hypothesis prior P(q|D) = 1/Z P(D|q) P(q) is the posterior

Distribution of hypotheses given the data

q

d[1] d[2] d[M]

Page 42: CS  b351 Statistical Learning

BAYESIAN PREDICTION For a new draw Y: use hypothesis posterior to

predict P(Y|D)

Y

q

d[1] d[2] d[M]

Page 43: CS  b351 Statistical Learning

CANDY EXAMPLE• Candy comes in 2 flavors, cherry and lime, with identical

wrappers• Manufacturer makes 5 indistinguishable bags

• Suppose we draw• What bag are we holding? What flavor will we draw

next?

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

Page 44: CS  b351 Statistical Learning

BAYESIAN LEARNING Main idea: Compute the probability of each

hypothesis, given the data Data D: Hypotheses: h1,…,h5

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

Page 45: CS  b351 Statistical Learning

BAYESIAN LEARNING Main idea: Compute the probability of each

hypothesis, given the data Data D: Hypotheses: h1,…,h5

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

P(hi|D)

P(D|hi)

We want this…

But all we have is this!

Page 46: CS  b351 Statistical Learning

USING BAYES’ RULE P(hi|D) = a P(D|hi) P(hi) is the posterior

(Recall, 1/a = P(D) = Si P(D|hi) P(hi)) P(D|hi) is the likelihood P(hi) is the hypothesis prior

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

Page 47: CS  b351 Statistical Learning

COMPUTING THE POSTERIOR Assume draws are independent Let P(h1),…,P(h5) = (0.1, 0.2, 0.4, 0.2, 0.1) D = { 10 x }

P(D|h1) = 0P(D|h2) = 0.2510

P(D|h3) = 0.510

P(D|h4) = 0.7510

P(D|h5) = 110

P(D|h1)P(h1)=0P(D|h2)P(h2)=9e-8P(D|h3)P(h3)=4e-4P(D|h4)P(h4)=0.011P(D|h5)P(h5)=0.1

P(h1|D) =0P(h2|D) =0.00P(h3|D) =0.00P(h4|D) =0.10P(h5|D) =0.90

Sum = 1/a = 0.1114

Page 48: CS  b351 Statistical Learning

POSTERIOR HYPOTHESES

Page 49: CS  b351 Statistical Learning

PREDICTING THE NEXT DRAW P(Y|d) = Si P(Y|hi,D)P(hi|D)

= Si P(Y|hi)P(hi|D)

P(h1|D) =0P(h2|D) =0.00P(h3|D) =0.00P(h4|D) =0.10P(h5|D) =0.90

H

D Y

P(Y|h1) =0P(Y|h2) =0.25P(Y|h3) =0.5P(Y|h4) =0.75P(Y|h5) =1

Probability that next candy drawn is a lime

P(Y|D) = 0.975

Page 50: CS  b351 Statistical Learning

P(NEXT CANDY IS LIME | D)

Page 51: CS  b351 Statistical Learning

BACK TO COIN FLIPS: UNIFORM PRIOR, BERNOULLI DISTRIBUTION Assume P(q) is uniform P(q|D) = 1/Z P(D|q) = 1/Z qc(1-q)N-c

What’s P(Y|D)?

qi

d[1] d[2] d[M]

Y

Page 52: CS  b351 Statistical Learning

ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION

=>Z = c! (N-c)! / (N+1)! =>P(Y) = 1/Z (c+1)! (N-c)! / (N+2)!

= (c+1) / (N+2)

qi

d[1] d[2] d[M]

Y

Can think of this as a “correction” using “virtual counts”

Page 53: CS  b351 Statistical Learning

NONUNIFORM PRIORS P(q|D) P(D|q)P(q) = qc (1-q)N-c P(q)

Define, for all q, the probability that I believe in q

10 q

P(q)

Page 54: CS  b351 Statistical Learning

BETA DISTRIBUTION Betaa,b(q) = g qa-1 (1-q)b-1

a, b hyperparameters > 0 g is a normalization

constant a=b=1 is uniform

distribution

Page 55: CS  b351 Statistical Learning

POSTERIOR WITH BETA PRIORPosterior qc (1-q)N-c P(q)

= g qc+a-1 (1-q)N-c+b-1

= Betaa+c,b+N-c(q)

Prediction = meanE[q]=(c+a)/(N+a+b)

Page 56: CS  b351 Statistical Learning

POSTERIOR WITH BETA PRIORWhat does this mean? Prior specifies a “virtual

count” of a=a-1 heads, b=b-1 tailsSee heads, increment aSee tails, increment b

Effect of prior diminishes with more data

Page 57: CS  b351 Statistical Learning

CHOOSING A PRIOR Part of the design process; must be chosen

according to your intuition Uninformed belief a=b=1, strong belief => a,b high

Page 58: CS  b351 Statistical Learning

FITTING CPTS VIA MAP M examples D=(d[1],…,d[M]), virtual counts

a, b Estimate PMLE(Y|X) by assuming we’ve seen a

examples of Y=T, and b examples of Y=F

P(Y|X) XT F

Y

T (Count(Y=T,X=T)+a) / (Count(X=T)

+a+b)

(Count(Y=T,X=F) +a)/ (Count(X=F)

+a+b)F (Count(Y=F,X=T)

+b)/ (Count(X=T)+a+b)

(Count(Y=F,X=F)+b)/ (Count(X=F)+a+b)

Page 59: CS  b351 Statistical Learning

PROPERTIES OF MAP Approaches the MLE as dataset grows large

(effect of prior diminishes in the face of evidence)

More stable estimates than MLE with small sample sizes

Needs a designer’s judgment to set the prior

Page 60: CS  b351 Statistical Learning

EXTENSIONS OF BETA PRIORS Parameters of multi-valued (categorical)

distributions, e.g. histograms: Dirichlet prior Mathematical derivation more complex, but in

practice still takes the form of “virtual counts”

0 20 40 60 80 1001201401601802000

0.050.1

0.150.2

0.250.3

0.35

0 20 40 60 80 1001201401601802000

0.05

0.1

0.15

0.2

0.25

0 20 40 60 80 1001201401601802000

0.050.1

0.150.2

0.250.3

0.350.4

0 20 40 60 80 1001201401601802000

0.020.040.060.080.1

0.120.140.160.18

0 1

5 10

Page 61: CS  b351 Statistical Learning

RECAP Parameter learning via coin flips

Maximum Likelihood Bayesian Learning with Beta prior

Learning Bayes net parameters

Page 62: CS  b351 Statistical Learning

NEXT TIME Introduction to machine learning R&N 18.1-3