![Page 1: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/1.jpg)
COMS4771, Columbia University
Machine Learning
4771
Instructors:
Adrian Weller and Ilia Vovsha
![Page 2: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/2.jpg)
COMS4771, Columbia University
Topic 2: Basic concepts of Bayesians and Frequentists
•Properties of PDFs
•Bayesians & Frequentists
•ML, MAP and Full Bayes
•Example: Coin Toss
•Bernoulli Priors
•Conjugate Priors
•Bayesian decision theory
![Page 3: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/3.jpg)
COMS4771, Columbia University
Properties of PDFs•Review some basics of probability theory
•First, pdf is a function, multiple inputs, one output:
•Function’s output is always non-negative:
•Can have discrete or continuous or both inputs:
•Summing over the domain of all inputs gives unity:
p x
1,…,x
n( ) ( )
10.3, , 1 0.2
n
p X X= = =…
p x
1,…,x
n( )≥ 0
( )1 2 3 4
1, 0, 0, 3.1415p X X X X= = = =
p x,y( )x=−∞
∞
∫y=−∞
∞
∫ dxdy = 1
p x,y( )x
∑y
∑ = 1 0.4 0.1
0.3 0.2
Continuous�integral, Discrete�sum
0
1
0 1
X
Y
ProbabilityDistributionFunction
![Page 4: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/4.jpg)
Properties of PDFs
COMS4771, Columbia University(Bishop PRML 1.2.1)
CDF Cumulative
Distribution
Function
D
![Page 5: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/5.jpg)
COMS4771, Columbia University
Properties of PDFs•Marginalizing: integrate/sum out a variable leaves a
marginal distribution over the remaining ones…
•Conditioning: if a variable ‘y’ is ‘given’ we get aconditional distribution over the remaining ones…
•Bayes Rule: mathematically just redo conditioningbut has a deeper meaning (1764)… if wehave x being data and θ being a model
p x,y( )
y∑ = p x( )
p x | y( )=
p x,y( )p y( )
p θ | x( )=p x | θ( )p θ( )
p x( )posterior
likelihood
evidence
prior
![Page 6: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/6.jpg)
COMS4771, Columbia University
Properties of PDFs•Expectation: can use pdf p(x) to compute averages and
expected values for quantities, denoted by:
•Properties:
•Mean: expected value for x
•Variance: expected value of (x-mean)2, how much x varies E
p x( )x{}= p x( )x
−∞
∞
∫ dx
Fine=0$ Fine=20$
0.8 0.2
example: speeding ticket
expected cost of speeding?
E
p x( )f x( ){ }= p x( )f x( )
x∫ dx or = p x( )f x( )
x∑
Var x{}= E x −E x{}( )2
= E x
2− 2xE x{}+ E x{}
2{ }= E x
2{ }− 2E x{}E x{}+ E x{}2
= E x2{ }−E x{}
2
E cf x( ){ }= cE f x( ){ }E f x( )+c{ }= E f x( ){ }+c
E E f x( ){ }{ }= E f x( ){ }
![Page 7: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/7.jpg)
COMS4771, Columbia University
The IID Assumption•Most of the time, we will assume that a dataset is
independent and identically distributed (IID)
•In many real situations, data is generated by someblack box phenomenon in an arbitrary order.
•Assume we are given a dataset:
“Independent” means that (given the model θ) theprobability of our data multiplies:
“Identically distributed” means that each marginalprobability is the same for each data point
p x
1,…,x
N| Θ( )= p
ixi| Θ( )
i=1
N
∏
D = x
1,…,x
N{ }
p x
1,…,x
N| Θ( )= p
ixi| Θ( )
i=1
N
∏ = p xi| Θ( )
i=1
N
∏
![Page 8: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/8.jpg)
Ex: Is a coin fair?
COMS4771, Columbia University
A stranger tells you his coin is fair.
Let’s assume tosses are iid with P(H)=µ.
He tosses it 4 times, gets H H T H.
What can you say about µ?
![Page 9: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/9.jpg)
COMS4771, Columbia University
Bayesians & Frequentists•Frequentists (Neymann/Pearson/Wald). An orthodox view that sampling is infinite and decision rules can be sharp.
•Bayesians (Bayes/Laplace/de Finetti). Unknown quantities are treated probabilistically and the state of the world can always be updated.
de Finetti: p( event ) = price I would pay for acontract that pays $1when event happens
•Likelihoodists (Fisher). Single sample inference based on maximizing the likelihood function.
actuarial fair
![Page 10: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/10.jpg)
COMS4771, Columbia University
Bayesians & Frequentists•Frequentists:
•Data are a repeatable random sample - there is a frequency •Underlying parameters remain constant during this repeatable process•Parameters are fixed
•Bayesians:•Data are observed from the realized sample.•Parameters are unknown and described probabilistically•Data are fixed
![Page 11: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/11.jpg)
COMS4771, Columbia University
Frequentists
•Frequentists: classical / objective view / no priorsevery statistician should compute same p(x) so no priorscan’t have a p(event) if it never happenedavoid p(θ), there is 1 true model, not distribution of thempermitted: p
θ(x,y) forbidden: p(x,y|θ)
Frequentist inference: estimate one best model θuse the Maximum Likelihood Estimator (ML)(unbiased & minimum variance)do not depend on Bayes rule for learning
( )
( )1 2
|
Data
arg m x
, ,
a
,n
MLp
x x x
θθθ
=
=
…
D
D
![Page 12: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/12.jpg)
COMS4771, Columbia University
Bayesians
•Bayesians: subjective view / priors are okput a distribution or pdf on all variables in the problemeven models & deterministic quantities (speed of light)use a prior p(θ) on the model θ before seeing any data
Bayesian inference: use Bayes rule for learning, integrateover all model (θ) unknown variables
![Page 13: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/13.jpg)
COMS4771, Columbia University
Bayesian Inference•Bayes rule can lead us to maximum likelihood•Assume we have a prior over models p(θ)
•How to pick p(θ)?Pick simpler θ is betterPick form for mathematical convenience
•We have data (can assume IID):
•Want to get a model to compute:
•Want p(x) given our data… How to proceed?
( )( ) ( )
( )
||
p x p
p x
p x
θ θθ =
posterior
likelihood
evidence
prior
{ }D1 2, , ,
Nx x x= …
( )p x
( )p θ
![Page 14: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/14.jpg)
COMS4771, Columbia University
Bayesian Inference•Want p(x) given our data… ( ) ( )D
1 2| | , , ,
n
p x p x x x x= …
( ) ( )( ) ( )
( )( ) ( )
( )
( )( ) ( )( )
D D
D D
DD
D
D
1
| , |
| , |
|| ,
||
N
ii
p x p x d
p x p d
p pp x d
p
p x pp x d
p
θ
θ
θ
=
θ
= θ θ
= θ θ θ
θ θ= θ θ
θ θ= θ θ
∫∫
∫
∏∫
p x | θ( ) ( )D|p θ
Manymodels Weight on
each model
θ = 1
θ = 2
θ = 3
Prior
![Page 15: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/15.jpg)
COMS4771, Columbia University
Bayesian Inference to MAP & ML•The full Bayesian Inference integral can be mathematicallytricky. MAP and ML are approximations of it…
•Maximum A Posteriori (MAP) is like MaximumLikelihood (ML) with a prior p(θ) which letsus prefer some models over others
( ) ( )( ) ( )( )
( ) ( )
DD
1
*
|| |
|
N
iip x p
p x p x dp
p x d
=
θ
θ
θ θ= θ θ
≈ θ δ θ − θ θ
∏∫
∫
( ) ( )( )
( ) ( )( )
D
D
*
1
1
|argmax
|argmax
N
ii
N
ii
where
p x p
p
p x
MAP
MLuniform
p
=
θ
=
θ
θ =
θ θ θ θ
∏
∏
( )D|p θ
θ*
lMAP
θ( )= lML
θ( )+ log p θ( )= log p xi| θ( )
i=1
N
∑ + log p θ( )
( | )p θ D
![Page 16: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/16.jpg)
Ex: Is a coin fair?
COMS4771, Columbia University
A stranger tells you his coin is fair.
Let’s assume tosses are iid with P(H)=µ.
He tosses it 4 times, gets H H T H.
What can you say about µ?
( ), , ,H H T H=D
![Page 17: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/17.jpg)
COMS4771, Columbia University
Bernoulli Probability ML•Bernoulli:
•Log-Likelihood (IID):
•Gradient=0:
( ) ( ) { }1
1 0,1 0,1x
x
p x x−
= − ∈ ∈ µ
µ µ
( ) ( )1
1 1log | log 1
ii
xN N x
ii ip x
−
= =
= −µ µ µ∑ ∑
( )( ) ( )
( )
( )
1
1
1
1 0
1 1
11 0
1 1
1
log 1 0
log 1 log 1 0
log log 1 0
0
( ) 0
1 ( ) 0
0
ii
xN x
i
N
i ii
i class i class
i class i class
m
N
x x
m N m
m N m
m N
−∂
∂ =
∂
∂ =
∂
∂ ∈ ∈
µ
µ
µ
µ µ
µ
−∈ ∈
µ−
− =
+ − − =
+ − =
− =
− − =
− − − =
µ µ
µ µ
µ µ
µ µ
µ
− =
=
µ
∑
∑
∑ ∑∑ ∑
x=0 x=1
m
N
N m
N
−
0 ~ Tail1 ~ Headµ=P(H)
N trials/tossesm heads/1sN-m tails/0s
![Page 18: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/18.jpg)
Bernoulli Bayes, Prior 1
• Assume prior µ=1/2, point mass distribution
• Posterior
• If the prior is 0 for some value, the posterior will also be 0 at that value no matter what data we see
COMS4771, Columbia University
( ), , ,H H T H=D
( ) ( || ) ( )P Pr r P rµ µ µ= = ×∝ =D D
![Page 19: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/19.jpg)
Bernoulli Bayes, Prior 2
• Allow some chance of bias
• Prior P(µ=1/2) = 1-b
P(µ=3/4) = b
• Posterior
COMS4771, Columbia University
( ), , ,H H T H=D
4
3
( ) ( |
3 1( ) ( |
4
1 1 1 1| ) ( ) (1 )
2 2 2 2
3 3 3| ) ( )
4 4 44
P
bP
P bP
P P
µ µ µ
µ µ µ
= = × = = × −
= = × =
∝
∝ = ×
D D
D D
![Page 20: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/20.jpg)
Bernoulli Bayes, Prior 2
• Prior P(µ=1/2) = 1-b
P(µ=3/4) = b
• Posterior
COMS4771, Columbia University
( ), , ,H H T H=D
4
3
1 1 1 1| ) ( ) (1 )
2 2 2 2
3 3 3| ) ( )
4 4 4
The two are equal
( ) ( |
3 1
when
( ) ( |4 4
16
43
P b
P b
b
P P
P P
µ µ µ
µ µ µ
= = × = = × −
= = × = = ×
∝
∝
=
D D
D D
![Page 21: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/21.jpg)
Bernoulli Bayes, Prior 3
• Uniform prior
COMS4771, Columbia University
( ), , ,H H T H=D
3
13
0
3
| ) ( )
) (1 )
1 1 1(1 )
4 5 20
Hence posterior | (1 )
Notice for tosses with m he
( ) ( |
( |
( ) 20
, tails,
( |
Does this suggest as a convenient prior
a
)
?
ds
) (1m N m
P P
P
P r
N m
r r P r
r r r
r r dr
r r
N
r r rP
µ µ µ
µ
µ
µ−
= = × =
= = −
− = − =
=
∝
=
−
−
= = −
∫
D D
D
D
D
[0,1]Uµ ∼
Where’s the mode?
![Page 22: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/22.jpg)
Beta Distribution
• Distribution over
COMS4771, Columbia University(Bishop PRML 2.1, especially 2.1.1)
Here switch to typical ‘sloppy’ notation, using µ for variable name and its value.
![Page 23: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/23.jpg)
Bernoulli Bayes, Beta Prior
The Beta distribution provides the conjugate prior for the Bernoulli distribution, i.e. the posterior distribution has the same form as the prior.
effective number of observations +1
All distributions in the Exponential Family (includes multinomial, Gaussian, Poisson) have convenient conjugate priors (Bishop PRML 2.4).
COMS4771, Columbia University(Bishop PRML 2.1, especially 2.1.1)
![Page 24: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/24.jpg)
Beta Distribution
COMS4771, Columbia University(Bishop PRML 2.1, especially 2.1.1)
Which distribution is this?
![Page 25: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/25.jpg)
Prior · Likelihood = Posteriornormalized
COMS4771, Columbia University(Bishop PRML 2.1, especially 2.1.1)
Example:
a=2, b=2 a=3, b=2Single observationH or x=1
Recall our earlier example of a Uniform prior, check this works…
)(p µ ( | )p µD |( )p µ D
![Page 26: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/26.jpg)
Properties of the Posterior
As the size of the data set, N, grows
COMS4771, Columbia University(Bishop PRML 2.1, especially 2.1.1)
0
0 0
2var[
[ ]
] 0( ) ( 1)
N
ML
N N
N N
N N N N
a a m m
a b a m b N m N
a b
a b a b
µ µ
µ
+= = → =
+ + + + −
= →
+ + +
E
This is typical behavior for Bayesian learning.
![Page 27: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/27.jpg)
Prediction under the PosteriorWhat is the probability that the next coin toss will land heads up?
COMS4771, Columbia University(Bishop PRML 2.1, especially 2.1.1)
N
N N
a
a b+
( ) 1 1 1, ,, 3,N N
H H a bT H= ⇒ = + = +D
![Page 28: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/28.jpg)
Bayesian Decision Theory
• Initially assume just 2 classes
• Given input data we want to determine which class is optimal
• Various possible criteria
• Need either directly (discriminative)
• Or by Bayes,
(generative)
• Divide input space into decision regions separated by decision boundaries such that
• How might we choose boundaries?
COMS4771, Columbia University(Bishop PRML 1.5)
1 2, C C
x
|( )k
p xC
( , ( || )
( )
) ) ( )
( )( k k k
k
p x p xx
p x
pp
p x= =
C C CC
1 2,R R
assign k k
x∈ ⇒R C
![Page 29: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/29.jpg)
Criterion 1: Minimize Misclassification Rate
COMS4771, Columbia University(Bishop PRML 1.5)
decision boundary,in general couldbe more complex
optimal
( , ) ( | ) ( )k k
p x p x p x=C C
assign class 1 but should be 2 assign class 2 but should be 1
![Page 30: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/30.jpg)
Criterion 2: Minimize Expected LossExample loss matrix:
classify medical images as ‘cancer’ or ‘normal’
DecisionTruth
COMS4771, Columbia University(Bishop PRML 1.5)
Now
Choose regions to minimize Expected LossjR
![Page 31: Machine Learning 4771coms4771/lectures/lec2x.pdf · COMS4771, Columbia University Properties of PDFs •Review some basics of probability theory •First, pdf is a function, multiple](https://reader035.vdocuments.net/reader035/viewer/2022071101/5fdace1834efb813d83e60ed/html5/thumbnails/31.jpg)
Reject Option
COMS4771, Columbia University(Bishop PRML 1.5)
1 2| ) ( | ) then less confident about assigning claI s( s.f x p xp ≈C C
One possibility is to reject or refuse to assign.