Download - PGM: Tirgul 10 Parameter Learning and Priors
![Page 1: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/1.jpg)
.
PGM: Tirgul 10Parameter Learning
and Priors
![Page 2: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/2.jpg)
2
Why learning?
Knowledge acquisition bottleneck Knowledge acquisition is an expensive process Often we don’t have an expert
Data is cheap Vast amounts of data becomes available to us
Learning allows us to build systems based on the data
![Page 3: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/3.jpg)
3
Learning Bayesian networks
InducerInducerInducerInducerData + Prior information
E
R
B
A
C .9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
![Page 4: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/4.jpg)
4
Known Structure -- Complete Data E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>
InducerInducerInducerInducer
E B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B) E B
A
Network structure is specified Inducer needs to estimate parameters
Data does not contain missing values
![Page 5: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/5.jpg)
5
Unknown Structure -- Complete Data E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>
InducerInducerInducerInducer
E B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B) E B
A
Network structure is not specified Inducer needs to select arcs & estimate parameters
Data does not contain missing values
![Page 6: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/6.jpg)
6
Known Structure -- Incomplete Data
InducerInducerInducerInducer
E B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B) E B
A
Network structure is specified Data contains missing values
We consider assignments to missing values
E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>
![Page 7: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/7.jpg)
7
Known Structure / Complete Data
Given a network structure G And choice of parametric family for P(Xi|Pai)
Learn parameters for network
Goal Construct a network that is “closest” to probability
that generated the data
![Page 8: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/8.jpg)
8
Example: Binomial Experiment(Statistics 101)
When tossed, it can land in one of two positions: Head or Tail
We denote by the (unknown) probability P(H).Estimation task: Given a sequence of toss samples x[1], x[2], …,
x[M] we want to estimate the probabilities P(H)= and P(T) = 1 -
Head Tail
![Page 9: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/9.jpg)
9
Statistical Parameter Fitting Consider instances x[1], x[2], …, x[M] such
that The set of values that x can take is known Each is sampled from the same distribution Each sampled independently of the rest
The task is to find a parameter so that the data can be summarized by a probability P(x[j]| ).
Depends on the given family of probability distributions: multinomial, Gaussian, Poisson, etc.
For now, focus on multinomial distributions
i.i.d.samples
![Page 10: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/10.jpg)
10
The Likelihood Function How good is a particular ?
It depends on how likely it is to generate the observed data
The likelihood for the sequence H,T, T, H, H is
m
mxPDPDL )|][()|():(
)1()1():( DL
0 0.2 0.4 0.6 0.8 1
L(
:D)
![Page 11: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/11.jpg)
11
Sufficient Statistics
To compute the likelihood in the thumbtack example we only require NH and NT
(the number of heads and the number of tails)
NH and NT are sufficient statistics for the binomial distribution
TH NNDL )1():(
![Page 12: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/12.jpg)
12
Sufficient Statistics
A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood
Formally, s(D) is a sufficient statistics if for any two datasets D and D’
s(D) = s(D’ ) L( |D) = L( |D’)
Datasets
Statistics
![Page 13: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/13.jpg)
13
Maximum Likelihood Estimation
MLE Principle:
Choose parameters that maximize the likelihood function
This is one of the most commonly used estimators in statistics
Intuitively appealing
![Page 14: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/14.jpg)
14
Example: MLE in Binomial Data
Applying the MLE principle we get
(Which coincides with what one would expect)
0 0.2 0.4 0.6 0.8 1
L(
:D)
Example:(NH,NT ) = (3,2)
MLE estimate is 3/5 = 0.6
TH
H
NN
N
![Page 15: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/15.jpg)
15
Learning Parameters for a Bayesian Network
E B
A
C
][][][][
]1[]1[]1[]1[
MCMAMBME
CABE
D
Training data has the form:
![Page 16: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/16.jpg)
16
Learning Parameters for a Bayesian Network
E B
A
C
Since we assume i.i.d. samples,likelihood function is
m
mCmAmBmEPDL ):][],[],[],[():(
![Page 17: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/17.jpg)
17
Learning Parameters for a Bayesian Network
E B
A
C
By definition of network, we get
m
m
mAmCP
mEmBmAP
mBP
mEP
mCmAmBmEPDL
):][|][(
):][],[|][(
):][(
):][(
):][],[],[],[():(
][][][][
]1[]1[]1[]1[
MCMAMBME
CABE
![Page 18: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/18.jpg)
18
Learning Parameters for a Bayesian Network
E B
A
C
Rewriting terms, we get
m
m
m
m
m
mAmCP
mEmBmAP
mBP
mEP
mCmAmBmEPDL
):][|][(
):][],[|][(
):][(
):][(
):][],[],[],[():(
][][][][
]1[]1[]1[]1[
MCMAMBME
CABE
![Page 19: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/19.jpg)
19
General Bayesian Networks
Generalizing for any Bayesian network:
The likelihood decomposes according to the structure of the network.
iii
i miii
m iiii
mn
DL
mPamxP
mPamxP
mxmxPDL
):(
):][|][(
):][|][(
):][,],[():( 1 i.i.d. samples
Network factorization
![Page 20: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/20.jpg)
20
General Bayesian Networks (Cont.)
Decomposition
Independent Estimation Problems
If the parameters for each family are not related, then they can be estimated independently of each other.
![Page 21: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/21.jpg)
21
From Binomial to Multinomial
For example, suppose X can have the values 1,2,…,K
We want to learn the parameters 1, 2. …, K
Sufficient statistics: N1, N2, …, NK - the number of times each
outcome is observed
Likelihood function:
MLE:
K
k
Nk
kDL1
):(
N
Nkk
ˆ
![Page 22: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/22.jpg)
22
Likelihood for Multinomial Networks
When we assume that P(Xi | Pai ) is multinomial, we get further decomposition:
i i
ii
ii
i i
ii
i ii
pa x
paxNpax
pa x
paxNiii
pa pamPamiii
miiiii
paxP
pamxP
mPamxPDL
),(|
),(
][,
):|(
):|][(
):][|][():(
m
iiiii mPamxPDL ):][|][():(
i i
ii
i ii
pa x
paxNiii
pa pamPamiii
miiiii
paxP
pamxP
mPamxPDL
),(
][,
):|(
):|][(
):][|][():(
i iipa pamPamiii
miiiii
pamxP
mPamxPDL
][,
):|][(
):][|][():(
![Page 23: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/23.jpg)
23
Likelihood for Multinomial Networks
When we assume that P(Xi | Pai ) is multinomial, we get further decomposition:
For each value pai of the parents of Xi we get an independent multinomial problem
The MLE is
i i
ii
iipa x
paxNpaxii DL ),(
|):(
)(),(ˆ
|i
iipax paN
paxNii
![Page 24: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/24.jpg)
24
Maximum Likelihood Estimation
Consistency Estimate converges to best possible value as the
number of examples grow
To make this formal, we need to introduce some definitions
![Page 25: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/25.jpg)
25
KL-Divergence
Let P and Q be two distributions over X A measure of distance between P and Q is the
Kullback-Leibler Divergence
KL(P||Q) = 1 (when logs are in base 2) = The probability P assigns to an instance is, on
average, half the probability Q assigns to it KL(P||Q) 0 KL(P||Q) = 0 iff are P and Q equal
x xQ
xPxPQPKL
)()(
log)()||(
![Page 26: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/26.jpg)
26
Consistency
Let P(X| ) be a parametric family We need to make various regularity condition we won’t
go into now Let P*(X) be the distribution that generates the data Let be the MLE estimate given a dataset D
Thm As N , where
with probability 1
D
*ˆ D
):(||)((minarg ** XPXPKL
![Page 27: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/27.jpg)
27
Consistency -- Geometric Interpretation
P*
P(X| * )
Space of probability distribution
Distributions that canrepresented by P(X| )
![Page 28: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/28.jpg)
28
Is MLE all we need?
Suppose that after 10 observations, ML estimates P(H) = 0.7 for the thumbtack Would you bet on heads for the next toss?
Suppose now that after 10 observations,ML estimates P(H) = 0.7 for a coinWould you place the same bet?
![Page 29: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/29.jpg)
29
Bayesian Inference
Frequentist Approach: Assumes there is an unknown but fixed parameter Estimates with some confidence Prediction by using the estimated parameter value
Bayesian Approach: Represents uncertainty about the unknown parameter Uses probability to quantify this uncertainty:
Unknown parameters as random variables Prediction follows from the rules of probability:
Expectation over the unknown parameters
![Page 30: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/30.jpg)
30
Bayesian Inference (cont.)
We can represent our uncertainty about the sampling process using a Bayesian network
The values of X are independent given
The conditional probabilities, P(x[m] | ), are the parameters in the model
Prediction is now inference in this network
X[1] X[2] X[m] X[m+1]
Observed data Query
![Page 31: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/31.jpg)
31
Bayesian Inference (cont.)
Prediction as inference in this network
where
dMxxPMxP
dMxxPMxxMxP
MxxMxP
])[,],1[|()|]1[(
])[,],1[|(])[,],1[,|]1[(
])[,],1[|]1[(
])[],1[()()|][],1[(
])[],1[|(MxxP
PMxxPMxxP
Posterior
Likelihood Prior
Probability of data
X[1] X[2] X[m] X[m+1]
![Page 32: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/32.jpg)
32
Example: Binomial Data Revisited
Prior: uniform for in [0,1] P( ) = 1
Then P( |D) is proportional to the likelihood L( :D)
(NH,NT ) = (4,1)
MLE for P(X = H ) is 4/5 = 0.8 Bayesian prediction is 7142.0
75
)|()|]1[( dDPDHMxP
0 0.2 0.4 0.6 0.8 1
)()|][],1[(])[],1[|( PMxxPMxxP
![Page 33: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/33.jpg)
33
Bayesian Inference and MLE
In our example, MLE and Bayesian prediction differ
But…
If prior is well-behaved Does not assign 0 density to any “feasible”
parameter value
Then: both MLE and Bayesian prediction converge to the same value
Both are consistent
![Page 34: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/34.jpg)
34
Dirichlet Priors
Recall that the likelihood function is
A Dirichlet prior with hyperparameters 1,…,K is defined as
for legal 1,…, K
Then the posterior has the same form, with
hyperparameters 1+N 1,…,K +N K
K
1k
Nk
kDL ):(
K
kk
kP1
1)(
K
k
Nk
K
k
Nk
K
kk
kkkkDPPDP1
1
11
1)|()()|(
![Page 35: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/35.jpg)
35
Dirichlet Priors (cont.)
We can compute the prediction on a new event in closed form:
If P() is Dirichlet with hyperparameters 1,…,K then
Since the posterior is also Dirichlet, we get
kk dPkXP )()]1[(
)(
)|()|]1[(N
NdDPDkMXP kk
k
![Page 36: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/36.jpg)
36
Dirichlet Priors -- Example
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 0.2 0.4 0.6 0.8 1
Dirichlet(1,1)Dirichlet(2,2)
Dirichlet(0.5,0.5)Dirichlet(5,5)
![Page 37: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/37.jpg)
37
Prior Knowledge
The hyperparameters 1,…,K can be thought of as “imaginary” counts from our prior experience
Equivalent sample size = 1+…+K
The larger the equivalent sample size the more confident we are in our prior
![Page 38: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/38.jpg)
38
Effect of Priors
Prediction of P(X=H ) after seeing data with NH = 0.25•NT for different sample sizes
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0 20 40 60 80 1000
0.1
0.2
0.3
0.4
0.5
0.6
0 20 40 60 80 100
Different strength H + T Fixed ratio H / T
Fixed strength H + T
Different ratio H / T
![Page 39: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/39.jpg)
39
Effect of Priors (cont.) In real data, Bayesian estimates are less
sensitive to noise in the data
0.1
0.2
0.3
0.4
0.5
0.6
0.7
5 10 15 20 25 30 35 40 45 50
P(X
= 1
|D)
N
MLEDirichlet(.5,.5)
Dirichlet(1,1)Dirichlet(5,5)
Dirichlet(10,10)
N
0
1Toss Result
![Page 40: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/40.jpg)
40
Conjugate Families The property that the posterior distribution follows the
same parametric form as the prior distribution is called conjugacy
Dirichlet prior is a conjugate family for the multinomial likelihood
Conjugate families are useful since: For many distributions we can represent them with
hyperparameters They allow for sequential update within the same
representation In many cases we have closed-form solution for
prediction
![Page 41: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/41.jpg)
41
Bayesian Networks and Bayesian Prediction
Priors for each parameter group are independent Data instances are independent given the
unknown parameters
X
X[1] X[2] X[M] X[M+1]
Observed dataPlate notation
Y[1] Y[2] Y[M] Y[M+1]
Y|X X
Y|Xm
X[m]
Y[m]
Query
![Page 42: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/42.jpg)
42
Bayesian Networks and Bayesian Prediction (Cont.)
We can also “read” from the network:
Complete data posteriors on parameters are independent
X
X[1] X[2] X[M] X[M+1]
Observed dataPlate notation
Y[1] Y[2] Y[M] Y[M+1]
Y|X X
Y|Xm
X[m]
Y[m]
Query
![Page 43: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/43.jpg)
43
Bayesian Prediction(cont.) Since posteriors on parameters for each family are
independent, we can compute them separately Posteriors for parameters within families are also
independent:
Complete data independent posteriors on Y|X=0 and Y|X=1
X
Y|Xm
X[m]
Y[m]
Refined model
X
Y|X=0m
X[m]
Y[m]
Y|X=1
![Page 44: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/44.jpg)
44
Bayesian Prediction(cont.)
Given these observations, we can compute the posterior for each multinomial Xi | pai
independently
The posterior is Dirichlet with parameters
(Xi=1|pai)+N (Xi=1|pai),…, (Xi=k|pai)+N (Xi=k|pai)
The predictive distribution is then represented by the parameters
)()(
),(),(~|
ii
iiiipax paNpa
paxNpaxii
![Page 45: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/45.jpg)
45
Assessing Priors for Bayesian Networks
We need the(xi,pai) for each node xj
We can use initial parameters 0 as prior information
Need also an equivalent sample size parameter M0
Then, we let (xi,pai) = M0P(xi,pai|0)
This allows to update a network using new data
![Page 46: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/46.jpg)
46
Learning Parameters: Case Study (cont.)
Experiment: Sample a stream of instances from the alarm
network Learn parameters using
MLE estimator Bayesian estimator with uniform prior with
different strengths
![Page 47: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/47.jpg)
47
Learning Parameters: Case Study (cont.)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
KL
Div
erg
en
ce
M
MLEBayes w/ Uniform Prior, M'=5
Bayes w/ Uniform Prior, M'=10Bayes w/ Uniform Prior, M'=20Bayes w/ Uniform Prior, M'=50
![Page 48: PGM: Tirgul 10 Parameter Learning and Priors](https://reader036.vdocuments.net/reader036/viewer/2022062409/56814600550346895db30d6b/html5/thumbnails/48.jpg)
48
Learning Parameters: Summary
Estimation relies on sufficient statistics For multinomial these are of the form N (xi,pai) Parameter estimation
Bayesian methods also require choice of priors Both MLE and Bayesian are asymptotically equivalent
and consistent Both can be implemented in an on-line manner by
accumulating sufficient statistics
)()(),(),(~
|ii
iiiipax paNpa
paxNpaxii
)(),(ˆ
|i
iipax paN
paxNii
MLE Bayesian (Dirichlet)