structure learning in bayesian...
TRANSCRIPT
![Page 1: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/1.jpg)
Structure learning in Bayesian networks
Probabilistic Graphical Models
Sharif University of Technology
Spring 2018
Soleymani
![Page 2: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/2.jpg)
Recall: Learning problem
2
Target: true distribution πβ that maybe correspond to β³β
= π¦β, π½β
Hypothesis space: specified probabilistic graphical models
Data: set of instances sampled from πβ
Learning goal: selecting a model β³ to construct the best
approximation to πβ according to a performance metric
We may use Bayesian model averaging instead of selecting a model β³
![Page 3: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/3.jpg)
Structure learning of Bayesian networks
3
For a fixed set of variables (i.e. nodes), we intend to learn
the directed graph structure π’β:
Which edges are included in the model?
Fewer edges miss dependencies
More edges cause spurious dependencies
We often prefer sparser structures that show better
generalization
![Page 4: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/4.jpg)
Learning the structure of DGMs
4
Approaches
Constraint-based
Find conditional independencies in data using statistical tests
Find an I-map graph for the obtained conditional independencies
Score-based
View a BN structure and parameter learning as a density estimation
task and address structure learning as a model selection problem
Score function measures how well the model fits the observed data
Maximum likelihood
Bayesian score
![Page 5: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/5.jpg)
Score-based approach
5
Input:
Hypothesis space: set of possible structures
Training data
Output:A network that maximizes the score function
Thus, we need a score function (including priors, if needed)
and a search strategy (i.e. optimization procedure) to find a
structure in the hypothesis space
![Page 6: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/6.jpg)
Structural search
6
Hypothesis space
General graphs: Number of graphs on π nodes: O 2π2
Trees: Number of trees on π nodes O π!
Combinatorial optimization problem
Many heuristics to solve this problem but no guarantee on attaining
optimality
super-exponential
![Page 7: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/7.jpg)
Likelihood score
7
Find π’, ππ’ that maximize the likelihood:
maxπ’,π½π’
π π|π’, π½π’ = maxπ’
maxπ½π’
π π|π’, π½π’
= maxπ’
π π|π’, π½π’
ScoreL π’, π = π π|π’, π½π’
Likelihood Score
![Page 8: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/8.jpg)
Likelihood score: Information-theoretic
interpretation
8
Given structure, log likelihood of data:
ln π(π|ππ’, π’) = ln π=1
π
π(π π |ππ’, π’) =
π=1
π
ln
π
π(π₯ππ
|π·πππ
π, π½ππ|ππππ
, π’)
=
π
π=1
π
ln π(π₯ππ
|π·πππ
π, π½ππ|ππππ
, π’)
= π
π
π₯πβπππ(ππ)
ππβπππ(π·πππ)
πΆππ’ππ‘(π₯π , ππ)
πln π(π₯π|ππ , π½ππ|ππππ
, π’)
ln π(π| ππ’, π’) = π
π
π₯πβπππ(ππ)
ππβπππ(π·πππ)
π(π₯π , ππ) ln π(π₯π|ππ) π(π₯π)
π(π₯π)
= π
π
πΌ π ππ; ππππβ π» π ππ
This score measures the strength of the dependencies between variables and their parents.
![Page 9: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/9.jpg)
Limitations of likelihood score
9
We have πΌπ π; π β₯ 0 and πΌπ π; π = 0 iff π π, π= π π π(π)
In the training data, almost always πΌ π π; π > 0
Due to statistical noise in the training data, exact independence
almost never occurs in the empirical distribution
Thus, ML score never prefers the simpler network
Score is maximized for fully connected network
Therefore likelihood score fails to generalize well
πΌ π; π βͺ π β₯ πΌ π; ππΌ π; π βͺ π = πΌ π; π iff π β₯ π|π
![Page 10: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/10.jpg)
Avoiding overfitting
10
Restricting the hypothesis space explicitly:
restrict number of parents or number of parameters
Impose penalty on the complexity of the structure:
Explicitly
Bayesian score and BIC
![Page 11: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/11.jpg)
Tree structure: Chow-Liu algorithm
11
Compute π€ππ = πΌ(ππ , ππ) for each pair of nodes ππ and ππ
Find a maximum weight spanning tree (MWST)
tree with the greatest total weight
maxπ
(π,π)βπ
π€ππ
Give directions to edges in MST
pick an arbitrary node as the root and draw arrows going away
from it (e.g., using a BFS algorithm)
we can find an optimal
tree in polynomial time
![Page 12: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/12.jpg)
BIC score
12
ScoreBIC π’ = log π(π| π½π’) βlog π
2Dim π’
= π
π
πΌ π ππ; ππππβ π
π
π» π ππ βlog π
2Dim π’
Its negation is also known as Minimum Description Length
(MDL)
For the larger π, the more emphasis is given to the fit to data
Mutual information grows linearly with π while complexity grows
logarithmically with π
can be ignoredModel complexity
(i.e., # of independent
parameters)
![Page 13: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/13.jpg)
Asymptotic consistency of BIC score
13
Definition: A scoring function is called consistent if as π β βand probabilityβ 1:
π’β (or any I-equivalent structure) maximizes the score
All non-I-equivalent structures have strictly lower score
Theorem: As π β β, the structure that maximizes the BIC
score is the true structure π’β (or any I-equivalent structure)
Asymptotically, spurious dependencies will not contribute to likelihood
and will be penalized
Required dependencies will be added due to linear growth of likelihood
term compared to logarithmic growth of model complexity
![Page 14: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/14.jpg)
Bayesian score
14
Uncertainty over parameters and structure
Try to average over all possible parameter values and
also consider the structure prior
Bayesian score:
ScoreB π’ = log π π|π’ + log π π’
We consider the whole distribution over
parameters π½π’ for the structure π’ to find the
marginal likelihood:
π π|π’ = π½π’
π π|π’, π½π’ π π½π’|π’ ππ½π’
π π’|π =π π|π’ π π’
π π
Independent of π’
Structure prior
Marginal
likelihood
likelihood
![Page 15: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/15.jpg)
Marginal likelihood
15
π π|π’ = π π 1 , β¦ , π π π’
= π π 1 π’ Γ π π 2 π 1 , π’ Γ β― Γ π π π π 1 , β¦ , π πβ1 , π’
Dirichlet priors and single variable:
π π₯ 1 , β¦ , π₯ π = π=1
πΎ πΌπ πΌπ + 1 β¦ πΌπ + π[π] β 1
πΌ πΌ + 1 β¦ πΌ + π β 1
=Ξ πΌ
Ξ πΌ + πΓ
π=1
πΎΞ πΌπ + π π
Ξ πΌπ
π π =
π=1
π
πΌ(π₯ π = π)
πΌ πΌ + 1 β¦ πΌ + π β 1 =Ξ(πΌ + π)
Ξ(πΌ)
πΌ =
π=1
πΎ
πΌπ
![Page 16: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/16.jpg)
Marginal likelihood in Bayesian networks:
global & local parameter independence
16
Let π π½π’|π’ satisfy global parameter independence assumptions:
π π|π’ =
π
π½ππ|ππππ
π=1
π
π π₯π(π)
|ππππ
π’ (π), π½
ππ|ππππ
π’ , π’ π π½ππ|ππππ
π’ |π’ ππ½ππππ
π’
Let π π½π’|π’ satisfy global & local parameter independence assumptions:
π π|π’ =
π
ππβπππ(ππππ
π’)
π½ππ|ππ
π=1
πππ
π(π)
π’=ππ
π
π π₯π(π)
|ππ , π½ππ|ππ, π’ π π½ππ|ππ
|π’ ππ½ππ|ππ
If Dirichlet priors π π½ππ|ππ|π’ has hyperparameters πΌπ₯π|ππ
π’|π₯ β πππ ππ :
π π|π’ =
π
ππβπππ(ππππ
π’)
Ξ π₯πβπππ(ππ)πΌπ₯π|ππ
π’
Ξ π₯πβπππ(ππ)πΌπ₯π|ππ
π’+ ππ π₯π, ππ
π₯πβπππ(ππ)
Ξ πΌπ₯π|ππ
π’+ ππ π₯π , ππ
Ξ πΌπ₯π|ππ
π’
![Page 17: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/17.jpg)
Asymptotic behavior of Bayesian score
17
As π β β, a network π’ with Dirichlet priors satisfies:
log π π|π’ = log π(π| π½π’) βlog π
2Dim π’ + π(1)
Thus, Bayesian score is asymptotically equivalent to BIC and
asymptotically consistent
π(1) term does not
grow with π
![Page 18: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/18.jpg)
Priors
18
Structure prior: π(π’) Uniform prior: π(π’) β constant
Prior penalizing number of edges: π(π’) β π πΈ(π’) (0 < π < 1)
Prior penalizing number of parameters
Parameter priors: π(π½|π’) is usually instantiated as BDe prior
πΌ ππ , ππππ
π’= πΌπβ² ππ , ππππ
π’
BDe prior A single network π΅β² provides priors for all candidate networks
Unique prior with the property that I-equivalent networks have the sameBayesian score
Normalizing constant can be ignored
πΌ: equivalent sample size
π΅0: prior network representing πβ²
ππππ
π’are not the same as parents of ππ in π΅0
![Page 19: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/19.jpg)
Score Equivalence
19
A score function satisfies score equivalence when all I-
equivalent networks have the same score
The likelihood score and BIC satisfy score equivalence
![Page 20: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/20.jpg)
General graph structure
20
Theorem:
The problem of learning a BN structure with at most π (π β₯ 2)
parents is NP-hard.
Search in the space of graph structures using local search
algorithms
Algorithms like hill-climbing and simulated Annealing
Exploit score decomposition during the search to alleviate
computational overhead
![Page 21: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/21.jpg)
Optimization by local searches
21
Graphs as states
Score function is used as the objective function
Neighbors of a state are found using these modifications:
Edge addition
Edge deletion
Edge reversal
We only consider legal neighbors
that result in DAGs
![Page 22: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/22.jpg)
Decomposable score
22
Decomposable structure score:
Score π’; π =
π
FamScore(ππ|ππππ
π’; π)
Decomposability is useful to reduce the computational
overhead of evaluating different structures during search.
if we have a decomposable score, then a local change in the structure
does not change the score of the remaining parts of the structure.
FamScore(π|π; π) measures how well the set of
variables π serves as parent of π in the data set π
![Page 23: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/23.jpg)
Decomposability of scores
23
Likelihood score is decomposable
Bayesian score is decomposable when:
parameter priors satisfy global parameter independence
and parameter modularity
Parameter modularity:
ππππ
π’= ππππ
π’β²
= π β π π½ππ|π π’ = π(π½ππ|π|π’β²)
and structure prior satisfies structure modularity
π(π’) β
π
π(ππππ= ππππ
π’)
![Page 24: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/24.jpg)
Summary
24
Score functions
Likelihood score
BIC score
Bayesian score
Optimal structure for trees is found using a greedy algorithm
When hypothesis space includes general graphs, the model
selection problem will be NP-hard and we use local search
strategies to optimize the structure
![Page 25: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0b43b97e708231d42fa864/html5/thumbnails/25.jpg)
References
25
D. Koller and N. Friedman, βProbabilistic Graphical Models:
Principles and Techniquesβ, MIT Press, 2009, Chapter 18.1, 18.3,
18.4.