structure learning in bayesian...

Post on 23-Jun-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Structure learning in Bayesian networks

Probabilistic Graphical Models

Sharif University of Technology

Spring 2018

Soleymani

Recall: Learning problem

2

Target: true distribution π‘ƒβˆ— that maybe correspond to β„³βˆ—

= π’¦βˆ—, πœ½βˆ—

Hypothesis space: specified probabilistic graphical models

Data: set of instances sampled from π‘ƒβˆ—

Learning goal: selecting a model β„³ to construct the best

approximation to π‘€βˆ— according to a performance metric

We may use Bayesian model averaging instead of selecting a model β„³

Structure learning of Bayesian networks

3

For a fixed set of variables (i.e. nodes), we intend to learn

the directed graph structure π’’βˆ—:

Which edges are included in the model?

Fewer edges miss dependencies

More edges cause spurious dependencies

We often prefer sparser structures that show better

generalization

Learning the structure of DGMs

4

Approaches

Constraint-based

Find conditional independencies in data using statistical tests

Find an I-map graph for the obtained conditional independencies

Score-based

View a BN structure and parameter learning as a density estimation

task and address structure learning as a model selection problem

Score function measures how well the model fits the observed data

Maximum likelihood

Bayesian score

Score-based approach

5

Input:

Hypothesis space: set of possible structures

Training data

Output:A network that maximizes the score function

Thus, we need a score function (including priors, if needed)

and a search strategy (i.e. optimization procedure) to find a

structure in the hypothesis space

Structural search

6

Hypothesis space

General graphs: Number of graphs on 𝑀 nodes: O 2𝑀2

Trees: Number of trees on 𝑀 nodes O 𝑀!

Combinatorial optimization problem

Many heuristics to solve this problem but no guarantee on attaining

optimality

super-exponential

Likelihood score

7

Find 𝒒, πœƒπ’’ that maximize the likelihood:

max𝒒,πœ½π’’

𝑃 π’Ÿ|𝒒, πœ½π’’ = max𝒒

maxπœ½π’’

𝑃 π’Ÿ|𝒒, πœ½π’’

= max𝒒

𝑃 π’Ÿ|𝒒, πœ½π’’

ScoreL 𝒒, π’Ÿ = 𝑃 π’Ÿ|𝒒, πœ½π’’

Likelihood Score

Likelihood score: Information-theoretic

interpretation

8

Given structure, log likelihood of data:

ln 𝑃(π’Ÿ|πœƒπ’’, 𝒒) = ln 𝑛=1

𝑁

𝑃(𝒙 𝑛 |πœƒπ’’, 𝒒) =

𝑛=1

𝑁

ln

𝑖

𝑃(π‘₯𝑖𝑛

|𝑷𝒂𝑋𝑖

𝑛, πœ½π‘‹π‘–|π‘ƒπ‘Žπ‘‹π‘–

, 𝒒)

=

𝑖

𝑛=1

𝑁

ln 𝑃(π‘₯𝑖𝑛

|𝑷𝒂𝑋𝑖

𝑛, πœ½π‘‹π‘–|π‘ƒπ‘Žπ‘‹π‘–

, 𝒒)

= 𝑁

𝑖

π‘₯π‘–βˆˆπ‘‰π‘Žπ‘™(𝑋𝑖)

π’–π‘–βˆˆπ‘‰π‘Žπ‘™(𝑷𝒂𝑋𝑖)

πΆπ‘œπ‘’π‘›π‘‘(π‘₯𝑖 , 𝒖𝑖)

𝑁ln 𝑃(π‘₯𝑖|𝒖𝑖 , πœ½π‘‹π‘–|π‘ƒπ‘Žπ‘‹π‘–

, 𝒒)

ln 𝑃(π’Ÿ| πœƒπ’’, 𝒒) = 𝑁

𝑖

π‘₯π‘–βˆˆπ‘‰π‘Žπ‘™(𝑋𝑖)

π’–π‘–βˆˆπ‘‰π‘Žπ‘™(𝑷𝒂𝑋𝑖)

𝑃(π‘₯𝑖 , 𝒖𝑖) ln 𝑃(π‘₯𝑖|𝒖𝑖) 𝑃(π‘₯𝑖)

𝑃(π‘₯𝑖)

= 𝑁

𝑖

𝐼 𝑃 𝑋𝑖; π‘ƒπ‘Žπ‘‹π‘–βˆ’ 𝐻 𝑃 𝑋𝑖

This score measures the strength of the dependencies between variables and their parents.

Limitations of likelihood score

9

We have 𝐼𝑃 𝑋; π‘Œ β‰₯ 0 and 𝐼𝑃 𝑋; π‘Œ = 0 iff 𝑃 𝑋, π‘Œ= 𝑃 𝑋 𝑃(π‘Œ)

In the training data, almost always 𝐼 𝑃 𝑋; π‘Œ > 0

Due to statistical noise in the training data, exact independence

almost never occurs in the empirical distribution

Thus, ML score never prefers the simpler network

Score is maximized for fully connected network

Therefore likelihood score fails to generalize well

𝐼 𝑋; π‘Œ βˆͺ 𝑍 β‰₯ 𝐼 𝑋; π‘ŒπΌ 𝑋; π‘Œ βˆͺ 𝑍 = 𝐼 𝑋; π‘Œ iff 𝑋 βŠ₯ 𝑍|π‘Œ

Avoiding overfitting

10

Restricting the hypothesis space explicitly:

restrict number of parents or number of parameters

Impose penalty on the complexity of the structure:

Explicitly

Bayesian score and BIC

Tree structure: Chow-Liu algorithm

11

Compute 𝑀𝑖𝑗 = 𝐼(𝑋𝑖 , 𝑋𝑗) for each pair of nodes 𝑋𝑖 and 𝑋𝑗

Find a maximum weight spanning tree (MWST)

tree with the greatest total weight

max𝑇

(𝑖,𝑗)βˆˆπ‘‡

𝑀𝑖𝑗

Give directions to edges in MST

pick an arbitrary node as the root and draw arrows going away

from it (e.g., using a BFS algorithm)

we can find an optimal

tree in polynomial time

BIC score

12

ScoreBIC 𝒒 = log 𝑃(π’Ÿ| πœ½π’’) βˆ’log 𝑁

2Dim 𝒒

= 𝑁

𝑖

𝐼 𝑃 𝑋𝑖; π‘ƒπ‘Žπ‘‹π‘–βˆ’ 𝑁

𝑖

𝐻 𝑃 𝑋𝑖 βˆ’log 𝑁

2Dim 𝒒

Its negation is also known as Minimum Description Length

(MDL)

For the larger 𝑁, the more emphasis is given to the fit to data

Mutual information grows linearly with 𝑁 while complexity grows

logarithmically with 𝑁

can be ignoredModel complexity

(i.e., # of independent

parameters)

Asymptotic consistency of BIC score

13

Definition: A scoring function is called consistent if as 𝑁 β†’ ∞and probabilityβ†’ 1:

π’’βˆ— (or any I-equivalent structure) maximizes the score

All non-I-equivalent structures have strictly lower score

Theorem: As 𝑁 β†’ ∞, the structure that maximizes the BIC

score is the true structure π’’βˆ— (or any I-equivalent structure)

Asymptotically, spurious dependencies will not contribute to likelihood

and will be penalized

Required dependencies will be added due to linear growth of likelihood

term compared to logarithmic growth of model complexity

Bayesian score

14

Uncertainty over parameters and structure

Try to average over all possible parameter values and

also consider the structure prior

Bayesian score:

ScoreB 𝒒 = log 𝑃 π’Ÿ|𝒒 + log 𝑃 𝒒

We consider the whole distribution over

parameters πœ½π’’ for the structure 𝒒 to find the

marginal likelihood:

𝑃 π’Ÿ|𝒒 = πœ½π’’

𝑃 π’Ÿ|𝒒, πœ½π’’ 𝑃 πœ½π’’|𝒒 π‘‘πœ½π’’

𝑃 𝒒|π’Ÿ =𝑃 π’Ÿ|𝒒 𝑃 𝒒

𝑃 π’Ÿ

Independent of 𝒒

Structure prior

Marginal

likelihood

likelihood

Marginal likelihood

15

𝑃 π’Ÿ|𝒒 = 𝑃 𝒙 1 , … , 𝒙 𝑁 𝒒

= 𝑃 𝒙 1 𝒒 Γ— 𝑃 𝒙 2 𝒙 1 , 𝒒 Γ— β‹― Γ— 𝑃 𝒙 𝑁 𝒙 1 , … , 𝒙 π‘βˆ’1 , 𝒒

Dirichlet priors and single variable:

𝑃 π‘₯ 1 , … , π‘₯ 𝑁 = π‘˜=1

𝐾 π›Όπ‘˜ π›Όπ‘˜ + 1 … π›Όπ‘˜ + 𝑀[π‘˜] βˆ’ 1

𝛼 𝛼 + 1 … 𝛼 + 𝑁 βˆ’ 1

=Ξ“ 𝛼

Ξ“ 𝛼 + 𝑁×

π‘˜=1

𝐾Γ π›Όπ‘˜ + 𝑀 π‘˜

Ξ“ π›Όπ‘˜

𝑀 π‘˜ =

𝑛=1

𝑁

𝐼(π‘₯ 𝑛 = π‘˜)

𝛼 𝛼 + 1 … 𝛼 + 𝑁 βˆ’ 1 =Ξ“(𝛼 + 𝑁)

Ξ“(𝛼)

𝛼 =

π‘˜=1

𝐾

π›Όπ‘˜

Marginal likelihood in Bayesian networks:

global & local parameter independence

16

Let 𝑃 πœ½π’’|𝒒 satisfy global parameter independence assumptions:

𝑃 π’Ÿ|𝒒 =

𝑖

πœ½π‘‹π‘–|π‘ƒπ‘Žπ‘‹π‘–

𝑛=1

𝑁

𝑃 π‘₯𝑖(𝑛)

|π‘ƒπ‘Žπ‘‹π‘–

𝒒 (𝑛), 𝜽

𝑋𝑖|π‘ƒπ‘Žπ‘‹π‘–

𝒒 , 𝒒 𝑃 πœ½π‘‹π‘–|π‘ƒπ‘Žπ‘‹π‘–

𝒒 |𝒒 π‘‘πœ½π‘ƒπ‘Žπ‘‹π‘–

𝒒

Let 𝑃 πœ½π’’|𝒒 satisfy global & local parameter independence assumptions:

𝑃 π’Ÿ|𝒒 =

𝑖

π’–π‘–βˆˆπ‘‰π‘Žπ‘™(π‘ƒπ‘Žπ‘‹π‘–

𝒒)

πœ½π‘‹π‘–|𝒖𝑖

𝑛=1

π‘ƒπ‘Žπ‘‹

𝑖(𝑛)

𝒒=𝒖𝑖

𝑁

𝑃 π‘₯𝑖(𝑛)

|𝒖𝑖 , πœ½π‘‹π‘–|𝒖𝑖, 𝒒 𝑃 πœ½π‘‹π‘–|𝒖𝑖

|𝒒 π‘‘πœ½π‘‹π‘–|𝒖𝑖

If Dirichlet priors 𝑃 πœ½π‘‹π‘–|𝒖𝑖|𝒒 has hyperparameters 𝛼π‘₯𝑖|π’–π’Š

𝒒|π‘₯ ∈ π‘‰π‘Žπ‘™ 𝑋𝑖 :

𝑃 π’Ÿ|𝒒 =

𝑖

π’–π‘–βˆˆπ‘‰π‘Žπ‘™(π‘ƒπ‘Žπ‘‹π‘–

𝒒)

Ξ“ π‘₯π‘–βˆˆπ‘‰π‘Žπ‘™(𝑋𝑖)𝛼π‘₯𝑖|𝒖𝑖

𝒒

Ξ“ π‘₯π‘–βˆˆπ‘‰π‘Žπ‘™(𝑋𝑖)𝛼π‘₯𝑖|𝒖𝑖

𝒒+ 𝑀𝑖 π‘₯𝑖, 𝒖𝑖

π‘₯π‘–βˆˆπ‘‰π‘Žπ‘™(𝑋𝑖)

Ξ“ 𝛼π‘₯𝑖|𝒖𝑖

𝒒+ 𝑀𝑖 π‘₯𝑖 , 𝒖𝑖

Ξ“ 𝛼π‘₯𝑖|𝒖𝑖

𝒒

Asymptotic behavior of Bayesian score

17

As 𝑁 β†’ ∞, a network 𝒒 with Dirichlet priors satisfies:

log 𝑃 π’Ÿ|𝒒 = log 𝑃(π’Ÿ| πœ½π’’) βˆ’log 𝑁

2Dim 𝒒 + 𝑂(1)

Thus, Bayesian score is asymptotically equivalent to BIC and

asymptotically consistent

𝑂(1) term does not

grow with 𝑀

Priors

18

Structure prior: 𝑃(𝒒) Uniform prior: 𝑃(𝒒) ∝ constant

Prior penalizing number of edges: 𝑃(𝒒) ∝ 𝑐 𝐸(𝒒) (0 < 𝑐 < 1)

Prior penalizing number of parameters

Parameter priors: 𝑃(𝜽|𝒒) is usually instantiated as BDe prior

𝛼 𝑋𝑖 , π‘ƒπ‘Žπ‘‹π‘–

𝒒= 𝛼𝑃′ 𝑋𝑖 , π‘ƒπ‘Žπ‘‹π‘–

𝒒

BDe prior A single network 𝐡′ provides priors for all candidate networks

Unique prior with the property that I-equivalent networks have the sameBayesian score

Normalizing constant can be ignored

𝛼: equivalent sample size

𝐡0: prior network representing 𝑃′

π‘ƒπ‘Žπ‘‹π‘–

𝒒are not the same as parents of 𝑋𝑖 in 𝐡0

Score Equivalence

19

A score function satisfies score equivalence when all I-

equivalent networks have the same score

The likelihood score and BIC satisfy score equivalence

General graph structure

20

Theorem:

The problem of learning a BN structure with at most 𝑑 (𝑑 β‰₯ 2)

parents is NP-hard.

Search in the space of graph structures using local search

algorithms

Algorithms like hill-climbing and simulated Annealing

Exploit score decomposition during the search to alleviate

computational overhead

Optimization by local searches

21

Graphs as states

Score function is used as the objective function

Neighbors of a state are found using these modifications:

Edge addition

Edge deletion

Edge reversal

We only consider legal neighbors

that result in DAGs

Decomposable score

22

Decomposable structure score:

Score 𝒒; π’Ÿ =

𝑖

FamScore(𝑋𝑖|π‘ƒπ‘Žπ‘‹π‘–

𝒒; π’Ÿ)

Decomposability is useful to reduce the computational

overhead of evaluating different structures during search.

if we have a decomposable score, then a local change in the structure

does not change the score of the remaining parts of the structure.

FamScore(𝑋|π‘ˆ; π’Ÿ) measures how well the set of

variables π‘ˆ serves as parent of 𝑋 in the data set π’Ÿ

Decomposability of scores

23

Likelihood score is decomposable

Bayesian score is decomposable when:

parameter priors satisfy global parameter independence

and parameter modularity

Parameter modularity:

π‘ƒπ‘Žπ‘‹π‘–

𝒒= π‘ƒπ‘Žπ‘‹π‘–

𝒒′

= π‘ˆ β‡’ 𝑃 πœ½π‘‹π‘–|π‘ˆ 𝒒 = 𝑃(πœ½π‘‹π‘–|π‘ˆ|𝒒′)

and structure prior satisfies structure modularity

𝑃(𝒒) ∝

𝑖

𝑃(π‘ƒπ‘Žπ‘‹π‘–= π‘ƒπ‘Žπ‘‹π‘–

𝒒)

Summary

24

Score functions

Likelihood score

BIC score

Bayesian score

Optimal structure for trees is found using a greedy algorithm

When hypothesis space includes general graphs, the model

selection problem will be NP-hard and we use local search

strategies to optimize the structure

References

25

D. Koller and N. Friedman, β€œProbabilistic Graphical Models:

Principles and Techniques”, MIT Press, 2009, Chapter 18.1, 18.3,

18.4.

top related