locally averaged bayesian dirichlet metrics

Locally averaged Bayesian Dirichlet metrics

A. Cano, M. Gomez-Olmedo, A. R. Masegosa and S. Moral

Department of Computer Science and Artificial Intelligence

University of Granada (Spain)

Belfast, July 2011

European Conference on Symbolic and Quantitative Approaches to Reasoning under Uncertainty

ECSQARU 2011 Belfast (UK) 1/30

Outline

1 Introduction

2 Bayesian Dirichlet Metrics

3 Locally Averaged Bayesian Dirichlet Metrics

4 Experimental Evaluation

5 Conclusions & Future Works

Introduction

Part I

Introduction

Bayesian Networks

Excellent models to graphically represent the dependency structure of theunderlying distribution in multivariate domains.

This dependency structure in a multivariate problem domain represents a veryrelevant source of knowledge (direct interactions, conditionalindependencies...)

Introduction

Learning Bayesian Networks from Data

Learning Algorithms

Constrained-Base learning based on hypothesis tests approaches such as PCalgorithm.

Score+Search methods which employs a search algorithm guided by a scorefunction.

The model with the highest score is selected.

Introduction

Bayesian Score Metrics

Marginal Likelihood of the data

P(D|G) =

∫P(D|θ,G)P(θ|G)dφ

Bayesian Dirichlet Equivalent Metric (BDe)

It satisfies the likelihood equivalence property.

A global Dirichlet distribution is assumed in order to guarantee the likelihoodequivalence property.

The parametrization depends of the equivalent sample size, ESS, parameter.

score(G : D) =∏

|Ui |∏j=0

Γ( ESS|Ui |

+ Nij )

|Xi |∏k=1

Γ( ESS|Ui ||Xi |

+ Nijk )

Γ( ESS|Ui ||Xi |

Introduction

Bayesian Score Metrics

Marginal Likelihood of the data

P(D|G) =

Bayesian Dirichlet Equivalent Metric (BDe)

It satisfies the likelihood equivalence property.

A global Dirichlet distribution is assumed in order to guarantee the likelihoodequivalence property.

The parametrization depends of the equivalent sample size, ESS, parameter.

score(G : D) =∏

|Ui |∏j=0

Γ( ESS|Ui |

+ Nij )

|Xi |∏k=1

Γ( ESS|Ui ||Xi |

+ Nijk )

Γ( ESS|Ui ||Xi |

Introduction

Sensitivity to ESS parameter

Experimental Evaluations [Silander et al.2007]

The global MAP BN was computed with an exhaustive search based algorithmfor 20 UCI data sets.

They found as different ESS values lead to different optimal BN models.

For some data sets (e.g. Yeast database) the optimal BN model monotonicallygoes from the empty to the fully connected graph.

N. of Arcs in the optimal BN vs ESS value

Introduction

Our approach

Solution: Marginalizing the ESS parameter

As firstly suggested in [Silander et al. 2007], a possible solution is to employ aBayesian approach:

Assume a prior distribution on the ESS parameter and to marginalize itout.

Locally Averaged Bayesian Dirichlet Metrics

It is based on a local averaging approach to marginalize the ESS parameter.

We experimentally justify that this approach is superior:

It is able to adapt to more complex parameter spaces.

This approach removes the sensitivity of Bayesian Dirichlet metric to the ESSparameter.

Introduction

Our approach

Solution: Marginalizing the ESS parameter

As firstly suggested in [Silander et al. 2007], a possible solution is to employ aBayesian approach:

Assume a prior distribution on the ESS parameter and to marginalize itout.

It is based on a local averaging approach to marginalize the ESS parameter.

We experimentally justify that this approach is superior:

It is able to adapt to more complex parameter spaces.

This approach removes the sensitivity of Bayesian Dirichlet metric to the ESSparameter.

Bayesian Dirichlet Metrics

Part II

Notation

Let be X = (X1, ...,Xn) a set of nmultinomial random variables.

|Xi | is the number of values of Xi .

We also assume a fully observedmultinomial data set D.

A Bayesian Network B can be described by:

G is a directed acyclic graph.

G = (Pa(X1), ...,Pa(Xn)).

θG a set of parameter vectors.

P(Xi |Pa(Xi ) = j) = θij .

Bayesian Dirchlet equivalent metric

Marginal Likelihood of a graph structure:

P(D|G) =

It is computed under the following assumptions:

Complete labelled training data.The prior distributions over the parameters are Dirichlet distributions.

θij ∼ Dirichet(αij1, ..., αijk )

Parameters are globally and locally independent:

scoreBDeu(G|D) =n∏

|PaG(Xi )|∏j=0

Γ(αij )

Γ(αij + Nij )

|Xi |∏k=1

Γ(αijk + Nijk )

Γ(αijk )

BDe metrics sets alpha values as follows, in order to guarantee the likelihoodequivalence property:

αijk =S

|Xi ||Pa(Xi )|

P(D|G) =

|PaG(Xi )|∏j=0

Γ(αij )

Γ(αij + Nij )

|Xi |∏k=1

Γ(αijk + Nijk )

Γ(αijk )

αijk =S

|Xi ||Pa(Xi )|

P(D|G) =

|PaG(Xi )|∏j=0

Γ(αij )

Γ(αij + Nij )

|Xi |∏k=1

Γ(αijk + Nijk )

Γ(αijk )

αijk =S

|Xi ||Pa(Xi )|ECSQARU 2011 Belfast (UK) 11/30

Sensitivity to the ESS

The problem is that we make αijk values exponentially small either with thenumber or the cardinality of the parents: αijk = S

|Xi ||Pa(Xi )|.

Beta(1,1), Beta(0.5, 0.5), Beta(0.25, 0.25), Beta(0.125, 0.125)

(SteckJackola2002, Steck2008, Ueno.2010): small αijk values tends to favorthe the absence of an edge Y −→ X over its presence (even if they are notconditionally independent).

Specially if the empirical P̂(X |Y ) is not very extreme (it does not matchwith the prior assupmtions).

The problem is that we make αijk values exponentially small either with thenumber or the cardinality of the parents: αijk = S

|Xi ||Pa(Xi )|.

Beta(1,1), Beta(0.5, 0.5), Beta(0.25, 0.25), Beta(0.125, 0.125)

(SteckJackola2002, Steck2008, Ueno.2010): small αijk values tends to favorthe the absence of an edge Y −→ X over its presence (even if they are notconditionally independent).

Specially if the empirical P̂(X |Y ) is not very extreme (it does not matchwith the prior assupmtions).

If we increase the S value, we implicitly assume that marginal distributionsP(Xi ) = θi have very symmetrical probability distribution.

Beta(1,1), Beta(2, 2), Beta(4, 4), Beta(8, 8)

(SteckJackola2002, Steck2008, Ueno.2010): larger S values tends to favor thepresence of an edge Y −→ X over its absence (even if they are conditionallyindependent).

Specially, if there is a notable skewness in both marginal distributions:P(X |PaX ) and P(Y |PaY ).

If we increase the S value, we implicitly assume that marginal distributionsP(Xi ) = θi have very symmetrical probability distribution.

Beta(1,1), Beta(2, 2), Beta(4, 4), Beta(8, 8)

(SteckJackola2002, Steck2008, Ueno.2010): larger S values tends to favor thepresence of an edge Y −→ X over its absence (even if they are conditionallyindependent).

Specially, if there is a notable skewness in both marginal distributions:P(X |PaX ) and P(Y |PaY ).

Part III

Locally Averaged BayesianDirichlet Metrics

Globally Averaged Bayesian Dirichlet Metrics

[Silander et al. 2007] Bayesian solution to the problem of selecting anoptimal ESS:

Consider S as a random variable, place a prior on S and marginalize it out.

P(D|G) =

∫P(D|G, s)P(s|G)ds

where P(D|G, s) is the classic marginal likelihood which depends of theequivalent sample size.

It is assumed that P(S|G) is uniform and integral is approximated by a simpleaveraging method

P(D|G) =1|S|

∑s∈S

|Ui |∏j=0

Γ( S|Ui |

+ Nij )

|Xi |∏k=1

Γ( S|Ui ||Xi |

+ Nijk )

Γ( S|Ui ||Xi |

where S is a finite set of different S values.

Satisfies the likelihood equivalence property but it is not locally decomposable.

P(D|G) =

P(D|G) =1|S|

∑s∈S

|Ui |∏j=0

Γ( S|Ui |

+ Nij )

|Xi |∏k=1

Γ( S|Ui ||Xi |

+ Nijk )

Γ( S|Ui ||Xi |

P(D|G) =

P(D|G) =1|S|

∑s∈S

|Ui |∏j=0

Γ( S|Ui |

+ Nij )

|Xi |∏k=1

Γ( S|Ui ||Xi |

+ Nijk )

Γ( S|Ui ||Xi |

A toy example:

Z and Y have very skewed marginaldistributions.

P(X|Z) is not notably far from uniform.

We generate 1000 data samples.

We evaluate the BN with the highest score.

Different averaging set values, SL, were tested:

S1 = {0.5, 1, 2}, S2 = {0.25, 0.5, 1, 2, 4}, ...,S10 = {2−10, 2−9, ...., 29, 210}.

S << 1 (very skewed), S < 1 (skewed), S ≈ 1(uniform), S >> 1 (strongly uniform).

ResultsIt always retrieves the empty graph without any edge.

Reasons:

We assume a global distribution (either strongly uniform or uniform orskewed or very skewed) for all parameters at the same time.This assumption does not fit the parameter space of this Bayesiannetwork.

S1 = {0.5, 1, 2}, S2 = {0.25, 0.5, 1, 2, 4}, ...,S10 = {2−10, 2−9, ...., 29, 210}.

Reasons:

S1 = {0.5, 1, 2}, S2 = {0.25, 0.5, 1, 2, 4}, ...,S10 = {2−10, 2−9, ...., 29, 210}.

Reasons:

Locally Averaged Bayesian Dirichlet MetricsThe marginalization of the parameter S is carried out locally:

We assume that each parameter vector θij is drawn from a different Dirichlet distribution where the

parameters S are independent.

P(D|G) =1|S|

|Pa(Xi )|∏j=0

∑s∈S

Γ( S|Pa(Xi )|

+ Nij )

|Xi |∏k=1

Γ( S|Pa(Xi )||Xi |

+ Nijk )

Γ( S|Pa(Xi )||Xi |

It is now locally decomposable metric but it losses the likelihood equivalenceproperty.

P(D|G) =1|S|

|Pa(Xi )|∏j=0

∑s∈S

Γ( S|Pa(Xi )|

+ Nij )

|Xi |∏k=1

Γ( S|Pa(Xi )||Xi |

+ Nijk )

Γ( S|Pa(Xi )||Xi |

P(D|G) =1|S|

|Pa(Xi )|∏j=0

∑s∈S

Γ( S|Pa(Xi )|

+ Nij )

|Xi |∏k=1

Γ( S|Pa(Xi )||Xi |

+ Nijk )

Γ( S|Pa(Xi )||Xi |

S1 = {0.5, 1, 2}, S2 = {0.25, 0.5, 1, 2, 4}, ...,S10 = {2−10, 2−9, ...., 29, 210}.

ResultsWhen L ≥ 5 we always retrieve the right graph.

We assume that each parameter vector follows a different Dirichletdistribution either strongly uniform or uniform or skewed or very skewed. Butindependent from the rest of parameters.

This assumption allow to fit much more complex parameter spaces.

S1 = {0.5, 1, 2}, S2 = {0.25, 0.5, 1, 2, 4}, ...,S10 = {2−10, 2−9, ...., 29, 210}.

Experimental Evaluation

Part IV

Experimental Set-up

Bayesian Networks:

alarm (37 nodes), boblo (23 nodes), boerlage-92 (23 nodes), hailfinder (56nodes), insurance (27 nodes).

Data Sets:

We run 10 times the algorithms with 1000 data samples (other datasamples sizes were evaluated).

Evaluation Measures

Number of missing/extra links, Kullback-Leibler distance....Algorithms

A greedy search algorithm is used assuming we are given a correcttopological order of the variables.Different SL sets are used to perform averaging: L = 1, ...10 (displayed onx-axis).

Experimental Set-up

Bayesian Networks:

Data Sets:

Evaluation Measures

Experimental Set-up

Bayesian Networks:

Data Sets:

Evaluation Measures

Number of missing/extra links, Kullback-Leibler distance....

Algorithms

Experimental Set-up

Bayesian Networks:

Data Sets:

Evaluation Measures

BDe with different S values I

Log of the S Value

AlarmBobloBoerlageHailfinderInsurance

−6 −4 −2 0 1 2 3 4 5 6 7 8

Log of the S ValueK

−6 −4 −2 0 1 2 3 4 5 6 7 8

Analysis

The BDe metric is very sensitive to the S values in some domain problems.

There is an optimal S value which is different for each problem.

BDe with different S values II

Log of the S Value

−6 −4 −2 0 1 2 3 4 5 6 7 8

Log of the S Value

−6 −4 −2 0 1 2 3 4 5 6 7 8

Analysis

We can see the theoretically predicted tendencies appears.

Higher S values have a tendency to add edges.

Lower S values have a tendency to remove edges.

Locally Averaged Bayesian Dirichlet metrics

L Values

1 2 3 4 5 6 7 8 9 10

L Values

1 2 3 4 5 6 7 8 9 10

Analysis

The higher the L value, the wider the set of averaged S values.

In some domains, the error measures improves with the size of averaged Svalues.

In other domains, the error does not improve but it does not get worse.

Globally Averaged Bayesian Dirichlet metrics

L Values

1 2 3 4 5 6 7 8 9 10

L ValuesK

1 2 3 4 5 6 7 8 9 10

Analysis

Similar behavior to locally averaged metrics.

Globally vs Locally Averaged Bayesian Dirichlet metrics

Global-AvBD error minus Local-AvBD error

L Values

1 2 3 4 5 6 7 8 9 10

Analysis

In Alarm, Boblo and Boerlage, there hardly are differences between them.

In Hailfinder and Insurance, Local-AvBD metric performs better.

The performance depends of the complexity of the parameter space.

BDe metric vs Locally Averaged Bayesian Dirichlet metrics

BD error minus Local-AvBD error

L Values

1 2 3 4 5 6 7 8 9 10

Analysis

For BD metric, it is seleced the model with the lowest error using any of Svalues in the set SL.

Local-AvBD metric peforms as least as well as the BD metric with an optimal Svalue.

In some domains (Hailfinder and Insurance), Local-AvBD metric carries outbetter inferences.

Conclusions and Future Works

Part V

ConclusionsLocally Averaged Bayesian Dirichlet metrics robustly infers more accuratemodels than the BDe metric with an optimal selection the ESS parameter.

It is able to adapt to complex parameter spaces.

This metric is worth for knowledge discovery tasks: the inferences does notdepend of any free parameter and it gives the performance of an opmtimalsolution.

Future WorksExtend this method to the parameter estimation of a BN model:

P(Xi = k |Pa(Xi ) = j) =nijk + S

|Xi ||Pa(Xi )|

nij + S|Pa(Xi )|

ConclusionsLocally Averaged Bayesian Dirichlet metrics robustly infers more accuratemodels than the BDe metric with an optimal selection the ESS parameter.

It is able to adapt to complex parameter spaces.

This metric is worth for knowledge discovery tasks: the inferences does notdepend of any free parameter and it gives the performance of an opmtimalsolution.

Future WorksExtend this method to the parameter estimation of a BN model:

P(Xi = k |Pa(Xi ) = j) =nijk + S

|Xi ||Pa(Xi )|

nij + S|Pa(Xi )|

Thanks for you attention!!!