scalable map inference in bayesian networks based on a map-reduce approach (pgm2016)

35
Scalable MAP inference in Bayesian networks based on a Map-Reduce approach Darío Ramos-López 1 , Antonio Salmerón 1 , Rafael Rumí 1 , Ana M. Martínez 2 , Thomas D. Nielsen 2 , Andrés R. Masegosa 3 , Helge Langseth 3 , Anders L. Madsen 2,4 1 Department of Mathematics, University of Almería, Spain 2 Department of Computer Science, Aalborg University, Denmark 3 Department of Computer and Information Science, The Norwegian University of Science and Technology, Norway 4 Hugin Expert A/S, Aalborg, Denmark 8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 1

Upload: amidst-toolbox

Post on 15-Jan-2017

50 views

Category:

Science


0 download

TRANSCRIPT

Scalable MAP inference in Bayesiannetworks based on a Map-Reduce

approach

Darío Ramos-López1, Antonio Salmerón1, Rafael Rumí1,Ana M. Martínez2, Thomas D. Nielsen2, Andrés R. Masegosa3,

Helge Langseth3, Anders L. Madsen2,4

1Department of Mathematics, University of Almería, Spain2Department of Computer Science, Aalborg University, Denmark

3 Department of Computer and Information Science,The Norwegian University of Science and Technology, Norway

4 Hugin Expert A/S, Aalborg, Denmark

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 1

Outline

1 Motivation

2 MAP in CLG networks

3 Scalable MAP

4 Experimental results

5 Conclusions

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 2

Outline

1 Motivation

2 MAP in CLG networks

3 Scalable MAP

4 Experimental results

5 Conclusions

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 3

Motivation

I Aim: Provide scalable solutions to the MAP problem.

I Challenges:I Data coming in streams at high speed, and

a quick response is required.

I For each observation in the stream, themost likely configuration of a set ofvariables of interest is sought.

I MAP inference is highly complex.

I Hybrid models come along with specificdifficulties.

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 4

Context

I The AMiDST project: Analysis of MassIve Data STreamshttp://www.amidst.eu

I Large number of variablesI Queries to be answered in real timeI Hybrid Bayesian networks (involving discrete and continuous

variables)I Conditional linear Gaussian networks

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 5

Outline

1 Motivation

2 MAP in CLG networks

3 Scalable MAP

4 Experimental results

5 Conclusions

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 6

Conditional Linear Gaussian networks

Y

W

TU

S

P(Y ) = (0.5, 0.5)P(S) = (0.1, 0.9)f (w |Y = 0) = N (w ;−1, 1)f (w |Y = 1) = N (w ; 2, 1)f (t|w , S = 0) = N (t;−w , 1)f (t|w , S = 1) = N (t;w , 1)f (u|w) = N (u;w , 1)

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 7

Querying a Bayesian network

I Belief update: Computing the posterior distribution of a variable:

p(xi |xE ) =

∑xD

∫xC

p(x , xE )dxC

∑xDi

∫xCi

p(x , xE )dxCi

I Maximum a posteriori (MAP): For a set of target variables X I , seek

x∗I = argmaxx I

p(x I |XE = xE )

where p(x I |XE = xE ) is obtained by first marginalizing out fromp(x) the variables not in X I and not in XE

I Most probable explanation (MPE): A particular case of MAP whereX I includes all the unobserved variables

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 8

Querying a Bayesian network

I Belief update: Computing the posterior distribution of a variable:

p(xi |xE ) =

∑xD

∫xC

p(x , xE )dxC

∑xDi

∫xCi

p(x , xE )dxCi

I Maximum a posteriori (MAP): For a set of target variables X I , seek

x∗I = argmaxx I

p(x I |XE = xE )

where p(x I |XE = xE ) is obtained by first marginalizing out fromp(x) the variables not in X I and not in XE

I Most probable explanation (MPE): A particular case of MAP whereX I includes all the unobserved variables

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 8

Outline

1 Motivation

2 MAP in CLG networks

3 Scalable MAP

4 Experimental results

5 Conclusions

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 9

MAP by Hill Climbing

HC_MAP(B ,x I ,xE ,r)while stopping criterion not satisfied do

x∗I ← GenerateConfiguration(B , x I , xE , r)

if p(x∗I , xE ) ≥ p(x I , xE ) then

x I ← x∗I

endendreturn x I

I Max. number of non-improving iterationsI Target prob. thresholdI Max. number of iterations

I Never move to a worse configuration

I Estimated using importance sampling:

p(x I , xE ) =∑

x∗∈ΩX∗

p(x I , xE , x∗) =∑

x∗∈ΩX∗

p(x I , xE , x∗)f ∗(x∗)

f ∗(x∗)

= Ef ∗

[p(x I , xE , x∗)

f ∗(x∗)

]≈ 1

n

n∑i=1

p(x I , xE , x∗(i)

)f ∗(x∗(i)

) ,

I We’ll see later

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 10

MAP by Hill Climbing

HC_MAP(B ,x I ,xE ,r)while stopping criterion not satisfied do

x∗I ← GenerateConfiguration(B , x I , xE , r)

if p(x∗I , xE ) ≥ p(x I , xE ) then

x I ← x∗I

endendreturn x I

I Max. number of non-improving iterationsI Target prob. thresholdI Max. number of iterations

I Never move to a worse configuration

I Estimated using importance sampling:

p(x I , xE ) =∑

x∗∈ΩX∗

p(x I , xE , x∗) =∑

x∗∈ΩX∗

p(x I , xE , x∗)f ∗(x∗)

f ∗(x∗)

= Ef ∗

[p(x I , xE , x∗)

f ∗(x∗)

]≈ 1

n

n∑i=1

p(x I , xE , x∗(i)

)f ∗(x∗(i)

) ,

I We’ll see later

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 10

MAP by Hill Climbing

HC_MAP(B ,x I ,xE ,r)while stopping criterion not satisfied do

x∗I ← GenerateConfiguration(B , x I , xE , r)

if p(x∗I , xE ) ≥ p(x I , xE ) then

x I ← x∗I

endendreturn x I

I Max. number of non-improving iterationsI Target prob. thresholdI Max. number of iterations

I Never move to a worse configuration

I Estimated using importance sampling:

p(x I , xE ) =∑

x∗∈ΩX∗

p(x I , xE , x∗) =∑

x∗∈ΩX∗

p(x I , xE , x∗)f ∗(x∗)

f ∗(x∗)

= Ef ∗

[p(x I , xE , x∗)

f ∗(x∗)

]≈ 1

n

n∑i=1

p(x I , xE , x∗(i)

)f ∗(x∗(i)

) ,

I We’ll see later

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 10

MAP by Hill Climbing

HC_MAP(B ,x I ,xE ,r)while stopping criterion not satisfied do

x∗I ← GenerateConfiguration(B , x I , xE , r)

if p(x∗I , xE ) ≥ p(x I , xE ) then

x I ← x∗I

endendreturn x I

I Max. number of non-improving iterationsI Target prob. thresholdI Max. number of iterations

I Never move to a worse configuration

I Estimated using importance sampling:

p(x I , xE ) =∑

x∗∈ΩX∗

p(x I , xE , x∗) =∑

x∗∈ΩX∗

p(x I , xE , x∗)f ∗(x∗)

f ∗(x∗)

= Ef ∗

[p(x I , xE , x∗)

f ∗(x∗)

]≈ 1

n

n∑i=1

p(x I , xE , x∗(i)

)f ∗(x∗(i)

) ,

I We’ll see later

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 10

MAP by Hill Climbing

HC_MAP(B ,x I ,xE ,r)while stopping criterion not satisfied do

x∗I ← GenerateConfiguration(B , x I , xE , r)

if p(x∗I , xE ) ≥ p(x I , xE ) then

x I ← x∗I

endendreturn x I

I Max. number of non-improving iterationsI Target prob. thresholdI Max. number of iterations

I Never move to a worse configuration

I Estimated using importance sampling:

p(x I , xE ) =∑

x∗∈ΩX∗

p(x I , xE , x∗) =∑

x∗∈ΩX∗

p(x I , xE , x∗)f ∗(x∗)

f ∗(x∗)

= Ef ∗

[p(x I , xE , x∗)

f ∗(x∗)

]≈ 1

n

n∑i=1

p(x I , xE , x∗(i)

)f ∗(x∗(i)

) ,

I We’ll see later

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 10

MAP by Simulated Annealing

SA_MAP(B ,x I ,xE ,r)T ← 1 000; α← 0.90; ε > 0while T ≥ ε do

x∗I ← GenerateConfiguration(B , x I , xE , r)

Simulate a random number τ ∼ U(0, 1)if p(x∗

I , xE ) > p(x I , xE )/(T · ln(1/τ)) thenx I ← x∗

I

endT ← α · T

endreturn x I

I Default values of the temperature parametersI T 1: almost completely randomI T 1: almost completely greedy

I Accept x∗I if its prob. increases

or decreases < T · ln(1/τ)I Cool down the temperature

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 11

MAP by Simulated Annealing

SA_MAP(B ,x I ,xE ,r)T ← 1 000; α← 0.90; ε > 0while T ≥ ε do

x∗I ← GenerateConfiguration(B , x I , xE , r)

Simulate a random number τ ∼ U(0, 1)if p(x∗

I , xE ) > p(x I , xE )/(T · ln(1/τ)) thenx I ← x∗

I

endT ← α · T

endreturn x I

I Default values of the temperature parametersI T 1: almost completely randomI T 1: almost completely greedy

I Accept x∗I if its prob. increases

or decreases < T · ln(1/τ)I Cool down the temperature

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 11

MAP by Simulated Annealing

SA_MAP(B ,x I ,xE ,r)T ← 1 000; α← 0.90; ε > 0while T ≥ ε do

x∗I ← GenerateConfiguration(B , x I , xE , r)

Simulate a random number τ ∼ U(0, 1)if p(x∗

I , xE ) > p(x I , xE )/(T · ln(1/τ)) thenx I ← x∗

I

endT ← α · T

endreturn x I

I Default values of the temperature parametersI T 1: almost completely randomI T 1: almost completely greedy

I Accept x∗I if its prob. increases

or decreases < T · ln(1/τ)

I Cool down the temperature

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 11

MAP by Simulated Annealing

SA_MAP(B ,x I ,xE ,r)T ← 1 000; α← 0.90; ε > 0while T ≥ ε do

x∗I ← GenerateConfiguration(B , x I , xE , r)

Simulate a random number τ ∼ U(0, 1)if p(x∗

I , xE ) > p(x I , xE )/(T · ln(1/τ)) thenx I ← x∗

I

endT ← α · T

endreturn x I

I Default values of the temperature parametersI T 1: almost completely randomI T 1: almost completely greedy

I Accept x∗I if its prob. increases

or decreases < T · ln(1/τ)

I Cool down the temperature

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 11

Generating a new configuration of variables

Only discrete variablesI The new values are chosen at random

A variable whose parents are discrete or observed

Y

W

TU

S

P(Y ) = (0.5, 0.5)P(S) = (0.1, 0.9)f (w |Y = 0) = N (w ;−1, 1)f (w |Y = 1) = N (w ; 2, 1)f (t|w ,S = 0) = N (t;−w , 1)f (t|w ,S = 1) = N (t;w , 1)f (u|w) = N (u;w , 1)

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 12

Generating a new configuration of variables

Hybrid modelsI We take advantage of the properties of the CLG distribution

A variable whose parents are discrete or observed

Y

W

TU

S

P(Y ) = (0.5, 0.5)P(S) = (0.1, 0.9)f (w |Y = 0) = N (w ;−1, 1)f (w |Y = 1) = N (w ; 2, 1)f (t|w ,S = 0) = N (t;−w , 1)f (t|w ,S = 1) = N (t;w , 1)f (u|w) = N (u;w , 1)

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 12

Generating a new configuration of variables

Hybrid modelsI We take advantage of the properties of the CLG distribution

A variable whose parents are discrete or observed

Y

W

TU

S

P(Y ) = (0.5, 0.5)P(S) = (0.1, 0.9)f (w |Y = 0) = N (w ;−1, 1)f (w |Y = 1) = N (w ; 2, 1)f (t|w ,S = 0) = N (t;−w , 1)f (t|w ,S = 1) = N (t;w , 1)f (u|w) = N (u;w , 1)

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 12

Generating a new configuration of variables

Hybrid modelsI We take advantage of the properties of the CLG distribution

A variable whose parents are discrete or observed

Y

W

TU

S

Return its conditional mean:

f (w |Y = 0) = N (w ;−1, 1)f (w |Y = 1) = N (w ; 2, 1)

A variable with unobserved continuous parents

Y

W

TU

S

Simulate a value usingits conditional distribution:

f (t|w ,S = 0) = N (t;−w , 1)f (t|w ,S = 1) = N (t;w , 1)

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 13

Generating a new configuration of variables

Hybrid modelsI We take advantage of the properties of the CLG distribution

A variable with unobserved continuous parents

Y

W

TU

S

Simulate a value usingits conditional distribution:

f (t|w ,S = 0) = N (t;−w , 1)f (t|w ,S = 1) = N (t;w , 1)

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 13

Scalable implementation

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 14

Outline

1 Motivation

2 MAP in CLG networks

3 Scalable MAP

4 Experimental results

5 Conclusions

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 15

Experimental analysis

PurposeAnalyze the scalability in terms of

I SpeedI Accuracy

Experimental setupI Synthetic networks with 200 variables (50% discrete)I 70% of the variables observed at randomI 10% of the variables selected as target ⇒ 20% to be

marginalized out

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 16

Experimental analysis

PurposeAnalyze the scalability in terms of

I SpeedI Accuracy

Experimental setupI Synthetic networks with 200 variables (50% discrete)I 70% of the variables observed at randomI 10% of the variables selected as target ⇒ 20% to be

marginalized out

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 16

Experimental analysis

Computing environmentI AMIDST Toolbox with Apache FlinkI Multi-core environment based on a dual-processor AMD

Opteron 2.8 GHz server with 32 cores and 64 GB of RAM,running Ubuntu Linux 14.04.1 LTS

I Multi-node environment based on Amazon Web Services(AWS)

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 17

Scalability: run times

100

200

300

400

500

600

1 2 4 8 16 24 32Number of cores

Exe

cutio

n tim

e (

s)

method HC Global HC Local SA Global SA Local

Scalability of MAP in a multi−core node

5

10

15

Sp

ee

d-u

p fa

ctor 5

10

15

20

1 2 4 8 16 24 32Number of nodes (4 vCPUs per node)

Sp

ee

d−

up fact

or

method HC Global HC Local SA Global SA Local

Scalability of MAP in a multi−node cluster

25

50

75

100

125

1 2 4 8 16 24 32Number of nodes (4 vCPUs per node)

Exe

cutio

n t

ime

(s)

method HC Global HC Local SA Global SA Local

Scalability of MAP in a multi−node cluster

5

10

15

20

Sp

ee

d-u

p fa

ctor

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 18

Scalability: accuracy (Simulated Annealing)

−230

−220

−210

−200

−190

1 2 4 8 16 32Number of cores

log

Pro

ba

bility

cores

1

2

4

8

16

32

Simulated Annealing global

−200

−195

−190

−185

1 2 4 8 16 32Number of cores

log

Pro

ba

bility

cores

1

2

4

8

16

32

Simulated Annealing local

Estimated log-probabilities of the MAP configurations found by each algorithm

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 19

Scalability: accuracy (Hill Climbing)

−240

−220

−200

1 2 4 8 16 32Number of cores

log

Pro

ba

bility

cores

1

2

4

8

16

32

Hill Climbing global

−240

−220

−200

1 2 4 8 16 32Number of cores

log

Pro

ba

bility

cores

1

2

4

8

16

32

Hill Climbing local

Estimated log-probabilities of the MAP configurations found by each algorithm

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 20

Outline

1 Motivation

2 MAP in CLG networks

3 Scalable MAP

4 Experimental results

5 Conclusions

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 21

Conclusions

I Scalable MAP for CLG models in terms of accuracy and runtime

I Available in the AMIDST ToolboxI Valid for multi-cores and cluster systemsI MapReduce-based design on top of Apache Flink

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 22

Thank you for your attentionYou can download our open source Java toolbox:

http://www.amidsttoolbox.com

Acknowledgments: This project has received funding from the European Union’sSeventh Framework Programme for research, technological development and

demonstration under grant agreement no 619209

8th Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 23