causal modeling with generative neural networks · 2017. 10. 30. · variational autoencoder...

Causal Modeling with Generative Neural Networks

Michele SebagTAO, CNRS − INRIA − LRI − Universite Paris-Sud

Joint work: D. Kalainathan, O. Goudet, I. Guyon,M. Hajaiej, A. Decelle, C. Furtlehner

https://arxiv.org/abs/1709.05321Credit for slides: Yann LeCun

Leiden − Sept. 2017

1 / 27

Motivation

State of art

Causal Generative Neural Nets

Naive ML Approach to SW

2 / 27

ML: discriminative or generative modellingGiven a training set usually iid samples ∼ P(X ,Y )

E = {(xi , yi ), xi ∈ IRd , i ∈ [[1, n]]}

Find

I Supervised learning: h : X 7→ Y or P(Y |X )

I Generative model P(X ,Y )

Predictive modelling might be based on correlationsIf umbrellas in the street, Then it rains

3 / 27

The big data promise:ML models will expectedly support interventions:

I health and nutrition

I education

I economics/management

I climate

Intervention Pearl 2009

Intervention do(X = x) forces variables X to value x

Direct cause Xi → Xj

PXj |do(Xi=x,X\ij=c) 6= PXj |do(Xi=x′,X\ij=c)

ExampleC: Cancer, S : Smoking, G : Genetic factorsP(C |do{S = 0,G = 0}) 6= P(C |do{S = 1,G = 0})

4 / 27

Correlations do not support interventions

Causal models are needed to support interventions

5 / 27

Why is this relevant to space weather ?

Causal models support understanding

Causal models are more robust e.g., to concept drift

I Given observations drawn after P(X ), P(Y |X ),

Find P(Y |X ) that minimizes

IEx∼P(X )

[arg max

yP(y |x)− arg max

yP(y |x)

]I But P(X ) in production might differ from P(X ) in training

6 / 27

Causal modelling, how

Historically, based on interventions. However, often

I impossible climate

I unethical make people smoking

I too expensive e.g., in economics

Machine Learning alternatives

I Observational data

I Statistical tests

I Learned models

I Prior knowledge / Assumptions / Constraints

7 / 27

Motivation

State of art



8 / 27

Functional Causal Models, a.k.a. Structural Equation Models

Xi = fi (Pa(Xi ),Ei )

Pa(Xi ): Direct causes for Xi All unobserved influences: noise variables Ei

X1 = f1(E1)

X2 = f2(X1,E2)

X3 = f3(X1,E3)

X4 = f4(E4)

X5 = f5(X3,X4,E5)

TasksI Finding the structure of the graph (no cycles)I Finding functions (fi )

9 / 27

Conducting a causal modelling study

Milestones

I Testing bivariate independence (statistical tests)find edges X − Y ;Y − Z

I Conditional independenceprune the edges X ⊥⊥ Z |Y

I Full causal graph modellingorient the edges X → Y → Z

Challenges

I Computational complexity tractable approximation

I Conditional independence: data hungry tests

I Assuming causal sufficiency can be relaxed

10 / 27

X − Y independance

P(X ,Y ) =?P(X ).P(Y )

Categorical variables

I Entropy H(X ) = −∑

x p(x)log(p(x))x : value taken by X , p(x) its frequency

I Mutual information M(X ,Y ) = H(X ) + H(Y )− H(X ,Y )

I Others: χ2, G-test

Continuous variables

I t-test, z-test

I Hilbert-Schmidt Independence Criterion (HSIC) Gretton et al., 05

Cov(f , g) = IEx,y [f (x)g(y)]− IEx [f (x)]IEy [g(y)]

I Given f : X 7→ IR and g : Y 7→ IRI Cov(f , g) = 0 for all f , g iff X and Y are independent

11 / 27

An ML approachGuyon et al, 2014-2015

E = {(Ai ,Bi , `i ), `i in {→,←,⊥⊥}}

12 / 27

Exploiting the distribution asymmetry

Hoyer et al. 09; Mooij et al. 2016

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0X

3

2

1

0

1

2

3

Y Y=f(X)

X=g(Y)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0X

1.5

1.0

0.5

0.0

0.5

1.0

1.5

f(X) - Y

3 2 1 0 1 2 3Y

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

g(Y) - X

True model with noise ε independent on X

Y = X + ε

Learn Y = f (X ), plot the residual Y − f (X )Learn X = g(Y ), plot the residual X − g(Y )

13 / 27

Exploiting the asymmetry, 2

Given A, B

14 / 27


Given A, B, Learn

I A = f (B)

I B = g(A)

Retain model with best fit: A→ B

15 / 27


Given A, B, Learn

I A = f (B)

I B = g(A)

Retain model with best fit: A→ B

A: Altitude of city, B: Temperature

15 / 27

Find V-structure: A ⊥⊥ C and A 6⊥⊥ C |B

Explaining away causes

16 / 27

Motivation

State of art



17 / 27

Auto-Encoders

Training setE = {(xi ), xi ∈ IRd , i = 1 . . . n}

Structure of Auto-Encoder

Minimization of Mean Squared Error (MSE)

Minimize∑i

||xi − x′i ||2

Output: z, a compressed representation of x

18 / 27

Stacked Auto-Encoders

E = {(xi ), xi ∈ IRd , i = 1 . . . n}

Differences

I Several hidden layers

I Minimize MSE or cross-entropy loss

Minimize∑i,j

xi,j log xi,j + (1− xi,j) log (1− xi,j)

19 / 27

Variational Auto-Encoders

Kingma et al. 13

E = {(xi ), xi ∈ IRd , i = 1 . . . n}

Difference

I Hidden layer: parameters of a distribution N (µ, σ2)

I Distribution used to generate values z = µ+ σ ×N (0, 1)

20 / 27

Variational Auto-Encoders

Kingma et al. 13

E = {(xi ), xi ∈ IRd , i = 1 . . . n}

Difference

I Hidden layer: parameters of a distribution N (µ, σ2)

I Distribution used to generate values z = µ+ σ ×N (0, 1)

21 / 27

Causal Generative Neural NetsGoudet et al. 17

E = {(xi ), xi ∈ IRd , i = 1 . . . n}

E ′ = {(x′i ), x′i ∈ IRd , i = 1 . . . n′}

I Train the generator to minimize the “distance” between original andgenerated data in IRd

MMD(G) =1

n2

∑i,j

k(xi , xj) +1

n′2

∑i,j

k(x′i , x′j)− 2

1

nn′

∑i,j

k(xi , x′j)

k(x, z) =∑i

exp−γid||x−z||2

γi in {10−2 . . . 102}

22 / 27

Relaxing the causal sufficiency assumption

X2 = f2(E2,E2,3)

X3 = f3(E3,E2,3,E3,5)

X4 = f4(E4,E4,5)

X5 = f5(X3,X4,E5,E3,5,E4,5)

23 / 27

Graph inference

Results: Area under the precision/recall curveAlgorithm G2 G3 G4Constraint-basedPC-Gaussian 82.3 ±4 (87.8) 80.0 ±7 (89.2) 88.1 ±10 (95.7)PC-HSIC 93.4 ±3 (78.5) 93.0 ±4 (77.9) 98.9 ±2 (88.0)Score-basedGES 75.3 ±7 (81.2) 73.6 ±7 (77.7) 69.3±11 (78.6)Pairwise orientationLiNGAM 64.4 ±4 (100) 71.1 ±1 (100) 71.6 ±7 (100)ANM 72.9 ±9 (100) 72.5 ±4 (100) 79.9 ±5 (100)Jarfo 69.9 ±9 (100) 87.3 ±3 (100) 88.5 ±5 (100)CGNN-Fourier 94.5 ±2 (100) 84.9 ±9 (100) 93.6 ±3 (100)CGNN-MMD 96.9 ±1 (100) 96.5 ±3 (100) 97.2 ±3 (100)

Python framework available at:https://github.com/Diviyan-Kalainathan/CausalDiscoveryToolbox

Caveat: up to 50 variables

24 / 27

Motivation

State of art



25 / 27

Compact solar state

representations

Principle

9

Image preprocessing

10

Autoencoders

• Dimensionality reduction

• Input and Output

similarity

• Bottleneck

• 256x256 → 512

11

Autoencoders

• 512x512 → 512

12

Autoencoders

• 256x256 → 64

13

Variational Autoencoder

• Assumption on the

latent space distribution

• 256x256 → 90

14

Autoencoders training

• Intermediate image size

• Custom loss : loss =(ytrue−ypred )

2

(ytrue+ε)α +(ytrue−ypred )

2

(1−ytrue+ε)α

15

Results

Autoencoder Conv Conv + Dense Conv + PCA Variational

Reduction rate 1/128 1/1024 1/524 1/728

• Visual similarity

• Smoothness over time

• Classification for verification

16

Results

Event precision recall accuracy F1-score

Coronal hole 0.74 0.36 0.62 0.48

Lepping 0.90 0.51 0.77 0.65

Pseudo streamer 0.66 0.93 0.78 0.77

Strahl 0.55 0.98 0.73 0.70* Random predictor performances are 0.625 for accuracy and 0.25 for the rest

• Only 8000 labeled images

• Time distribution

• Prediction at L1

• Low performances

• Let’s extract more information

17

Going further

Classification of solar events

I More data

I Caveat: the train/test split

Predicting data at L1

I the propagation time from sun to L1

I help needed !

26 / 27

Thanks

I Olivier Goudet, Diviyan Kalainathan, Isabelle Guyon, Aris Tritas

I Mhamed Hajaiej, Cyril Furtlehner, Aurelien Decelle

27 / 27

causal modeling with generative neural networks · 2017. 10. 30. · variational autoencoder...

Documents