feature selection and causal discovery isabelle guyon, clopinet andré elisseeff, ibm zürich...

Post on 12-Jan-2016

219 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Feature Selection and

Causal discovery

Isabelle Guyon, Clopinet

André Elisseeff, IBM Zürich

Constantin Aliferis, Vanderbilt University

Road Map

• What is feature selection?

• Why is it hard?

• What works best in practice?

• How to make progress using causality?

• Can causal discovery benefit from feature selection?

Feature selection

Causal discovery

Introduction

Causal discovery

• What affects your health?

• What affects the economy?

• What affects climate changes?

and…

Which actions will have beneficial effects?

Feature Selection

Remove features Xi to improve (or least degrade) prediction of Y.

X

Y

Uncovering Dependencies

Factors of variability

ArtifactualActual

UnknownKnown

Uncontrollable

Observable Unobservable

Controllable

Predictions and Actions

X

Y

See e.g. Judea Pearl, “Causality”, 2000

Predictive power of causes and effects

Lung disease

Coughing

Allergy

Smoking

Anxiety Smoking is a better predictor of lung disease than coughing.

“Causal feature selection”

• Abandon the usual motto of predictive modeling: “we don’t care about causality”.

• Feature selection may benefit from introducing a notion of causality:– To be able to predict the consequence of given

actions.– To add robustness to the predictions if the input

distribution changes.– To get more compact and robust feature sets.

“FS-enabled causal discovery”

Isn’t causal discovery solved with experiments?

• No! Randomized Controlled Trials (RCT) may be:– Unethical (e.g. a RCT about the effects of smoking)– Costly and time consuming – Impossible (e.g. astronomy)

• Observational data may be available to help plan future experiments Causal discovery may benefit from feature selection.

Feature selection basics

Individual Feature Irrelevance

P(Xi, Y) = P(Xi) P(Y)

P(Xi| Y) = P(Xi)

xi

density

Individual Feature Relevance

-1

- +

- +1Specificity

Sensitivity

ROC curve

AUC

0

1

Univariate selection may fail

Guyon-Elisseeff, JMLR 2004; Springer 2006

Multivariate FS is complex

n features, 2n possible feature subsets!

Kohavi-John, 1997

FS strategies

• Wrappers:– Use the target risk functional to evaluate feature

subsets. – Train one learning machine for each feature subset

investigated.

• Filters:– Use another evaluation function than the target risk

functional. – Often no learning machine is involved in the feature

selection process.

Reducing complexity

• For wrappers:– Use forward or backward selection: O(n2) steps.– Mix forward and backward search, e.g. floating search.

• For filters:– Use a cheap evaluation function (no learning machine).– Make independence assumptions: n evaluations.

• Embedded methods:– Do not retrain the LM at every step: e.g. RFE, n steps.– Search FS space and LM parameter space

simultaneously: e.g. 1-norm/Lasso approaches.

In practice…

• Univariate feature selection often yields better accuracy results than multivariate feature selection.

• NO feature selection at all gives sometimes the best accuracy results, even in the presence of known distracters.

• Multivariate methods usually claim only better “parsimony”.

• How can we make multivariate FS work better?

NIPS 2003 and WCCI 2006 challenges : http://clopinet.com/challenges

Definition of “irrelevance”

• We want to determine whether one variable Xi is “relevant” to the target Y.

• Surely irrelevant feature:

P(Xi, Y |S\i) = P(Xi |S\i)P(Y |S\i)

for all S\i X\i

for all assignment of values to S\i

Are all non-irrelevant features relevant?

Causality enters the picture

Causal Bayesian networks

• Bayesian network:– Graph with random variables X1, X2, …Xn as

nodes.– Dependencies represented by edges.– Allow us to compute P(X1, X2, …Xn) as

i P( Xi | Parents(Xi) ).

– Edge directions have no meaning.

• Causal Bayesian network: egde directions indicate causality.

Markov blanket

Lung disease

Coughing

Allergy

Smoking

Anxiety

A node is conditionally independent of all other nodes given its Markov blanket.

Relevance revisited

In terms of Bayesian networks in “faithful” distributions:

• Strongly relevant features = members of the Markov Blanket

• Weakly relevant features = variables with a path to the Markov Blanket but not in the Markov Blanket

• Irrelevant features = variables with no path to the Markov Blanket

Koller-Sahami, 1996; Kohavi-John, 1997; Aliferis et al., 2002.

Is X2 “relevant”?

X2 || Y

baseline(X2)

health(Y)

peak(X1)

P(X1, X2 , Y)= P(X1 | X2 , Y) P(X2) P(Y)

X2 X1

180 190 200 210 220 230 240 250 260

20

40

60

80

100

120

peak

baselineY

normaldisease

x1

x2

X2 || Y | X1

1

Are X1 and X2“relevant”?

time(X2)

health(Y)

peak(X1)

P(X1, X2 , Y)= P(X1 | X2 , Y) P(X2) P(Y)

X1 || YX2 || YX1 || X2

peak

sample processing time normal

diseaseY

2

XOR and unfaithfulness

X1

Y

X2

X1 || YX2 || YX1 || X2

Y = X1 X2X1 X2 Y

1 1 1

1 -1 -1

-1 1 -1

-1 -1 1

Example:X1 and X2: Two fair coins tossed at randomY: Win if both coins end on the same side

X1

Y

X2X1

Y

X2X1

Y

X2

y

x1

Adding a variable…

… can make another one irrelevant

y

x1

X2

Simpson’s paradox

X1 || Y | X2

3

y

x1

… conclusion: no evidence that eating chocolate makes you live longer.

X1 || Y | X2

Is chocolate good for your health?

chocolate intake

life expectancy y

x1

Male

Female

X2=gender

3

y

x1

… conclusion: eating chocolatemay make you live longer!

Really?

Is chocolate good for your health?

chocolate intake

life expectancy y

x1

Depressed

Happy

X2=mood

3

Same independence relationsDifferent causal relations

P(X1, X2 , Y)

= P(X1 | X2) P(Y | X2) P(X2)

X1 || Y | X2

X1 YX2

X1 YX2

P(X1, X2 , Y)

= P(Y | X2) P(X2 | X1) P(X1)

P(X1, X2 , Y)

= P(X1 | X2) P(X2 | Y) P(Y)

X1 YX2

Is X1 “relevant”?

X1 || Y | X2

chocolate intake(X1)

life expectancy

(Y)

gender(X2)

chocolate intake(X1)

life expectancy

(Y)

mood(X2)

3

Non-causal features may be predictive yet not

“relevant”baseline

(X2)health

(Y)

peak(X1)

time(X2)

health(Y)

peak(X1)

chocolate intake(X1)

life expectancy

(Y)

gender(X2)

chocolate intake(X1)

life expectancy

(Y)

mood(X2)

1 2

3

x2

x1

Causal feature discovery

x2

x1P(X,Y) = P(X|Y) P(Y)

Y

X1 X2

P(X,Y) = P(Y|X) P(X)

Y

X1 X2

Sun-Janzing-Schoelkopf, 2005

Conclusion

• Feature selection focuses on uncovering subsets of variables X1, X2, … predictive of the target Y.

• Taking a closer look at the type of dependencies may help refining the notion of variable relevance.

• Uncovering causal relationships may yield better feature selection, robust under distribution changes.

• These “causal features” may be better targets of action.

top related