ockham’s razor in causal discovery: a new explanation kevin t. kelly conor mayo-wilson department...

Ockham’s Razor in Causal Discovery: A

New ExplanationKevin T. Kelly

Conor Mayo-WilsonDepartment of Philosophy

Joint Program in Logic and ComputationCarnegie Mellon University

www.hss.cmu.edu/philosophy/faculty-kelly.php

I. Prediction vs. Policy

Predictive Links

Correlation or co-dependency allows one to predict Y from X.

Ash traysLu

ng

can

cer

Ash traysLinked toLung cancer!

scientistpolicy maker

Policy

Policy manipulates X to achieve a change in Y.

Ash traysLu

ng

can

cer

Prohibit ash trays!

Ash traysLinked toLung cancer!

Policy

Policy manipulates X to achieve a change in Y.

Ash traysLu

ng

can

cer

We failed!

Correlation is not Causation

Manipulation of X can destroy the correlation of X with Y.

Ash traysLu

ng

can

cer

We failed!

Standard Remedy

Randomized controlled study

Ash traysLu

ng

can

cer

That’s what happensif you carry out thepolicy.

InfeasibilityExpenseMorality

Lead

IQ

Let me force a few thousand childrento eat lead.

InfeasibilityExpenseMorality

Lead

IQ

Just joking!

Ironic Alliance

Lead

IQ

Ha! You will never prove thatlead affects IQ…

industry

Ironic Alliance

Lead

IQ

And you can’t throw my peopleout of work on a mere whim.

Lead

IQ

So I will keep on polluting, which will never settle the matter because it is not a randomized trial.

Ironic Alliance

II. Causes From Correlations

Causal Discovery

Protein A

Protein B

Protein C Cancer protein

Patterns of conditional correlation can imply unambiguous causal conclusions

(Pearl, Spirtes, Glymour, Scheines, etc.)

Eliminate protein C!

Basic Idea

Causation is a directed, acyclic network over variables.

What makes a network causal is a relation of compatibility between networks and joint probability distributions.X

YZ

X

YZ

compatibility

pG

Joint distribution p is compatible with directed, acyclic network G iff:

Causal Markov Condition: each variable X is independent of its non-effects given its immediate causes.

Faithfulness Condition: every conditional independence relation that holds in p is a consequence of the Causal Markov Cond.

Compatibility

Y Z

X

W

V

V

B C

Common Cause

• B yields info about C (Faithfulness);• B yields no further info about C given A (Markov).

A

A

B C

Causal Chain

• B yields info about C (Faithfulness);• B yields no further info about C given A (Markov).

B

A

C

A

B

C

Common Effect

• B yields no info about C (Markov);• B yields extra info about C given A (Faithfulness).

A

B C

A

B C

Distinguishability

A

B CA

B

C

A

C

B

A

B C

indistinguishable distinctive

Immediate Connections

• There is an immediate causal connection between X and Y iff

X is dependent on Y given every subset of variables not containing X and Y (Spirtes, Glymour and Scheines)

X Y

No intermediate conditioning setbreaks dependency

X YZ

WSome conditioningset breaks dependency

Recovery of Skeleton

• Apply preceding condition to recover every non-oriented immediate causal connection.

X Y

Y

Z

skeleton

X Y

Y

Z

truth

Orientation of Skeleton

• Look for the distinctive pattern of common effects.

Common effect

X Y

Y

Z

X Y

Y

Z

truth

Orientation of Skeleton

• Look for the distinctive pattern of common effects.

• Draw all deductive consequences of these orientations.

Common effect

X Y

Y

Z

Y is not common effect of ZYSo orientation must be downward

X Y

Y

Z

truth

Causation from Correlation

Protein A

Protein B


The following network is causally unambiguous if all variables are observed.

Causation from Correlation

Protein A

Protein B


The red arrow is also immune to latent confounding causes

Brave New World for Policy

Protein A

Protein B


Experimental (confounder-proof) conclusions from correlational data!


III. The Catch

Metaphysics vs. Inference

The above results all assume that the true statistical independence relations for p are given.

But they must be inferred from finite samples.

Sample Inferred statisticaldependencies

Causalconclusions

Problem of Induction Independence is indistinguishable

from sufficiently small dependence at sample size n.

independence

dependence

data

Bridging the Inductive Gap

Assume conditional independence until the data show otherwise.

Ockham’s razor: assume no more causal complexity than necessary.

Inferential Instability No guarantee that small

dependencies will not be detected later.

Can have spectacular impact on prior causal conclusions.

Current Policy Analysis

Protein A

Protein B



Protein A

Protein B


As Sample Size Increases…

Rescind that order!

Protein A

Protein B

Protein C Cancer proteinweak

Protein D

As Sample Size Increases Again…

Eliminate protein C again!

Protein A

Protein B


Protein D

Protein Eweak

weak

As Sample Size Increases Again…

Protein A

Protein B


Protein D

Protein Eweak

weak

Etc.

Eliminate protein C again!

Typical Applications Linear Causal Case: each variable

X is a linear function of its parents and a normally distributed hidden variable called an “error term”. The error terms are mutually independent.

Discrete Multinomial Case: each variable X takes on a finite range of values.

No unobserved latent confounding causes

An Optimistic Concession

Genetics

Smoking Cancer

Causal Flipping Theorem

No matter what a consistent causal discovery procedure has seen so far, there exists a pair G, p satisfying the above assumptions so that the current sample is arbitrarily likely in p and the procedure produces arbitrarily many opposite conclusions in p about an arbitrary causal arrow in G as sample size increases.

oops

I meant oops

oopsI meant

I meant

Causal Flipping Theorem

Every consistent causal inference method is covered.

Therefore, multiple instability is an intrinsic feature of the causal discovery problem.

oops

I meant oops

oopsI meant

I meant

The Crooked Course"Living in the midst of ignorance and considering themselves intelligent and enlightened, the senseless people go round and round, following crooked courses, just like the blind led by the blind." Katha Upanishad, I. ii. 5.

Extremist Reaction Since causal discovery cannot lead

straight to the truth, it is not justified.

I must remain silent.Therefore, I

win.

Moderate Reaction

Many explanations have been offered to make sense of the here-today-gone-tomorrow nature of medical wisdom — what we are advised with confidence one year is reversed the next — but the simplest one is that it is the natural rhythm of science.

(Do We Really Know What Makes us Healthy?, NY Times Magazine, Sept. 16, 2007).

Skepticism Inverted

Unavoidable retractions are justified because they are unavoidable.

Avoidable retractions are not justified because they are avoidable.

So the best possible methods for causal discovery are those that minimize causal retractions.

The best possible means for finding the truth are justified.

Larger Proposal

The same holds for Ockham’s razor in general when the aim is to find the true theory.

IV. Ockham’s Razor

Which Theory is Right?

???

Ockham Says:

Choose theSimplest!

But Why?

Gotcha!

Puzzle

An indicator must be sensitive to what it indicates.

simple

Puzzle

An indicator must be sensitive to what it indicates.

complex

Puzzle

But Ockham’s razor always points at simplicity.

simple

Puzzle

But Ockham’s razor always points at simplicity.

complex

Puzzle

How can a broken compass help you find something unless you already know where it is?

complex

Standard Accounts

1. Prior Simplicity BiasBayes, BIC, MDL, MML, etc.

2. Risk MinimizationSRM, AIC, cross-validation, etc.

1. Bayesian Account Ockham’s razor is a feature of

one’s personal prior belief state. Short run: no objective

connection with finding the truth (flipping theorem applies).

Long run: converges to the truth, but other prior biases would also lead to convergence.

2. Risk Minimization Acct.

Risk minimization is about prediction rather than truth.

Urges using a false causal theory rather than the known true theory for predictive purposes.

Therefore, not suited to exact science or to practical policy applications.

V. A New Foundation for

Ockham’s Razor

Connections to the Truth Short-run

Reliability Too strong to be

feasible when theory matters.

Long-run Convergence Too weak to single

out Ockham’s razor

ComplexSimple

ComplexSimple

Middle Path Short-run Reliability

Too strong to be feasible when theory matters.

“Straightest” convergence Just right?

Long-run Convergence Too weak to single

out Ockham’s razor

ComplexSimple

ComplexSimple

ComplexSimple

Empirical Problems

T1 T2 T3

Set K of infinite input sequences. Partition of K into alternative

theories.K

Empirical Methods

T1 T2 T3

Map finite input sequences to theories or to “?”.

K

T3

e

Method Choice

T1 T2 T3

e1 e2 e3 e4

Input history

Output historyAt each stage, scientist can choose a new method (agreeing with past theory choices).

Aim: Converge to the Truth

T1 T2 T3

K

T3 ? T2 ? T1 T1 T1 T1 . . .T1 T1 T1

Retraction

Choosing T and then not choosing T next

T’

T

?

Aim: Eliminate Needless Retractions

Truth

Aim: Eliminate Needless Delays to Retractions

theory

applicationapplicationapplication

applicationcorollary

applicationtheory

applicationapplication

corollary applicationcorollary

Aim: Eliminate Needless Delays to Retractions

Why Timed Retractions?

Retraction minimization =generalized significance level.

Retraction time minimization = generalized power.

Easy Retraction Time Comparisons

T1 T1 T2 T2

T1 T1 T2 T2 T3 T3T2 T4 T4

T2 T2

Method 1

Method 2

T4 T4 T4

. . .

. . .

at least as many at least as late

Worst-case Retraction Time Bounds

T1 T2

Output sequences

T1 T2

T1 T2

T4

T3

T3

T3

T3

T3 T3

T4

T4

T4

T4 T4

. . .

(1, 2, ∞)

. . .

. . .

. . .. . .

. . .T4

T4

T4

T1 T2 T3 T3 T3 T4T3 . . .

Curve Fitting

Data = open intervals around Y at rational values of X.

Curve Fitting

No effects:

Curve Fitting

First-order effect:

Curve Fitting

Second-order effect:

Ockham

ConstantLinear

Quadratic

Cubic

There yet?Maybe.

Ockham

ConstantLinear

Quadratic

Cubic

There yet?

Maybe.

Ockham

ConstantLinear

Quadratic

Cubic

There yet?Maybe.

Ockham Violation

ConstantLinear

Quadratic

Cubic

There yet?Maybe.

Ockham Violation

ConstantLinear

Quadratic

Cubic

I know you’re coming!

Ockham Violation

ConstantLinear

Quadratic

Cubic

Maybe.

Ockham Violation

ConstantLinear

Quadratic

Cubic

!!!

Hmm, it’s quite nice here…

Ockham Violation

ConstantLinear

Quadratic

Cubic

You’re back!Learned your lesson?

Violator’s Path

ConstantLinear

Quadratic

Cubic

See, you shouldn’t run aheadEven if you are right!

Ockham Path

ConstantLinear

Quadratic

Cubic

More General Argument Required

Cover case in which demon has branching paths (causal discovery)

More General Argument Required

Cover case in which scientist lags behind (using time as a cost)

Come on!

Empirical Effects

Empirical Effects

May take arbitrarily long to discoverBut can’t be taken back

Empirical Theories True theory determined by which

effects appear.

Empirical Complexity

More complex

Background Constraints

More complex

Ockham’s Razor Don’t select a theory unless it is

uniquely simplest in light of experience.

Weak Ockham’s Razor Don’t select a theory unless it

among the simplest in light of experience.

Stalwartness Don’t retract your answer while it

is uniquely simplest

Timed Retraction Bounds

r(M, e, n) = the least timed retraction bound covering the total timed retractions of M along input streams of complexity n that extend e

Empirical Complexity 0 1 2 3 . . .

. . .

M

Efficiency of Method M at e

M converges to the truth no matter what;

For each convergent M’ that agrees with M up to the end of e, and for each n: r(M, e, n) r(M’, e, n)


. . .

M M’

M is Beaten at e

There exists convergent M’ that agrees with M up to the end of e, such that For each n, r(M, e, n) r(M’, e, n); Exists n, r(M, e, n) > r(M’, e, n).


. . .

M M’

Ockham Efficiency Theorem

Let M be a solution. The following are equivalent: M is always strongly Ockham and

stalwart; M is always efficient; M is never weakly beaten.

Example: Causal Inference Effects are conditional statistical

dependence relations.

X dep Y | {Z}, {W}, {Z,W}

Y dep Z | {X}, {W}, {X,W}

X dep Z | {Y}, {Y,W}

. . .. . .

Causal Discovery = Ockham’s Razor

X Y Z W

Ockham’s Razor

X Y Z W

X dep Y | {Z}, {W}, {Z,W}


X Y Z W

X dep Y | {Z}, {W}, {Z,W}Y dep Z | {X}, {W}, {X,W}X dep Z | {Y}, {Y,W}


X Y Z W

X dep Y | {Z}, {W}, {Z,W}Y dep Z | {X}, {W}, {X,W}X dep Z | {Y}, {W}, {Y,W}


X Y Z W

X dep Y | {Z}, {W}, {Z,W}Y dep Z | {X}, {W}, {X,W}X dep Z | {Y}, {W}, {Y,W}Z dep W| {X}, {Y}, {X,Y}Y dep W| {Z}, {X,Z}


X Y Z W

X dep Y | {Z}, {W}, {Z,W}Y dep Z | {X}, {W}, {X,W}X dep Z | {Y}, {W}, {Y,W}Z dep W| {X}, {Y}, {X,Y}Y dep W| {X}, {Z}, {X,Z}

IV. Simplicity Defined

Approach

Empirical complexity reflects nested problems of induction posed by the problem.

Hence, simplicity is problem-relative but topologically invariant.

Empirical Problems

T1 T2 T3

Set K of infinite input sequences. Partition Q of K into alternative

theories.

K

Simplicity Concepts A simplicity concept for (K, Q) is just

a well-founded order < on a partition S of K with ascending chains of order type not exceeding omega such that:

1. Each element of S is included in some answer in Q.

2. Each downward union in (S, <) is closed;

3. Incomparable sets share no boundary point.

4. Each element of S is included in the boundary of its successor.

Empirical Complexity Defined

Let K|e denote the set of all possibilities compatible with observations e.

Let (S, <) be a simplicity concept for (K|e, Q).

Define c(w, e) = the length of the longest < path to the cell of S that contains w.

Define c(T, e) = the least c(w, e) such that T is true in w.

Applications Polynomial laws: complexity =

degree Conservation laws: complexity =

particle types – conserved quantities.

Causal networks: complexity = number of logically independent conditional dependencies entailed by faithfulness.

General Ockham Efficiency Theorem

Let M be a solution. The following are equivalent: M is always strongly Ockham and

stalwart; M is always efficient; M is never beaten.

Conclusions

Causal truths are necessary for counterfactual predictions.

Ockham’s razor is necessary for staying on the straightest path to the true theory but does not point at the true theory.

No evasions or circles are required.

Future Directions

Extension of unique efficiency theorem to stochastic model selection.

Latent variables as Ockham conclusions.

Degrees of retraction. Pooling of marginal Ockham

conclusions. Retraction efficiency assessment of

MDL, SRM.

Suggested Reading

"Ockham’s Razor, Truth, and Information", in Handbook of the Philosophy of Information, J. van Behthem and P. Adriaans, eds., to appear.

"Ockham’s Razor, Empirical Complexity, and Truth-finding Efficiency", Theoretical Computer Science, 383: 270-289, 2007.

Both available as pre-prints at: www.hss.cmu.edu/philosophy/faculty-kelly.php

http://www.hss.cmu.edu/philosophy/kelly/papers/kellyinfo14.pdf

http://www.hss.cmu.edu/philosophy/kelly/papers/kellyinfo14.pdf

http://www.hss.cmu.edu/philosophy/kelly/papers/burginfixed4.pdf

http://www.hss.cmu.edu/philosophy/kelly/papers/burginfixed4.pdf

1. Prior Simplicity Bias

The simple theory is more plausible now because it was more plausible yesterday.

More Subtle Version

Simple data are a miracle in the complex theory but not in the simple theory.

P C

Regularity: retrograde motion of Venus at solar conjunction

Has to be!

However…

e would not be a miracle given P(q);

Why not this?

CP

The Real MiracleIgnorance about model: p(C) p(P);

+ Ignorance about parameter setting: p’(P(q) | P) p(P(q’ ) | P).

= Knowledge about C vs. P(q):p(P(q)) << p(C).

CP

qqqqqqqq

Lead into gold.Perpetual motion.Free lunch.

Sounds good!

Standard Paradox of Indifference

Ignorance of red vs. not-red+ Ignorance over not-red: = Knowledge about red vs. white.

qq

Knognorance = All the priveleges of knowledgeWith none of the responsibilitiesSounds good!

The Ellsberg Paradox

1/3 ? ?

Human Preference

1/3 ? ?

a > b

a c < cb

b

Human View

1/3 ? ?

a > b

a c < cb

bknowledge ignorance

knowledgeignorance

Bayesian “Rationality”

1/3 ? ?

a > b

a c > cb

bknognoranceknognorance

knognoranceknognorance

In Any Event

The coherentist foundations of Bayesianism have nothing to do with short-run truth-conduciveness.Not so loud!

Bayesian Convergence

Too-simple theories get shot down…

ComplexityTheories

Updated opinion


Plausibility is transferred to the next-simplest theory…

Blam! ComplexityTheories

Updated opinion

Plink!


The true theory is never shot down.

Blam! ComplexityTheories

Updated opinion

Zing!

Convergence

But alternative strategies also converge: Any theory choice in the short run is

compatible with convergence in the long run.

Summary of Bayesian Approach

Prior-based explanations of Ockham’s razor are circular and based on a faulty model of ignorance.

Convergence-based explanations of Ockham’s razor fail to single out Ockham’s razor.

2. Risk Minimization Ockham’s razor minimizes

expected distance of empirical estimates from the true value.

Truth

Unconstrained Estimates

are Centered on truth but spread around it.

Pop!Pop!Pop!Pop!

Unconstrained aim

Off-center but less spread.

Clamped aim

Truth

Constrained Estimates

Off-center but less spread Overall improvement in expected

distance from truth…

Truth

Pop!Pop!Pop!Pop!

Constrained Estimates

Clamped aim

Doesn’t Find True Theory

The theory that minimizes estimation risk can be quite false…

Four eyes!

Clamped aim

Makes Sense…when loss of an answer is similar in

nearby distributions.

Similarityp

Close is goodenough!Loss

But Not When Truth Matters

…i.e., when loss of an answer is discontinuous with similarity.

Similarityp

Close is no cigar!Loss

ockham’s razor in causal discovery: a new explanation kevin t. kelly conor mayo-wilson department...

Documents

b c slide

policy slide

bc slide

c markov b

correlation of x

c faithfulness b

policy policy

ironic alliance slide