ockham’s razor in causal discovery: a new explanation kevin t. kelly conor mayo-wilson department...
TRANSCRIPT
Ockham’s Razor in Causal Discovery: A
New ExplanationKevin T. Kelly
Conor Mayo-WilsonDepartment of Philosophy
Joint Program in Logic and ComputationCarnegie Mellon University
www.hss.cmu.edu/philosophy/faculty-kelly.php
I. Prediction vs. Policy
Predictive Links
Correlation or co-dependency allows one to predict Y from X.
Ash traysLu
ng
can
cer
Ash traysLinked toLung cancer!
scientistpolicy maker
Policy
Policy manipulates X to achieve a change in Y.
Ash traysLu
ng
can
cer
Prohibit ash trays!
Ash traysLinked toLung cancer!
Policy
Policy manipulates X to achieve a change in Y.
Ash traysLu
ng
can
cer
We failed!
Correlation is not Causation
Manipulation of X can destroy the correlation of X with Y.
Ash traysLu
ng
can
cer
We failed!
Standard Remedy
Randomized controlled study
Ash traysLu
ng
can
cer
That’s what happensif you carry out thepolicy.
InfeasibilityExpenseMorality
Lead
IQ
Let me force a few thousand childrento eat lead.
InfeasibilityExpenseMorality
Lead
IQ
Just joking!
Ironic Alliance
Lead
IQ
Ha! You will never prove thatlead affects IQ…
industry
Ironic Alliance
Lead
IQ
And you can’t throw my peopleout of work on a mere whim.
Lead
IQ
So I will keep on polluting, which will never settle the matter because it is not a randomized trial.
Ironic Alliance
II. Causes From Correlations
Causal Discovery
Protein A
Protein B
Protein C Cancer protein
Patterns of conditional correlation can imply unambiguous causal conclusions
(Pearl, Spirtes, Glymour, Scheines, etc.)
Eliminate protein C!
Basic Idea
Causation is a directed, acyclic network over variables.
What makes a network causal is a relation of compatibility between networks and joint probability distributions.X
YZ
X
YZ
compatibility
pG
Joint distribution p is compatible with directed, acyclic network G iff:
Causal Markov Condition: each variable X is independent of its non-effects given its immediate causes.
Faithfulness Condition: every conditional independence relation that holds in p is a consequence of the Causal Markov Cond.
Compatibility
Y Z
X
W
V
V
B C
Common Cause
• B yields info about C (Faithfulness);• B yields no further info about C given A (Markov).
A
A
B C
Causal Chain
• B yields info about C (Faithfulness);• B yields no further info about C given A (Markov).
B
A
C
A
B
C
Common Effect
• B yields no info about C (Markov);• B yields extra info about C given A (Faithfulness).
A
B C
A
B C
Distinguishability
A
B CA
B
C
A
C
B
A
B C
indistinguishable distinctive
Immediate Connections
• There is an immediate causal connection between X and Y iff
X is dependent on Y given every subset of variables not containing X and Y (Spirtes, Glymour and Scheines)
X Y
No intermediate conditioning setbreaks dependency
X YZ
WSome conditioningset breaks dependency
Recovery of Skeleton
• Apply preceding condition to recover every non-oriented immediate causal connection.
X Y
Y
Z
skeleton
X Y
Y
Z
truth
Orientation of Skeleton
• Look for the distinctive pattern of common effects.
Common effect
X Y
Y
Z
X Y
Y
Z
truth
Orientation of Skeleton
• Look for the distinctive pattern of common effects.
• Draw all deductive consequences of these orientations.
Common effect
X Y
Y
Z
Y is not common effect of ZYSo orientation must be downward
X Y
Y
Z
truth
Causation from Correlation
Protein A
Protein B
Protein C Cancer protein
The following network is causally unambiguous if all variables are observed.
Causation from Correlation
Protein A
Protein B
Protein C Cancer protein
The red arrow is also immune to latent confounding causes
Brave New World for Policy
Protein A
Protein B
Protein C Cancer protein
Experimental (confounder-proof) conclusions from correlational data!
Eliminate protein C!
III. The Catch
Metaphysics vs. Inference
The above results all assume that the true statistical independence relations for p are given.
But they must be inferred from finite samples.
Sample Inferred statisticaldependencies
Causalconclusions
Problem of Induction Independence is indistinguishable
from sufficiently small dependence at sample size n.
independence
dependence
data
Bridging the Inductive Gap
Assume conditional independence until the data show otherwise.
Ockham’s razor: assume no more causal complexity than necessary.
Inferential Instability No guarantee that small
dependencies will not be detected later.
Can have spectacular impact on prior causal conclusions.
Current Policy Analysis
Protein A
Protein B
Protein C Cancer protein
Eliminate protein C!
Protein A
Protein B
Protein C Cancer protein
As Sample Size Increases…
Rescind that order!
Protein A
Protein B
Protein C Cancer proteinweak
Protein D
As Sample Size Increases Again…
Eliminate protein C again!
Protein A
Protein B
Protein C Cancer proteinweak
Protein D
Protein Eweak
weak
As Sample Size Increases Again…
Protein A
Protein B
Protein C Cancer proteinweak
Protein D
Protein Eweak
weak
Etc.
Eliminate protein C again!
Typical Applications Linear Causal Case: each variable
X is a linear function of its parents and a normally distributed hidden variable called an “error term”. The error terms are mutually independent.
Discrete Multinomial Case: each variable X takes on a finite range of values.
No unobserved latent confounding causes
An Optimistic Concession
Genetics
Smoking Cancer
Causal Flipping Theorem
No matter what a consistent causal discovery procedure has seen so far, there exists a pair G, p satisfying the above assumptions so that the current sample is arbitrarily likely in p and the procedure produces arbitrarily many opposite conclusions in p about an arbitrary causal arrow in G as sample size increases.
oops
I meant oops
oopsI meant
I meant
Causal Flipping Theorem
Every consistent causal inference method is covered.
Therefore, multiple instability is an intrinsic feature of the causal discovery problem.
oops
I meant oops
oopsI meant
I meant
The Crooked Course"Living in the midst of ignorance and considering themselves intelligent and enlightened, the senseless people go round and round, following crooked courses, just like the blind led by the blind." Katha Upanishad, I. ii. 5.
Extremist Reaction Since causal discovery cannot lead
straight to the truth, it is not justified.
I must remain silent.Therefore, I
win.
Moderate Reaction
Many explanations have been offered to make sense of the here-today-gone-tomorrow nature of medical wisdom — what we are advised with confidence one year is reversed the next — but the simplest one is that it is the natural rhythm of science.
(Do We Really Know What Makes us Healthy?, NY Times Magazine, Sept. 16, 2007).
Skepticism Inverted
Unavoidable retractions are justified because they are unavoidable.
Avoidable retractions are not justified because they are avoidable.
So the best possible methods for causal discovery are those that minimize causal retractions.
The best possible means for finding the truth are justified.
Larger Proposal
The same holds for Ockham’s razor in general when the aim is to find the true theory.
IV. Ockham’s Razor
Which Theory is Right?
???
Ockham Says:
Choose theSimplest!
But Why?
Gotcha!
Puzzle
An indicator must be sensitive to what it indicates.
simple
Puzzle
An indicator must be sensitive to what it indicates.
complex
Puzzle
But Ockham’s razor always points at simplicity.
simple
Puzzle
But Ockham’s razor always points at simplicity.
complex
Puzzle
How can a broken compass help you find something unless you already know where it is?
complex
Standard Accounts
1. Prior Simplicity BiasBayes, BIC, MDL, MML, etc.
2. Risk MinimizationSRM, AIC, cross-validation, etc.
1. Bayesian Account Ockham’s razor is a feature of
one’s personal prior belief state. Short run: no objective
connection with finding the truth (flipping theorem applies).
Long run: converges to the truth, but other prior biases would also lead to convergence.
2. Risk Minimization Acct.
Risk minimization is about prediction rather than truth.
Urges using a false causal theory rather than the known true theory for predictive purposes.
Therefore, not suited to exact science or to practical policy applications.
V. A New Foundation for
Ockham’s Razor
Connections to the Truth Short-run
Reliability Too strong to be
feasible when theory matters.
Long-run Convergence Too weak to single
out Ockham’s razor
ComplexSimple
ComplexSimple
Middle Path Short-run Reliability
Too strong to be feasible when theory matters.
“Straightest” convergence Just right?
Long-run Convergence Too weak to single
out Ockham’s razor
ComplexSimple
ComplexSimple
ComplexSimple
Empirical Problems
T1 T2 T3
Set K of infinite input sequences. Partition of K into alternative
theories.K
Empirical Methods
T1 T2 T3
Map finite input sequences to theories or to “?”.
K
T3
e
Method Choice
T1 T2 T3
e1 e2 e3 e4
Input history
Output historyAt each stage, scientist can choose a new method (agreeing with past theory choices).
Aim: Converge to the Truth
T1 T2 T3
K
T3 ? T2 ? T1 T1 T1 T1 . . .T1 T1 T1
Retraction
Choosing T and then not choosing T next
T’
T
?
Aim: Eliminate Needless Retractions
Truth
Aim: Eliminate Needless Retractions
Truth
Aim: Eliminate Needless Delays to Retractions
theory
applicationapplicationapplication
applicationcorollary
applicationtheory
applicationapplication
corollary applicationcorollary
Aim: Eliminate Needless Delays to Retractions
Why Timed Retractions?
Retraction minimization =generalized significance level.
Retraction time minimization = generalized power.
Easy Retraction Time Comparisons
T1 T1 T2 T2
T1 T1 T2 T2 T3 T3T2 T4 T4
T2 T2
Method 1
Method 2
T4 T4 T4
. . .
. . .
at least as many at least as late
Worst-case Retraction Time Bounds
T1 T2
Output sequences
T1 T2
T1 T2
T4
T3
T3
T3
T3
T3 T3
T4
T4
T4
T4 T4
. . .
(1, 2, ∞)
. . .
. . .
. . .. . .
. . .T4
T4
T4
T1 T2 T3 T3 T3 T4T3 . . .
Curve Fitting
Data = open intervals around Y at rational values of X.
Curve Fitting
No effects:
Curve Fitting
First-order effect:
Curve Fitting
Second-order effect:
Ockham
ConstantLinear
Quadratic
Cubic
There yet?Maybe.
Ockham
ConstantLinear
Quadratic
Cubic
There yet?
Maybe.
Ockham
ConstantLinear
Quadratic
Cubic
There yet?Maybe.
Ockham
ConstantLinear
Quadratic
Cubic
There yet?Maybe.
Ockham Violation
ConstantLinear
Quadratic
Cubic
There yet?Maybe.
Ockham Violation
ConstantLinear
Quadratic
Cubic
I know you’re coming!
Ockham Violation
ConstantLinear
Quadratic
Cubic
Maybe.
Ockham Violation
ConstantLinear
Quadratic
Cubic
!!!
Hmm, it’s quite nice here…
Ockham Violation
ConstantLinear
Quadratic
Cubic
You’re back!Learned your lesson?
Violator’s Path
ConstantLinear
Quadratic
Cubic
See, you shouldn’t run aheadEven if you are right!
Ockham Path
ConstantLinear
Quadratic
Cubic
More General Argument Required
Cover case in which demon has branching paths (causal discovery)
More General Argument Required
Cover case in which scientist lags behind (using time as a cost)
Come on!
Empirical Effects
Empirical Effects
Empirical Effects
May take arbitrarily long to discoverBut can’t be taken back
Empirical Effects
May take arbitrarily long to discoverBut can’t be taken back
Empirical Effects
May take arbitrarily long to discoverBut can’t be taken back
Empirical Effects
May take arbitrarily long to discoverBut can’t be taken back
Empirical Effects
May take arbitrarily long to discoverBut can’t be taken back
Empirical Effects
May take arbitrarily long to discoverBut can’t be taken back
Empirical Effects
May take arbitrarily long to discoverBut can’t be taken back
Empirical Theories True theory determined by which
effects appear.
Empirical Complexity
More complex
Background Constraints
More complex
Background Constraints
More complex
Ockham’s Razor Don’t select a theory unless it is
uniquely simplest in light of experience.
Weak Ockham’s Razor Don’t select a theory unless it
among the simplest in light of experience.
Stalwartness Don’t retract your answer while it
is uniquely simplest
Stalwartness Don’t retract your answer while it
is uniquely simplest
Timed Retraction Bounds
r(M, e, n) = the least timed retraction bound covering the total timed retractions of M along input streams of complexity n that extend e
Empirical Complexity 0 1 2 3 . . .
. . .
M
Efficiency of Method M at e
M converges to the truth no matter what;
For each convergent M’ that agrees with M up to the end of e, and for each n: r(M, e, n) r(M’, e, n)
Empirical Complexity 0 1 2 3 . . .
. . .
M M’
M is Beaten at e
There exists convergent M’ that agrees with M up to the end of e, such that For each n, r(M, e, n) r(M’, e, n); Exists n, r(M, e, n) > r(M’, e, n).
Empirical Complexity 0 1 2 3 . . .
. . .
M M’
Ockham Efficiency Theorem
Let M be a solution. The following are equivalent: M is always strongly Ockham and
stalwart; M is always efficient; M is never weakly beaten.
Example: Causal Inference Effects are conditional statistical
dependence relations.
X dep Y | {Z}, {W}, {Z,W}
Y dep Z | {X}, {W}, {X,W}
X dep Z | {Y}, {Y,W}
. . .. . .
Causal Discovery = Ockham’s Razor
X Y Z W
Ockham’s Razor
X Y Z W
X dep Y | {Z}, {W}, {Z,W}
Causal Discovery = Ockham’s Razor
X Y Z W
X dep Y | {Z}, {W}, {Z,W}Y dep Z | {X}, {W}, {X,W}X dep Z | {Y}, {Y,W}
Causal Discovery = Ockham’s Razor
X Y Z W
X dep Y | {Z}, {W}, {Z,W}Y dep Z | {X}, {W}, {X,W}X dep Z | {Y}, {W}, {Y,W}
Causal Discovery = Ockham’s Razor
X Y Z W
X dep Y | {Z}, {W}, {Z,W}Y dep Z | {X}, {W}, {X,W}X dep Z | {Y}, {W}, {Y,W}Z dep W| {X}, {Y}, {X,Y}Y dep W| {Z}, {X,Z}
Causal Discovery = Ockham’s Razor
X Y Z W
X dep Y | {Z}, {W}, {Z,W}Y dep Z | {X}, {W}, {X,W}X dep Z | {Y}, {W}, {Y,W}Z dep W| {X}, {Y}, {X,Y}Y dep W| {X}, {Z}, {X,Z}
IV. Simplicity Defined
Approach
Empirical complexity reflects nested problems of induction posed by the problem.
Hence, simplicity is problem-relative but topologically invariant.
Empirical Problems
T1 T2 T3
Set K of infinite input sequences. Partition Q of K into alternative
theories.
K
Simplicity Concepts A simplicity concept for (K, Q) is just
a well-founded order < on a partition S of K with ascending chains of order type not exceeding omega such that:
1. Each element of S is included in some answer in Q.
2. Each downward union in (S, <) is closed;
3. Incomparable sets share no boundary point.
4. Each element of S is included in the boundary of its successor.
Empirical Complexity Defined
Let K|e denote the set of all possibilities compatible with observations e.
Let (S, <) be a simplicity concept for (K|e, Q).
Define c(w, e) = the length of the longest < path to the cell of S that contains w.
Define c(T, e) = the least c(w, e) such that T is true in w.
Applications Polynomial laws: complexity =
degree Conservation laws: complexity =
particle types – conserved quantities.
Causal networks: complexity = number of logically independent conditional dependencies entailed by faithfulness.
General Ockham Efficiency Theorem
Let M be a solution. The following are equivalent: M is always strongly Ockham and
stalwart; M is always efficient; M is never beaten.
Conclusions
Causal truths are necessary for counterfactual predictions.
Ockham’s razor is necessary for staying on the straightest path to the true theory but does not point at the true theory.
No evasions or circles are required.
Future Directions
Extension of unique efficiency theorem to stochastic model selection.
Latent variables as Ockham conclusions.
Degrees of retraction. Pooling of marginal Ockham
conclusions. Retraction efficiency assessment of
MDL, SRM.
Suggested Reading
"Ockham’s Razor, Truth, and Information", in Handbook of the Philosophy of Information, J. van Behthem and P. Adriaans, eds., to appear.
"Ockham’s Razor, Empirical Complexity, and Truth-finding Efficiency", Theoretical Computer Science, 383: 270-289, 2007.
Both available as pre-prints at: www.hss.cmu.edu/philosophy/faculty-kelly.php
1. Prior Simplicity Bias
The simple theory is more plausible now because it was more plausible yesterday.
More Subtle Version
Simple data are a miracle in the complex theory but not in the simple theory.
P C
Regularity: retrograde motion of Venus at solar conjunction
Has to be!
However…
e would not be a miracle given P(q);
Why not this?
CP
The Real MiracleIgnorance about model: p(C) p(P);
+ Ignorance about parameter setting: p’(P(q) | P) p(P(q’ ) | P).
= Knowledge about C vs. P(q):p(P(q)) << p(C).
CP
qqqqqqqq
Lead into gold.Perpetual motion.Free lunch.
Sounds good!
Standard Paradox of Indifference
Ignorance of red vs. not-red+ Ignorance over not-red: = Knowledge about red vs. white.
Knognorance = All the priveleges of knowledgeWith none of the responsibilitiesSounds good!
The Ellsberg Paradox
1/3 ? ?
Human Preference
1/3 ? ?
a > b
a c < cb
b
Human View
1/3 ? ?
a > b
a c < cb
bknowledge ignorance
knowledgeignorance
Bayesian “Rationality”
1/3 ? ?
a > b
a c > cb
bknognoranceknognorance
knognoranceknognorance
In Any Event
The coherentist foundations of Bayesianism have nothing to do with short-run truth-conduciveness.Not so loud!
Bayesian Convergence
Too-simple theories get shot down…
ComplexityTheories
Updated opinion
Bayesian Convergence
Plausibility is transferred to the next-simplest theory…
Blam! ComplexityTheories
Updated opinion
Plink!
Bayesian Convergence
Plausibility is transferred to the next-simplest theory…
Blam! ComplexityTheories
Updated opinion
Plink!
Bayesian Convergence
Plausibility is transferred to the next-simplest theory…
Blam! ComplexityTheories
Updated opinion
Plink!
Bayesian Convergence
The true theory is never shot down.
Blam! ComplexityTheories
Updated opinion
Zing!
Convergence
But alternative strategies also converge: Any theory choice in the short run is
compatible with convergence in the long run.
Summary of Bayesian Approach
Prior-based explanations of Ockham’s razor are circular and based on a faulty model of ignorance.
Convergence-based explanations of Ockham’s razor fail to single out Ockham’s razor.
2. Risk Minimization Ockham’s razor minimizes
expected distance of empirical estimates from the true value.
Truth
Unconstrained Estimates
are Centered on truth but spread around it.
Pop!Pop!Pop!Pop!
Unconstrained aim
Off-center but less spread.
Clamped aim
Truth
Constrained Estimates
Off-center but less spread Overall improvement in expected
distance from truth…
Truth
Pop!Pop!Pop!Pop!
Constrained Estimates
Clamped aim
Doesn’t Find True Theory
The theory that minimizes estimation risk can be quite false…
Four eyes!
Clamped aim
Makes Sense…when loss of an answer is similar in
nearby distributions.
Similarityp
Close is goodenough!Loss
But Not When Truth Matters
…i.e., when loss of an answer is discontinuous with similarity.
Similarityp
Close is no cigar!Loss