Ockham’s Razor in Causal Discovery: A New Explanation
description
Transcript of Ockham’s Razor in Causal Discovery: A New Explanation
Ockham’s Razor in Causal Discovery: A
New ExplanationKevin T. Kelly
Conor Mayo-WilsonDepartment of Philosophy
Joint Program in Logic and ComputationCarnegie Mellon University
www.hss.cmu.edu/philosophy/faculty-kelly.php
I. Prediction vs. Policy
Predictive LinksCorrelation or co-dependency allows one to predict Y from X.
Ash traysLung
can
cer Ash traysLinked toLung cancer!
scientistpolicy maker
PolicyPolicy manipulates X to achieve a change in Y.
Ash traysLung
can
cer
Prohibit ash trays!
Ash traysLinked toLung cancer!
PolicyPolicy manipulates X to achieve a change in Y.
Ash traysLung
can
cer
We failed!
Correlation is not Causation
Manipulation of X can destroy the correlation of X with Y.
Ash traysLung
can
cer
We failed!
Standard RemedyRandomized controlled study
Ash traysLung
can
cer
That’s what happensif you carry out thepolicy.
InfeasibilityExpenseMorality
Lead
IQ
Let me force a few thousand childrento eat lead.
InfeasibilityExpenseMorality
Lead
IQ
Just joking!
Ironic Alliance
Lead
IQ
Ha! You will never prove thatlead affects IQ…
industry
Ironic Alliance
Lead
IQ
And you can’t throw my peopleout of work on a mere whim.
Lead
IQ
So I will keep on polluting, which will never settle the matter because it is not a randomized trial.
Ironic Alliance
II. Causes From Correlations
Causal Discovery
Protein A
Protein BProtein C Cancer protein
Patterns of conditional correlation can imply unambiguous causal conclusions
(Pearl, Spirtes, Glymour, Scheines, etc.)
Eliminate protein C!
Basic Idea Causation is a directed, acyclic
network over variables. What makes a network causal is a
relation of compatibility between networks and joint probability distributions.X
YZ
XY Z
compatibility
pG
Joint distribution p is compatible with directed, acyclic network G iff:
Causal Markov Condition: each variable X is independent of its non-effects given its immediate causes.
Faithfulness Condition: every conditional independence relation that holds in p is a consequence of the Causal Markov Cond.
Compatibility
Y ZXW
VV
B C
Common Cause• B yields info about C (Faithfulness);• B yields no further info about C given A (Markov).
A
A
B C
Causal Chain• B yields info about C (Faithfulness);• B yields no further info about C given A (Markov).
B
A
C
A
B
C
Common Effect• B yields no info about C (Markov);• B yields extra info about C given A (Faithfulness).
A
B C
A
B C
Distinguishability
A
B CA
B
C
A
C
B
A
B C
indistinguishable distinctive
Immediate Connections• There is an immediate causal connection between X and Y iff
X is dependent on Y given every subset of variables not containing X and Y (Spirtes, Glymour and Scheines)
X YNo intermediate conditioning setbreaks dependency
X YZ
WSome conditioningset breaks dependency
Recovery of Skeleton• Apply preceding condition to recover every non-oriented immediate causal connection.
X YYZ
skeleton
X YYZ
truth
Orientation of Skeleton• Look for the distinctive pattern of common effects.
Common effectX Y
YZ
X YYZ
truth
Orientation of Skeleton• Look for the distinctive pattern of common effects.
• Draw all deductive consequences of these orientations.
Common effectX Y
YZ
Y is not common effect of ZYSo orientation must be downward
X YYZ
truth
Causation from Correlation
Protein A
Protein BProtein C Cancer protein
The following network is causally unambiguous if all variables are observed.
Causation from Correlation
Protein A
Protein BProtein C Cancer protein
The red arrow is also immune to latent confounding causes
Brave New World for Policy
Protein A
Protein BProtein C Cancer protein
Experimental (confounder-proof) conclusions from correlational data!
Eliminate protein C!
III. The Catch
Metaphysics vs. Inference
The above results all assume that the true statistical independence relations for p are given.
But they must be inferred from finite samples.
Sample Inferred statisticaldependencies
Causalconclusions
Problem of Induction Independence is indistinguishable
from sufficiently small dependence at sample size n.
independence
dependencedata
Bridging the Inductive Gap
Assume conditional independence until the data show otherwise.
Ockham’s razor: assume no more causal complexity than necessary.
Inferential Instability No guarantee that small
dependencies will not be detected later.
Can have spectacular impact on prior causal conclusions.
Current Policy AnalysisProtein A
Protein BProtein C Cancer protein
Eliminate protein C!
Protein A
Protein BProtein C Cancer protein
As Sample Size Increases…
Rescind that order!
Protein A
Protein BProtein C Cancer proteinweak
Protein D
As Sample Size Increases Again…
Eliminate protein C again!
Protein A
Protein BProtein C Cancer proteinweak
Protein D
Protein Eweak
weak
As Sample Size Increases Again…
Protein A
Protein BProtein C Cancer proteinweak
Protein D
Protein Eweak
weak
Etc.Eliminate protein C again!
Typical Applications Linear Causal Case: each variable
X is a linear function of its parents and a normally distributed hidden variable called an “error term”. The error terms are mutually independent.
Discrete Multinomial Case: each variable X takes on a finite range of values.
No unobserved latent confounding causes
An Optimistic Concession
Genetics
Smoking Cancer
Causal Flipping Theorem
No matter what a consistent causal discovery procedure has seen so far, there exists a pair G, p satisfying the above assumptions so that the current sample is arbitrarily likely in p and the procedure produces arbitrarily many opposite conclusions in p about an arbitrary causal arrow in G as sample size increases.
oops
I meant oops
oopsI meant
I meant
Causal Flipping Theorem
Every consistent causal inference method is covered.
Therefore, multiple instability is an intrinsic feature of the causal discovery problem.
oops
I meant oops
oopsI meant
I meant
The Crooked Course"Living in the midst of ignorance and considering themselves intelligent and enlightened, the senseless people go round and round, following crooked courses, just like the blind led by the blind." Katha Upanishad, I. ii. 5.
Extremist Reaction Since causal discovery cannot lead
straight to the truth, it is not justified.
I must remain silent.Therefore, I
win.
Moderate Reaction Many explanations have been
offered to make sense of the here-today-gone-tomorrow nature of medical wisdom — what we are advised with confidence one year is reversed the next — but the simplest one is that it is the natural rhythm of science.
(Do We Really Know What Makes us Healthy?, NY Times Magazine, Sept. 16, 2007).
Skepticism Inverted Unavoidable retractions are justified
because they are unavoidable. Avoidable retractions are not
justified because they are avoidable. So the best possible methods for
causal discovery are those that minimize causal retractions.
The best possible means for finding the truth are justified.
Larger Proposal The same holds for Ockham’s razor
in general when the aim is to find the true theory.
IV. Ockham’s Razor
Which Theory is Right?
???
Ockham Says:
Choose theSimplest!
But Why?
Gotcha!
Puzzle An indicator must be sensitive to
what it indicates.
simple
Puzzle An indicator must be sensitive to
what it indicates.
complex
Puzzle But Ockham’s razor always points
at simplicity.
simple
Puzzle But Ockham’s razor always points
at simplicity.
complex
Puzzle How can a broken compass help
you find something unless you already know where it is?
complex
Standard Accounts1. Prior Simplicity Bias
Bayes, BIC, MDL, MML, etc.
2. Risk MinimizationSRM, AIC, cross-validation, etc.
1. Bayesian Account Ockham’s razor is a feature of
one’s personal prior belief state. Short run: no objective
connection with finding the truth (flipping theorem applies).
Long run: converges to the truth, but other prior biases would also lead to convergence.
2. Risk Minimization Acct.
Risk minimization is about prediction rather than truth.
Urges using a false causal theory rather than the known true theory for predictive purposes.
Therefore, not suited to exact science or to practical policy applications.
V. A New Foundation for
Ockham’s Razor
Connections to the Truth Short-run
Reliability Too strong to be
feasible when theory matters.
Long-run Convergence Too weak to single
out Ockham’s razor
ComplexSimple
ComplexSimple
Middle Path Short-run Reliability
Too strong to be feasible when theory matters.
“Straightest” convergence Just right?
Long-run Convergence Too weak to single
out Ockham’s razor
ComplexSimple
ComplexSimple
ComplexSimple
Empirical Problems
T1 T2 T3
Set K of infinite input sequences. Partition of K into alternative
theories.K
Empirical Methods
T1 T2 T3
Map finite input sequences to theories or to “?”.
K
T3
e
Method Choice
T1 T2 T3
e1 e2 e3 e4
Input history
Output historyAt each stage, scientist can choose a new method (agreeing with past theory choices).
Aim: Converge to the Truth
T1 T2 T3
K
T3 ? T2 ? T1 T1 T1 T1 . . .T1 T1 T1
Retraction Choosing T and then not choosing
T next
T’T
?
Aim: Eliminate Needless Retractions
Truth
Aim: Eliminate Needless Retractions
Truth
Aim: Eliminate Needless Delays to Retractions
theory
applicationapplicationapplication
applicationcorollary
applicationtheory
applicationapplicationcorollary applicationcorollary
Aim: Eliminate Needless Delays to Retractions
Why Timed Retractions?Retraction minimization =generalized significance level.
Retraction time minimization = generalized power.
Easy Retraction Time Comparisons
T1 T1 T2 T2
T1 T1 T2 T2 T3 T3T2 T4 T4
T2 T2
Method 1
Method 2
T4 T4 T4
. . .
. . .
at least as many at least as late
Worst-case Retraction Time Bounds
T1 T2
Output sequences
T1 T2
T1 T2
T4
T3
T3
T3
T3
T3 T3
T4
T4
T4
T4 T4
. . .
(1, 2, ∞)
. . .
. . .
. . .. . .. . .
T4
T4
T4
T1 T2 T3 T3 T3 T4T3 . . .
Curve Fitting Data = open intervals around Y
at rational values of X.
Curve Fitting No effects:
Curve Fitting First-order effect:
Curve Fitting Second-order effect:
Ockham
ConstantLinear
QuadraticCubic
There yet?Maybe.
Ockham
ConstantLinear
QuadraticCubic
There yet?Maybe.
Ockham
ConstantLinear
QuadraticCubic
There yet?Maybe.
Ockham
ConstantLinear
QuadraticCubic
There yet?Maybe.
Ockham Violation
ConstantLinear
QuadraticCubic
There yet?Maybe.
Ockham Violation
ConstantLinear
QuadraticCubic
I know you’re coming!
Ockham Violation
ConstantLinear
QuadraticCubic
Maybe.
Ockham Violation
ConstantLinear
QuadraticCubic
!!!
Hmm, it’s quite nice here…
Ockham Violation
ConstantLinear
QuadraticCubic
You’re back!Learned your lesson?
Violator’s Path
ConstantLinear
QuadraticCubic
See, you shouldn’t run aheadEven if you are right!
Ockham Path
ConstantLinear
QuadraticCubic
More General Argument Required
Cover case in which demon has branching paths (causal discovery)
More General Argument Required
Cover case in which scientist lags behind (using time as a cost)
Come on!
Empirical Effects
Empirical Effects
Empirical Effects
May take arbitrarily long to discoverBut can’t be taken back
Empirical Effects
May take arbitrarily long to discoverBut can’t be taken back
Empirical Effects
May take arbitrarily long to discoverBut can’t be taken back
Empirical Effects
May take arbitrarily long to discoverBut can’t be taken back
Empirical Effects
May take arbitrarily long to discoverBut can’t be taken back
Empirical Effects
May take arbitrarily long to discoverBut can’t be taken back
Empirical Effects
May take arbitrarily long to discoverBut can’t be taken back
Empirical Theories True theory determined by which
effects appear.
Empirical Complexity
More complex
Background Constraints
More complex
Background Constraints
More complex
Ockham’s Razor Don’t select a theory unless it is
uniquely simplest in light of experience.
Weak Ockham’s Razor Don’t select a theory unless it
among the simplest in light of experience.
Stalwartness Don’t retract your answer while it
is uniquely simplest
Stalwartness Don’t retract your answer while it
is uniquely simplest
Timed Retraction Bounds
r(M, e, n) = the least timed retraction bound covering the total timed retractions of M along input streams of complexity n that extend e
Empirical Complexity 0 1 2 3 . . .
. . .
M
Efficiency of Method M at e
M converges to the truth no matter what;
For each convergent M’ that agrees with M up to the end of e, and for each n: r(M, e, n) r(M’, e, n)
Empirical Complexity 0 1 2 3 . . .
. . .
M M’
M is Beaten at e There exists convergent M’ that
agrees with M up to the end of e, such that For each n, r(M, e, n) r(M’, e, n); Exists n, r(M, e, n) > r(M’, e, n).
Empirical Complexity 0 1 2 3 . . .
. . .
M M’
Ockham Efficiency Theorem
Let M be a solution. The following are equivalent: M is always strongly Ockham and
stalwart; M is always efficient; M is never weakly beaten.
Example: Causal Inference Effects are conditional statistical
dependence relations.
X dep Y | {Z}, {W}, {Z,W}
Y dep Z | {X}, {W}, {X,W}
X dep Z | {Y}, {Y,W}
. . .. . .
Causal Discovery = Ockham’s Razor
X Y Z W
Ockham’s Razor
X Y Z W
X dep Y | {Z}, {W}, {Z,W}
Causal Discovery = Ockham’s Razor
X Y Z W
X dep Y | {Z}, {W}, {Z,W}Y dep Z | {X}, {W}, {X,W}X dep Z | {Y}, {Y,W}
Causal Discovery = Ockham’s Razor
X Y Z W
X dep Y | {Z}, {W}, {Z,W}Y dep Z | {X}, {W}, {X,W}X dep Z | {Y}, {W}, {Y,W}
Causal Discovery = Ockham’s Razor
X Y Z W
X dep Y | {Z}, {W}, {Z,W}Y dep Z | {X}, {W}, {X,W}X dep Z | {Y}, {W}, {Y,W}Z dep W| {X}, {Y}, {X,Y}Y dep W| {Z}, {X,Z}
Causal Discovery = Ockham’s Razor
X Y Z W
X dep Y | {Z}, {W}, {Z,W}Y dep Z | {X}, {W}, {X,W}X dep Z | {Y}, {W}, {Y,W}Z dep W| {X}, {Y}, {X,Y}Y dep W| {X}, {Z}, {X,Z}
IV. Simplicity Defined
ApproachEmpirical complexity reflects
nested problems of induction posed by the problem.
Hence, simplicity is problem-relative but topologically invariant.
Empirical Problems
T1 T2 T3
Set K of infinite input sequences. Partition Q of K into alternative
theories.
K
Simplicity Concepts A simplicity concept for (K, Q) is just
a well-founded order < on a partition S of K with ascending chains of order type not exceeding omega such that:
1. Each element of S is included in some answer in Q.
2. Each downward union in (S, <) is closed;
3. Incomparable sets share no boundary point.
4. Each element of S is included in the boundary of its successor.
Empirical Complexity Defined
Let K|e denote the set of all possibilities compatible with observations e.
Let (S, <) be a simplicity concept for (K|e, Q).
Define c(w, e) = the length of the longest < path to the cell of S that contains w.
Define c(T, e) = the least c(w, e) such that T is true in w.
Applications Polynomial laws: complexity =
degree Conservation laws: complexity =
particle types – conserved quantities.
Causal networks: complexity = number of logically independent conditional dependencies entailed by faithfulness.
General Ockham Efficiency Theorem
Let M be a solution. The following are equivalent: M is always strongly Ockham and
stalwart; M is always efficient; M is never beaten.
Conclusions Causal truths are necessary for
counterfactual predictions. Ockham’s razor is necessary for
staying on the straightest path to the true theory but does not point at the true theory.
No evasions or circles are required.
Future Directions Extension of unique efficiency
theorem to stochastic model selection.
Latent variables as Ockham conclusions.
Degrees of retraction. Pooling of marginal Ockham
conclusions. Retraction efficiency assessment of
MDL, SRM.
Suggested Reading "Ockham’s Razor, Truth, and Informat
ion", in Handbook of the Philosophy of Information, J. van Behthem and P. Adriaans, eds., to appear.
"Ockham’s Razor, Empirical Complexity, and Truth-finding Efficiency", Theoretical Computer Science, 383: 270-289, 2007.
Both available as pre-prints at: www.hss.cmu.edu/philosophy/faculty-kelly.php
1. Prior Simplicity Bias
The simple theory is more plausible now because it was more plausible yesterday.
More Subtle VersionSimple data are a miracle in the complex theory but not in the simple theory.
P C
Regularity: retrograde motion of Venus at solar conjunction
Has to be!
However… e would not be a miracle given P(q);
Why not this?
CP
The Real MiracleIgnorance about model: p(C) p(P);
+ Ignorance about parameter setting: p’(P(q) | P) p(P(q’ ) | P).
= Knowledge about C vs. P(q):p(P(q)) << p(C).
CP
qqqqqqqq
Lead into gold.Perpetual motion.Free lunch.
Sounds good!
Standard Paradox of IndifferenceIgnorance of red vs. not-red
+ Ignorance over not-red: = Knowledge about red vs. white.
Knognorance = All the priveleges of knowledgeWith none of the responsibilitiesSounds good!
The Ellsberg Paradox
1/3 ? ?
Human Preference
1/3 ? ?
a > b
a c < cb
b
Human View
1/3 ? ?
a > b
a c < cb
bknowledge ignorance
knowledgeignorance
Bayesian “Rationality”
1/3 ? ?
a > b
a c > cb
bknognoranceknognorance
knognoranceknognorance
In Any Event
The coherentist foundations of Bayesianism have nothing to do with short-run truth-conduciveness.Not so loud!
Bayesian Convergence Too-simple theories get shot down…
ComplexityTheories
Updated opinion
Bayesian Convergence Plausibility is transferred to the next-
simplest theory…
Blam! ComplexityTheories
Updated opinion
Plink!
Bayesian Convergence Plausibility is transferred to the next-
simplest theory…
Blam! ComplexityTheories
Updated opinion
Plink!
Bayesian Convergence Plausibility is transferred to the next-
simplest theory…
Blam! ComplexityTheories
Updated opinion
Plink!
Bayesian Convergence The true theory is never shot down.
Blam! ComplexityTheories
Updated opinion
Zing!
Convergence But alternative strategies also
converge: Any theory choice in the short run is
compatible with convergence in the long run.
Summary of Bayesian Approach
Prior-based explanations of Ockham’s razor are circular and based on a faulty model of ignorance.
Convergence-based explanations of Ockham’s razor fail to single out Ockham’s razor.
2. Risk Minimization Ockham’s razor minimizes
expected distance of empirical estimates from the true value.
Truth
Unconstrained Estimates
are Centered on truth but spread around it.
Pop!Pop!Pop!Pop!
Unconstrained aim
Off-center but less spread.
Clamped aim
Truth
Constrained Estimates
Off-center but less spread Overall improvement in expected
distance from truth…
Truth
Pop!Pop!Pop!Pop!
Constrained Estimates
Clamped aim
Doesn’t Find True Theory
The theory that minimizes estimation risk can be quite false…
Four eyes!
Clamped aim
Makes Sense…when loss of an answer is similar in
nearby distributions.
Similarityp
Close is goodenough!Loss
But Not When Truth Matters
…i.e., when loss of an answer is discontinuous with similarity.
Similarityp
Close is no cigar!Loss