Ockham’s Razor in Causal Discovery: A New Explanation

Ockham’s Razor in Causal Discovery: A

New ExplanationKevin T. Kelly

Conor Mayo-WilsonDepartment of Philosophy

Joint Program in Logic and ComputationCarnegie Mellon University

www.hss.cmu.edu/philosophy/faculty-kelly.php

I. Prediction vs. Policy

Predictive LinksCorrelation or co-dependency allows one to predict Y from X.

Ash traysLung

can

cer Ash traysLinked toLung cancer!

scientistpolicy maker

PolicyPolicy manipulates X to achieve a change in Y.

Ash traysLung

can

cer

Prohibit ash trays!

Ash traysLinked toLung cancer!

PolicyPolicy manipulates X to achieve a change in Y.

Ash traysLung

can

cer

We failed!

Correlation is not Causation

Manipulation of X can destroy the correlation of X with Y.

Ash traysLung

can

cer

We failed!

Standard RemedyRandomized controlled study

Ash traysLung

can

cer

That’s what happensif you carry out thepolicy.

InfeasibilityExpenseMorality

Lead

IQ

Let me force a few thousand childrento eat lead.

InfeasibilityExpenseMorality

Lead

IQ

Just joking!

Ironic Alliance

Lead

IQ

Ha! You will never prove thatlead affects IQ…

industry

Ironic Alliance

Lead

IQ

And you can’t throw my peopleout of work on a mere whim.

Lead

IQ

So I will keep on polluting, which will never settle the matter because it is not a randomized trial.

Ironic Alliance

II. Causes From Correlations

Causal Discovery

Protein A

Protein BProtein C Cancer protein

Patterns of conditional correlation can imply unambiguous causal conclusions

(Pearl, Spirtes, Glymour, Scheines, etc.)

Eliminate protein C!

Basic Idea Causation is a directed, acyclic

network over variables. What makes a network causal is a

relation of compatibility between networks and joint probability distributions.X

YZ

XY Z

compatibility

pG

Joint distribution p is compatible with directed, acyclic network G iff:

Causal Markov Condition: each variable X is independent of its non-effects given its immediate causes.

Faithfulness Condition: every conditional independence relation that holds in p is a consequence of the Causal Markov Cond.

Compatibility

Y ZXW

VV

B C

Common Cause• B yields info about C (Faithfulness);• B yields no further info about C given A (Markov).

A

A

B C

Causal Chain• B yields info about C (Faithfulness);• B yields no further info about C given A (Markov).

B

A

C

A

B

C

Common Effect• B yields no info about C (Markov);• B yields extra info about C given A (Faithfulness).

A

B C

A

B C

Distinguishability

A

B CA

B

C

A

C

B

A

B C

indistinguishable distinctive

Immediate Connections• There is an immediate causal connection between X and Y iff

X is dependent on Y given every subset of variables not containing X and Y (Spirtes, Glymour and Scheines)

X YNo intermediate conditioning setbreaks dependency

X YZ

WSome conditioningset breaks dependency

Recovery of Skeleton• Apply preceding condition to recover every non-oriented immediate causal connection.

X YYZ

skeleton

X YYZ

truth

Orientation of Skeleton• Look for the distinctive pattern of common effects.

Common effectX Y

YZ

X YYZ

truth

Orientation of Skeleton• Look for the distinctive pattern of common effects.

• Draw all deductive consequences of these orientations.

Common effectX Y

YZ

Y is not common effect of ZYSo orientation must be downward

X YYZ

truth

Causation from Correlation

Protein A


The following network is causally unambiguous if all variables are observed.

Causation from Correlation

Protein A


The red arrow is also immune to latent confounding causes

Brave New World for Policy

Protein A


Experimental (confounder-proof) conclusions from correlational data!


III. The Catch

Metaphysics vs. Inference

The above results all assume that the true statistical independence relations for p are given.

But they must be inferred from finite samples.

Sample Inferred statisticaldependencies

Causalconclusions

Problem of Induction Independence is indistinguishable

from sufficiently small dependence at sample size n.

independence

dependencedata

Bridging the Inductive Gap

Assume conditional independence until the data show otherwise.

Ockham’s razor: assume no more causal complexity than necessary.

Inferential Instability No guarantee that small

dependencies will not be detected later.

Can have spectacular impact on prior causal conclusions.

Current Policy AnalysisProtein A



Protein A


As Sample Size Increases…

Rescind that order!

Protein A

Protein BProtein C Cancer proteinweak

Protein D

As Sample Size Increases Again…

Eliminate protein C again!

Protein A


Protein D

Protein Eweak

weak

As Sample Size Increases Again…

Protein A


Protein D

Protein Eweak

weak

Etc.Eliminate protein C again!

Typical Applications Linear Causal Case: each variable

X is a linear function of its parents and a normally distributed hidden variable called an “error term”. The error terms are mutually independent.

Discrete Multinomial Case: each variable X takes on a finite range of values.

No unobserved latent confounding causes

An Optimistic Concession

Genetics

Smoking Cancer

Causal Flipping Theorem

No matter what a consistent causal discovery procedure has seen so far, there exists a pair G, p satisfying the above assumptions so that the current sample is arbitrarily likely in p and the procedure produces arbitrarily many opposite conclusions in p about an arbitrary causal arrow in G as sample size increases.

oops

I meant oops

oopsI meant

I meant

Causal Flipping Theorem

Every consistent causal inference method is covered.

Therefore, multiple instability is an intrinsic feature of the causal discovery problem.

oops

I meant oops

oopsI meant

I meant

The Crooked Course"Living in the midst of ignorance and considering themselves intelligent and enlightened, the senseless people go round and round, following crooked courses, just like the blind led by the blind." Katha Upanishad, I. ii. 5.

Extremist Reaction Since causal discovery cannot lead

straight to the truth, it is not justified.

I must remain silent.Therefore, I

win.

Moderate Reaction Many explanations have been

offered to make sense of the here-today-gone-tomorrow nature of medical wisdom — what we are advised with confidence one year is reversed the next — but the simplest one is that it is the natural rhythm of science.

(Do We Really Know What Makes us Healthy?, NY Times Magazine, Sept. 16, 2007).

Skepticism Inverted Unavoidable retractions are justified

because they are unavoidable. Avoidable retractions are not

justified because they are avoidable. So the best possible methods for

causal discovery are those that minimize causal retractions.

The best possible means for finding the truth are justified.

Larger Proposal The same holds for Ockham’s razor

in general when the aim is to find the true theory.

IV. Ockham’s Razor

Which Theory is Right?

???

Ockham Says:

Choose theSimplest!

But Why?

Gotcha!

Puzzle An indicator must be sensitive to

what it indicates.

simple

Puzzle An indicator must be sensitive to

what it indicates.

complex

Puzzle But Ockham’s razor always points

at simplicity.

simple

Puzzle But Ockham’s razor always points

at simplicity.

complex

Puzzle How can a broken compass help

you find something unless you already know where it is?

complex

Standard Accounts1. Prior Simplicity Bias

Bayes, BIC, MDL, MML, etc.

2. Risk MinimizationSRM, AIC, cross-validation, etc.

1. Bayesian Account Ockham’s razor is a feature of

one’s personal prior belief state. Short run: no objective

connection with finding the truth (flipping theorem applies).

Long run: converges to the truth, but other prior biases would also lead to convergence.

2. Risk Minimization Acct.

Risk minimization is about prediction rather than truth.

Urges using a false causal theory rather than the known true theory for predictive purposes.

Therefore, not suited to exact science or to practical policy applications.

V. A New Foundation for

Ockham’s Razor

Connections to the Truth Short-run

Reliability Too strong to be

feasible when theory matters.

Long-run Convergence Too weak to single

out Ockham’s razor

ComplexSimple

ComplexSimple

Middle Path Short-run Reliability

Too strong to be feasible when theory matters.

“Straightest” convergence Just right?

Long-run Convergence Too weak to single

out Ockham’s razor

ComplexSimple

ComplexSimple

ComplexSimple

Empirical Problems

T1 T2 T3

Set K of infinite input sequences. Partition of K into alternative

theories.K

Empirical Methods

T1 T2 T3

Map finite input sequences to theories or to “?”.

K

T3

e

Method Choice

T1 T2 T3

e1 e2 e3 e4

Input history

Output historyAt each stage, scientist can choose a new method (agreeing with past theory choices).

Aim: Converge to the Truth

T1 T2 T3

K

T3 ? T2 ? T1 T1 T1 T1 . . .T1 T1 T1

Retraction Choosing T and then not choosing

T next

T’T

?

Aim: Eliminate Needless Retractions

Truth

Aim: Eliminate Needless Delays to Retractions

theory

applicationapplicationapplication

applicationcorollary

applicationtheory

applicationapplicationcorollary applicationcorollary

Aim: Eliminate Needless Delays to Retractions

Why Timed Retractions?Retraction minimization =generalized significance level.

Retraction time minimization = generalized power.

Easy Retraction Time Comparisons

T1 T1 T2 T2

T1 T1 T2 T2 T3 T3T2 T4 T4

T2 T2

Method 1

Method 2

T4 T4 T4

. . .

. . .

at least as many at least as late

Worst-case Retraction Time Bounds

T1 T2

Output sequences

T1 T2

T1 T2

T4

T3

T3

T3

T3

T3 T3

T4

T4

T4

T4 T4

. . .

(1, 2, ∞)

. . .

. . .

. . .. . .. . .

T4

T4

T4

T1 T2 T3 T3 T3 T4T3 . . .

Curve Fitting Data = open intervals around Y

at rational values of X.

Curve Fitting No effects:

Curve Fitting First-order effect:

Curve Fitting Second-order effect:

Ockham

ConstantLinear

QuadraticCubic

There yet?Maybe.

Ockham Violation

ConstantLinear

QuadraticCubic

There yet?Maybe.

Ockham Violation

ConstantLinear

QuadraticCubic

I know you’re coming!

Ockham Violation

ConstantLinear

QuadraticCubic

Maybe.

Ockham Violation

ConstantLinear

QuadraticCubic

!!!

Hmm, it’s quite nice here…

Ockham Violation

ConstantLinear

QuadraticCubic

You’re back!Learned your lesson?

Violator’s Path

ConstantLinear

QuadraticCubic

See, you shouldn’t run aheadEven if you are right!

Ockham Path

ConstantLinear

QuadraticCubic

More General Argument Required

Cover case in which demon has branching paths (causal discovery)

More General Argument Required

Cover case in which scientist lags behind (using time as a cost)

Come on!

Empirical Effects

Empirical Effects

May take arbitrarily long to discoverBut can’t be taken back

Empirical Theories True theory determined by which

effects appear.

Empirical Complexity

More complex

Background Constraints

More complex

Ockham’s Razor Don’t select a theory unless it is

uniquely simplest in light of experience.

Weak Ockham’s Razor Don’t select a theory unless it

among the simplest in light of experience.

Stalwartness Don’t retract your answer while it

is uniquely simplest

Timed Retraction Bounds

r(M, e, n) = the least timed retraction bound covering the total timed retractions of M along input streams of complexity n that extend e

Empirical Complexity 0 1 2 3 . . .

. . .

M

Efficiency of Method M at e

M converges to the truth no matter what;

For each convergent M’ that agrees with M up to the end of e, and for each n: r(M, e, n) r(M’, e, n)


. . .

M M’

M is Beaten at e There exists convergent M’ that

agrees with M up to the end of e, such that For each n, r(M, e, n) r(M’, e, n); Exists n, r(M, e, n) > r(M’, e, n).


. . .

M M’

Ockham Efficiency Theorem

Let M be a solution. The following are equivalent: M is always strongly Ockham and

stalwart; M is always efficient; M is never weakly beaten.

Example: Causal Inference Effects are conditional statistical

dependence relations.

X dep Y | {Z}, {W}, {Z,W}

Y dep Z | {X}, {W}, {X,W}

X dep Z | {Y}, {Y,W}

. . .. . .

Causal Discovery = Ockham’s Razor

X Y Z W

Ockham’s Razor

X Y Z W

X dep Y | {Z}, {W}, {Z,W}


X Y Z W

X dep Y | {Z}, {W}, {Z,W}Y dep Z | {X}, {W}, {X,W}X dep Z | {Y}, {Y,W}


X Y Z W

X dep Y | {Z}, {W}, {Z,W}Y dep Z | {X}, {W}, {X,W}X dep Z | {Y}, {W}, {Y,W}


X Y Z W

X dep Y | {Z}, {W}, {Z,W}Y dep Z | {X}, {W}, {X,W}X dep Z | {Y}, {W}, {Y,W}Z dep W| {X}, {Y}, {X,Y}Y dep W| {Z}, {X,Z}


X Y Z W

X dep Y | {Z}, {W}, {Z,W}Y dep Z | {X}, {W}, {X,W}X dep Z | {Y}, {W}, {Y,W}Z dep W| {X}, {Y}, {X,Y}Y dep W| {X}, {Z}, {X,Z}

IV. Simplicity Defined

ApproachEmpirical complexity reflects

nested problems of induction posed by the problem.

Hence, simplicity is problem-relative but topologically invariant.

Empirical Problems

T1 T2 T3

Set K of infinite input sequences. Partition Q of K into alternative

theories.

K

Simplicity Concepts A simplicity concept for (K, Q) is just

a well-founded order < on a partition S of K with ascending chains of order type not exceeding omega such that:

1. Each element of S is included in some answer in Q.

2. Each downward union in (S, <) is closed;

3. Incomparable sets share no boundary point.

4. Each element of S is included in the boundary of its successor.

Empirical Complexity Defined

Let K|e denote the set of all possibilities compatible with observations e.

Let (S, <) be a simplicity concept for (K|e, Q).

Define c(w, e) = the length of the longest < path to the cell of S that contains w.

Define c(T, e) = the least c(w, e) such that T is true in w.

Applications Polynomial laws: complexity =

degree Conservation laws: complexity =

particle types – conserved quantities.

Causal networks: complexity = number of logically independent conditional dependencies entailed by faithfulness.

General Ockham Efficiency Theorem

Let M be a solution. The following are equivalent: M is always strongly Ockham and

stalwart; M is always efficient; M is never beaten.

Conclusions Causal truths are necessary for

counterfactual predictions. Ockham’s razor is necessary for

staying on the straightest path to the true theory but does not point at the true theory.

No evasions or circles are required.

Future Directions Extension of unique efficiency

theorem to stochastic model selection.

Latent variables as Ockham conclusions.

Degrees of retraction. Pooling of marginal Ockham

conclusions. Retraction efficiency assessment of

MDL, SRM.

Suggested Reading "Ockham’s Razor, Truth, and Informat

ion", in Handbook of the Philosophy of Information, J. van Behthem and P. Adriaans, eds., to appear.

"Ockham’s Razor, Empirical Complexity, and Truth-finding Efficiency", Theoretical Computer Science, 383: 270-289, 2007.

Both available as pre-prints at: www.hss.cmu.edu/philosophy/faculty-kelly.php

http://www.hss.cmu.edu/philosophy/kelly/papers/kellyinfo14.pdf

http://www.hss.cmu.edu/philosophy/kelly/papers/kellyinfo14.pdf

http://www.hss.cmu.edu/philosophy/kelly/papers/burginfixed4.pdf

http://www.hss.cmu.edu/philosophy/kelly/papers/burginfixed4.pdf

1. Prior Simplicity Bias

The simple theory is more plausible now because it was more plausible yesterday.

More Subtle VersionSimple data are a miracle in the complex theory but not in the simple theory.

P C

Regularity: retrograde motion of Venus at solar conjunction

Has to be!

However… e would not be a miracle given P(q);

Why not this?

CP

The Real MiracleIgnorance about model: p(C) p(P);

+ Ignorance about parameter setting: p’(P(q) | P) p(P(q’ ) | P).

= Knowledge about C vs. P(q):p(P(q)) << p(C).

CP

qqqqqqqq

Lead into gold.Perpetual motion.Free lunch.

Sounds good!

Standard Paradox of IndifferenceIgnorance of red vs. not-red

+ Ignorance over not-red: = Knowledge about red vs. white.

qq

Knognorance = All the priveleges of knowledgeWith none of the responsibilitiesSounds good!

The Ellsberg Paradox

1/3 ? ?

Human Preference

1/3 ? ?

a > b

a c < cb

b

Human View

1/3 ? ?

a > b

a c < cb

bknowledge ignorance

knowledgeignorance

Bayesian “Rationality”

1/3 ? ?

a > b

a c > cb

bknognoranceknognorance

knognoranceknognorance

In Any Event

The coherentist foundations of Bayesianism have nothing to do with short-run truth-conduciveness.Not so loud!

Bayesian Convergence Too-simple theories get shot down…

ComplexityTheories

Updated opinion

Bayesian Convergence Plausibility is transferred to the next-

simplest theory…

Blam! ComplexityTheories

Updated opinion

Plink!

Bayesian Convergence The true theory is never shot down.

Blam! ComplexityTheories

Updated opinion

Zing!

Convergence But alternative strategies also

converge: Any theory choice in the short run is

compatible with convergence in the long run.

Summary of Bayesian Approach

Prior-based explanations of Ockham’s razor are circular and based on a faulty model of ignorance.

Convergence-based explanations of Ockham’s razor fail to single out Ockham’s razor.

2. Risk Minimization Ockham’s razor minimizes

expected distance of empirical estimates from the true value.

Truth

Unconstrained Estimates

are Centered on truth but spread around it.

Pop!Pop!Pop!Pop!

Unconstrained aim

Off-center but less spread.

Clamped aim

Truth

Constrained Estimates

Off-center but less spread Overall improvement in expected

distance from truth…

Truth

Pop!Pop!Pop!Pop!

Constrained Estimates

Clamped aim

Doesn’t Find True Theory

The theory that minimizes estimation risk can be quite false…

Four eyes!

Clamped aim

Makes Sense…when loss of an answer is similar in

nearby distributions.

Similarityp

Close is goodenough!Loss

But Not When Truth Matters

…i.e., when loss of an answer is discontinuous with similarity.

Similarityp

Close is no cigar!Loss

Ockham’s Razor in Causal Discovery: A New Explanation

Documents

Transcript of Ockham’s Razor in Causal Discovery: A New Explanation