Likelihood, Inference, and Model Comparison

45
Likelihood, Inference, and Model Comparison FEE Course, January 2013 http://www.sortie-nd.org/lme/lme_course.html Likelihood Methods and Models in Ecology

description

FEE Course, January 2013. Likelihood, Inference, and Model Comparison. Likelihood Methods and Models in Ecology. http://www.sortie-nd.org/lme/lme_course.html. Outline. Probability and probability density functions - PowerPoint PPT Presentation

Transcript of Likelihood, Inference, and Model Comparison

Page 1: Likelihood, Inference,  and Model Comparison

Likelihood, Inference, and Model Comparison

FEE Course, January 2013

http://www.sortie-nd.org/lme/lme_course.html

Likelihood Methods and Models in Ecology

Page 2: Likelihood, Inference,  and Model Comparison

Outline

Probability and probability density functions

Maximum likelihood estimates (versus traditional “method of moment” estimates)

Statistical inference Classical “frequentist” statistics :

Limitations and mental gyrations... The “likelihood” alternative: Basic

principles and definitions Model comparison as a generalization of

hypothesis testing

Page 3: Likelihood, Inference,  and Model Comparison

Probability defined more generally...

Consider an outcome X from some process that has a set of possible outcomes S:

- If X and S are discrete, then P{X} = X/S

- If X is continuous, then the probability has to be defined in the limit:

b

aba dxxgxXxP )(}{

Where g(x) is a probability density function (PDF)

Page 4: Likelihood, Inference,  and Model Comparison

The Normal Probability Density Function (PDF)

0

0.2

0.4

0.6

0.8

1

-5 -4 -3 -2 -1 0 1 2 3 4 5

Prob

(x)

X

Normal PDF with mean = 0

Var = 0.25Var = 0.5Var = 1Var = 2Var = 5Var = 10

))ux(exp()x(prob 2

2

2 221

m = mean2= variance

Properties of a PDF:(1) 0 < prob(x) < 1

(2) ∫ prob(x) = 1

Page 5: Likelihood, Inference,  and Model Comparison

Common PDFs...

For continuous data:- Normal- Lognormal- Gamma

For discrete data:- Poisson- Binomial- Multinomial- Negative Binomial

Poisson PDF

0.0

0.1

0.2

0.3

0 5 10 15 20 25 30

x

Prob

(x)

m = 2.5m = 5m = 10

See McLaughlin (1993) “A compendium of common probability distributions” in the reading list

Page 6: Likelihood, Inference,  and Model Comparison

Why are PDFs important?

Answer: because they are used to calculate likelihood…

(And in that case, they are called “likelihood functions”)

Page 7: Likelihood, Inference,  and Model Comparison

Statistical “Estimators”

A statistical estimator is a function applied to a sample of data, and used to estimate an unknown population parameter

(and an “estimate” is just the result of applying an “estimator” to a sample)

n

iix

nx

1

1 :mean population for theestimator common A

Page 8: Likelihood, Inference,  and Model Comparison

Properties of Estimators

Some desirable properties of “point estimators” (functions to estimate a fixed parameter)- Bias: if the average error is zero, the estimate is unbiased

- Efficiency: an estimate with the minimum variance is the most efficient (note: the most efficient estimator is often biased)

- Consistency: As sample size increases, the probability of the estimate being close to the parameter increases

- Asymptotically normal: a consistent estimator whose distribution around the true parameter approaches a normal θdistribution with standard deviation shrinking in proportion to as the sample size n grows

n1

Page 9: Likelihood, Inference,  and Model Comparison

Maximum likelihood (ML) estimates versus

Method of moment (MOM) estimates

Bottom line:MOM was born in the time before computers, and was OK, ML needs computing power, but has more desirable properties…

Page 10: Likelihood, Inference,  and Model Comparison

Doing it MOM’s way: Central Moments

3 )(1 kurtosis ,)(1moment Fourth

)(1 skew ,)(1 moment Third

)(s variance sample )(1moment Second

0)(1moment centralFirst

1 :mean c)(arithmeti sample theIf

4

14

4

1

3

13

3

1

22

1

1

1

1

xxns

xxn

xxns

xxn

xxn

xxn

xn

x

n

ii

n

ii

n

ii

n

ii

n

ii

n

ii

n

ii

Page 11: Likelihood, Inference,  and Model Comparison

What’s wrong with MOM’s way?

Nothing, if all you are interested in is calculating properties of your sample…

But MOM’s formulas are generally not the best way1 to infer estimates of the statistical properties of the population from which the sample was drawn…

For example: Population variance

(because the second central moment is a biased underestimate of the population variance)

1… in the formal terms of bias, efficiency, consistency, and asymptotic normality

n

ii xx

n 1

22 )(1

1

Page 12: Likelihood, Inference,  and Model Comparison

The Maximum Likelihood alternative…

Going back to PDF’s: in plain language, a PDF allows you to calculate the probability that an observation will take on a value (x), given the underlying (true?) parameters of the population

Poisson PDF

0.0

0.1

0.2

0.3

0 5 10 15 20 25 30

x

Prob

(x)

m = 2.5m = 5m = 10

ax

xaaxP

variance)(and mean thewhere

!

exp)( :PDFPoisson

Page 13: Likelihood, Inference,  and Model Comparison

Inference defined...

“a : the act of passing from one proposition, statement, or judgment considered as true to

another whose truth is believed to follow from that of the former

b : the act of passing from statistical sample data to generalizations (as of the value of

population parameters) usually with calculated degrees of certainty”

Source: Merriam-Webster Online Dictionary

Page 14: Likelihood, Inference,  and Model Comparison

Statistical Inference...

... Typically concerns inferring properties of an unknown distribution from data generated by

that distribution ...

Components:

-- Point estimation

-- Hypothesis testing

-- Model comparison

Page 15: Likelihood, Inference,  and Model Comparison

Probability and Inference

How do you choose the “correct inference” from your data, given inevitable uncertainty and error?

Can you assign a probability to your certainty in the correctness of a given inference?- (hint: if this is really important to you, then

you should consider becoming a Bayesian, as long as you can accept what I consider to be some fairly objectionable baggage…)

How do you choose between alternate hypotheses?- Can you assess the strength of your

evidence for alternate hypotheses?

Page 16: Likelihood, Inference,  and Model Comparison

“Thus, our general problem is to assess the relative merits of rival hypotheses in the light of

observational or experimental data that bear upon them....” (Edwards, pg 1).

The crux of the problem...

Edwards, A.W.F. 1992. Likelihood. Expanded Edition. Johns Hopkins University Press.

Page 17: Likelihood, Inference,  and Model Comparison

Assigning Probabilities to Hypotheses

Unfortunately, hypotheses (or even different parameter estimates) can not generally be treated as “data” (outcomes of trials)

Statisticians have debated alternate solutions to this problem for centuries- (with no generally agreed upon solution)

Page 18: Likelihood, Inference,  and Model Comparison

One Way Out: Classical “Frequentist” Statistics and Tests of Null Hypotheses

Probability is defined in terms of the outcome of a series of repeated trials..

Hypothesis testing via “significance” of pre-defined “statistics”

- What is the probability of observing a particular value of a predefined test statistic, given an assumed hypothesis about the underlying scientific model, and assumptions about the probability model of the test statistic...

- Hypotheses are never “accepted”, but are “rejected” (categorically) if the probability of obtaining the observed value of the test statistic is very small (“p-value”)

Page 19: Likelihood, Inference,  and Model Comparison

An Implicit Assumption

The data are an approximate “sample” of an underlying “true” reality –

i.e., there is a true population mean, and the sample provides an estimate of it...

Page 20: Likelihood, Inference,  and Model Comparison

Limitations of Frequentist Statistics

Do not provide a means of measuring relative strength of observational support for alternate hypotheses (merely helps decide when to “reject” individual hypotheses in comparison to a single “null” hypothesis...)

- So you conclude the slope of the line is not = 0. How strong is your evidence that the slope is really 0.45 vs. 0.50?

Extremely non-intuitive: just what is a “confidence interval” anyway...

Page 21: Likelihood, Inference,  and Model Comparison

So what is our alternative? likelihood as a basis for inference

Remember that the PDF defines the probability of observing an outcome (x), given that you already know the true population parameter (θ)

But we want to generate an estimate of θ, given our data (x)

And, unfortunately, the two are not identical:

)|()|( xPxP

Poisson PDF

0.0

0.1

0.2

0.3

0 5 10 15 20 25 30

x

Prob

(x)

m = 2.5m = 5m = 10

Page 22: Likelihood, Inference,  and Model Comparison

Fisher and the concept of “Likelihood”...

)|()|( xPxL

In plain English: “The likelihood (L) of the parameter estimates (θ), given a sample (x) is proportional to the probability of observing the data, given the parameters...”

{and this probability is something we can calculate, using the appropriate underlying probability model (i.e. a PDF)}

The “Likelihood Principle”

Page 23: Likelihood, Inference,  and Model Comparison

R.A. Fisher (1890- 1962)

“Likelihood and Probability in R. A. Fisher’s Statistical Methods for Research Workers” (John Aldrich) A good summary of the evolution of Fisher’s ideas on probability, likelihood, and inference… Contains links to PDFs of Fisher’s early papers… A second page shows the evolution of his ideas through changes in successive editions of Fisher’s books…

Age 22

http://www.economics.soton.ac.uk/staff/aldrich/fisherguide/prob+lik.htm

Page 24: Likelihood, Inference,  and Model Comparison

Calculating Likelihood and Log-Likelihood for Datasets

)|()|(|Likelihood1

n

iixgXPXL

More generally, for i = 1..n independent observations, and a vector X of observations (xi):

But, logarithms are easier to work with, so...

n

iixgXL

1

)|(ln|lnlikelihood-Log

)|( ixgwhere is the appropriate PDF

From basic probability theory: If two events (A and B) are independent, then P(A,B) = P(A)P(B)

Page 25: Likelihood, Inference,  and Model Comparison

A simple example…

mu 4.5sigma 1.2

observation (x)

likelihood = prob(x|f

log-likelihood

6.11 0.136 -1.9986.40 0.095 -2.3545.73 0.196 -1.6295.71 0.200 -1.6105.91 0.166 -1.7964.96 0.309 -1.1745.36 0.257 -1.3586.29 0.110 -2.2105.54 0.229 -1.4756.02 0.149 -1.901

likelihood 2.4964E-08summed log-likelihood -17.506

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0 2 4 6 8 10

prob

(x)

X

A sample of 10 observations…Assume they are normally distributed, with an unknown population mean and standard deviation.

What is the (log) likelihood that the mean is 4.5 and the standard deviation is 1.2?

))ux(exp()x(prob 2

2

2 221

Page 26: Likelihood, Inference,  and Model Comparison

Likelihood “Surfaces”

The variation in likelihood for any given set of parameter values defines a likelihood “surface”...

For a model with just 1 parameter, the surface is simply a curve:(aka a “likelihood profile”) -155

-153

-151

-149

-147

2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8

Parameter Estimate

Log-

Like

lihoo

d

Page 27: Likelihood, Inference,  and Model Comparison

“Support” and “Support Limits”

Log-likelihood = “Support” (Edwards 1992)

-155

-153

-151

-149

-147

2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8

Parameter Estimate

Log-

Like

lihoo

d

2-unit support interval

Maximum likelihood estimate

Page 28: Likelihood, Inference,  and Model Comparison

Another (still somewhat trivial) example…

MOM vs ML estimates of the probability of survival for a population:- Data: a quadrat in which 16 of 20 seedlings survived during a

census interval. (Note that in this case, the quadrat is the unit of observation…, so sample size = 1)

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.05

0.10

0.15

0.20

Binomial PDF with 16 successes out of 20 trials

x

P(x

)

x <- seq(0,1,0.005)y <- dbinom(16,20,x)plot(x,y)x[which.max(y)]

xNpxpxN

1 PDF Binomal

i.e. Given N=20, x = 16, what is p?

))!(!!tcoefficien binomial

xNxn

xN

Page 29: Likelihood, Inference,  and Model Comparison

A more realistic example

0.0 0.2 0.4 0.6 0.8 1.0

-300

-200

-100

-50

0

p

log

likel

ihoo

d

# Create some data (5 quadrats)N <- c(11,14,8,22,50)x <- c(8,7,5,17,35)

# Calculate the log-likelihood for each# probability of survivalp <- seq(0,1,0.005)log_likelihood <- rep(0,length(p))for (i in 1:length(p)) { log_likelihood[i] <- sum(log(dbinom(x,N,p[i]))) }

# Plot the likelihood profileplot(p,log_likelihood)

# What probability of survival maximizes log likelihood?p[which.max(log_likelihood)]0.685# How does this compare to the average across the 5 quadratsmean(x/N)0.665

Page 30: Likelihood, Inference,  and Model Comparison

Focus in on the MLE…

0.0 0.2 0.4 0.6 0.8 1.0

-300

-200

-100

-50

0

p

log

likel

ihoo

d

0.55 0.60 0.65 0.70 0.75 0.80

-15

-14

-13

-12

-11

-10

-9

plo

g lik

elih

ood

# what is the log-likelihood of the MLE?max(log_likelihood)[1] -9.46812

Things to note about log-likelihoods:

• They should always be negative! (if not, you have a problem with your likelihood function)

• The absolute magnitude of the log-likelihood increases as sample size increases

n

iixg

1

)|(lnlikelihood-Log

Page 31: Likelihood, Inference,  and Model Comparison

An example with continuous data…

))ux(exp()x(prob 2

2

2 221

x = observedm = mean2= variance

The normal PDF:

In R:dnorm(x, mean = 0, sd = 1, log = FALSE)

> dnorm(2,2.5,1)[1] 0.3520653> dnorm(2,2.5,1,log=T)[1] -1.043939>

Problem: Now there are TWO unknowns needed to calculate likelihood (the mean and the variance)!

Solution: treat the variance just like another parameter in the model, and find the ML estimate of the variance just like you would any other parameter…

Page 32: Likelihood, Inference,  and Model Comparison

Likelihood and Model Comparison as a basis for Hypothesis Testing

When and where is “strong inference” really useful?

When is it just an impediment to progress?

Stephens et al. 2005. Information theory and hypothesis testing: a call for pluralism. Journal of Applied Ecology 42:4-12.

Platt, J. R. 1964. Strong inference. Science 146:347-353

Page 33: Likelihood, Inference,  and Model Comparison

Chamberlain’s alternative: multiple working hypotheses

Science rarely progresses through a series of dichotomously branched decisions…

Instead, we are constantly trying to choose among a large set of alternate hypotheses- Concept is very old, but the computational power needed to adopt

this approach has only recently become available…

Chamberlain, T. C. 1890. The method of multiple working hypotheses. Science 15:92.

“Conscientiously followed, the method of the (single) working hypothesis … has some serious defects. … To avoid this grave danger, the method of multiple working hypotheses is urged. It differs… in that it distributes the effort and divides the affections. …the effort is to bring up into review every rational explanation of the phenomenon in hand… and to give to all of these as impartially as possible a working form and a due place in the investigation.”

Page 34: Likelihood, Inference,  and Model Comparison

Hypothesis testing and “significance”

Nester’s (1996) Creed:

• TREATMENTS: all treatments differ• FACTORS: all factors interact

• CORRELATIONS: all variables are correlated• POPULATIONS: no two populations are identical in

any respect• NORMALITY: no data are normally distributed

• VARIANCES: variances are never equal• MODELS: all models are wrong

• EQUALITY: no two numbers are the same• SIZE: many numbers are very small

Nester, M. R. 1996. An applied statistician’s creed. Applied Statistician 45:401-410

Page 35: Likelihood, Inference,  and Model Comparison

Hypothesis testing vs. estimation

“The problem of estimation is of more central importance, (than hypothesis testing).. for in almost all situations we know that the effect whose significance we are measuring is perfectly real, however small; what is at issue is its magnitude.” (Edwards, 1992, pg. 2)

“An insignificant result, far from telling us that the effect is non-existent, merely warns us that the sample was not large enough to reveal it.” (Edwards, 1992, pg. 2)

Page 36: Likelihood, Inference,  and Model Comparison

The most important point of the lecture…

Any hypothesis test can be framed as a comparison of alternate models…

(and being free of the constraints imposed by the alternate models embedded in classical statistical tests is perhaps the most important benefit of the likelihood approach…)

Page 37: Likelihood, Inference,  and Model Comparison

Differences in Frequentist vs. Likelihood Approaches

Traditional Frequentist Approach:- Report “significance” of a test that …… based on a test statistic

calculated from sums of squares (F statistic), with a necessary assumption of a homogeneous and normally distributed error

Likelihood Approach- Compare a set of alternate models, assess the strength of

evidence in your data for each of them, and identify the “best” model

- If the assumption about the error term isn’t appropriate, use a different error term!

Page 38: Likelihood, Inference,  and Model Comparison

Remember that the error term is part of the model…

And you don’t just have to accept that a simple, normally distributed, homogeneous error is appropriate…

),(N Ay iijijiij20 Estimate a separate error

term for each group

y

),(N Ay ijijiij

20Or an error term that varies as a function of the

predicted value

Or where the error isn’t normally distributed

)scale,shape(Gammay

Ay

ij

ijiij

Page 39: Likelihood, Inference,  and Model Comparison

An Example: Analysis of Covariance

A traditional frequentist ANCOVA model (homogeneous slopes):

What is restrictive about this model?

How would you generalize this in a likelihood framework?- What alternate models are you testing with the standard

frequentist statistics?- What more general alternate models might you like to test?

groups 1..Ajfor

bxay ijijjij

Page 40: Likelihood, Inference,  and Model Comparison

“It will not be sufficient, when faced with a mass of observations, to plead special creation, even though, as

we shall see, such a hypothesis commands a higher numerical likelihood than any other.”

(Edwards, 1992, pg. 1, in explaining the need for a rigorous basis for scientific inference, given uncertainty in

nature...)

But is likelihood enough?The challenge of parsimony

The importance of seeking simple answers...

Page 41: Likelihood, Inference,  and Model Comparison

Models, Truth, and “Full Reality”(The Burnham and Anderson

view...)

“We believe that “truth” (full reality) in the biological sciences has essentially infinite dimension, and hence ... cannot be revealed with only ... finite data and a “model” of those data...

... We can only hope to identify a model that provides a good approximation to the data available.”

(Burnham and Anderson 2002, pg. 20)

Page 42: Likelihood, Inference,  and Model Comparison

The “full” model

What I irreverently call the “god” model: everything is the way it is because it is…

In statistical terms, this is simply a model with as many parameters as observations- i.e.: xi = θi

This will always be the model with the highest likelihood!(but it won’t be the most parsimonious)…

Page 43: Likelihood, Inference,  and Model Comparison

Parsimony, Ockham’s razor, and drawing elephants...

William of Ockham (1285-1349):

“Pluralitas non est ponenda sine neccesitate” “entities should not be multiplied

unnecessarily”

“Parsimony: ... 2 : economy in the use of means to an end; especially : economy of explanation in conformity with Occam's

razor”

(Merriam-Webster Online Dictionary)

Page 44: Likelihood, Inference,  and Model Comparison

So how many parameters DOES it take to draw an elephant...?*

*30 would “carry a chemical engineer into preliminary design” (Wel, 1975) (cited in B&A, pg 30)

Information Theory perspective:

“How much information is lost when using a simple model to approximate reality?”

Answer: the Kullback-Leibler Distance (generally unknowable)

More Practical Answer: Akaike’s Information Criterion (AIC) identifies the model that minimizes KL distance

KxLAIC 2)|(ln(2

Page 45: Likelihood, Inference,  and Model Comparison

The brave new world…

Science is the development of simplified models as explanations (approximations) of reality…

The “quality” of the explanation (the model) will be a balance of many factors (both quantitative and qualitative)