Download - REML and residual likelihood

university-logo

Maximum likelihoodApplications and examples

REML and residual likelihood

Peter McCullagh

Department of StatisticsUniversity of Chicago

Nelder LectureImperial College, March 8 2012

Peter McCullagh REML

university-logo


JAN: Some personal remarks...

IC 1974–1977:The MS/PhD program in StatisticsComputing strategies: GLIM,...Ordinal data and log-linear models...

Chicago 1977-79: consulting workIC 1979–1984:

Plans for the GLM book I: London 1980–81Writing the GLM book II: Vancouver 1982Writing the GLM book III: London/Rothamsted 1982/83Toronto ASA Mtg 1984

Chicago 1985–1987:The second edition...Random effects models: the salamander data


university-logo


Outline

1 Maximum likelihoodREML and residual likelihoodLikelihood ratios

2 Applications and examplesExample I: fumigants for eelworm controlExample II: kernel smoothingBox-Cox and REML


university-logo


REML and residual likelihoodLikelihood ratios

Symmetric functions

Estimation of moments/cumulants:Thiele 1891; Fisher 1929; Dressel 1940; Tukey 1950Y1, . . . ,Yn iid mean κ1, variance κ2, . . .

Polynomial symmetric functions...k1 = (Y1 + · · ·+ Yn)/n for κ1

k2 =∑

(Yi − k1)2/(n − 1) for κ2k11 =

∑]ij YiYj/n↓2 = k2

1 − k2/n for ??

k3 =∑

(Y1 − k1)3n2/n↓3 for κ3k21 = . . .k111 =

∑]ijk YiYjYk/n↓3

k4 =((n + 1)

∑(Yi − k1)4/n − 3(n − 1)k2

2)n3/n↓4

k31, k22, k211, k1111


university-logo



Maximum likelihood estimation

Design with n units/plots/subjects i = 1, . . . ,ncovariate x(i) ≡ xi in Rp givenResponse Y (i) = Yi a real number

Observation space Y ∈ S = Rn

Covariate space X = span(X ) ⊂ SLinear model: For some β ∈ Rp and σ2 > 0

Y ∼ N(Xβ, σ2In)Log likelihood function: l(β, σ; y) = −1

2‖y − Xβ‖2 − n logσ

β̂ = (X ′X )−1X ′y ; µ̂ = X β̂σ̂2 = ‖y − µ̂‖2/n

E(σ̂2) = (n − p)σ2/n: too small!Conventional estimate s2 = ‖y − µ̂‖2/(n − p)


university-logo



Residuals

One definition: R = Y − X (X ′X )−1X ′Y = QYAnother definition R′ = AY where ker(A) = XBut R′ = AR.... so all definitions are equivalent

... for likelihood computations

Distributions:R ∼ N(0, σ2Q) R2 ∼ N(0, σ2A′A)Likelihoods? No density function for R


university-logo



Variance-components estimation

Design with n units/plots/subjects i = 1, . . . ,nBlock factor relationship: B(i , j) = 1 if i ∼B j (given)covariate x(i) ≡ xi in Rp (given treatment level)Response Y (i) = Yi a real number

Linear model: For some β ∈ Rp and σ20σ

21 > 0

Y ∼ N(Xβ, σ20In + σ2

1B)mean µ = Xβ; variance Σ = σ2

0In + σ21B; W = Σ−1

Log likelihood function: l(β, σ; y) = −12‖y − µ‖

2 − 12 log |Σ|

Sufficient statistics (balance and µ = 0)E(YY ′) = tr(Σ) = nσ2

0 + nσ21

E(Y ′BY = tr(ΣB) = nσ20 + σ2

1∑

n2j

typically too small


university-logo



Gaussian likelihoods

Density of the Gaussian N(µ,Σ) distn at y ∈ Rn

|W |1/2 exp(−12‖y − µ‖

2) dy(Rn,W ) regarded as an inner product spaceW = Σ−1, 〈x , y〉 = x ′Wy

K ⊂ Rn a subspace of dimension k spanned by cols of KOrthogonal projections: P = K (K ′WK )−1K ′W , Q = I − PA: a linear transformation with kernel K

Marginal likelihood based on AY ∼ N(Aµ,AΣA′) is|AΣA′|−1/2 exp(−1

2(y − µ)′A′(AΣA′)−1A(y − µ))


university-logo



Gaussian likelihoods contd.

Marginal likelihood based on AY ∼ N(Aµ,AΣA′) is

|AΣA′|−1/2 exp(−12(y − µ)′A′(AΣA′)−1A(y − µ))

Equivalent expressions:

|W |1/2

|K ′WK |1/2 exp(−12(y − µ)′WQ(y − µ))

|W |1/2 |K ′K |1/2

|K ′WK |1/2 exp(−12(y − µ)′WQ(y − µ))

Det1/2(WQ) exp(−12(y − µ)′WQ(y − µ))

Det(WQ) is the product of n − k non-zero eigenvaluesRn/K regarded as an inner product space 〈x , y〉 = x ′WQy


university-logo



REML and residual likelihood

Family of distributions: N(Xβ,Σ(θ)): β ∈ Rp, θ ∈ ΘFull log likelihood:

l(β,Σ; y) = −12 log det(Σ)− 1

2(y − Xβ)′W (y − Xβ)

Profile log likelihood: β̂θ = (X ′WX )−1X ′Wy : W = Σ−1θ

l(β̂,Σ; y) = −12 log det(Σ)− 1

2y ′WQy

Residual: Y 7→ AY where ker(A) = XResidual log likelihood

l(Σ; Qy) =−12 log det(Σ)− 1

2 log det(X ′WX )− 12y ′WQy

= 12 log Det(WQ)− 1

2y ′WQy + const(X )


university-logo



Summary: Marginal likelihood: K 6= X

Model subspace X = {µ(β) : β ∈ Rp} X = span(X )Kernel subspace K = span(K )Covariance matrix Σθ: W = Σ−1

Log likelihood based on observation y +KLog likelihood based on Ay where ker(A) = K

l(β, θ; y +K) = 12 log Det(WQ)− 1

2(y − µ)′WQ(y − µ))

where WQ = W (I − K (K ′WK )−1K ′W ) is i.p. in Rn/K

Special cases:K = 0: ordinary likelihoodK = 1n = span(e1 + · · ·+ en): likelihood based on contrastsK = X : standard REMLK = span(en): likelihood with yn unobserved


university-logo



Likelihood ratio tests

Simple likelihood ratio: Pθ(event)Pθ′ (event)

Maximized likelihood ratio:

supθ∈HAPθ(event)

supθ∈H0Pθ(event)

Event in numerator = event in denominator, usually dyFor marginal likelihood, event = dy +K

Marginal likelihood ratio statistic

supΘ Pθ(dy +K)

supΘ′ Pθ(dy +K)

Same K in numerator and denominator


university-logo


Example I: fumigants for eelworm controlExample II: kernel smoothingBox-Cox and REML

Example: Eelworm control using fumigants

I

II III

IV

Actual field layout of 48 plots in four blocks. Experiment usingfumigants to control eelworms in oat field. (Bailey, 2008, p. 73).

Data (eelworm counts) from Cochran and Cox (1950, Table 3.1)Blk 1 (I) Blk 2 (IV)

269 283 252 212 95 127 80 134138 100 197 263 107 89 41 74282 230 216 145 88 25 42 62

Blk 2 (II) Blk 4 (III)124 211 194 222 193 209 109 153102 193 128 42 29 9 17 19162 191 107 67 23 19 44 48

1


university-logo



Variance models: taking off from JAN (1965)

Block-structured effects: η iid∼ N(0, σ21) const on each block

Yi = trt effects + ηb(i) + εi

cov(Yi ,Yj) = σ21B(i , j) + σ2

0δij

Stationary isotropic spatial effect:

η ∼ GP(0, σ21K ) cov(η(x), η(x ′)) = σ2

1K (|x − x ′|)

Y (A) = trt erffects +

∫Aη(x) dx + εi

cov(Yi ,Yj) = σ21K̄ (xi , xj) + σ2

0δij

K (x , x ′) = exp(−|x − x ′|/ρ) with range ρ > 0 for illustrationIn practice, ρ̂ =∞ K (x , x ′) = −|x − x ′|


university-logo



Comparison of variance models for eelworm expt

Y (i) response for plot i : (log ratio of eelworm counts)Block relation: B(i , j) = 1 if i ∼ j in same blockDistance relation: Dij = d(i , j): Vij = exp(−Dij/ρ)Take K = fumigant ∗ dose as kernel

Maximal model: cov(Y (i),Y (j)) = σ20δij + σ2

1Bij + σ22Vij

H0 : σ21 = σ2

2 = 0H1 : σ2

2 = 0 (no spatial effect beyond blocks)H2 : σ2

1 = 0 (no block effect)

Log likelihood values: 6.47, 12.28, 20.53, 20.53 (both)

(Max ‘always’ occurs at ρ→∞V = −D is pos def on contrasts: K ⊃ 1)

R syntax: regress(y~1, ~blk+V, kernel=K)


university-logo



Treatment comparisons via likelihood

Fix covariance model at cov(Y ) = σ20In + σ2

2V : (V = −D)

Treatments: Four fumigants and three dose levels includingzeroNullnull model: Nothing has any effect (X = 1) dim 1Null model: all fumigants equally effective: 1+dose dim 3Alternative: fumigant*dose dim 9

regress(y~1, ~V, kernel=~1) llik=14.4regress(y~dose, ~V, kernel=~1) llik=17.3regress(y~dose, ~V, kernel=~dose) llik=16.3regress(y~fumigant:dose, ~V, kernel=~dose)26.7

Comparisons involving models having the same kernelDefault kernel is K = X : (REML)


university-logo



Marginal likelihood and kernel smoothing

(Y0Y1

)∼(

Σ00 Σ01Σ10 Σ11

)Y0 |Y1 = y1∼N(Σ01Σ−1

11 y1, Σ00 − Σ01Σ−111 Σ10)

Implications: Observe Y1 = y1 only (n-component vector)Predictive distn: mean = Σ01Σ−1

11 y1; cov = W−100

Typical application: observe (y1, . . . , yn) at (x1, . . . , xn)Σij = σ2

0δij + Σ21K (xi , xj) K (x , x ′) = e−|x−x ′| or ...

Predictions: E(Y (x∗) |data) =∑

ij K (x∗, xi)Σ−1ij yj

‘smooth’ fn of x∗ called kernel spline


university-logo



●

●●

●

●

●

●

●●

●●

●●●

● ●

●

●

●

●

●●

●●

●

●

●●

2 4 6 8 10

0.3

0.5

0.7

0.9

x[w]

C_1 spline: const mean model●

●●

●

●

●

●

●●

●●

●●●

● ●

●

●

●

●

●●

●●

●

●

●●

2 4 6 8 10

0.3

0.5

0.7

0.9

x[w]

y[w

]

C_1 spline: linear mean model

●

●●

●

●

●

●

●●

●●

●●●

● ●

●

●

●

●

●●

●●

●

●

●●

0.3

0.5

0.7

0.9

C_2 spline: linear mean model●

●●

●

●

●

●

●●

●●

●●●

● ●

●

●

●

●

●●

●●

●

●

●●

0.3

0.5

0.7

0.9

y[w

]

C_2 spline: quadratic mean model


university-logo



R code:d <- abs(outer(x, x, "-")); rho <- 100;K <- (1 + d/rho)*exp(-d/rho)fit1 <- regress(y~1, ~K, kernel=1)blp <- fit1$fitted + fit1$sigma[2] * K %*% fit1$W%*% (y-fit1$fitted)plot(x, y, cex=0.5); lines(x, blp)

Example of an improper covariance function:K3 <- d^3fit3 <- regress(y~1+x+xsq, ~K3, kernel=~1+x)blp <- fit3$fitted + fit3$sigma[2] * K3 %*% fit3$W%*% (y-fit3$fitted)

plot(x, y, cex=0.5); lines(x, blp)


university-logo



The Box-Cox technique for transformation

Family of transformations y 7→ gλ(y) = (yλ − 1)/λindexed by λ and applied component-wise

Model: for some λ, gλ(Y ) ∼ N(Xβ,Σ)Density at y ∈ Rn is

det(W )1/2 exp(−12‖gλ(y)− Xβ‖2W )× Jλ(y) dy

W = Σ−1, Jλ(y) =∏|g′λ(yi)|

Log likelihood is

12 log det(W )− 1

2‖gλ(y)− Xβ‖2W +∑

log |g′λ(yi)|

Profile log likelihood for λ is

12 log det(W )− 1

2‖gλ(y)‖2WQ +∑

log |g′λ(yi)|


university-logo



Box-Cox and REML

Profile log likelihood for λ

12 log det(W )− 1

2‖gλ(y)‖2WQ +∑

log |g′λ(yi)|

gλ(y) = (yλ − 1)/λ, g′λ(y) = yλ−1

REML likelihood: (Shi-Tsai, JRSSB, 2002)

l(λ,W ; y ,X ) = 12 log Det(WQ)− 1

2‖gλ(y)‖2WQ +∑

log |g′λ(yi)|

...by adopting the results of Verbyla (1990) or Diggle (1994)...

Is this right/reasonable/OK?(i) seems reasonable by analogy with REML to adjust for d.f.(ii) but not a function of the residuals Qy(iii) Put X = In: resid = 0 but l(λ...) = (λ− 1)

∑log(yi)

... so it cannot be right!Peter McCullagh REML

university-logo



Box-Cox and REML, contd

Is there a right way to combine Box-Cox with REML?No!Why not?

Ans I:Because the transformation y 7→ yλ Rn → Rn

is not measurable with respect to B(Rn/K)The transformation does not preserve cosets

Ans II:Model says Y λ ∼ N(µ ∈ X ,Σ) or Y ∼ N(µ,Σ, λ)Then E(Y ) 6∈ X implies distn of QY depends on µ


university-logo



References

Bailey, R. (2007) Design of Comparative Experiments. Cambridge.Box, GEP and Cox, D.R. (1964) Analysis of transformations JRSSB211–252.Harville, D.A. (1974) Bayesian variance components. Bka 61,383–385.Harville, D.A. (1977) Variance component estimation. JASA 72,320-340.Nelder, J.A. (1965) Orthogonal block structure. Proc Roy Soc A 283Patterson, H. and Thompson, R. (1971) Biometrika 58 545-554.Shi, P. and Tsal, C-L. Regression model selection: A residuallikelihood approach. JRSSB 2002.