Modern Regression Basics - unica2.unica.itunica2.unica.it/conversano/ThomasYee1.pdf · Outline of...

Modern Regression Basics

T. W. Yee

University of Auckland

October 2008 @ Cagliari

[email protected]://www.stat.auckland.ac.nz/~yee

T. W. Yee (University of Auckland) Modern Regression Basics 1/167October 2008 @ Cagliari 1 / 167

http://www.stat.auckland.ac.nz/~yee

Outline of This Talk

Outline of This Talk

1 Linear Models

2 Generalized Linear Models (GLMs)

3 Smoothing

4 Generalized Additive Models (GAMs)

5 Introduction to VGLMs and VGAMs

6 Concluding Remarks


Linear Models

Linear Models

Data (xi , yi ,wi ), i = 1, . . . , n, Var(εi ) = σ2/wi , and

E (Yi ) = η(xi ) =

p∑k=1

xik βk .

That is,y = Xβ + ε, ε ∼ Np

(0, σ2 W−1

). (1)

X is an n × p matrix (assumed of rank p), and β is a p-vector ofregression coefficients (parameters).

t-test, ANOVA, multiple linear regression etc. are special cases of (1).


Linear Models

Estimation I

Estimate β by weighted least squares (WLS):

β = argminn∑

i=1

wi

(yi −

p∑k=1

xik βk

)2

= argmin (y − Xβ)T W (y − Xβ) .

Solution is (from the normal equations)

β =(XT WX

)−1XT Wy, (2)

y = X β. (3)

Also, the variance-covariance matrix of β is

Var(β) = σ2(XT WX

)−1. (4)


Linear Models

Estimation II

Suppose W = In. Then LS has a very nice geometric interpretation. Also,

y = Hy where H = X(XT X

)−1XT . (5)

Note that H = H2 (idempotent) and H = HT (symmetric), hence H is aprojection matrix . Such a matrix represents an orthogonal projection.

The eigenvalues of H are p 1’s and (n − p) 0’s. Consequently,trace(H) = rank(H).


Linear Models

Figure: http://www.R-project.org


Linear Models

S Model Formulae I

The S model formula adopted from Wilkinson and Rogers (1973).

Form: response ∼ expression

LHS = the response (usually a vector in a data frame or a matrix).

RHS = explanatory variables.


Linear Models

S Model Formulae II

Consider

> y ~ x1 + x2 + x3 + f1:f2 + f1 * x1 + f2/f3 + f3:f4:f5 +

+ (f6 + f7)^2

where variables beginning with an x are numeric and those beginning withan f are factors.

By default an intercept is fitted, which is 1. Suppress intercepts by -1.

The interaction f1*f2 is expanded to 1 + f1 + f2 + f1:f2. The termsf1 and f2 are main effects.

A second-order interaction between two factors can be expressed usingfactor:factor: γij . There are other types of interactions. Interactionsbetween a factor and numeric, factor:numeric, produce βjx .


Linear Models

S Model Formulae III

Interactions between two numerics, numeric:numeric, produce across-product term such as βx2x3.

The term (f6 + f7)^2 expands to f6 + f7 + f6:f7. A term (f6 + f7+ f8)^2 - f7:f8 would expand to all main effects and all second-orderinteractions except for f7:f8.

Nesting is achieved by /, e.g., f2/f3 is shorthand for 1 + f2 + f3:f2,or equivalently,

> 1 + f2 + f3 %in% f2

Example: f2 = state and f3 = county.


Linear Models

S Model Formulae IV

There are times when you need to use the identity function I(), e.g.,because“^”has special meaning,

> lm(y ~ -1 + offset(a) + x1 + I(x2 - 1) + I(x3^3))

fits

yi = ai + β1xi1 + β2(xi2 − 1) + β3x3i3 + εi ,

εi ∼ iid N(0, σ2), i = 1, . . . , n,

where a is a vector containing the (known) ai .

Other functions: factor(), as.factor(), ordered() terms(),levels(), options().


Linear Models

S generics

Generic functions are available for lm objects. They include

add1()

anova()

coef()

deviance()

drop1()

plot()

predict()

print()

residuals()

step()

summary()

update().

Other less used generic functions are alias(), effects(), family(),kappa(), labels(), proj().

Some other functions are model.matrix(), options().


Linear Models

The lm() Function

> args(lm)

function (formula, data, subset, weights, na.action, method = "qr",

model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,

contrasts = NULL, offset, ...)

NULL

The most useful arguments areweights,

subset,

na.action—na.fail(), na.omit(),

contrasts.

Data frames: read.table(), write.table(), na.omit().


Linear Models

Factors I

> options()$contrasts

unordered ordered"contr.treatment" "contr.poly"

Note:1 contr.treatment is used so that each coefficient compares that

level with level 1 (omitting level 1 itself).

2 contr.sum constrains the coefficients to sum to zero.

3 contr.poly is used for equally spaced, equally replicated orthogonalpolynomial contrasts


Linear Models

Factors II

One can change them by, for example,

> options(contrasts = c("contr.treatment", "contr.poly"))

Here, the first level of the factor is the baseline level.

Table: Dummy variables—partial method.

RACE D1 D2 D3

White 0 0 0Black 1 0 0Hispanic 0 1 0Other 0 0 1


Linear Models

Factors III

An example:


> y <- 1:9

> x <- rep(1:3, len = 9)

> lm(y ~ as.factor(x))

Call:lm(formula = y ~ as.factor(x))

Coefficients:(Intercept) as.factor(x)2 as.factor(x)3

4 1 2


Linear Models

Factors IV

Another example:

> options(contrasts = c("contr.sum", "contr.poly"))

> lm(y ~ as.factor(x))

Call:lm(formula = y ~ as.factor(x))

Coefficients:(Intercept) as.factor(x)1 as.factor(x)25.000e+00 -1.000e+00 -7.493e-17


Linear Models

Topics not done . . .

Other important topics not done:

residual analysis

influential observations

robust regression

variable selection

. . .


Generalized Linear Models (GLMs)


Y ∼ Exponential family (normal, binomial, Poisson, . . . )

g(µ) = η(x) = βTx = β1 + β2 x2 + · · ·+ βp xp

g is the link function (known, monotonic, twice differentiable).

η =p∑

k=1βk xk is known as the linear predictor .

Proposed by Nelder and Wedderburn (1972), GLMs include the generallinear model, logistic regression, probit analysis, Poisson regression,gamma, inverse Gaussian etc. The unification was a major breakthrough instatistical theory.

Estimation: iteratively reweighted least squares (IRLS; see later)



The Exponential Family I

The distribution of a univariate r.v. Y belongs to the exponential family ifits p.(d).f. f (y ; θ) can be written as

f (y ; θ) = exp{p(y)q(θ) + r(y) + s(θ)}. (6)

Here θ = parameter of interest, and the functions p, q, r , s are known.Other parameters can be accomodated provided they are known—wesimply incorporate them in p, q, r and s.

The exponential family has a canonical form where p(y) = y . We alsowant to be able to explicitly consider scale parameters such as σ inN(µ, σ2) so we write

f (y ; θ, φ) = exp

{ω

y d(θ)− b(θ)

φ+ c(y , φ, ω)

}. (7)



The Exponential Family II

Equation (7) belongs to (6) provided φ is known (ω is some knownconstant here). (φ > 0, ω ≥ 0).

θ∗ = d(θ)

is often called the natural parameter of the distribution.



The Exponential Family III

(i) Y ∼ N(µ, σ2)

f (y ;µ, σ) =1√

2πσ2exp

{− 1

2σ2(y − µ)2

}= exp

{yµ− 1

2µ2

σ2− y2

2σ2− 1

2log(2πσ2)

}.

(ii) Y ∼ Poisson(θ)

f (y ; θ) =e−θθy

y != exp{y log θ − θ − log y !}.



The Exponential Family IV

(iii) Z ∼ Binomial(m, p). For z = 0, 1, . . . ,m,

P(Z = z ; p) =

(mz

)pz(1− p)m−z

= exp

{z log p + (m − z) log(1− p) + log

(mz

)}= exp

{z log

p

1− p+ m log(1− p) + log

(mz

)}.

If we look at the sample proportion, Y = Z/m, for y = 0, 1/m, . . . ,m/m,

P(Y = y ; p) = P(Z = my ; p)

= exp

[m

{y log

p

1− p+ log(1− p)

}+ log

(mmy

)].



The Exponential Family V

Let `(θ) = log L(θ) = log f (Y ; θ), u(θ) =∂`(θ)

∂θ, and

I(θ) = −∂2`(θ)

∂θ2.

Can show

E [Y ] = µ =b′(θ)

d ′(θ)(8)

and Var(Y ) =φ2

ω2d ′(θ)2

[ω

b′′(θ)− µd ′′(θ)

φ

], or

Var(Y ) =b′′(θ)φ

d ′(θ)2ω− φ

ω

b′(θ)d ′′(θ)

{d ′(θ)}3. (9)



The Exponential Family VI

If the model is parameterized in terms of the natural parameter θ∗, (e.g.,θ∗ = log p

1−p instead of p) then d ′(θ∗) = 1 and

Var(Y ) = b′′(θ∗)φ

ω. (10)

One then gets the following table.



The Exponential Family VII

tNormal(µ, σ2) Poisson(λ) 1

mBinomial(m, p)

(Y = Sample proportion)

f (y ; θ, φ)1

√2πσ

e− 1

2(y−u)2

σ2e−λλy

y !

„mmy

«pmy (1− p)m−my

Range of Y (−∞,∞) 0, 1, 2, . . . 0, 1/m, 2/m, . . . , m/mMean = E(Y ) µ µ = λ µ = p“Usual” parameter, θ µ λ p

“Natural” param, θ∗ µ log λ = log µ logp

1− p= log

µ

1− µ

b(θ) 12µ2 (= 1

2θ∗2) λ (= eθ∗ ) − log(1− p) = log(1 + eθ∗ )

φ σ2 1 1ω 1 1 m

c(y, φ, ω) −1

2

(y2

φ+ log(2πφ)

)− log y ! log

„mmy

«

µ = E [Y ] µ λ (= eθ∗ ) p =eθ∗

1 + eθ∗

Variance Function Constant (σ2) µµ(1− µ)

m(Var(Y ) as fn of µ)



S and GLMs I

In S use, e.g., glm(y x2 + x3 + x4, family=binomial, data=d)

Family functions are gaussian(), binomial(), poisson(),Gamma(), inverse.gaussian(), quasi().

Generic functions include anova(), coef(), fitted(), plot(),predict(), print(), resid(), summary(), update().

Recall the Wilkinson and Rogers (1973) formula language, e.g.,if f1 and f2 are factors and x1 and x2 are numeric, then

f1 ∗ f2 ≡ 1 + f1 + f2 + f1 : f2,

f1/f2 ≡ “f1 and then f2 within factor f1”,

x1 + x2 ≡ β1X1 + β2X2.

Data frames to hold all the data. Columns are the variables.



S and GLMs II

> data(nzc)

> with(nzc, plot(year, female/(male + female), ylab = "Proportion",

+ main = "Proportion of NZ Chinese that are female",

+ col = "blue", las = 1))

> abline(h = 0.5, lty = "dashed")

> fit.nzc = vglm(cbind(female, male) ~ year, fam = binomialff,

+ data = nzc)

> with(nzc, lines(year, fitted(fit.nzc), col = "red"))



S and GLMs III

● ● ● ● ● ●● ● ●

●

●

●

●

●

●

●

●●

●

●

●

●●

●● ●

1880 1900 1920 1940 1960 1980 2000

0.0

0.1

0.2

0.3

0.4

0.5

Proportion of NZ Chinese that are female

year

Pro

port

ion



S and GLMs IV

> with(nzc, plot(year, female/(male + female), ylab = "Proportion",

+ main = "Proportion of NZ Chinese that are female",



> fit.nzc = vglm(cbind(female, male) ~ poly(year,

+ 2), fam = binomialff, data = nzc)

● ● ● ● ● ●● ● ●

●

●

●

●

●

●

●

●●

●

●

●

●●

●● ●

1880 1900 1920 1940 1960 1980 2000

0.0

0.1

0.2

0.3

0.4

0.5

Proportion of NZ Chinese that are female

year

Pro

port

ion



Logistic regression I


> y <- cbind(c(5, 20, 15, 10), c(20, 10, 10, 10))

> x <- 1:4

> fit <- glm(y ~ as.factor(x), family = binomial)

> fit

Call: glm(formula = y ~ as.factor(x), family = binomial)

Coefficients:

(Intercept) as.factor(x)2 as.factor(x)3 as.factor(x)4

-1.386 2.079 1.792 1.386

Degrees of Freedom: 3 Total (i.e. Null); 0 Residual

Null Deviance: 14.04

Residual Deviance: 4.441e-15 AIC: 22.14

> exp(coef(fit)[-1])



Logistic regression II

as.factor(x)2 as.factor(x)3 as.factor(x)4

8 6 4

Here the model is

logit p(x) = β0 + βj , j = 1, 2, 3, 4

where β1 = 0.



Logistic regression III

If η(x) = β0 + β1x then the log odds for a change in c units in x isobtained from the logit difference

η(x + c)− η(x) = c β1 (11)

and the associated odds ratio is

ψ(c) = ψ(x + c, x) = exp(c β1). (12)

Example If

logit P(D|AGE) = 0.2 + 0.13× AGE

then an increase in age of 10 years will increase the odds of disease byexp(1.3) = 3.67. �



Logistic regression IV

In general, for

logit p(x) = β0 + βTx (13)

we have

log ψ = log

{p(x1)/(1− p(x1))

p(x0)/(1− p(x0))

}= βT (x1 − x0). (14)

Thus, for confidence intervals etc., use

log ψ = (x1 − x0)T Var(β) (x1 − x0). (15)

Note:

p(x1)/(1− p(x1))

p(x0)/(1− p(x0))

is the odds ratio for Y = 1 for a person with x1 relative to a person withx0.T. W. Yee (University of Auckland) Modern Regression Basics 41/167October 2008 @ Cagliari 41 / 167


Some extensions of GLMs

Quasi-likelihood (Wedderburn, 1974)

Composite link functions (Thompson and Baker, 1981)

IRLS for other models (Green, 1984)

Double Exponential Families (Efron, 1986)

Generalized Estimating Equations (GEE; Liang and Zeger, 1986)

Generalized linear mixed models (GLMMs)

ANOVA Splines (Wahba and co-workers, 1995)

Polychotomous Regression (Kooperberg and co-workers, 1997)

Multivariate GLMs (Fahrmeir and Tutz, 2001)

Heirarchial GLMs (Nelder and Lee, late 1990’s)

Generalized additive models (GAMs; Hastie and Tibshirani, 1986)

Generalized additive mixed models (GAMMs; Lin, 1998)

Vector GLMs and VGAMs (Yee and Wild, 1996)


Smoothing

Smoothing

Smoothing is a powerful tool for exploratory data analysis. It allows adata-driven approach rather than model-driven approach. Allows the datato“speak for itself”.

Probably the central idea is localness, i.e., local behaviour versus globalbehaviour of a function.

Scatterplot data (xi , yi ), i = 1, . . . , n.

The classical smoothing problem is

yi = f (xi ) + εi , εi ∼ (0, σi ) (16)

independently. Here, f is an arbitary smooth function, and i = 1, . . . , n.

Q: How can f be estimated?

A: If there is no a priori function form for f , one solution is the smoother.T. W. Yee (University of Auckland) Modern Regression Basics 43/167October 2008 @ Cagliari 43 / 167

Smoothing

Uses of Smoothing

Smoothing has many uses, e.g.,

data visualization and EDA

prediction

derivative estimation, e.g., growth curves, acceleration

used as a basis for many modern statistical techniques


Smoothing

Example I

0.0 0.2 0.4 0.6 0.8 1.0

1.5

2.0

2.5

3.0

x

y


Smoothing

There are four broad categories of smoothers:

1 series or regression smoothers (polynomials, Fourier regression,regression splines, filtering),

2 Kernel smoothers (N-W, locally weighted averages, local regression,loess),

3 Smoothing splines (roughness penalties),

4 Near neighbour smoothers (running means, medians, Tukeysmoothers).

We will look at kernel smoothers and splines.


Smoothing

Scatterplot data (yi , xi ), i = 1, . . . , n.

The classical smoothing problem is

Yi = f (Xi ) + εi (17)

where

f = a smooth function estimated from the data,E (εi ) = 0,

Var(εi ) = σ2 independently.

We let Var(εi ) ∝ w−1i (known), written Var(ε) = W−1,

W = diag(w1, . . . ,wn) = Σ−1.

WLOG let the data is ordered so that x1 < x2 < · · · < xn.


Smoothing

Kernel Smoothers INadaraya-Watson Estimator

Kernel regression estimators are well-known, easily understood andmathematically tractable.

The Nadaraya-Watson (N-W) estimator estimates f (x) by

fnw (x) =

n∑i=1

K

(x − xi

h

)yi

n∑i=1

K

(x − xi

h

) =

n∑i=1

Kh(x − xi ) yi

n∑i=1

Kh(x − xi )

(18)

where

Kh(u) = h−1K(u

h

). (19)


Smoothing

Kernel Smoothers IINadaraya-Watson Estimator

K , a symmetric unimodal function about 0 which integrates to unity,creates the local averaging of values of yi whose corresponding values of xi

are close to the point of estimation x .

The amount of smoothing is controlled by the bandwidth h. Some popularkernel functions are given in Table 2.

As h decreases, the bias decreases and the variance increases.

In practice, the choice of bandwidth h is more crucial than the choice ofkernel function.


Smoothing

Kernel Smoothers IIINadaraya-Watson Estimator

0.0 0.2 0.4 0.6 0.8 1.0

1.5

2.0

2.5

3.0

Reg

ress

ion

func

tion


Smoothing

Kernel Smoothers IVNadaraya-Watson Estimator

Table: Popular kernel functions. Nb. the quartic is also known as the biweight.

Kernel K (u)

Uniform 12 I (|u| ≤ 1)

Triangle (1− |u|) I (|u| ≤ 1)Epanechnikov 3

4(1− u2) I (|u| ≤ 1)Quartic 15

16(1− u2)2 I (|u| ≤ 1)Tricube 70

81(1− |u|3)3 I (|u| ≤ 1)Triweight 35

32(1− u2)3 I (|u| ≤ 1)

Gaussian exp(−12u2)/

√2π

Cosinus π4 cos(πu/2) I (|u| ≤ 1)


Smoothing

Local Regression I

Theoretically elegant and also called local polynomial kernel estimators, ithas favourable asymptotic properties and boundary behaviour.

Idea: estimate f (x0) by“locally”fitting a rth degree polynomial to data viaweighted least squares (WLS).

Example: a local linear kernel estimate for

f (xi ) = 2 exp(−x2i /0.3

2) + 3 exp(−(xi − 1)2/0.72) + εi ,

xi = (i − 1)/n,

εi ∼ N(0, σ = 0.115) independently.


Smoothing

Local Regression II

0.0 0.2 0.4 0.6 0.8 1.0

1.5

2.0

2.5

3.0

Reg

ress

ion

func

tion

Figure: Local linear kernel estimate (solid red) of the regression function f givenin the text based on 100 simulated observations (crosses). The solid black curveis the true function. The red dashed curves are the kernel weights.


Smoothing

We now derive an explicit expression for the local polynomial kernelestimator.

Let r be the degree of the polynomial being fitted. At a point x , theestimator f (x ; r , h) is obtained by fitting the polynomialβ0 + β1(· − x) + · · ·+ βr (· − x)r to the (xi , yi ) using WLS with kernelweights Kh(xi − x).

The value of f (x ; r , h) is the height of the fit β0, where β = (β0, . . . , βr )T

minimizes

n∑i=1

{yi − β0 − β1(xi − x)− · · · − βr (xi − x)r}2 Kh(xi − x) . (20)

The solution is

β = (XTx WxXx)

−1XTx Wx y (21)


Smoothing

where y = (y1, . . . , yn)T ,

Xx =

1 (x1 − x) . . . (x1 − x)r

......

...1 (xn − x) . . . (xn − x)r

is n × (r + 1), and Wx=Diag(Kh(x1 − x), . . . ,Kh(xn − x)). Since theestimator of f (x) is the intercept, we have

f (x ; r , h) = eT1 (XT

x WxXx)−1XT

x Wx y . (22)


Smoothing

Simple explicit formulae exist for the N-W estimator (r = 0):

f (x ; 0, h) =

n∑i=1

Kh(xi − x) yi

n∑i=1

Kh(xi − x)

(23)

and the local linear estimator (r = 1):

f (x ; 1, h) = n−1n∑

i=1

{s2(x ; h)− s1(x ; h)(xi − x)} Kh(xi − x) yi

s2(x ; h)s0(x ; h)− s1(x ; h)2(24)

where

sr (x ; h) = n−1n∑

i=1

(xi − x)r Kh(xi − x) . (25)


Smoothing

Derivative Estimation I

Uses include the study of human growth curves where the first twoderivatives of height as a function of age (“speed”and“acceleration”ofgrowth) have important biological significance.

The extension of local polynomial ideas to estimate the νth derivative isstraightforward. One can estimate f (ν)(x) via the intercept coefficient ofthe νth derivative of the local polynomial being fitted at x , assumingν ≤ r . In general,

f (ν)(x ; r , h) = ν! eTν+1(X

Tx WxXx)

−1XTx Wx y, for all ν = 0, . . . , r (26)

from (22). Note that f (ν)(x ; r , h) is not in general equal to the νthderivative of f (x ; r , h).


Smoothing

Derivative Estimation II

Choosing rIn the early 1990’s Fan and co-workers showed that, for estimating f (ν)(x),there is no increase in variability when passing from an even (i.e., r − νeven) r = ν + 2q order fit to an odd r = ν + 2q + 1 order fit, but whenpassing from an odd r = ν + 2q + 1 order fit to the consecutive evenr = ν + 2q + 2 order there is a price to be paid in terms of increasedvariability. Therefore, even order fits r = ν + 2q are not recommended.Fan and Gijbels (1996) recommend using the lowest odd order, i.e.,r = ν + 1, or occasionally r = ν + 3.

For f choose r = 1 (maybe 3) . . .

For f ′ choose r = 2 (maybe 4) . . .


Smoothing

Lowess and Loess I

A popular method based on local regression is Lowess (Cleveland, 1979)and Loess (Cleveland and Devlin, 1988).

Lowess = locally weighted scatterplot smoother, and it robustifies thelocally WLS method above.

The basic idea is to fit a polynomial of degree r locally via (20) and obtainthe fitted values. Then calculate the residuals and assign weights to eachresidual: large/small residuals receive small/large weights respectively.Then perform another local polynomial fit of order r with weights given bythe product of the initial weight and new weight. Thus observationsshowing large residuals at the initial fit are downweighted in the second fit.The above process is repeated a number of times.

Cleveland (1979) recommended r = 1 and 3 iterations (default).


Smoothing

Lowess and Loess II

> par(mfrow = c(2, 2), mar = c(5, 4, 2, 1) + 0.1)

> set.seed(761)

> x <- sort(rnorm(100))

> eps <- rnorm(100, 0, 0.1)

> y <- sin(x) + eps

> plot(x, y, col = "blue", pch = 4)

> title("Default: lowess(x, y)", cex = 0.5)

> lo1 <- lowess(x, y)

> lines(lo1, lty = 1, col = "red")


> title("lowess(x, y, f=0.5)", cex = 0.5)

> lo2 <- lowess(x, y, f = 0.5)



> title("lowess(x, y, f=0.2)", cex = 0.5)

> lo3 <- lowess(x, y, f = 0.2)



> title("Default: loess(y ~ x)", cex = 0.5)


Smoothing

Lowess and Loess III

> lo4 <- loess(y ~ x)

> lines(x, fitted(lo4), lty = 1, col = "red")

−2 −1 0 1 2

−1.0

−0.5

0.0

0.5

1.0

x

y

Default: lowess(x, y)

−2 −1 0 1 2

−1.0

−0.5

0.0

0.5

1.0

x

y

lowess(x, y, f=0.5)

−2 −1 0 1 2

−1.0

−0.5

0.0

0.5

1.0

x

y

lowess(x, y, f=0.2)

−2 −1 0 1 2

−1.0

−0.5

0.0

0.5

1.0

x

y

Default: loess(y ~ x)


Smoothing

Lowess and Loess IV

Once again choosing a good bandwidth is crucial.


Smoothing

Local Likelihood I

Local likelihood replaces the local least squares criterion by an appropriatelocal log-likelihood criterion.

Example: for binary data (xi , yi ), i = 1, . . . , n, yi = 0 or 1, the locallog-likelihood is

n∑i=1

K

(|xi − x |

h

){yi log pi + (1− yi ) log (1− pi )} (27)

where pi = p(xi ) = P(Y = 1|xi ).

We could model p(x) directly using local polynomials, however, it isusually preferable to use θ(x) = logit p(x). We approximate θ(x) locallyby a polynomial, then choose the polynomial coefficients to maximize thelikelihood.


Smoothing

Local Likelihood II

Local likelihood can also be applied to other regression models and densityestimation.

Local likelihood developed by Tibshirani (1984). A good book on the topicis Loader (1999).


Smoothing

Regression Splines I

Idea: fit a higher degree polynomial (polynomial regression.)

Some drawbacks:

Polynomials aren’t very local but have a global nature. So usually arenot ok at the boundaries, especially if the degree of the polynomial ishigh [cf. Stone-Weierstrass Theorem];

Individual observations can have a large influence on remote parts ofthe curve;

The polynomial degree cannot be controlled continuously.

Polynomial regression can be fitted using the poly() function, e.g.,

> fit <- lm(y ~ poly(x, 5))

fits a 5th degree polynomial.


Smoothing

Regression Splines II

Regression splines use a piecewise polynomial. The regions are separatedby knots (or breakpoints). The positions where each pair of segments joinare called joints. The more knots, the more flexible the family of curvesbecome.

It is customary to force the piecewise polynomials to join smoothly atthese knots. A popular choice are piecewise cubic polynomials withcontinuous 0th, 1st and 2nd derivatives called cubic splines. Using splinesof degree > 3 seldom yields any advantage.

Given a set of knots, the smooth is computed by multiple regression on aset of basis vectors.


Smoothing

Regression Splines III

Here’s a regression spline.

> pos <- function(x) ifelse(x > 0, x, 0)

> x <- 1:7

> y <- c(8, 3, 8, 5, 9, 14, 11)

> knot <- 4

> plot(x, y, col = "blue")

> X <- cbind(1, x, x^2, x^3, pos(x - knot)^3)

> fit <- lm(y ~ X - 1)

> xx <- seq(1, 7, length = 200)

> XX <- cbind(1, xx, xx^2, xx^3, pos(xx - knot)^3)

> lines(xx, XX %*% coef(fit))

> abline(v = knot, lty = "dashed", col = "purple")

> X


Smoothing

Regression Splines IV

x

[1,] 1 1 1 1 0

[2,] 1 2 4 8 0

[3,] 1 3 9 27 0

[4,] 1 4 16 64 0

[5,] 1 5 25 125 1

[6,] 1 6 36 216 8

[7,] 1 7 49 343 27


Smoothing

Regression Splines V

●

●

●

●

●

●

●

1 2 3 4 5 6 7

4

6

8

10

12

14

x

y


Smoothing

Regression Splines VI

Definitions: A function f ∈ C k [a, b] if derivatives f ′, f ′′, . . . , f (k) all existand are continuous in [a, b], e.g., |x | /∈ C 1[a, b].Notes:

1 f ∈ C k [a, b] =⇒ f ∈ C k−1[a, b].

2 C [a, b] ≡ C 0[a, b] = {f (t) : f (t) continuous and real valued,a ≤ t ≤ b}.

There are at least two bases for cubic splines:

1 truncated power series easier to understand but is not used inpractice,

2 B-splines harder to understand but is used in practice.


Smoothing

Regression Splines VII

Advantages of regression splines:

computationally and statistically simple,

standard parametric inferences are available. For example, testingwhether a knot can be removed and the same polynomial equationused to explain two adjacent segments can be tested by H0 : θj = 0,which is one of the t-tests statistics always printed by a regressionprogram.

Disadvantages of regression splines:

difficult to choose the number of knots,

difficult to choose the position of the knots,

the smoothness of the estimate cannot be varied continuously as afunction of a single smoothing parameter.


Smoothing

Regression Splines VIII

Here is a more formal definition of a spline.

In mathematics, a ‘spline’ denotes a function s(x) which is essentially apiecewise polynomial over an interval (a, b), such that a certain number ofits derivatives are continuous for all points in (a, b). More precisely, s(x) isa spline of degree r (some given positive integer) with knots ξ1, . . . , ξK(such that a < ξ1 < ξ2 < · · · < ξK < b) if it satisfies the followingproperties:

for any subinterval (ξj , ξj+1), s(x) is a polynomial of degree r ;(order r + 1);

s ′(x), . . . , s(r−1)(x) are continuous (derivatives), i.e., s ∈ C r−1(a, b);

the rth derivative of s(x) is a step function with jumps at ξ1, . . . , ξK .

Often r is chosen to be 3, and the term cubic spline is then used for theassociated curve.


Smoothing

Regression Splines IX

Wold (1974), in a paper reflecting a lot of experience fitting regressionsplines, made the following recommendations when using cubic splines:

1 Knot points should be located at data points,

2 Have as few knots as possible, ensuring that a minimum of 4 or 5observations should fall between knot points,

3 No more than one extremum point and one inflexion point should fallbetween knots (because a cubic is not possible of approximating morevariations),

4 Extrema should be centered in intervals and inflexion points should belocated near knot points.


Smoothing

B-Splines I

B-splines form a numerically stable basis for splines.It is convenient to consider splines of a general order, M say.

1 M = 4: cubic spline.

2 M = 3: quadratic spline which has continuous derivatives up to orderM − 2 = 1 at the knots—this is aka a parabolic spline.

3 M = 2: linear spline which has continuous derivatives up to orderM − 2 = 0 at the knots—i.e., the function is continuous.

Let ξ0 (< ξ1) and ξK+1 (> ξK ) be 2 boundary knots. Define theaugmented knot sequence {τ} such that

τ1 ≤ τ2 ≤ · · · ≤ τM ≤ ξ0;

τj+M = ξj , j = 1, . . . ,K ;

ξK+1 ≤ τK+M+1 ≤ · · · ≤ τK+2M .


Smoothing

B-Splines II

The actual values of these additional knots beyond the boundary arearbitrary, and it is customary to make them all the same and equal to ξ0and ξK+1 respectively.

Denote by Bi ,m(x) the ith B-spline basis function of order m for the knotsequence {τ}, m ≤ M. They are defined recursively as follows:

For i = 1, . . . ,K + 2M − 1,

Bi ,1(x) =

{1, τi ≤ x < τi+1,0, otherwise;

(28)


Smoothing

B-Splines III

Then for i = 1, . . . ,K + 2M −m,

Bi ,m(x) =x − τi

τi+m−1 − τiBi ,m−1(x) +

τi+m − x

τi+m − τi+1Bi+1,m−1(x) (29)

(de Boor, 1978). He showed stable and efficient recursive algorithmsfor computing them.

Thus with m = 4, Bi ,4, i = 1, . . . ,K + 4, are the K + 4 cubic B-splinebasis functions for the knot sequence {ξ}. This recursion can be continuedand will generate the B-spline basis for any order spline.


Smoothing

B-Splines IV

> knots <- c(1:3, 5, 7, 8, 10)

> atx <- seq(0, 11, by = 0.01)

> mycol = (1:(22 + 1))[-7]

> for (ord in 2:5) {

+ B <- bs(x = atx, degree = ord - 1, knots = knots,

+ intercept = TRUE)

+ matplot(atx, B[, 1], type = "l", ylim = 0:1,

+ lty = 2, ylab = "", xlab = "")

+ matlines(atx, B[, -1], col = mycol, lty = 1)

+ title(paste("B-splines of order", ord))

+ abline(v = knots, lty = 2, col = "purple")

+ }

> attr(B, "degree")

[1] 4

> attr(B, "knots")

[1] 1 2 3 5 7 8 10


Smoothing

B-Splines V

> attr(B, "Boundary.knots")

[1] 0 11

> attr(B, "intercept")

[1] TRUE

> attr(B, "class")

[1] "bs" "basis"


Smoothing

B-Splines VI

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

B−splines of order 2

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0


0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0


0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0



Smoothing

B-Splines VII

In general, bs() adds ord boundary knots to each end, where theboundary knot values are min(xi ) and max(xi ). If intercept=FALSE thenthe left-most function/column is omitted.


Smoothing

B-Splines VIII

To illustrate that linear combinations of the B-spline basis functions doaccommodate smooth curves,

> matplot(atx, 2 * B[, 3] - 5 * B[, 4] + 3 * B[,

+ 7], type = "l", lwd = 2, col = "blue")

0 2 4 6 8 10

−2

−1

0

1

atx

2 *

B[,

3] −

5 *

B[,

4] +

3 *

B[,

7]


Smoothing

B-Splines IX

Here are some additional notes:1 > args(bs)

function (x, df = NULL, knots = NULL, degree = 3, intercept = FALSE,

Boundary.knots = range(x))

NULL

In fact, in Value:, df should be length(knots) + degree +intercept.

2 Safe prediction not as good as smart prediction, e.g., I(bs(x)),poly(scale(x), 2).


Smoothing

B-Splines X

3 B-splines are actually defined by means of divided differences. Bi ,m,which is based on knots τi , . . . , τi+m, is defined as

Bi ,m(x) = (τi+m − τi )i+m∑j=i

(x − τj)m−1+

i+m∏s=i ,s 6=j

(τj − τs)

. (30)

Equation (29) follows from this.


Smoothing

B-Splines XI

As an illustration,

> library(splines)

> n <- 50

> set.seed(760)

> knots <- 1:5

> x <- seq(0, 2 * pi, length = n)

> y <- sin(x) + rnorm(n, sd = 0.5)


> fit <- lm(y ~ bs(x, knots = knots))

> abline(v = knots, lty = "dashed", col = "purple")

> lines(x, sin(x), col = "black")

> aknots = c(-Inf, knots, Inf)

> for (ii in 2:length(aknots)) {

+ newx = seq(max(aknots[ii - 1], min(x)), min(aknots[ii],

+ max(x)), len = 200)

+ lines(newx, predict(fit, data.frame(x = newx)),

+ col = ii - 1, lwd = 2)

+ }


Smoothing

B-Splines XII

Overall, the fit is ok, but could be improved by decreasing the number ofknots and heeding the recommendations of Wold (1974).

0 1 2 3 4 5 6

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

x

y


Smoothing

B-Splines XIII

Knots with varying multiplicities have an effect illustrated by the following.

●

●●

●

●

●●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●●

0 1 2 3 4 5 6

−1

0

1

2Multiplicity 1

●

●●

●

●

●●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●●

0 1 2 3 4 5 6

−1

0

1

2Multiplicity 2

●

●●

●

●

●●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●●

0 1 2 3 4 5 6

−1

0

1

2Multiplicity 3

●

●●

●

●

●●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●●

0 1 2 3 4 5 6

−1

0

1

2Multiplicity 4


Smoothing

Natural Splines I

A cubic spline on [a, b] is a natural cubic splines (NCS) if its 2nd and 3rdderivatives are 0 at a and b (natural boundary conditions).

Natural splines, a restricted form of B-splines, has been implemented bythe function ns(). Given knots ξ1, . . . , ξK , ns() is linear on (−∞, ξ0] and[ξK+1,∞) where ξ0 and ξK+1 are two extra knots.

ns() chooses these to be the minimum and maximum of the xi

respectively. The result is K + 2 parameters.


Smoothing

Natural Splines II

Here’s an example.

> set.seed(21)

> nn = 20

> x = seq(0, 1, len = nn)

> y = runif(nn)

> myknots = c(0.3, 0.7)

> plot(x, y, xlim = c(-0.5, 1.5), col = "blue")

> fit = lm(y ~ ns(x, knot = myknots))

> newx = seq(-0.5, 2.5, len = 100)

> lines(newx, predict(fit, data.frame(x = newx)),

+ col = "blue")

> abline(v = c(range(x), myknots), col = "purple",

+ lty = "dashed")

> coef(fit)


Smoothing

Natural Splines III

(Intercept) ns(x, knot = myknots)1

0.4918062 -0.2425835

ns(x, knot = myknots)2 ns(x, knot = myknots)3

0.1631501 -0.1865069


Smoothing

Natural Splines IV

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−0.5 0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

x

y


Smoothing

Smoothing splines I

Cubic smoothing splines minimize

S(f ) =n∑

i=1

(yi − f (xi ))2 + λ

∫ b

a{f ′′(x)}2 dx , (31)

over a Sobolev space of order 2. Here, a < x1 < · · · < xn < b for some aand b, and λ ≥ 0.

The terms of S(f ):

1 The first penalizes lack-of-fit;

2 the second penalizes wiggliness.

These two conflicting quantities are weighted by the non-negativesmoothing parameter λ.


Smoothing

Smoothing splines II

Larger values of λ produce more smoother curves.

As λ→∞, f ′′(x) → 0 and the solution is a least squares line.

As λ→ 0, the solution tends to an interpolating twice-differentiablefunction.

(31) fits into the“penalty function”approach (Green and Silverman, 1994).Penalized least squares minimizes

(y − f)T Σ−1(y − f) + fTKf.

Solution:f = A(λ) y

where A(λ) = (In + ΣK)−1 is the influence or smoother matrix .


Smoothing

Some notes I

Here are some notes:

1 The smoothing parameter λ can be regarded as the ‘turning knob’which controls the tradeoff between fidelity to the data andsmoothness. Can select λ by trial and error.

2 The justification of the penalty term by physics (energy ∝∫ ba curvature2), which ≈∝

∫ ba f ′′(t)2 dt. Hooke’s Law.

3 Importantly, Reinsch (1967) showed, using the calculus of variations,that the solution of (31) is a cubic spline with knots at the uniquevalues of the xi . It can be shown that minimizing (31) is equivalent tominimizing∫ b

a{f ′′(x)}2 dx subject to

n∑i=1

{yi − f (xi )}2 ≤ σ.

4 As n →∞, λ should become smaller.


Smoothing

Some notes II5 There are alternative regularizations, e.g.,∫ b

af ′(x)2 dx (32)

whose solution is a linear spline. In general,∫ b

af (ν)(x)2 dx

produces a spline of degree 2ν − 1. Note we never get an even degreespline—not unless fractional derivatives are used.

6 S2[a, b] is actually a Sobolev space of order 2. In general, a Sobolevspace of order m is W m

2 [a, b] = {f : f (j), j =0, . . . ,m − 1, is absolutely continuous on [a, b], f (m) ∈ L2[a, b]},i.e.,

∫ ba {f

(m)(t)}2 dt <∞.


Smoothing

Some notes III

How can we compute a cubic smoothing spline? There are several ways:

1 Direct method. Not recommended (O(n3)).

2 State-space approach (O(n)).

3 B-splines—this is a numerically stable method (O(n)).

4 Reinsch algorithm (O(n)).


Smoothing

> args(smooth.spline)

function (x, y = NULL, w = NULL, df, spar = NULL, cv = FALSE,

all.knots = FALSE, nknots = NULL, keep.data = TRUE, df.offset = 0,

penalty = 1, control.spar = list())

NULL

For basic use, use the arguments df.Here’s an example.

> data(cars)

> with(cars, plot(speed, dist, main = "data(cars) & smoothing splines"))

> cars.spl <- with(cars, smooth.spline(speed, dist))

> cars.spl

Call:

smooth.spline(x = speed, y = dist)

Smoothing Parameter spar= 0.7801305 lambda= 0.1112206 (11 iterations)

Equivalent Degrees of Freedom (Df): 2.635278

Penalized Criterion: 4187.776

GCV: 244.1044


Smoothing

This example has duplicate points, so avoid cv=TRUE.

> lines(cars.spl, col = "blue")

> with(cars, lines(smooth.spline(speed, dist, df = 10),

+ lty = 2, col = "red", lwd = 2))

> with(cars.spl, legend(5, 120, c(paste("default [C.V.] => df =",

+ round(df, 1)), "s( * , df = 10)"), col = c("blue",

+ "red"), lty = 1:2, lwd = 1:2, bg = "bisque"))

●

●

●

●

●

●

●

●

●

●

●

●

●●● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

● ●

●

●

●●

●

●

5 10 15 20 25

0

20

40

60

80

100

120

data(cars) & smoothing splines

speed

dist

default [C.V.] => df = 2.6s( * , df = 10)


Smoothing

Smoothing—Some General Theory I

In scatterplot smoothing, there is a fundamental trade-off between the biasand variance of the estimate, and this phenomenon is governed by thesmoothing parameter.

An optimal choice of span would trade the bias against the variance. Onesuch criterion is the mean square error (MSE):

E

[(fk(xi )− f (xi )

)2]

= Var(fk(xi )

)+(E fk(xi )− f (xi )

)2.


Smoothing

Linear Smoothers I

A smoother is linear if

S(a y1 + b y2|x) = aS(y1|x) + b S(y2|x) (33)

for any constants a and b. That is,

y = Sy (34)

where S does not depend on y.

S is referred to as the influence (or smoother) matrix .

Examples: For the bin, running-mean, running-line, regression spline,cubic spline, kernel and local polynomial kernel smoothers are all linearsmoothers (with a fixed smoothing parameter).


Smoothing

Linear Smoothers II

The theory for linear smoothers is much simpler than for nonlinearsmoothers. Many properties of smoothers seen by the eigenvalues andeigenvectors of S. For example, for a cubic spline, S(λ) has all eigenvaluesvalues in (0, 1] with exactly two unit eigenvalues with correspondingeigenvectors 1 and x. That is,

S1 = 1 1 andS x = 1 x.

These correspond to constant and linear functions.


Smoothing

Linear Smoothers III● ●

●

●

●

●

●

●

●

●●

●●

●●

●● ● ● ●

1e−05

1e−03

1e−01

Figure: Eigenvalues of a cubic smoothing spline.


Smoothing

Degrees of Freedom I

All smoothers allow the user to vary the amount of smoothing done viathe smoothing parameter, e.g., bandwidth, the span, or λ.

However, it would be useful to have some measure of the amount ofsmoothing done. One such measure is the effective degrees of freedom(EDF) of a smooth. It is useful for a number of reasons, e.g., comparingdifferent types of smoothers while keeping the amount of smoothingroughly equal.

The theory of EDF is a natural extension of standard results from thegeneral linear model

Y = Xβ + ε , Var(ε) = σ2 I.

Recall that, if β is p × 1, that


Smoothing

Degrees of Freedom II

1 Y = Py where P = X(XTX)−1XT is idempotent of rank p. Thentrace(P) = p,

2 trace(Var(Y)) = σ2 trace(PPT ) = σ2p,

3 E [(n − p) S2] = E [ResSS] = E [(Y − Y)T (Y − Y)] = σ2(n − p).

By replacing P by S, these results suggest the following three definitionsfor the effective degrees of freedom of a smooth:

1 df = trace(S),

2 df var = trace(SST ) and

3 df err = n − trace(2S− SST ).

More generally, with weights W, these are

1 df = trace(S),

2 df var = trace(WSW−1ST ) and

3 df err = n − trace(2S− STWSW−1).


Smoothing

Degrees of Freedom III

It can be shown that if S is a symmetric projection matrix then trace(S),trace(2S− SST ) and trace(SST ) coincide. For cubic splines, it can beshown that

trace(SST ) ≤ trace(S) ≤ trace(2S− SST )

and that all three of these functions are decreasing in λ.

Notes:

1 df is the most popular and easiest to compute. Cost of df is O(n) formost smoothers.


Smoothing

Degrees of Freedom IV

2 2 ≤ the degrees of freedom ≤ the number of distinct xi .Linear fit = 2.The number of distinct xi = interpolant.As the degrees of freedom increases the fit becomes more wiggly. Asmooth with 3 degrees of freedom has approximately the sameflexibility as a quadratic. A value of 4 or 5 degrees of freedom is oftenused as the default value in software, as this can accommodate areasonable amount of nonlinearity without being excessive.


Smoothing

Standard Errors I

f = S y =⇒ Var(f) = σ2 SST (35)

Can form pointwise SE bands for f (useful in preventing theover-interpretation of a plot of the estimated function).

But (35) impractical if n is large (all of S needed).

Trick: for cubic splines, Silverman (1985) uses a Bayesian derivation todiscuss the use of the alternative

σ2 S.

Cost is O(n)


Smoothing

Equivalent Kernels I

Consider y = Sy for a linear smoother. Plotting the jth row of S versus xi

gives the weights used for the estimate yj . This mimics the kernel functionof a kernel smoother.

The EK for a cubic spline is

κ(u) =1

2exp

(− |u|√

2

)sin

(|u|√

2+π

4

)as n →∞ (Silverman, 1984).


Smoothing

Equivalent Kernels II

−5 0 5

0.0

0.1

0.2

0.3

u

EK

If the design points xi have a local density g(x), and if x is not too nearthe boundary and λ is not too big or too small, then the local bandwidthh(x) satisfies

h(x) = {λ n g(x)}−1/4 .


Smoothing

Automatic Smoothing Parameter Selection I

Choosing the bandwidth/smoothing parameter is the most importantdecision for a specified method. We want an automatic way of choosingthe ‘right’ smoothing parameter.

A popular method is cross-validation (CV), and restrict attention to linearsmoothers.

CV idea: leave point (xi , yi ) out one at a time and estimating the smoothat xi based on the remaining n − 1 points.

Choose λCV to minimize the cross-validation sum of squares

CV (λ) =1

n

n∑i=1

{yi − f −i

λ (xi )}2

(36)


Smoothing

Automatic Smoothing Parameter Selection II

where f −iλ (xi ) is the fitted value at xi , computed by leaving out the ith

data point.

One can compute (36) naıvely. But there is a trick.

Define f −iλ (xi ) to be the fit obtained by setting the weight of the ith

observation to zero, and increasing the remaining weights so that they sumto unity, i.e.,

f −iλ (xi ) =

n∑j=1j 6=i

sij1− sii

yj . (37)

This means

f −iλ (xi ) =

n∑j=1j 6=i

sij yj + sii f−iλ (xi ) (38)


Smoothing

Automatic Smoothing Parameter Selection III

and

yi − f −iλ (xi ) =

yi − fλ(xi )

1− sii. (39)

Thus, CV (λ) can be written

CV (λ) =1

n

n∑i=1

{yi − fλ (xi )

1− sii (λ)

}2

. (40)

So there is no need to compute f −iλ (xi ) naıvely.

In practice, CV sometimes gives questionable performance.


Smoothing

Generalized Cross-Validation

A variant of CV (λ) is generalized cross-validation (GCV). The GCV idea isto replace sii by its average value trace(S)/n which is easier to compute:

GCV (λ) =1

n

n∑i=1

{yi − fλ (xi )

1− trace(S)/n

}2

.

GCV tends to undersmooth.


Smoothing

Testing for nonlinearity

Suppose we wish to compare two smooths f1 = S1y and f2 = S2y. Forexample, the smooth f2 might be rougher than f1, and we wish to test if itpicks up any significant bias. A standard case that often arises is when f1is linear, in which case we want to test if the linearity is real. We mustassume that f2 is unbiased, and that f1 is unbiased under H0. LettingResSSj be the residual sum of squares for the jth smooth and γj betrace(2S− STS), then

(ResSS1 − ResSS2)/(γ2 − γ1)

ResSS2/(n − γ1)∼ Fγ2−γ1,n−γ1 (41)

approximately.


Smoothing

The Curse of Dimensionality I

Sometimes multidimensional smoothers can work with a moderate numberof inputs. But the curse of dimensionality hinders them in higherdimensions:

local neighbourhoods are empty, or

nearest-neighbourhoods are not local

all points are close to the boundary

sample sizes need to grow exponentially.


Smoothing

The Curse of Dimensionality II

That is, neigbourhoods with a fixed number of points become less local asthe dimensions increase. For fixed n, the data becomes more isolated ind-space and smoothers require a larger neighbourhood to find enough datapoints in order to calculate the variance of an estimate. Hence theestimate is no longer local and can be severely biased.

The following illustrates the curse of dimensionality. Suppose we have datauniformly distributed in a d dimensional unit cube. We spread out asubcube from the origin to capture span% of the data. What distance dowe have to reach out on each axis? The next figure gives the answer.Most reasonable high-dimensional procedures assume some structure.


Smoothing

The Curse of Dimensionality III

0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

span (%)

dist

ance

d=1

d=2d=3

d=10

Figure: Distance on each axis of a subcube required to capture span% of the dataof a d dimensional unit cube.


Generalized Additive Models (GAMs)

Generalized Additive Models (GAMs) I

For general p, the linear model is

Y = β1X1 + · · ·+ βpXp + ε, ε ∼ N(0, σ2) independently. (42)

This model has some strong assumptions:

1 Linearity, i.e., the effect of each Xk on E (Y ) is linear,

2 Normal errors with zero mean, constant variance, and independent,

3 Additivity, i.e., Xk and X` do not interact; they have an additiveeffect on the response.



Generalized Additive Models (GAMs) II

We relax the linearity assumption.

The linear predictor becomes an additive predictor :

η(x) = f1(x1) + · · ·+ fp(xp), (43)

a sum of arbitary smooth functions.

Additivity is still assumed. Easy to interpret.

Identifiability: the fk(xk) are centred.

Very useful for exploratory data analysis. Allows the data to“speak foritself”.

Some GAM books are Hastie and Tibshirani (1990) and Wood (2006).




Fit an additive model by backfitting . It is iterative procedure that smoothspartial residuals to f

E (y|x) = ft(xt) +

p∑k=1, k 6=t

fk(xk)

so

ft(xt) = E

y −p∑

k=1, k 6=t

fk(xk)

∣∣∣∣∣∣Xt

.

Modified backfitting possible—and is implemented. It decomposes

η(x) = Xβ +

p∑k=1

rk(xk)

i.e., into a linear and nonlinear components.T. W. Yee (University of Auckland) Modern Regression Basics 146/167October 2008 @ Cagliari 146 / 167



Example 1 Kauri data

Y = presence/absence of a tree species, agaaus, which is Agathisaustralis, better known as Kauri, NZ’s most famous tree. Data is from 392sites from the Hunua forest near Auckland.

Figure: Big Kauri tree.




174.2 174.6 175.0

−37.4

−37.2

−37.0

−36.8

−36.6

−36.4

Longitude

Latit

ude

●W

H

Figure: The Hunua and Waitakere Ranges.



Generalized Additive Models (GAMs) II

logit P[Yagaaus = 1] = f (altitude)

where f is a smooth function determined from the data.

> data(hunua)

> fit.h = vgam(agaaus ~ s(altitude), binomialff,

+ hunua)

> plot(fit.h, se = TRUE, lcol = "blue", scol = "red",

+ llwd = 2, slwd = 2)

> o = with(hunua, order(altitude))

> with(hunua, plot(altitude[o], fitted(fit.h)[o],

+ type = "l", ylim = 0:1, lwd = 2, col = "blue",

+ xlab = "altitude", ylab = "Fitted value"))

> with(hunua, points(altitude, agaaus + (runif(nrow(hunua)) -

+ 0.5)/30, col = "red"))



Generalized Additive Models (GAMs) III

0 100 200 300 400 500 600

−8

−6

−4

−2

0

2

altitude

s(al

titud

e)

0 100 200 300 400 500 600

0.0

0.2

0.4

0.6

0.8

1.0

altitudeF

itted

val

ue

●●●●●●●

●●

●

●●

●

●

●

●

●

●

●

●

●● ●●

●

● ●

●

●●●

●

●● ●

●●

●●●● ●●● ●●

●●●●

●

● ●●

●

●●●

●

●●●●●●

●●

●●●●●

●

●

●

●● ●●●●●

●

●●

●

●

●

●●

●

●

●

●●●●

●● ●●

●

●

●●

●

●

● ●

●

● ●●

●●

● ●● ●

●

●●

●

● ●●

●

●●

●

● ●●●●● ●●● ●

●

●

● ●

●

●

●●

●

●●●●●

●

●

●●●

●●●●●●

●●●●●

●

●●●●●● ●●●●● ●●●●● ● ●●●●● ● ●●

●●●●●●

●

●●

●

●●●● ●● ●●●●●●● ●●●●●●●● ● ● ●●

●●

●● ●● ●●

●●●

●

●

●●●

●

● ●●● ●● ●

●

● ● ● ●●

●●

●●● ●

●

●

●●

●●

●●

●

●

● ●

●● ●●●

●●

● ●

●

●●

● ● ●● ● ●●●●● ● ● ● ●●●●●

● ●

●● ●●

●●●●●

● ●●●●● ●●●●

●

●●

● ●● ●●●●●

●

●● ● ● ●●● ● ●

●●● ● ●●●

● ●●●●●● ● ●●●

●●●●●●●●●●●●●●●●● ●●●●●

●

Figure: GAM plots of Kauri data in the Hunua ranges.



Generalized Additive Models (GAMs) IV

Example 2 NZ Chinese data

> with(nzc, plot(year, female/(male + female), ylab = "Proportion female",



> fit.nzc = vgam(cbind(female, male) ~ s(year, df = 3),

+ fam = binomialff, data = nzc)

> with(nzc, lines(year, fitted(fit.nzc), col = "red"))

● ● ● ● ● ●● ● ●

●

●

●

●

●

●

●

●●

●

●

●

●●

●● ●

1880 1900 1920 1940 1960 1980 2000

0.0

0.1

0.2

0.3

0.4

0.5

year

Pro

port

ion

fem

ale



GAMs I

There are some packages in R which fit GAMs. These include

gam written by Trevor Hastie, is similar to the S-PLUS version

mgcv by Wood, has an emphasis on smoothing parameter selection

gamlss by Londoners. Limited.


Introduction to VGLMs and VGAMs

Introduction to VGLMs and VGAMs I

The VGAM package implements several large classes of regression modelsof which vector generalized linear and additive models (VGLMs/VGAMs)are most commonly used.

The primary key words are

maximum likelihood estimation,

iteratively reweighted least squares (IRLS),

Fisher scoring,

additive models.

Other concepts are

reduced rank regression,

constrained ordination,

vector smoothing.



Introduction to VGLMs and VGAMs II

Basically, VGLMs model each parameter, transformed if necessary, as alinear combination of the explanatory variables. That is,

gj(θj) = ηj = βTj x (44)

where gj is a parameter link function.

VGAMs extend (44) to

gj(θj) = ηj =

p∑k=1

f(j)k(xk), (45)

i.e., an additive model for each parameter.



Introduction to VGLMs and VGAMs III

Example: negative binomial

Y has a probability function that can be written as

P(Y = y) =

(y + k − 1

y

) (µ

µ+ k

)y ( k

k + µ

)k

where y = 0, 1, 2, . . .. Parameters µ (= E (Y )) > 0 and k > 0.

The MASS implementation is restricted to a intercept-only estimate of k,e.g., cannot fit log k = β(2)1 + β(2)2x2. In contrast, VGAM can fit

log µ = η1 = βT1 x,

log k = η2 = βT2 x.

using vglm(..., negbinomial(zero=NULL)).



Introduction to VGLMs and VGAMs IV

Note that it is natural for any positive parameter to have the log link asthe default.

There are also variants of the negative binomial in the VGAM package,e.g., the zero-inflated and zero-altered versions.



Introduction to VGLMs and VGAMs V

The new framework extends GLMs and GAMs in three main ways:

(i) multivariate responses y and/or linear/additive predictors η arehandled,

(ii) y not restricted to the exponential family,

(iii) ηj need not be a function of a mean µ: ηj = gj(θj) for anyparameter θj .

This formulation is deliberately general so that it encompasses as manydistributions and models as possible. We wish to be limited only by theassumption that the regression coefficients enter through a set of linear oradditive predictors.

Given the covariates, the conditional distribution of the response isintended to be completely general.



Introduction to VGLMs and VGAMs VI

The scope of VGAM is very broad; it potentially covers

univariate and multivariate distributions,

categorical data analysis,

quantile and expectile regression,

time series,

survival analysis,

mixture models,

extreme value analysis,

nonlinear regression,

. . . .

It conveys GLM/GAM-type modelling to a much broader range of models.



Introduction to VGLMs and VGAMs VII

LM

RR−VGLM

RR−VLM

VLM

VGLM VGAM

RR−VGAM

QRR−VGLM

VAM

Generalized

Normal errors

Parametric Nonparametric

Figure: Flowchart for different classes of models. Legend: LM = linear model, V= vector, G = generalized, A = additive, RR = reduced-rank, Q = quadratic.



Introduction to VGLMs and VGAMs VIII

tη Model S function Reference

BT1 x1 + BT

2 x2 (= BT x) VGLM vglm() Yee & Hastie (2003)

BT1 x1 +

p1+p2Pk=p1+1

Hk f∗k (xk ) VGAM vgam() Yee & Wild (1996)

BT1 x1 + A ν RR-VGLM rrvglm() Yee & Hastie (2003)

BT1 x1 + A ν +

0BBB@νT D1ν

.

.

.

νT DMν

1CCCA QRR-VGLM cqo() Yee (2004)

BT1 x1 +

RPr=1

fr (νr ) RR-VGAM cao() Yee (2006)

Table: A summary of VGAM and its framework. The latent variables ν = CTx2,or ν = cTx2 if rank R = 1. Here, xT = (xT

1 , xT2 ). Abbreviations: A = additive, C

= constrained, L = linear, O = ordination, Q = quadratic, RR = reduced-rank,VGLM = vector generalized linear model.


Concluding Remarks

Concluding Remarks

1 The LM is fundamental and basic to statistical regression modelling.

2 GLMs provide an important unification of several important regressionmodels, e.g., normal, Poisson, binomial, . . . .

3 Smoothing allows the data to“speak for themselves”.

4 GAMs are a very useful data-driven variant of GLMs. Acombinination of GLMs + smoothing. But only c.6 models.

5 VGLMs/VGAMs are GLMs/GAMs but applied to a much wider classof models. We will look at VGLMs/VGAMs in detail later.


Concluding Remarks

Finito


Concluding Remarks

The End


Modern Regression Basics - unica2.unica.itunica2.unica.it/conversano/ThomasYee1.pdf · Outline of...

Documents

Transcript of Modern Regression Basics - unica2.unica.itunica2.unica.it/conversano/ThomasYee1.pdf · Outline of...