Modern Regression Basics - unica2.unica.itunica2.unica.it/conversano/ThomasYee1.pdf · Outline of...
Transcript of Modern Regression Basics - unica2.unica.itunica2.unica.it/conversano/ThomasYee1.pdf · Outline of...
Modern Regression Basics
T. W. Yee
University of Auckland
October 2008 @ Cagliari
[email protected]://www.stat.auckland.ac.nz/~yee
T. W. Yee (University of Auckland) Modern Regression Basics 1/167October 2008 @ Cagliari 1 / 167
Outline of This Talk
Outline of This Talk
1 Linear Models
2 Generalized Linear Models (GLMs)
3 Smoothing
4 Generalized Additive Models (GAMs)
5 Introduction to VGLMs and VGAMs
6 Concluding Remarks
T. W. Yee (University of Auckland) Modern Regression Basics 2/167October 2008 @ Cagliari 2 / 167
Linear Models
Linear Models
Data (xi , yi ,wi ), i = 1, . . . , n, Var(εi ) = σ2/wi , and
E (Yi ) = η(xi ) =
p∑k=1
xik βk .
That is,y = Xβ + ε, ε ∼ Np
(0, σ2 W−1
). (1)
X is an n × p matrix (assumed of rank p), and β is a p-vector ofregression coefficients (parameters).
t-test, ANOVA, multiple linear regression etc. are special cases of (1).
T. W. Yee (University of Auckland) Modern Regression Basics 3/167October 2008 @ Cagliari 3 / 167
Linear Models
Estimation I
Estimate β by weighted least squares (WLS):
β = argminn∑
i=1
wi
(yi −
p∑k=1
xik βk
)2
= argmin (y − Xβ)T W (y − Xβ) .
Solution is (from the normal equations)
β =(XT WX
)−1XT Wy, (2)
y = X β. (3)
Also, the variance-covariance matrix of β is
Var(β) = σ2(XT WX
)−1. (4)
T. W. Yee (University of Auckland) Modern Regression Basics 5/167October 2008 @ Cagliari 5 / 167
Linear Models
Estimation II
Suppose W = In. Then LS has a very nice geometric interpretation. Also,
y = Hy where H = X(XT X
)−1XT . (5)
Note that H = H2 (idempotent) and H = HT (symmetric), hence H is aprojection matrix . Such a matrix represents an orthogonal projection.
The eigenvalues of H are p 1’s and (n − p) 0’s. Consequently,trace(H) = rank(H).
T. W. Yee (University of Auckland) Modern Regression Basics 6/167October 2008 @ Cagliari 6 / 167
Linear Models
Figure: http://www.R-project.org
T. W. Yee (University of Auckland) Modern Regression Basics 8/167October 2008 @ Cagliari 8 / 167
Linear Models
S Model Formulae I
The S model formula adopted from Wilkinson and Rogers (1973).
Form: response ∼ expression
LHS = the response (usually a vector in a data frame or a matrix).
RHS = explanatory variables.
T. W. Yee (University of Auckland) Modern Regression Basics 10/167October 2008 @ Cagliari 10 / 167
Linear Models
S Model Formulae II
Consider
> y ~ x1 + x2 + x3 + f1:f2 + f1 * x1 + f2/f3 + f3:f4:f5 +
+ (f6 + f7)^2
where variables beginning with an x are numeric and those beginning withan f are factors.
By default an intercept is fitted, which is 1. Suppress intercepts by -1.
The interaction f1*f2 is expanded to 1 + f1 + f2 + f1:f2. The termsf1 and f2 are main effects.
A second-order interaction between two factors can be expressed usingfactor:factor: γij . There are other types of interactions. Interactionsbetween a factor and numeric, factor:numeric, produce βjx .
T. W. Yee (University of Auckland) Modern Regression Basics 11/167October 2008 @ Cagliari 11 / 167
Linear Models
S Model Formulae III
Interactions between two numerics, numeric:numeric, produce across-product term such as βx2x3.
The term (f6 + f7)^2 expands to f6 + f7 + f6:f7. A term (f6 + f7+ f8)^2 - f7:f8 would expand to all main effects and all second-orderinteractions except for f7:f8.
Nesting is achieved by /, e.g., f2/f3 is shorthand for 1 + f2 + f3:f2,or equivalently,
> 1 + f2 + f3 %in% f2
Example: f2 = state and f3 = county.
T. W. Yee (University of Auckland) Modern Regression Basics 12/167October 2008 @ Cagliari 12 / 167
Linear Models
S Model Formulae IV
There are times when you need to use the identity function I(), e.g.,because“^”has special meaning,
> lm(y ~ -1 + offset(a) + x1 + I(x2 - 1) + I(x3^3))
fits
yi = ai + β1xi1 + β2(xi2 − 1) + β3x3i3 + εi ,
εi ∼ iid N(0, σ2), i = 1, . . . , n,
where a is a vector containing the (known) ai .
Other functions: factor(), as.factor(), ordered() terms(),levels(), options().
T. W. Yee (University of Auckland) Modern Regression Basics 13/167October 2008 @ Cagliari 13 / 167
Linear Models
S generics
Generic functions are available for lm objects. They include
add1()
anova()
coef()
deviance()
drop1()
plot()
predict()
print()
residuals()
step()
summary()
update().
Other less used generic functions are alias(), effects(), family(),kappa(), labels(), proj().
Some other functions are model.matrix(), options().
T. W. Yee (University of Auckland) Modern Regression Basics 14/167October 2008 @ Cagliari 14 / 167
Linear Models
The lm() Function
> args(lm)
function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
NULL
The most useful arguments areweights,
subset,
na.action—na.fail(), na.omit(),
contrasts.
Data frames: read.table(), write.table(), na.omit().
T. W. Yee (University of Auckland) Modern Regression Basics 16/167October 2008 @ Cagliari 16 / 167
Linear Models
Factors I
> options()$contrasts
unordered ordered"contr.treatment" "contr.poly"
Note:1 contr.treatment is used so that each coefficient compares that
level with level 1 (omitting level 1 itself).
2 contr.sum constrains the coefficients to sum to zero.
3 contr.poly is used for equally spaced, equally replicated orthogonalpolynomial contrasts
T. W. Yee (University of Auckland) Modern Regression Basics 18/167October 2008 @ Cagliari 18 / 167
Linear Models
Factors II
One can change them by, for example,
> options(contrasts = c("contr.treatment", "contr.poly"))
Here, the first level of the factor is the baseline level.
Table: Dummy variables—partial method.
RACE D1 D2 D3
White 0 0 0Black 1 0 0Hispanic 0 1 0Other 0 0 1
T. W. Yee (University of Auckland) Modern Regression Basics 19/167October 2008 @ Cagliari 19 / 167
Linear Models
Factors III
An example:
> options(contrasts = c("contr.treatment", "contr.poly"))
> y <- 1:9
> x <- rep(1:3, len = 9)
> lm(y ~ as.factor(x))
Call:lm(formula = y ~ as.factor(x))
Coefficients:(Intercept) as.factor(x)2 as.factor(x)3
4 1 2
T. W. Yee (University of Auckland) Modern Regression Basics 20/167October 2008 @ Cagliari 20 / 167
Linear Models
Factors IV
Another example:
> options(contrasts = c("contr.sum", "contr.poly"))
> lm(y ~ as.factor(x))
Call:lm(formula = y ~ as.factor(x))
Coefficients:(Intercept) as.factor(x)1 as.factor(x)25.000e+00 -1.000e+00 -7.493e-17
T. W. Yee (University of Auckland) Modern Regression Basics 21/167October 2008 @ Cagliari 21 / 167
Linear Models
Topics not done . . .
Other important topics not done:
residual analysis
influential observations
robust regression
variable selection
. . .
T. W. Yee (University of Auckland) Modern Regression Basics 22/167October 2008 @ Cagliari 22 / 167
Generalized Linear Models (GLMs)
Generalized Linear Models (GLMs)
Y ∼ Exponential family (normal, binomial, Poisson, . . . )
g(µ) = η(x) = βTx = β1 + β2 x2 + · · ·+ βp xp
g is the link function (known, monotonic, twice differentiable).
η =p∑
k=1βk xk is known as the linear predictor .
Proposed by Nelder and Wedderburn (1972), GLMs include the generallinear model, logistic regression, probit analysis, Poisson regression,gamma, inverse Gaussian etc. The unification was a major breakthrough instatistical theory.
Estimation: iteratively reweighted least squares (IRLS; see later)
T. W. Yee (University of Auckland) Modern Regression Basics 23/167October 2008 @ Cagliari 23 / 167
Generalized Linear Models (GLMs)
The Exponential Family I
The distribution of a univariate r.v. Y belongs to the exponential family ifits p.(d).f. f (y ; θ) can be written as
f (y ; θ) = exp{p(y)q(θ) + r(y) + s(θ)}. (6)
Here θ = parameter of interest, and the functions p, q, r , s are known.Other parameters can be accomodated provided they are known—wesimply incorporate them in p, q, r and s.
The exponential family has a canonical form where p(y) = y . We alsowant to be able to explicitly consider scale parameters such as σ inN(µ, σ2) so we write
f (y ; θ, φ) = exp
{ω
y d(θ)− b(θ)
φ+ c(y , φ, ω)
}. (7)
T. W. Yee (University of Auckland) Modern Regression Basics 25/167October 2008 @ Cagliari 25 / 167
Generalized Linear Models (GLMs)
The Exponential Family II
Equation (7) belongs to (6) provided φ is known (ω is some knownconstant here). (φ > 0, ω ≥ 0).
θ∗ = d(θ)
is often called the natural parameter of the distribution.
T. W. Yee (University of Auckland) Modern Regression Basics 26/167October 2008 @ Cagliari 26 / 167
Generalized Linear Models (GLMs)
The Exponential Family III
(i) Y ∼ N(µ, σ2)
f (y ;µ, σ) =1√
2πσ2exp
{− 1
2σ2(y − µ)2
}= exp
{yµ− 1
2µ2
σ2− y2
2σ2− 1
2log(2πσ2)
}.
(ii) Y ∼ Poisson(θ)
f (y ; θ) =e−θθy
y != exp{y log θ − θ − log y !}.
T. W. Yee (University of Auckland) Modern Regression Basics 27/167October 2008 @ Cagliari 27 / 167
Generalized Linear Models (GLMs)
The Exponential Family IV
(iii) Z ∼ Binomial(m, p). For z = 0, 1, . . . ,m,
P(Z = z ; p) =
(mz
)pz(1− p)m−z
= exp
{z log p + (m − z) log(1− p) + log
(mz
)}= exp
{z log
p
1− p+ m log(1− p) + log
(mz
)}.
If we look at the sample proportion, Y = Z/m, for y = 0, 1/m, . . . ,m/m,
P(Y = y ; p) = P(Z = my ; p)
= exp
[m
{y log
p
1− p+ log(1− p)
}+ log
(mmy
)].
T. W. Yee (University of Auckland) Modern Regression Basics 28/167October 2008 @ Cagliari 28 / 167
Generalized Linear Models (GLMs)
The Exponential Family V
Let `(θ) = log L(θ) = log f (Y ; θ), u(θ) =∂`(θ)
∂θ, and
I(θ) = −∂2`(θ)
∂θ2.
Can show
E [Y ] = µ =b′(θ)
d ′(θ)(8)
and Var(Y ) =φ2
ω2d ′(θ)2
[ω
b′′(θ)− µd ′′(θ)
φ
], or
Var(Y ) =b′′(θ)φ
d ′(θ)2ω− φ
ω
b′(θ)d ′′(θ)
{d ′(θ)}3. (9)
T. W. Yee (University of Auckland) Modern Regression Basics 29/167October 2008 @ Cagliari 29 / 167
Generalized Linear Models (GLMs)
The Exponential Family VI
If the model is parameterized in terms of the natural parameter θ∗, (e.g.,θ∗ = log p
1−p instead of p) then d ′(θ∗) = 1 and
Var(Y ) = b′′(θ∗)φ
ω. (10)
One then gets the following table.
T. W. Yee (University of Auckland) Modern Regression Basics 30/167October 2008 @ Cagliari 30 / 167
Generalized Linear Models (GLMs)
The Exponential Family VII
tNormal(µ, σ2) Poisson(λ) 1
mBinomial(m, p)
(Y = Sample proportion)
f (y ; θ, φ)1
√2πσ
e− 1
2(y−u)2
σ2e−λλy
y !
„mmy
«pmy (1− p)m−my
Range of Y (−∞,∞) 0, 1, 2, . . . 0, 1/m, 2/m, . . . , m/mMean = E(Y ) µ µ = λ µ = p“Usual” parameter, θ µ λ p
“Natural” param, θ∗ µ log λ = log µ logp
1− p= log
µ
1− µ
b(θ) 12µ2 (= 1
2θ∗2) λ (= eθ∗ ) − log(1− p) = log(1 + eθ∗ )
φ σ2 1 1ω 1 1 m
c(y, φ, ω) −1
2
(y2
φ+ log(2πφ)
)− log y ! log
„mmy
«
µ = E [Y ] µ λ (= eθ∗ ) p =eθ∗
1 + eθ∗
Variance Function Constant (σ2) µµ(1− µ)
m(Var(Y ) as fn of µ)
T. W. Yee (University of Auckland) Modern Regression Basics 31/167October 2008 @ Cagliari 31 / 167
Generalized Linear Models (GLMs)
S and GLMs I
In S use, e.g., glm(y x2 + x3 + x4, family=binomial, data=d)
Family functions are gaussian(), binomial(), poisson(),Gamma(), inverse.gaussian(), quasi().
Generic functions include anova(), coef(), fitted(), plot(),predict(), print(), resid(), summary(), update().
Recall the Wilkinson and Rogers (1973) formula language, e.g.,if f1 and f2 are factors and x1 and x2 are numeric, then
f1 ∗ f2 ≡ 1 + f1 + f2 + f1 : f2,
f1/f2 ≡ “f1 and then f2 within factor f1”,
x1 + x2 ≡ β1X1 + β2X2.
Data frames to hold all the data. Columns are the variables.
T. W. Yee (University of Auckland) Modern Regression Basics 33/167October 2008 @ Cagliari 33 / 167
Generalized Linear Models (GLMs)
S and GLMs II
> data(nzc)
> with(nzc, plot(year, female/(male + female), ylab = "Proportion",
+ main = "Proportion of NZ Chinese that are female",
+ col = "blue", las = 1))
> abline(h = 0.5, lty = "dashed")
> fit.nzc = vglm(cbind(female, male) ~ year, fam = binomialff,
+ data = nzc)
> with(nzc, lines(year, fitted(fit.nzc), col = "red"))
T. W. Yee (University of Auckland) Modern Regression Basics 34/167October 2008 @ Cagliari 34 / 167
Generalized Linear Models (GLMs)
S and GLMs III
● ● ● ● ● ●● ● ●
●
●
●
●
●
●
●
●●
●
●
●
●●
●● ●
1880 1900 1920 1940 1960 1980 2000
0.0
0.1
0.2
0.3
0.4
0.5
Proportion of NZ Chinese that are female
year
Pro
port
ion
T. W. Yee (University of Auckland) Modern Regression Basics 35/167October 2008 @ Cagliari 35 / 167
Generalized Linear Models (GLMs)
S and GLMs IV
> with(nzc, plot(year, female/(male + female), ylab = "Proportion",
+ main = "Proportion of NZ Chinese that are female",
+ col = "blue", las = 1))
> abline(h = 0.5, lty = "dashed")
> fit.nzc = vglm(cbind(female, male) ~ poly(year,
+ 2), fam = binomialff, data = nzc)
● ● ● ● ● ●● ● ●
●
●
●
●
●
●
●
●●
●
●
●
●●
●● ●
1880 1900 1920 1940 1960 1980 2000
0.0
0.1
0.2
0.3
0.4
0.5
Proportion of NZ Chinese that are female
year
Pro
port
ion
T. W. Yee (University of Auckland) Modern Regression Basics 36/167October 2008 @ Cagliari 36 / 167
Generalized Linear Models (GLMs)
Logistic regression I
> options(contrasts = c("contr.treatment", "contr.poly"))
> y <- cbind(c(5, 20, 15, 10), c(20, 10, 10, 10))
> x <- 1:4
> fit <- glm(y ~ as.factor(x), family = binomial)
> fit
Call: glm(formula = y ~ as.factor(x), family = binomial)
Coefficients:
(Intercept) as.factor(x)2 as.factor(x)3 as.factor(x)4
-1.386 2.079 1.792 1.386
Degrees of Freedom: 3 Total (i.e. Null); 0 Residual
Null Deviance: 14.04
Residual Deviance: 4.441e-15 AIC: 22.14
> exp(coef(fit)[-1])
T. W. Yee (University of Auckland) Modern Regression Basics 38/167October 2008 @ Cagliari 38 / 167
Generalized Linear Models (GLMs)
Logistic regression II
as.factor(x)2 as.factor(x)3 as.factor(x)4
8 6 4
Here the model is
logit p(x) = β0 + βj , j = 1, 2, 3, 4
where β1 = 0.
T. W. Yee (University of Auckland) Modern Regression Basics 39/167October 2008 @ Cagliari 39 / 167
Generalized Linear Models (GLMs)
Logistic regression III
If η(x) = β0 + β1x then the log odds for a change in c units in x isobtained from the logit difference
η(x + c)− η(x) = c β1 (11)
and the associated odds ratio is
ψ(c) = ψ(x + c, x) = exp(c β1). (12)
Example If
logit P(D|AGE) = 0.2 + 0.13× AGE
then an increase in age of 10 years will increase the odds of disease byexp(1.3) = 3.67. �
T. W. Yee (University of Auckland) Modern Regression Basics 40/167October 2008 @ Cagliari 40 / 167
Generalized Linear Models (GLMs)
Logistic regression IV
In general, for
logit p(x) = β0 + βTx (13)
we have
log ψ = log
{p(x1)/(1− p(x1))
p(x0)/(1− p(x0))
}= βT (x1 − x0). (14)
Thus, for confidence intervals etc., use
log ψ = (x1 − x0)T Var(β) (x1 − x0). (15)
Note:
p(x1)/(1− p(x1))
p(x0)/(1− p(x0))
is the odds ratio for Y = 1 for a person with x1 relative to a person withx0.T. W. Yee (University of Auckland) Modern Regression Basics 41/167October 2008 @ Cagliari 41 / 167
Generalized Linear Models (GLMs)
Some extensions of GLMs
Quasi-likelihood (Wedderburn, 1974)
Composite link functions (Thompson and Baker, 1981)
IRLS for other models (Green, 1984)
Double Exponential Families (Efron, 1986)
Generalized Estimating Equations (GEE; Liang and Zeger, 1986)
Generalized linear mixed models (GLMMs)
ANOVA Splines (Wahba and co-workers, 1995)
Polychotomous Regression (Kooperberg and co-workers, 1997)
Multivariate GLMs (Fahrmeir and Tutz, 2001)
Heirarchial GLMs (Nelder and Lee, late 1990’s)
Generalized additive models (GAMs; Hastie and Tibshirani, 1986)
Generalized additive mixed models (GAMMs; Lin, 1998)
Vector GLMs and VGAMs (Yee and Wild, 1996)
T. W. Yee (University of Auckland) Modern Regression Basics 42/167October 2008 @ Cagliari 42 / 167
Smoothing
Smoothing
Smoothing is a powerful tool for exploratory data analysis. It allows adata-driven approach rather than model-driven approach. Allows the datato“speak for itself”.
Probably the central idea is localness, i.e., local behaviour versus globalbehaviour of a function.
Scatterplot data (xi , yi ), i = 1, . . . , n.
The classical smoothing problem is
yi = f (xi ) + εi , εi ∼ (0, σi ) (16)
independently. Here, f is an arbitary smooth function, and i = 1, . . . , n.
Q: How can f be estimated?
A: If there is no a priori function form for f , one solution is the smoother.T. W. Yee (University of Auckland) Modern Regression Basics 43/167October 2008 @ Cagliari 43 / 167
Smoothing
Uses of Smoothing
Smoothing has many uses, e.g.,
data visualization and EDA
prediction
derivative estimation, e.g., growth curves, acceleration
used as a basis for many modern statistical techniques
T. W. Yee (University of Auckland) Modern Regression Basics 45/167October 2008 @ Cagliari 45 / 167
Smoothing
Example I
0.0 0.2 0.4 0.6 0.8 1.0
1.5
2.0
2.5
3.0
x
y
T. W. Yee (University of Auckland) Modern Regression Basics 47/167October 2008 @ Cagliari 47 / 167
Smoothing
Example I
0.0 0.2 0.4 0.6 0.8 1.0
1.5
2.0
2.5
3.0
x
y
T. W. Yee (University of Auckland) Modern Regression Basics 49/167October 2008 @ Cagliari 49 / 167
Smoothing
There are four broad categories of smoothers:
1 series or regression smoothers (polynomials, Fourier regression,regression splines, filtering),
2 Kernel smoothers (N-W, locally weighted averages, local regression,loess),
3 Smoothing splines (roughness penalties),
4 Near neighbour smoothers (running means, medians, Tukeysmoothers).
We will look at kernel smoothers and splines.
T. W. Yee (University of Auckland) Modern Regression Basics 50/167October 2008 @ Cagliari 50 / 167
Smoothing
Scatterplot data (yi , xi ), i = 1, . . . , n.
The classical smoothing problem is
Yi = f (Xi ) + εi (17)
where
f = a smooth function estimated from the data,E (εi ) = 0,
Var(εi ) = σ2 independently.
We let Var(εi ) ∝ w−1i (known), written Var(ε) = W−1,
W = diag(w1, . . . ,wn) = Σ−1.
WLOG let the data is ordered so that x1 < x2 < · · · < xn.
T. W. Yee (University of Auckland) Modern Regression Basics 52/167October 2008 @ Cagliari 52 / 167
Smoothing
Kernel Smoothers INadaraya-Watson Estimator
Kernel regression estimators are well-known, easily understood andmathematically tractable.
The Nadaraya-Watson (N-W) estimator estimates f (x) by
fnw (x) =
n∑i=1
K
(x − xi
h
)yi
n∑i=1
K
(x − xi
h
) =
n∑i=1
Kh(x − xi ) yi
n∑i=1
Kh(x − xi )
(18)
where
Kh(u) = h−1K(u
h
). (19)
T. W. Yee (University of Auckland) Modern Regression Basics 54/167October 2008 @ Cagliari 54 / 167
Smoothing
Kernel Smoothers IINadaraya-Watson Estimator
K , a symmetric unimodal function about 0 which integrates to unity,creates the local averaging of values of yi whose corresponding values of xi
are close to the point of estimation x .
The amount of smoothing is controlled by the bandwidth h. Some popularkernel functions are given in Table 2.
As h decreases, the bias decreases and the variance increases.
In practice, the choice of bandwidth h is more crucial than the choice ofkernel function.
T. W. Yee (University of Auckland) Modern Regression Basics 55/167October 2008 @ Cagliari 55 / 167
Smoothing
Kernel Smoothers IIINadaraya-Watson Estimator
0.0 0.2 0.4 0.6 0.8 1.0
1.5
2.0
2.5
3.0
Reg
ress
ion
func
tion
T. W. Yee (University of Auckland) Modern Regression Basics 56/167October 2008 @ Cagliari 56 / 167
Smoothing
Kernel Smoothers IVNadaraya-Watson Estimator
Table: Popular kernel functions. Nb. the quartic is also known as the biweight.
Kernel K (u)
Uniform 12 I (|u| ≤ 1)
Triangle (1− |u|) I (|u| ≤ 1)Epanechnikov 3
4(1− u2) I (|u| ≤ 1)Quartic 15
16(1− u2)2 I (|u| ≤ 1)Tricube 70
81(1− |u|3)3 I (|u| ≤ 1)Triweight 35
32(1− u2)3 I (|u| ≤ 1)
Gaussian exp(−12u2)/
√2π
Cosinus π4 cos(πu/2) I (|u| ≤ 1)
T. W. Yee (University of Auckland) Modern Regression Basics 57/167October 2008 @ Cagliari 57 / 167
Smoothing
Local Regression I
Theoretically elegant and also called local polynomial kernel estimators, ithas favourable asymptotic properties and boundary behaviour.
Idea: estimate f (x0) by“locally”fitting a rth degree polynomial to data viaweighted least squares (WLS).
Example: a local linear kernel estimate for
f (xi ) = 2 exp(−x2i /0.3
2) + 3 exp(−(xi − 1)2/0.72) + εi ,
xi = (i − 1)/n,
εi ∼ N(0, σ = 0.115) independently.
T. W. Yee (University of Auckland) Modern Regression Basics 59/167October 2008 @ Cagliari 59 / 167
Smoothing
Local Regression II
0.0 0.2 0.4 0.6 0.8 1.0
1.5
2.0
2.5
3.0
Reg
ress
ion
func
tion
Figure: Local linear kernel estimate (solid red) of the regression function f givenin the text based on 100 simulated observations (crosses). The solid black curveis the true function. The red dashed curves are the kernel weights.
T. W. Yee (University of Auckland) Modern Regression Basics 60/167October 2008 @ Cagliari 60 / 167
Smoothing
We now derive an explicit expression for the local polynomial kernelestimator.
Let r be the degree of the polynomial being fitted. At a point x , theestimator f (x ; r , h) is obtained by fitting the polynomialβ0 + β1(· − x) + · · ·+ βr (· − x)r to the (xi , yi ) using WLS with kernelweights Kh(xi − x).
The value of f (x ; r , h) is the height of the fit β0, where β = (β0, . . . , βr )T
minimizes
n∑i=1
{yi − β0 − β1(xi − x)− · · · − βr (xi − x)r}2 Kh(xi − x) . (20)
The solution is
β = (XTx WxXx)
−1XTx Wx y (21)
T. W. Yee (University of Auckland) Modern Regression Basics 62/167October 2008 @ Cagliari 62 / 167
Smoothing
where y = (y1, . . . , yn)T ,
Xx =
1 (x1 − x) . . . (x1 − x)r
......
...1 (xn − x) . . . (xn − x)r
is n × (r + 1), and Wx=Diag(Kh(x1 − x), . . . ,Kh(xn − x)). Since theestimator of f (x) is the intercept, we have
f (x ; r , h) = eT1 (XT
x WxXx)−1XT
x Wx y . (22)
T. W. Yee (University of Auckland) Modern Regression Basics 63/167October 2008 @ Cagliari 63 / 167
Smoothing
Simple explicit formulae exist for the N-W estimator (r = 0):
f (x ; 0, h) =
n∑i=1
Kh(xi − x) yi
n∑i=1
Kh(xi − x)
(23)
and the local linear estimator (r = 1):
f (x ; 1, h) = n−1n∑
i=1
{s2(x ; h)− s1(x ; h)(xi − x)} Kh(xi − x) yi
s2(x ; h)s0(x ; h)− s1(x ; h)2(24)
where
sr (x ; h) = n−1n∑
i=1
(xi − x)r Kh(xi − x) . (25)
T. W. Yee (University of Auckland) Modern Regression Basics 65/167October 2008 @ Cagliari 65 / 167
Smoothing
Derivative Estimation I
Uses include the study of human growth curves where the first twoderivatives of height as a function of age (“speed”and“acceleration”ofgrowth) have important biological significance.
The extension of local polynomial ideas to estimate the νth derivative isstraightforward. One can estimate f (ν)(x) via the intercept coefficient ofthe νth derivative of the local polynomial being fitted at x , assumingν ≤ r . In general,
f (ν)(x ; r , h) = ν! eTν+1(X
Tx WxXx)
−1XTx Wx y, for all ν = 0, . . . , r (26)
from (22). Note that f (ν)(x ; r , h) is not in general equal to the νthderivative of f (x ; r , h).
T. W. Yee (University of Auckland) Modern Regression Basics 67/167October 2008 @ Cagliari 67 / 167
Smoothing
Derivative Estimation II
Choosing rIn the early 1990’s Fan and co-workers showed that, for estimating f (ν)(x),there is no increase in variability when passing from an even (i.e., r − νeven) r = ν + 2q order fit to an odd r = ν + 2q + 1 order fit, but whenpassing from an odd r = ν + 2q + 1 order fit to the consecutive evenr = ν + 2q + 2 order there is a price to be paid in terms of increasedvariability. Therefore, even order fits r = ν + 2q are not recommended.Fan and Gijbels (1996) recommend using the lowest odd order, i.e.,r = ν + 1, or occasionally r = ν + 3.
For f choose r = 1 (maybe 3) . . .
For f ′ choose r = 2 (maybe 4) . . .
T. W. Yee (University of Auckland) Modern Regression Basics 68/167October 2008 @ Cagliari 68 / 167
Smoothing
Lowess and Loess I
A popular method based on local regression is Lowess (Cleveland, 1979)and Loess (Cleveland and Devlin, 1988).
Lowess = locally weighted scatterplot smoother, and it robustifies thelocally WLS method above.
The basic idea is to fit a polynomial of degree r locally via (20) and obtainthe fitted values. Then calculate the residuals and assign weights to eachresidual: large/small residuals receive small/large weights respectively.Then perform another local polynomial fit of order r with weights given bythe product of the initial weight and new weight. Thus observationsshowing large residuals at the initial fit are downweighted in the second fit.The above process is repeated a number of times.
Cleveland (1979) recommended r = 1 and 3 iterations (default).
T. W. Yee (University of Auckland) Modern Regression Basics 70/167October 2008 @ Cagliari 70 / 167
Smoothing
Lowess and Loess II
> par(mfrow = c(2, 2), mar = c(5, 4, 2, 1) + 0.1)
> set.seed(761)
> x <- sort(rnorm(100))
> eps <- rnorm(100, 0, 0.1)
> y <- sin(x) + eps
> plot(x, y, col = "blue", pch = 4)
> title("Default: lowess(x, y)", cex = 0.5)
> lo1 <- lowess(x, y)
> lines(lo1, lty = 1, col = "red")
> plot(x, y, col = "blue", pch = 4)
> title("lowess(x, y, f=0.5)", cex = 0.5)
> lo2 <- lowess(x, y, f = 0.5)
> lines(lo2, lty = 1, col = "red")
> plot(x, y, col = "blue", pch = 4)
> title("lowess(x, y, f=0.2)", cex = 0.5)
> lo3 <- lowess(x, y, f = 0.2)
> lines(lo3, lty = 1, col = "red")
> plot(x, y, col = "blue", pch = 4)
> title("Default: loess(y ~ x)", cex = 0.5)
T. W. Yee (University of Auckland) Modern Regression Basics 71/167October 2008 @ Cagliari 71 / 167
Smoothing
Lowess and Loess III
> lo4 <- loess(y ~ x)
> lines(x, fitted(lo4), lty = 1, col = "red")
−2 −1 0 1 2
−1.0
−0.5
0.0
0.5
1.0
x
y
Default: lowess(x, y)
−2 −1 0 1 2
−1.0
−0.5
0.0
0.5
1.0
x
y
lowess(x, y, f=0.5)
−2 −1 0 1 2
−1.0
−0.5
0.0
0.5
1.0
x
y
lowess(x, y, f=0.2)
−2 −1 0 1 2
−1.0
−0.5
0.0
0.5
1.0
x
y
Default: loess(y ~ x)
T. W. Yee (University of Auckland) Modern Regression Basics 72/167October 2008 @ Cagliari 72 / 167
Smoothing
Lowess and Loess IV
Once again choosing a good bandwidth is crucial.
T. W. Yee (University of Auckland) Modern Regression Basics 73/167October 2008 @ Cagliari 73 / 167
Smoothing
Local Likelihood I
Local likelihood replaces the local least squares criterion by an appropriatelocal log-likelihood criterion.
Example: for binary data (xi , yi ), i = 1, . . . , n, yi = 0 or 1, the locallog-likelihood is
n∑i=1
K
(|xi − x |
h
){yi log pi + (1− yi ) log (1− pi )} (27)
where pi = p(xi ) = P(Y = 1|xi ).
We could model p(x) directly using local polynomials, however, it isusually preferable to use θ(x) = logit p(x). We approximate θ(x) locallyby a polynomial, then choose the polynomial coefficients to maximize thelikelihood.
T. W. Yee (University of Auckland) Modern Regression Basics 75/167October 2008 @ Cagliari 75 / 167
Smoothing
Local Likelihood II
Local likelihood can also be applied to other regression models and densityestimation.
Local likelihood developed by Tibshirani (1984). A good book on the topicis Loader (1999).
T. W. Yee (University of Auckland) Modern Regression Basics 76/167October 2008 @ Cagliari 76 / 167
Smoothing
Regression Splines I
Idea: fit a higher degree polynomial (polynomial regression.)
Some drawbacks:
Polynomials aren’t very local but have a global nature. So usually arenot ok at the boundaries, especially if the degree of the polynomial ishigh [cf. Stone-Weierstrass Theorem];
Individual observations can have a large influence on remote parts ofthe curve;
The polynomial degree cannot be controlled continuously.
Polynomial regression can be fitted using the poly() function, e.g.,
> fit <- lm(y ~ poly(x, 5))
fits a 5th degree polynomial.
T. W. Yee (University of Auckland) Modern Regression Basics 78/167October 2008 @ Cagliari 78 / 167
Smoothing
Regression Splines II
Regression splines use a piecewise polynomial. The regions are separatedby knots (or breakpoints). The positions where each pair of segments joinare called joints. The more knots, the more flexible the family of curvesbecome.
It is customary to force the piecewise polynomials to join smoothly atthese knots. A popular choice are piecewise cubic polynomials withcontinuous 0th, 1st and 2nd derivatives called cubic splines. Using splinesof degree > 3 seldom yields any advantage.
Given a set of knots, the smooth is computed by multiple regression on aset of basis vectors.
T. W. Yee (University of Auckland) Modern Regression Basics 79/167October 2008 @ Cagliari 79 / 167
Smoothing
Regression Splines III
Here’s a regression spline.
> pos <- function(x) ifelse(x > 0, x, 0)
> x <- 1:7
> y <- c(8, 3, 8, 5, 9, 14, 11)
> knot <- 4
> plot(x, y, col = "blue")
> X <- cbind(1, x, x^2, x^3, pos(x - knot)^3)
> fit <- lm(y ~ X - 1)
> xx <- seq(1, 7, length = 200)
> XX <- cbind(1, xx, xx^2, xx^3, pos(xx - knot)^3)
> lines(xx, XX %*% coef(fit))
> abline(v = knot, lty = "dashed", col = "purple")
> X
T. W. Yee (University of Auckland) Modern Regression Basics 80/167October 2008 @ Cagliari 80 / 167
Smoothing
Regression Splines IV
x
[1,] 1 1 1 1 0
[2,] 1 2 4 8 0
[3,] 1 3 9 27 0
[4,] 1 4 16 64 0
[5,] 1 5 25 125 1
[6,] 1 6 36 216 8
[7,] 1 7 49 343 27
T. W. Yee (University of Auckland) Modern Regression Basics 81/167October 2008 @ Cagliari 81 / 167
Smoothing
Regression Splines V
●
●
●
●
●
●
●
1 2 3 4 5 6 7
4
6
8
10
12
14
x
y
T. W. Yee (University of Auckland) Modern Regression Basics 82/167October 2008 @ Cagliari 82 / 167
Smoothing
Regression Splines VI
Definitions: A function f ∈ C k [a, b] if derivatives f ′, f ′′, . . . , f (k) all existand are continuous in [a, b], e.g., |x | /∈ C 1[a, b].Notes:
1 f ∈ C k [a, b] =⇒ f ∈ C k−1[a, b].
2 C [a, b] ≡ C 0[a, b] = {f (t) : f (t) continuous and real valued,a ≤ t ≤ b}.
There are at least two bases for cubic splines:
1 truncated power series easier to understand but is not used inpractice,
2 B-splines harder to understand but is used in practice.
T. W. Yee (University of Auckland) Modern Regression Basics 83/167October 2008 @ Cagliari 83 / 167
Smoothing
Regression Splines VII
Advantages of regression splines:
computationally and statistically simple,
standard parametric inferences are available. For example, testingwhether a knot can be removed and the same polynomial equationused to explain two adjacent segments can be tested by H0 : θj = 0,which is one of the t-tests statistics always printed by a regressionprogram.
Disadvantages of regression splines:
difficult to choose the number of knots,
difficult to choose the position of the knots,
the smoothness of the estimate cannot be varied continuously as afunction of a single smoothing parameter.
T. W. Yee (University of Auckland) Modern Regression Basics 84/167October 2008 @ Cagliari 84 / 167
Smoothing
Regression Splines VIII
Here is a more formal definition of a spline.
In mathematics, a ‘spline’ denotes a function s(x) which is essentially apiecewise polynomial over an interval (a, b), such that a certain number ofits derivatives are continuous for all points in (a, b). More precisely, s(x) isa spline of degree r (some given positive integer) with knots ξ1, . . . , ξK(such that a < ξ1 < ξ2 < · · · < ξK < b) if it satisfies the followingproperties:
for any subinterval (ξj , ξj+1), s(x) is a polynomial of degree r ;(order r + 1);
s ′(x), . . . , s(r−1)(x) are continuous (derivatives), i.e., s ∈ C r−1(a, b);
the rth derivative of s(x) is a step function with jumps at ξ1, . . . , ξK .
Often r is chosen to be 3, and the term cubic spline is then used for theassociated curve.
T. W. Yee (University of Auckland) Modern Regression Basics 85/167October 2008 @ Cagliari 85 / 167
Smoothing
Regression Splines IX
Wold (1974), in a paper reflecting a lot of experience fitting regressionsplines, made the following recommendations when using cubic splines:
1 Knot points should be located at data points,
2 Have as few knots as possible, ensuring that a minimum of 4 or 5observations should fall between knot points,
3 No more than one extremum point and one inflexion point should fallbetween knots (because a cubic is not possible of approximating morevariations),
4 Extrema should be centered in intervals and inflexion points should belocated near knot points.
T. W. Yee (University of Auckland) Modern Regression Basics 86/167October 2008 @ Cagliari 86 / 167
Smoothing
B-Splines I
B-splines form a numerically stable basis for splines.It is convenient to consider splines of a general order, M say.
1 M = 4: cubic spline.
2 M = 3: quadratic spline which has continuous derivatives up to orderM − 2 = 1 at the knots—this is aka a parabolic spline.
3 M = 2: linear spline which has continuous derivatives up to orderM − 2 = 0 at the knots—i.e., the function is continuous.
Let ξ0 (< ξ1) and ξK+1 (> ξK ) be 2 boundary knots. Define theaugmented knot sequence {τ} such that
τ1 ≤ τ2 ≤ · · · ≤ τM ≤ ξ0;
τj+M = ξj , j = 1, . . . ,K ;
ξK+1 ≤ τK+M+1 ≤ · · · ≤ τK+2M .
T. W. Yee (University of Auckland) Modern Regression Basics 88/167October 2008 @ Cagliari 88 / 167
Smoothing
B-Splines II
The actual values of these additional knots beyond the boundary arearbitrary, and it is customary to make them all the same and equal to ξ0and ξK+1 respectively.
Denote by Bi ,m(x) the ith B-spline basis function of order m for the knotsequence {τ}, m ≤ M. They are defined recursively as follows:
For i = 1, . . . ,K + 2M − 1,
Bi ,1(x) =
{1, τi ≤ x < τi+1,0, otherwise;
(28)
T. W. Yee (University of Auckland) Modern Regression Basics 89/167October 2008 @ Cagliari 89 / 167
Smoothing
B-Splines III
Then for i = 1, . . . ,K + 2M −m,
Bi ,m(x) =x − τi
τi+m−1 − τiBi ,m−1(x) +
τi+m − x
τi+m − τi+1Bi+1,m−1(x) (29)
(de Boor, 1978). He showed stable and efficient recursive algorithmsfor computing them.
Thus with m = 4, Bi ,4, i = 1, . . . ,K + 4, are the K + 4 cubic B-splinebasis functions for the knot sequence {ξ}. This recursion can be continuedand will generate the B-spline basis for any order spline.
T. W. Yee (University of Auckland) Modern Regression Basics 90/167October 2008 @ Cagliari 90 / 167
Smoothing
B-Splines IV
> knots <- c(1:3, 5, 7, 8, 10)
> atx <- seq(0, 11, by = 0.01)
> mycol = (1:(22 + 1))[-7]
> for (ord in 2:5) {
+ B <- bs(x = atx, degree = ord - 1, knots = knots,
+ intercept = TRUE)
+ matplot(atx, B[, 1], type = "l", ylim = 0:1,
+ lty = 2, ylab = "", xlab = "")
+ matlines(atx, B[, -1], col = mycol, lty = 1)
+ title(paste("B-splines of order", ord))
+ abline(v = knots, lty = 2, col = "purple")
+ }
> attr(B, "degree")
[1] 4
> attr(B, "knots")
[1] 1 2 3 5 7 8 10
T. W. Yee (University of Auckland) Modern Regression Basics 91/167October 2008 @ Cagliari 91 / 167
Smoothing
B-Splines V
> attr(B, "Boundary.knots")
[1] 0 11
> attr(B, "intercept")
[1] TRUE
> attr(B, "class")
[1] "bs" "basis"
T. W. Yee (University of Auckland) Modern Regression Basics 92/167October 2008 @ Cagliari 92 / 167
Smoothing
B-Splines VI
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
B−splines of order 2
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
B−splines of order 3
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
B−splines of order 4
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
B−splines of order 5
T. W. Yee (University of Auckland) Modern Regression Basics 93/167October 2008 @ Cagliari 93 / 167
Smoothing
B-Splines VII
In general, bs() adds ord boundary knots to each end, where theboundary knot values are min(xi ) and max(xi ). If intercept=FALSE thenthe left-most function/column is omitted.
T. W. Yee (University of Auckland) Modern Regression Basics 94/167October 2008 @ Cagliari 94 / 167
Smoothing
B-Splines VIII
To illustrate that linear combinations of the B-spline basis functions doaccommodate smooth curves,
> matplot(atx, 2 * B[, 3] - 5 * B[, 4] + 3 * B[,
+ 7], type = "l", lwd = 2, col = "blue")
0 2 4 6 8 10
−2
−1
0
1
atx
2 *
B[,
3] −
5 *
B[,
4] +
3 *
B[,
7]
T. W. Yee (University of Auckland) Modern Regression Basics 95/167October 2008 @ Cagliari 95 / 167
Smoothing
B-Splines IX
Here are some additional notes:1 > args(bs)
function (x, df = NULL, knots = NULL, degree = 3, intercept = FALSE,
Boundary.knots = range(x))
NULL
In fact, in Value:, df should be length(knots) + degree +intercept.
2 Safe prediction not as good as smart prediction, e.g., I(bs(x)),poly(scale(x), 2).
T. W. Yee (University of Auckland) Modern Regression Basics 96/167October 2008 @ Cagliari 96 / 167
Smoothing
B-Splines X
3 B-splines are actually defined by means of divided differences. Bi ,m,which is based on knots τi , . . . , τi+m, is defined as
Bi ,m(x) = (τi+m − τi )i+m∑j=i
(x − τj)m−1+
i+m∏s=i ,s 6=j
(τj − τs)
. (30)
Equation (29) follows from this.
T. W. Yee (University of Auckland) Modern Regression Basics 97/167October 2008 @ Cagliari 97 / 167
Smoothing
B-Splines XI
As an illustration,
> library(splines)
> n <- 50
> set.seed(760)
> knots <- 1:5
> x <- seq(0, 2 * pi, length = n)
> y <- sin(x) + rnorm(n, sd = 0.5)
> plot(x, y, col = "blue", pch = 4)
> fit <- lm(y ~ bs(x, knots = knots))
> abline(v = knots, lty = "dashed", col = "purple")
> lines(x, sin(x), col = "black")
> aknots = c(-Inf, knots, Inf)
> for (ii in 2:length(aknots)) {
+ newx = seq(max(aknots[ii - 1], min(x)), min(aknots[ii],
+ max(x)), len = 200)
+ lines(newx, predict(fit, data.frame(x = newx)),
+ col = ii - 1, lwd = 2)
+ }
T. W. Yee (University of Auckland) Modern Regression Basics 98/167October 2008 @ Cagliari 98 / 167
Smoothing
B-Splines XII
Overall, the fit is ok, but could be improved by decreasing the number ofknots and heeding the recommendations of Wold (1974).
0 1 2 3 4 5 6
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
x
y
T. W. Yee (University of Auckland) Modern Regression Basics 99/167October 2008 @ Cagliari 99 / 167
Smoothing
B-Splines XIII
Knots with varying multiplicities have an effect illustrated by the following.
●
●●
●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●●
0 1 2 3 4 5 6
−1
0
1
2Multiplicity 1
●
●●
●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●●
0 1 2 3 4 5 6
−1
0
1
2Multiplicity 2
●
●●
●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●●
0 1 2 3 4 5 6
−1
0
1
2Multiplicity 3
●
●●
●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●●
0 1 2 3 4 5 6
−1
0
1
2Multiplicity 4
T. W. Yee (University of Auckland) Modern Regression Basics 100/167October 2008 @ Cagliari 100 / 167
Smoothing
Natural Splines I
A cubic spline on [a, b] is a natural cubic splines (NCS) if its 2nd and 3rdderivatives are 0 at a and b (natural boundary conditions).
Natural splines, a restricted form of B-splines, has been implemented bythe function ns(). Given knots ξ1, . . . , ξK , ns() is linear on (−∞, ξ0] and[ξK+1,∞) where ξ0 and ξK+1 are two extra knots.
ns() chooses these to be the minimum and maximum of the xi
respectively. The result is K + 2 parameters.
T. W. Yee (University of Auckland) Modern Regression Basics 102/167October 2008 @ Cagliari 102 / 167
Smoothing
Natural Splines II
Here’s an example.
> set.seed(21)
> nn = 20
> x = seq(0, 1, len = nn)
> y = runif(nn)
> myknots = c(0.3, 0.7)
> plot(x, y, xlim = c(-0.5, 1.5), col = "blue")
> fit = lm(y ~ ns(x, knot = myknots))
> newx = seq(-0.5, 2.5, len = 100)
> lines(newx, predict(fit, data.frame(x = newx)),
+ col = "blue")
> abline(v = c(range(x), myknots), col = "purple",
+ lty = "dashed")
> coef(fit)
T. W. Yee (University of Auckland) Modern Regression Basics 103/167October 2008 @ Cagliari 103 / 167
Smoothing
Natural Splines III
(Intercept) ns(x, knot = myknots)1
0.4918062 -0.2425835
ns(x, knot = myknots)2 ns(x, knot = myknots)3
0.1631501 -0.1865069
T. W. Yee (University of Auckland) Modern Regression Basics 104/167October 2008 @ Cagliari 104 / 167
Smoothing
Natural Splines IV
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−0.5 0.0 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
x
y
T. W. Yee (University of Auckland) Modern Regression Basics 105/167October 2008 @ Cagliari 105 / 167
Smoothing
Smoothing splines I
Cubic smoothing splines minimize
S(f ) =n∑
i=1
(yi − f (xi ))2 + λ
∫ b
a{f ′′(x)}2 dx , (31)
over a Sobolev space of order 2. Here, a < x1 < · · · < xn < b for some aand b, and λ ≥ 0.
The terms of S(f ):
1 The first penalizes lack-of-fit;
2 the second penalizes wiggliness.
These two conflicting quantities are weighted by the non-negativesmoothing parameter λ.
T. W. Yee (University of Auckland) Modern Regression Basics 107/167October 2008 @ Cagliari 107 / 167
Smoothing
Smoothing splines II
Larger values of λ produce more smoother curves.
As λ→∞, f ′′(x) → 0 and the solution is a least squares line.
As λ→ 0, the solution tends to an interpolating twice-differentiablefunction.
(31) fits into the“penalty function”approach (Green and Silverman, 1994).Penalized least squares minimizes
(y − f)T Σ−1(y − f) + fTKf.
Solution:f = A(λ) y
where A(λ) = (In + ΣK)−1 is the influence or smoother matrix .
T. W. Yee (University of Auckland) Modern Regression Basics 108/167October 2008 @ Cagliari 108 / 167
Smoothing
Some notes I
Here are some notes:
1 The smoothing parameter λ can be regarded as the ‘turning knob’which controls the tradeoff between fidelity to the data andsmoothness. Can select λ by trial and error.
2 The justification of the penalty term by physics (energy ∝∫ ba curvature2), which ≈∝
∫ ba f ′′(t)2 dt. Hooke’s Law.
3 Importantly, Reinsch (1967) showed, using the calculus of variations,that the solution of (31) is a cubic spline with knots at the uniquevalues of the xi . It can be shown that minimizing (31) is equivalent tominimizing∫ b
a{f ′′(x)}2 dx subject to
n∑i=1
{yi − f (xi )}2 ≤ σ.
4 As n →∞, λ should become smaller.
T. W. Yee (University of Auckland) Modern Regression Basics 110/167October 2008 @ Cagliari 110 / 167
Smoothing
Some notes II5 There are alternative regularizations, e.g.,∫ b
af ′(x)2 dx (32)
whose solution is a linear spline. In general,∫ b
af (ν)(x)2 dx
produces a spline of degree 2ν − 1. Note we never get an even degreespline—not unless fractional derivatives are used.
6 S2[a, b] is actually a Sobolev space of order 2. In general, a Sobolevspace of order m is W m
2 [a, b] = {f : f (j), j =0, . . . ,m − 1, is absolutely continuous on [a, b], f (m) ∈ L2[a, b]},i.e.,
∫ ba {f
(m)(t)}2 dt <∞.
T. W. Yee (University of Auckland) Modern Regression Basics 111/167October 2008 @ Cagliari 111 / 167
Smoothing
Some notes III
How can we compute a cubic smoothing spline? There are several ways:
1 Direct method. Not recommended (O(n3)).
2 State-space approach (O(n)).
3 B-splines—this is a numerically stable method (O(n)).
4 Reinsch algorithm (O(n)).
T. W. Yee (University of Auckland) Modern Regression Basics 112/167October 2008 @ Cagliari 112 / 167
Smoothing
> args(smooth.spline)
function (x, y = NULL, w = NULL, df, spar = NULL, cv = FALSE,
all.knots = FALSE, nknots = NULL, keep.data = TRUE, df.offset = 0,
penalty = 1, control.spar = list())
NULL
For basic use, use the arguments df.Here’s an example.
> data(cars)
> with(cars, plot(speed, dist, main = "data(cars) & smoothing splines"))
> cars.spl <- with(cars, smooth.spline(speed, dist))
> cars.spl
Call:
smooth.spline(x = speed, y = dist)
Smoothing Parameter spar= 0.7801305 lambda= 0.1112206 (11 iterations)
Equivalent Degrees of Freedom (Df): 2.635278
Penalized Criterion: 4187.776
GCV: 244.1044
T. W. Yee (University of Auckland) Modern Regression Basics 114/167October 2008 @ Cagliari 114 / 167
Smoothing
This example has duplicate points, so avoid cv=TRUE.
> lines(cars.spl, col = "blue")
> with(cars, lines(smooth.spline(speed, dist, df = 10),
+ lty = 2, col = "red", lwd = 2))
> with(cars.spl, legend(5, 120, c(paste("default [C.V.] => df =",
+ round(df, 1)), "s( * , df = 10)"), col = c("blue",
+ "red"), lty = 1:2, lwd = 1:2, bg = "bisque"))
●
●
●
●
●
●
●
●
●
●
●
●
●●● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
● ●
●
●
●●
●
●
5 10 15 20 25
0
20
40
60
80
100
120
data(cars) & smoothing splines
speed
dist
default [C.V.] => df = 2.6s( * , df = 10)
T. W. Yee (University of Auckland) Modern Regression Basics 115/167October 2008 @ Cagliari 115 / 167
Smoothing
Smoothing—Some General Theory I
In scatterplot smoothing, there is a fundamental trade-off between the biasand variance of the estimate, and this phenomenon is governed by thesmoothing parameter.
An optimal choice of span would trade the bias against the variance. Onesuch criterion is the mean square error (MSE):
E
[(fk(xi )− f (xi )
)2]
= Var(fk(xi )
)+(E fk(xi )− f (xi )
)2.
T. W. Yee (University of Auckland) Modern Regression Basics 117/167October 2008 @ Cagliari 117 / 167
Smoothing
Linear Smoothers I
A smoother is linear if
S(a y1 + b y2|x) = aS(y1|x) + b S(y2|x) (33)
for any constants a and b. That is,
y = Sy (34)
where S does not depend on y.
S is referred to as the influence (or smoother) matrix .
Examples: For the bin, running-mean, running-line, regression spline,cubic spline, kernel and local polynomial kernel smoothers are all linearsmoothers (with a fixed smoothing parameter).
T. W. Yee (University of Auckland) Modern Regression Basics 119/167October 2008 @ Cagliari 119 / 167
Smoothing
Linear Smoothers II
The theory for linear smoothers is much simpler than for nonlinearsmoothers. Many properties of smoothers seen by the eigenvalues andeigenvectors of S. For example, for a cubic spline, S(λ) has all eigenvaluesvalues in (0, 1] with exactly two unit eigenvalues with correspondingeigenvectors 1 and x. That is,
S1 = 1 1 andS x = 1 x.
These correspond to constant and linear functions.
T. W. Yee (University of Auckland) Modern Regression Basics 120/167October 2008 @ Cagliari 120 / 167
Smoothing
Linear Smoothers III● ●
●
●
●
●
●
●
●
●●
●●
●●
●● ● ● ●
1e−05
1e−03
1e−01
Figure: Eigenvalues of a cubic smoothing spline.
T. W. Yee (University of Auckland) Modern Regression Basics 121/167October 2008 @ Cagliari 121 / 167
Smoothing
Degrees of Freedom I
All smoothers allow the user to vary the amount of smoothing done viathe smoothing parameter, e.g., bandwidth, the span, or λ.
However, it would be useful to have some measure of the amount ofsmoothing done. One such measure is the effective degrees of freedom(EDF) of a smooth. It is useful for a number of reasons, e.g., comparingdifferent types of smoothers while keeping the amount of smoothingroughly equal.
The theory of EDF is a natural extension of standard results from thegeneral linear model
Y = Xβ + ε , Var(ε) = σ2 I.
Recall that, if β is p × 1, that
T. W. Yee (University of Auckland) Modern Regression Basics 123/167October 2008 @ Cagliari 123 / 167
Smoothing
Degrees of Freedom II
1 Y = Py where P = X(XTX)−1XT is idempotent of rank p. Thentrace(P) = p,
2 trace(Var(Y)) = σ2 trace(PPT ) = σ2p,
3 E [(n − p) S2] = E [ResSS] = E [(Y − Y)T (Y − Y)] = σ2(n − p).
By replacing P by S, these results suggest the following three definitionsfor the effective degrees of freedom of a smooth:
1 df = trace(S),
2 df var = trace(SST ) and
3 df err = n − trace(2S− SST ).
More generally, with weights W, these are
1 df = trace(S),
2 df var = trace(WSW−1ST ) and
3 df err = n − trace(2S− STWSW−1).
T. W. Yee (University of Auckland) Modern Regression Basics 124/167October 2008 @ Cagliari 124 / 167
Smoothing
Degrees of Freedom III
It can be shown that if S is a symmetric projection matrix then trace(S),trace(2S− SST ) and trace(SST ) coincide. For cubic splines, it can beshown that
trace(SST ) ≤ trace(S) ≤ trace(2S− SST )
and that all three of these functions are decreasing in λ.
Notes:
1 df is the most popular and easiest to compute. Cost of df is O(n) formost smoothers.
T. W. Yee (University of Auckland) Modern Regression Basics 125/167October 2008 @ Cagliari 125 / 167
Smoothing
Degrees of Freedom IV
2 2 ≤ the degrees of freedom ≤ the number of distinct xi .Linear fit = 2.The number of distinct xi = interpolant.As the degrees of freedom increases the fit becomes more wiggly. Asmooth with 3 degrees of freedom has approximately the sameflexibility as a quadratic. A value of 4 or 5 degrees of freedom is oftenused as the default value in software, as this can accommodate areasonable amount of nonlinearity without being excessive.
T. W. Yee (University of Auckland) Modern Regression Basics 126/167October 2008 @ Cagliari 126 / 167
Smoothing
Standard Errors I
f = S y =⇒ Var(f) = σ2 SST (35)
Can form pointwise SE bands for f (useful in preventing theover-interpretation of a plot of the estimated function).
But (35) impractical if n is large (all of S needed).
Trick: for cubic splines, Silverman (1985) uses a Bayesian derivation todiscuss the use of the alternative
σ2 S.
Cost is O(n)
T. W. Yee (University of Auckland) Modern Regression Basics 128/167October 2008 @ Cagliari 128 / 167
Smoothing
Equivalent Kernels I
Consider y = Sy for a linear smoother. Plotting the jth row of S versus xi
gives the weights used for the estimate yj . This mimics the kernel functionof a kernel smoother.
The EK for a cubic spline is
κ(u) =1
2exp
(− |u|√
2
)sin
(|u|√
2+π
4
)as n →∞ (Silverman, 1984).
T. W. Yee (University of Auckland) Modern Regression Basics 130/167October 2008 @ Cagliari 130 / 167
Smoothing
Equivalent Kernels II
−5 0 5
0.0
0.1
0.2
0.3
u
EK
If the design points xi have a local density g(x), and if x is not too nearthe boundary and λ is not too big or too small, then the local bandwidthh(x) satisfies
h(x) = {λ n g(x)}−1/4 .
T. W. Yee (University of Auckland) Modern Regression Basics 131/167October 2008 @ Cagliari 131 / 167
Smoothing
Automatic Smoothing Parameter Selection I
Choosing the bandwidth/smoothing parameter is the most importantdecision for a specified method. We want an automatic way of choosingthe ‘right’ smoothing parameter.
A popular method is cross-validation (CV), and restrict attention to linearsmoothers.
CV idea: leave point (xi , yi ) out one at a time and estimating the smoothat xi based on the remaining n − 1 points.
Choose λCV to minimize the cross-validation sum of squares
CV (λ) =1
n
n∑i=1
{yi − f −i
λ (xi )}2
(36)
T. W. Yee (University of Auckland) Modern Regression Basics 133/167October 2008 @ Cagliari 133 / 167
Smoothing
Automatic Smoothing Parameter Selection II
where f −iλ (xi ) is the fitted value at xi , computed by leaving out the ith
data point.
One can compute (36) naıvely. But there is a trick.
Define f −iλ (xi ) to be the fit obtained by setting the weight of the ith
observation to zero, and increasing the remaining weights so that they sumto unity, i.e.,
f −iλ (xi ) =
n∑j=1j 6=i
sij1− sii
yj . (37)
This means
f −iλ (xi ) =
n∑j=1j 6=i
sij yj + sii f−iλ (xi ) (38)
T. W. Yee (University of Auckland) Modern Regression Basics 134/167October 2008 @ Cagliari 134 / 167
Smoothing
Automatic Smoothing Parameter Selection III
and
yi − f −iλ (xi ) =
yi − fλ(xi )
1− sii. (39)
Thus, CV (λ) can be written
CV (λ) =1
n
n∑i=1
{yi − fλ (xi )
1− sii (λ)
}2
. (40)
So there is no need to compute f −iλ (xi ) naıvely.
In practice, CV sometimes gives questionable performance.
T. W. Yee (University of Auckland) Modern Regression Basics 135/167October 2008 @ Cagliari 135 / 167
Smoothing
Generalized Cross-Validation
A variant of CV (λ) is generalized cross-validation (GCV). The GCV idea isto replace sii by its average value trace(S)/n which is easier to compute:
GCV (λ) =1
n
n∑i=1
{yi − fλ (xi )
1− trace(S)/n
}2
.
GCV tends to undersmooth.
T. W. Yee (University of Auckland) Modern Regression Basics 136/167October 2008 @ Cagliari 136 / 167
Smoothing
Testing for nonlinearity
Suppose we wish to compare two smooths f1 = S1y and f2 = S2y. Forexample, the smooth f2 might be rougher than f1, and we wish to test if itpicks up any significant bias. A standard case that often arises is when f1is linear, in which case we want to test if the linearity is real. We mustassume that f2 is unbiased, and that f1 is unbiased under H0. LettingResSSj be the residual sum of squares for the jth smooth and γj betrace(2S− STS), then
(ResSS1 − ResSS2)/(γ2 − γ1)
ResSS2/(n − γ1)∼ Fγ2−γ1,n−γ1 (41)
approximately.
T. W. Yee (University of Auckland) Modern Regression Basics 137/167October 2008 @ Cagliari 137 / 167
Smoothing
The Curse of Dimensionality I
Sometimes multidimensional smoothers can work with a moderate numberof inputs. But the curse of dimensionality hinders them in higherdimensions:
local neighbourhoods are empty, or
nearest-neighbourhoods are not local
all points are close to the boundary
sample sizes need to grow exponentially.
T. W. Yee (University of Auckland) Modern Regression Basics 139/167October 2008 @ Cagliari 139 / 167
Smoothing
The Curse of Dimensionality II
That is, neigbourhoods with a fixed number of points become less local asthe dimensions increase. For fixed n, the data becomes more isolated ind-space and smoothers require a larger neighbourhood to find enough datapoints in order to calculate the variance of an estimate. Hence theestimate is no longer local and can be severely biased.
The following illustrates the curse of dimensionality. Suppose we have datauniformly distributed in a d dimensional unit cube. We spread out asubcube from the origin to capture span% of the data. What distance dowe have to reach out on each axis? The next figure gives the answer.Most reasonable high-dimensional procedures assume some structure.
T. W. Yee (University of Auckland) Modern Regression Basics 140/167October 2008 @ Cagliari 140 / 167
Smoothing
The Curse of Dimensionality III
0 20 40 60 80
0.0
0.2
0.4
0.6
0.8
1.0
span (%)
dist
ance
d=1
d=2d=3
d=10
Figure: Distance on each axis of a subcube required to capture span% of the dataof a d dimensional unit cube.
T. W. Yee (University of Auckland) Modern Regression Basics 141/167October 2008 @ Cagliari 141 / 167
Generalized Additive Models (GAMs)
Generalized Additive Models (GAMs) I
For general p, the linear model is
Y = β1X1 + · · ·+ βpXp + ε, ε ∼ N(0, σ2) independently. (42)
This model has some strong assumptions:
1 Linearity, i.e., the effect of each Xk on E (Y ) is linear,
2 Normal errors with zero mean, constant variance, and independent,
3 Additivity, i.e., Xk and X` do not interact; they have an additiveeffect on the response.
T. W. Yee (University of Auckland) Modern Regression Basics 143/167October 2008 @ Cagliari 143 / 167
Generalized Additive Models (GAMs)
Generalized Additive Models (GAMs) II
We relax the linearity assumption.
The linear predictor becomes an additive predictor :
η(x) = f1(x1) + · · ·+ fp(xp), (43)
a sum of arbitary smooth functions.
Additivity is still assumed. Easy to interpret.
Identifiability: the fk(xk) are centred.
Very useful for exploratory data analysis. Allows the data to“speak foritself”.
Some GAM books are Hastie and Tibshirani (1990) and Wood (2006).
T. W. Yee (University of Auckland) Modern Regression Basics 144/167October 2008 @ Cagliari 144 / 167
Generalized Additive Models (GAMs)
Generalized Additive Models (GAMs) I
Fit an additive model by backfitting . It is iterative procedure that smoothspartial residuals to f
E (y|x) = ft(xt) +
p∑k=1, k 6=t
fk(xk)
so
ft(xt) = E
y −p∑
k=1, k 6=t
fk(xk)
∣∣∣∣∣∣Xt
.
Modified backfitting possible—and is implemented. It decomposes
η(x) = Xβ +
p∑k=1
rk(xk)
i.e., into a linear and nonlinear components.T. W. Yee (University of Auckland) Modern Regression Basics 146/167October 2008 @ Cagliari 146 / 167
Generalized Additive Models (GAMs)
Generalized Additive Models (GAMs) I
Example 1 Kauri data
Y = presence/absence of a tree species, agaaus, which is Agathisaustralis, better known as Kauri, NZ’s most famous tree. Data is from 392sites from the Hunua forest near Auckland.
Figure: Big Kauri tree.
T. W. Yee (University of Auckland) Modern Regression Basics 148/167October 2008 @ Cagliari 148 / 167
Generalized Additive Models (GAMs)
Generalized Additive Models (GAMs) I
174.2 174.6 175.0
−37.4
−37.2
−37.0
−36.8
−36.6
−36.4
Longitude
Latit
ude
●W
H
Figure: The Hunua and Waitakere Ranges.
T. W. Yee (University of Auckland) Modern Regression Basics 150/167October 2008 @ Cagliari 150 / 167
Generalized Additive Models (GAMs)
Generalized Additive Models (GAMs) II
logit P[Yagaaus = 1] = f (altitude)
where f is a smooth function determined from the data.
> data(hunua)
> fit.h = vgam(agaaus ~ s(altitude), binomialff,
+ hunua)
> plot(fit.h, se = TRUE, lcol = "blue", scol = "red",
+ llwd = 2, slwd = 2)
> o = with(hunua, order(altitude))
> with(hunua, plot(altitude[o], fitted(fit.h)[o],
+ type = "l", ylim = 0:1, lwd = 2, col = "blue",
+ xlab = "altitude", ylab = "Fitted value"))
> with(hunua, points(altitude, agaaus + (runif(nrow(hunua)) -
+ 0.5)/30, col = "red"))
T. W. Yee (University of Auckland) Modern Regression Basics 151/167October 2008 @ Cagliari 151 / 167
Generalized Additive Models (GAMs)
Generalized Additive Models (GAMs) III
0 100 200 300 400 500 600
−8
−6
−4
−2
0
2
altitude
s(al
titud
e)
0 100 200 300 400 500 600
0.0
0.2
0.4
0.6
0.8
1.0
altitudeF
itted
val
ue
●●●●●●●
●●
●
●●
●
●
●
●
●
●
●
●
●● ●●
●
● ●
●
●●●
●
●● ●
●●
●●●● ●●● ●●
●●●●
●
● ●●
●
●●●
●
●●●●●●
●●
●●●●●
●
●
●
●● ●●●●●
●
●●
●
●
●
●●
●
●
●
●●●●
●● ●●
●
●
●●
●
●
● ●
●
● ●●
●●
● ●● ●
●
●●
●
● ●●
●
●●
●
● ●●●●● ●●● ●
●
●
● ●
●
●
●●
●
●●●●●
●
●
●●●
●●●●●●
●●●●●
●
●●●●●● ●●●●● ●●●●● ● ●●●●● ● ●●
●●●●●●
●
●●
●
●●●● ●● ●●●●●●● ●●●●●●●● ● ● ●●
●●
●● ●● ●●
●●●
●
●
●●●
●
● ●●● ●● ●
●
● ● ● ●●
●●
●●● ●
●
●
●●
●●
●●
●
●
● ●
●● ●●●
●●
● ●
●
●●
● ● ●● ● ●●●●● ● ● ● ●●●●●
● ●
●● ●●
●●●●●
● ●●●●● ●●●●
●
●●
● ●● ●●●●●
●
●● ● ● ●●● ● ●
●●● ● ●●●
● ●●●●●● ● ●●●
●●●●●●●●●●●●●●●●● ●●●●●
●
Figure: GAM plots of Kauri data in the Hunua ranges.
T. W. Yee (University of Auckland) Modern Regression Basics 152/167October 2008 @ Cagliari 152 / 167
Generalized Additive Models (GAMs)
Generalized Additive Models (GAMs) IV
Example 2 NZ Chinese data
> with(nzc, plot(year, female/(male + female), ylab = "Proportion female",
+ col = "blue", las = 1))
> abline(h = 0.5, lty = "dashed")
> fit.nzc = vgam(cbind(female, male) ~ s(year, df = 3),
+ fam = binomialff, data = nzc)
> with(nzc, lines(year, fitted(fit.nzc), col = "red"))
● ● ● ● ● ●● ● ●
●
●
●
●
●
●
●
●●
●
●
●
●●
●● ●
1880 1900 1920 1940 1960 1980 2000
0.0
0.1
0.2
0.3
0.4
0.5
year
Pro
port
ion
fem
ale
T. W. Yee (University of Auckland) Modern Regression Basics 153/167October 2008 @ Cagliari 153 / 167
Generalized Additive Models (GAMs)
GAMs I
There are some packages in R which fit GAMs. These include
gam written by Trevor Hastie, is similar to the S-PLUS version
mgcv by Wood, has an emphasis on smoothing parameter selection
gamlss by Londoners. Limited.
T. W. Yee (University of Auckland) Modern Regression Basics 155/167October 2008 @ Cagliari 155 / 167
Introduction to VGLMs and VGAMs
Introduction to VGLMs and VGAMs I
The VGAM package implements several large classes of regression modelsof which vector generalized linear and additive models (VGLMs/VGAMs)are most commonly used.
The primary key words are
maximum likelihood estimation,
iteratively reweighted least squares (IRLS),
Fisher scoring,
additive models.
Other concepts are
reduced rank regression,
constrained ordination,
vector smoothing.
T. W. Yee (University of Auckland) Modern Regression Basics 157/167October 2008 @ Cagliari 157 / 167
Introduction to VGLMs and VGAMs
Introduction to VGLMs and VGAMs II
Basically, VGLMs model each parameter, transformed if necessary, as alinear combination of the explanatory variables. That is,
gj(θj) = ηj = βTj x (44)
where gj is a parameter link function.
VGAMs extend (44) to
gj(θj) = ηj =
p∑k=1
f(j)k(xk), (45)
i.e., an additive model for each parameter.
T. W. Yee (University of Auckland) Modern Regression Basics 158/167October 2008 @ Cagliari 158 / 167
Introduction to VGLMs and VGAMs
Introduction to VGLMs and VGAMs III
Example: negative binomial
Y has a probability function that can be written as
P(Y = y) =
(y + k − 1
y
) (µ
µ+ k
)y ( k
k + µ
)k
where y = 0, 1, 2, . . .. Parameters µ (= E (Y )) > 0 and k > 0.
The MASS implementation is restricted to a intercept-only estimate of k,e.g., cannot fit log k = β(2)1 + β(2)2x2. In contrast, VGAM can fit
log µ = η1 = βT1 x,
log k = η2 = βT2 x.
using vglm(..., negbinomial(zero=NULL)).
T. W. Yee (University of Auckland) Modern Regression Basics 159/167October 2008 @ Cagliari 159 / 167
Introduction to VGLMs and VGAMs
Introduction to VGLMs and VGAMs IV
Note that it is natural for any positive parameter to have the log link asthe default.
There are also variants of the negative binomial in the VGAM package,e.g., the zero-inflated and zero-altered versions.
T. W. Yee (University of Auckland) Modern Regression Basics 160/167October 2008 @ Cagliari 160 / 167
Introduction to VGLMs and VGAMs
Introduction to VGLMs and VGAMs V
The new framework extends GLMs and GAMs in three main ways:
(i) multivariate responses y and/or linear/additive predictors η arehandled,
(ii) y not restricted to the exponential family,
(iii) ηj need not be a function of a mean µ: ηj = gj(θj) for anyparameter θj .
This formulation is deliberately general so that it encompasses as manydistributions and models as possible. We wish to be limited only by theassumption that the regression coefficients enter through a set of linear oradditive predictors.
Given the covariates, the conditional distribution of the response isintended to be completely general.
T. W. Yee (University of Auckland) Modern Regression Basics 161/167October 2008 @ Cagliari 161 / 167
Introduction to VGLMs and VGAMs
Introduction to VGLMs and VGAMs VI
The scope of VGAM is very broad; it potentially covers
univariate and multivariate distributions,
categorical data analysis,
quantile and expectile regression,
time series,
survival analysis,
mixture models,
extreme value analysis,
nonlinear regression,
. . . .
It conveys GLM/GAM-type modelling to a much broader range of models.
T. W. Yee (University of Auckland) Modern Regression Basics 162/167October 2008 @ Cagliari 162 / 167
Introduction to VGLMs and VGAMs
Introduction to VGLMs and VGAMs VII
LM
RR−VGLM
RR−VLM
VLM
VGLM VGAM
RR−VGAM
QRR−VGLM
VAM
Generalized
Normal errors
Parametric Nonparametric
Figure: Flowchart for different classes of models. Legend: LM = linear model, V= vector, G = generalized, A = additive, RR = reduced-rank, Q = quadratic.
T. W. Yee (University of Auckland) Modern Regression Basics 163/167October 2008 @ Cagliari 163 / 167
Introduction to VGLMs and VGAMs
Introduction to VGLMs and VGAMs VIII
tη Model S function Reference
BT1 x1 + BT
2 x2 (= BT x) VGLM vglm() Yee & Hastie (2003)
BT1 x1 +
p1+p2Pk=p1+1
Hk f∗k (xk ) VGAM vgam() Yee & Wild (1996)
BT1 x1 + A ν RR-VGLM rrvglm() Yee & Hastie (2003)
BT1 x1 + A ν +
0BBB@νT D1ν
.
.
.
νT DMν
1CCCA QRR-VGLM cqo() Yee (2004)
BT1 x1 +
RPr=1
fr (νr ) RR-VGAM cao() Yee (2006)
Table: A summary of VGAM and its framework. The latent variables ν = CTx2,or ν = cTx2 if rank R = 1. Here, xT = (xT
1 , xT2 ). Abbreviations: A = additive, C
= constrained, L = linear, O = ordination, Q = quadratic, RR = reduced-rank,VGLM = vector generalized linear model.
T. W. Yee (University of Auckland) Modern Regression Basics 164/167October 2008 @ Cagliari 164 / 167
Concluding Remarks
Concluding Remarks
1 The LM is fundamental and basic to statistical regression modelling.
2 GLMs provide an important unification of several important regressionmodels, e.g., normal, Poisson, binomial, . . . .
3 Smoothing allows the data to“speak for themselves”.
4 GAMs are a very useful data-driven variant of GLMs. Acombinination of GLMs + smoothing. But only c.6 models.
5 VGLMs/VGAMs are GLMs/GAMs but applied to a much wider classof models. We will look at VGLMs/VGAMs in detail later.
T. W. Yee (University of Auckland) Modern Regression Basics 165/167October 2008 @ Cagliari 165 / 167
Concluding Remarks
Finito
T. W. Yee (University of Auckland) Modern Regression Basics 166/167October 2008 @ Cagliari 166 / 167