Beyond Mean Regression - Universität Innsbruck€¦ · Thomas Kneib Introduction Introduction •...

Beyond Mean Regression

Thomas Kneib

Lehrstuhl fur StatistikGeorg-August-Universitat Gottingen

8.3.2013 Innsbruck

Thomas Kneib Introduction

Introduction

• One of the top ten reasons to become statistician (according to Friedman, Friedman& Amoo):

Statisticians are mean lovers.

⇒ Focus on means in particular in regression model to reduce complexity.

• Obviously, a mean is not sufficient to fully describe a distribution.

Beyond Mean Regression 1


• Usual regression models based on data (yi,zi) for a continuous response variable yand covariates z:

yi = ηi + εi,

where ηi is a regression predictor formed in terms of the covariates zi.

• Assumptions on the error term:

E(εi) = 0, Var(εi) = σ2,

orεi ∼ N(0, σ2).



• The assumptions on the error term imply the following properties of the responsedistribution

– The predictor determines the expectation of the response:

E(yi|zi) = ηi.

– Homoscedasticity of the response:

Var(yi|zi) = σ2.

– Parallel quantile curves of the response (if the errors are also normal):

Qτ(yi|zi) = ηi + zτσ.



• Why could this be problematic?

– The variance of the responses may depend on covariates (heteroscedasticity).

– Other higher order characteristics (skewness, curtosis, . . . ) of the responses maydepend on covariates.

– Generic interest in extreme observations or the complete conditional distributionof the response.



• Example: Munich rental guide (illustrative application in this talk).

– Explain the net rent for a specific flat in terms of covariates such as living area oryear of construction.

– Published to give reference intervals of usual rents for both tenants and landlords.

⇒ We are not interested in average rents but rather in an interval covering typicalrents.

050

010

0015

0020

00re

nt in

Eur

o

20 40 60 80 100 120 140 160living area

050

010

0015

0020

00re

nt in

Eur

o

1920 1940 1960 1980 2000year of construction



• Some further examples:

– Analysing childhood BMI patterns in (post-) industrialized countries, where interestis mainly on extreme forms of overweight (obesity).

– Studying covariate effects on extreme forms of malnutrition in developing countries.

– Efficiency estimation in agricultural production, where interest is on evaluatingabove-average performance of farms.

– Modelling gas flow networks, where the behavior of the network in high or lowdemand situations shall be studied.



• More flexible regression approaches considered in the following:

– Regression models for location, scale and shape.

– Quantile regression.

– Expectile regression.



• Regression models for location, scale and shape:

– Retain the assumption of a specific error distribution but allow covariate effectsnot only on the mean.

– Simplest example: Regression for mean and variance of a normal distribution where

yi = ηi1 + exp(ηi2)εi, εi ∼ N(0, 1),

such thatE(yi|zi) = ηi1 Var(yi|zi) = exp(ηi2)

2.

– In general: Specify a distribution for the response, where (potentially) allparameters are related to predictors.



• Quantile and expectile regression:

– Drop the parametric assumption for the error / response distribution and insteadestimate separate models for different asymmetries τ ∈ [0, 1]:

yi = ηiτ + εiτ ,

– Instead of assuming E(εiτ) = 0, we can for example assume

P (εiτ ≤ 0) = τ ,

i.e. the τ -quantile of the error term is zero.

– Yields a regression model for the quantiles of the response.

– A dense set of quantiles completely characterizes the conditional distribution ofthe response.

– Expectiles are a computationally attractive alternative to quantiles.



• Estimated quantile curves for the Munich rental guide with linear effect of living areaand quadratic effect for year of construction.

– Homoscedastic linear model:

−50

00

500

1000

1500

2000

rent

in E

uro

20 40 60 80 100 120 140 160living area

050

010

0015

0020

00re

nt in

Eur

o




– Heteroscedastic linear model:

−50

00

500

1000

1500

2000

rent

in E

uro

20 40 60 80 100 120 140 160living area

050

010

0015

0020

00re

nt in

Eur

o




– Quantile regression:

−50

00

500

1000

1500

2000

rent

in E

uro

20 40 60 80 100 120 140 160living area

050

010

0015

0020

00re

nt in

Eur

o




• Usually, modern regression data contain more complex structures such that linearpredictors are not enough.

• For example, in the Munich rental guide

– the effects of living area and size of the flat may be of complex nonlinear form(instead of simply polynomial) and

– a spatial effect based on the subquarter information may be included to captureeffects of missing covariates and spatial correlation.

⇒ Consider semiparametric extensions.


Thomas Kneib Overview for the Rest of the Talk

Overview for the Rest of the Talk

• Semiparametric Predictor Specifications.

• More on Models:

– Generalized Additive Models for Location, Scale and Shape.

– Quantile Regression.

– Expectile Regression.

• Inferential Procedures & Comparison of the Approaches.


Thomas Kneib Semiparametric Regression

Semiparametric Regression

• Semiparametric regression provides a generic framework for flexible regression modelswith predictor

η = β0 + f1(z) + . . .+ fr(z)

where f1, . . . , fr are generic functions of the covariate vector z.

• Types of effects:

– Linear effects: f(z) = x′β.

– Nonlinear, smooth effects of continuous covariates: f(z) = f(x).

– Varying coefficients: f(z) = uf(x).

– Interaction surfaces: f(z) = f(x1, x2).

– Spatial effects: f(z) = fspat(s).

– Random effects: f(z) = bc with cluster index c.



• Generic model description based on

– a design matrix Zj, such that the vector of function evaluations f j = (fj(z1),. . . , fj(zn))

′ can be written as

f j = Zjγj.

– a quadratic penalty term

pen(fj) = pen(γj) = γ′jKjγj

which operationalises smoothness properties of fj.

• From a Bayesian perspective, the penalty term corresponds to a multivariate Gaussianprior

p(γj) ∝ exp

(− 1

2δ2jγ′jKjγj

).



• Estimation then relies on a penalised fit criterion, e.g.

n∑i=1

(yi − ηi)2 +

r∑j=1

λjγ′jKjγj

with smoothing parameters λj ≥ 0.



• Example 1. Penalised splines for nonlinear effects f(x):

– Approximate f(x) in terms of a linear combination of B-spline basis functions

f(x) =∑k

γkBk(x).

– Large variability in the estimates corresponds to large differences in adjacentcoefficients yielding the penalty term

pen(γ) =∑k

(∆dγk)2 = γ′D′

dDdγ

with difference operator ∆d and difference matrix Dd of order d.

– The corresponding Bayesian prior is a random walk of order d, e.g.

γk = γk−1 + uk, γk = 2γk−1 + γk−2 + uk

with uk i. i. d. N(0, δ2).



• Example 2. Markov random fields for the estimation of spatial effects based onregional data:

– Estimate a separate regression coefficient γs for each region, i.e. f = Zγ with

Z[i, s] =

{1 observation i belongs to region s

0 otherwise

– Penalty term based on differences of neighboring regions:

pen(γ) =∑s

∑r∈N(s)

(γs − γr)2 = γ′Kγ

where N(s) is the set of neighbors of region s and K is an adjacency matrix.

– An equivalent Bayesian prior structure is obtained based on Gaussian Markovrandom fields.


Thomas Kneib Inferential Procedures

Inferential Procedures

• For each of the three model classes discussed in the following, we will consider threepotential avenues for inference:

– Direct optimization of a fit criterion (e.g. maximum likelihood estimation forGAMLSS).

– Bayesian approaches.

– Functional gradient descent boosting.


Thomas Kneib Inferential Procedures

• Functional gradient descent boosting:

– Define the estimation problem in terms of a loss function ρ (e.g. the negativelog-likelihood).

– Use the negative gradients of the loss function evaluated at the current fit as ameasure for lack of fit.

– Iteratively fit simple base-learning procedures to the negative gradients to updatethe model fit.

– Componentwise updates of only the best-fitting model component yield automaticvariable selection and model choice.

– For semiparametric regression, penalized least squares estimates provide suitablebase-learners.


Thomas Kneib Generalized Additive Models for Location, Scale and Shape

Generalized Additive Models for Location, Scale and Shape

• GAMLSS provide a unified framework for semiparametric regression models in the caseof complex response distributions depending on up to four parameters (µi, σi, νi, ξi)where usually

– µi is the location parameter,

– σi is the scale parameter, and

– νi and ξi are shape parameters determining for example skewness or kurtosis.

• Each parameter is related to a regression predictor via a suitable response function,i.e.

µi = h1(ηi,µ), σi = h2(ηi,σ), . . .



• A very broad class of distributions is supported for both discrete and continuousresponses.

• Most important examples for continuous responses:

– Two-parameter normal distribution (location and scale).

– Three-parameter power exponential distribution (location, scale and kurtosis).

– Three-parameter t distribution (location, scale and degrees of freedom).

– Three-parameter gamma distribution (location, scale and shape).

– Four-parameter Box-Cox power distribution (location, scale, skewness andkurtosis).



• Direct optimization:

– For GAMLSS, the likelihood is available due to the explicit assumption made forthe distribution of the response.

– Maximization can be achieved by penalized iteratively weighted least squares(IWLS) estimation.

– Estimation and choice of the smoothing parameters is challenging at least forcomplex models.

• Bayesian inference:

– Inference based on Markov chain Monte Carlo (MCMC) simulations is in principlestraightforward but requires careful choice of the proposal densities.

– Promising results obtained based on IWLS proposals.

– Smoothing parameter choice is immediately included.



• Boosting:

– Due to the multiple predictors, the usual boosting framework has to be adaptedbut basically still works.



• Results for the Munich rental guide obtained with an additive model for location andscale:

20 40 60 80 100 120 140 160

−20

00

200

400

600

mean: area

area in sqm

1920 1940 1960 1980 2000

−50

050

100

150

mean: year of construction

year of construction



20 40 60 80 100 120 140 160

−0.

50.

00.

51.

0

standard dev.: area

area in sqm

1920 1940 1960 1980 2000

−0.

2−

0.1

0.0

0.1

0.2

0.3

standard dev.: year of construction



Thomas Kneib Quantile Regression

Quantile Regression

• The theoretical τ -quantile qτ for a continuous random variable is characterized by

P (Y ≤ qτ) ≥ τ and P (Y ≥ qτ) ≥ 1− τ.

• Estimation of quantiles based on i.i.d. samples y1, . . . , yn can be accomplished by

qτ = argminq

n∑i=1

wτ(yi, q)|yi − q|

with asymmetric weights

wτ(yi, q) =

1− τ yi < q

0 yi = q

τ yi > q.



• Plot of the weighted losses wτ(y, q)|y − q| (for q = 0)



• Quantile regression starts with the regression model

yi = ηiτ + εiτ .

• Instead of assuming E(εiτ) = 0 as in mean regression, we assume

Fεiτ(0) = P (εiτ ≤ 0) = τ

i.e. the τ -quantile of the error is zero.

• This implies that the predictor coincides with the τ -quantile of the conditionaldistribution of the response, i.e.

Fyi(ηiτ) = P (yi ≤ ηiτ) = τ.



• Quantile regression therefore

– is distribution-free since it does not make any specific assumptions on the type oferrors.

– does not even require i.i.d. errors.

– allows for heteroscedasticity.



• Note that each parametric regression models also induces a quantile regression model.

• Example: The heteroscedastic normal model

y ∼ N(η1, exp(η2)2)

yieldsqτ = η1 + exp(η2)zτ .



• Direct optimisation:

– Classical estimation is achieved by minimizing

n∑i=1

wτ(yi, ηiτ)|yi − ηiτ |+p∑

j=1

λjpen(fj).

– Can be solved with linear programming as long as the penalties are also linearfunctionals, e.g. for total variation penalization

pen(fj) =

∫|f ′′

j (x)|dx.

– Does not fit well with the class of quadratic penalties we are considering.

– Smoothing parameter selection is still challenging in particular with multiplesmoothing parameters.



• Bayesian inference

– Although quantile regression is distribution-free, there is an auxiliary errordistribution that links ML estimation to quantile regression.

– Assume an asymmetric Laplace distribution for the responses, i.e.

yi ∼ ALD(ηiτ , σ2, τ)

with density

exp

(−wτ(yi, ηiτ)

|yi − ηiτ |σ2

).

– Maximizing the resulting likelihood

exp

(−

n∑i=1

wτ(yi, ηiτ)|yi − ηiτ |

σ2

)

is equivalent to minimizing the quantile loss criterion.



– A computationally attractive way of working with the ALD in a Bayesian frameworkis its scale-mixture representation

– If zi |σ2 ∼ Exp(1/σ2) and

yi | zi, ηiτ , σ2 ∼ N(ηiτ + ξzi, σ2/wi)

with

ξ =1− 2τ

τ(1− τ), wi =

1

δ2zi, δ2 =

2

τ(1− τ).

then yi is marginally ALD(ηiτ , σ2, τ) distributed.

– Allows to construct efficient Gibbs samplers or variational Bayes approximations toexplore the posterior after imputing zi as additional unknowns.



• Boosting:

– Boosting can be immediately applied in the quantile regression context since it isformulated in terms of a loss function.

– Negative gradients are defined almost everywhere, i.e. no conceptual problems.



• Results for a geoadditive Bayesian quantile regression model:

τ=0.1

−150 1500

τ=0.2

−150 1500

τ=0.5

−150 1500

τ=0.9

−150 1500



living area

f( li

ving

are

a )

−50

00

500

1000

20 40 60 80 100 120 140 160


f( y

ear

of c

onst

ruct

ion

)

−10

00

5015

0

1920 1940 1960 1980 2000

living area

f( li

ving

are

a )

−50

00

500

1000

20 40 60 80 100 120 140 160


f( y

ear

of c

onst

ruct

ion

)

−10

00

5015

0

1920 1940 1960 1980 2000

living area

f( li

ving

are

a )

−50

00

500

1000

20 40 60 80 100 120 140 160


f( y

ear

of c

onst

ruct

ion

)

−10

00

5015

0

1920 1940 1960 1980 2000



• Theoretical expectiles are obtained by solving

τ =

∫ eτ−∞ |y − eτ |fy(y)dy∫∞−∞ |y − eτ |fy(y)dy

=Gy(eτ)− eτFy(eτ)

2(Gy(eτ)− eτFy(eτ)) + (eτ − µ)

where

– fy(·) and Fy(·) denote the density and cumulative distribution function of y,

– Gy(e) =∫ e

−∞ yfy(y)dy is the partial moment function of y and

– Gy(∞) = µ is the expectation of y.



• Direct optimization:

– Since the expectile loss is differentiable, estimates for the basis coefficients can beobtained by iterating

γ[t+1]jτ = (Z ′

jW[t]τ Zj + λjKj)

−1Z ′jW

[t]τ y.

– A combination with mixed model methodology allows to estimate the smoothingparameters.



• Bayesian inference:

– Similarly as for quantile regression, an asymmetric normal distribution can bedefined as auxiliary distribution for the responses.

– No scale mixture representation known so far.

– Bayesian formulation probably less important since inference is directly tractable.

• Boosting:

– Boosting can be immediately applied in the expectile regression context.


Thomas Kneib Comparison

Comparison

• Advantages of GAMLSS:

– One joint model for the distribution of the responses.

– Interpretability of the estimated effects in terms of parameters of the responsedistribution.

– Quantiles (or expectiles) derived from GAMLSS will always be coherent, i.e.ordering will be preserved.

– Readily available in both frequentist and Bayesian formulation.

• Disadvantages of GAMLSS:

– Potential for misspecification of the observation model.

– Model checking difficult in complex settings.

– If quantiles are of ultimate interest, GAMLSS do not provide direct estimates forthese.



• Advantages of quantile regression:

– Completely distribution-free approach.

– Easy interpretation in terms of conditional quantiles.

– Bayesian formulation enables very flexible, fully data-driven semiparametricspecifications of the predictor.

• Disadvantages of quantile regression:

– Bayesian formulation requires an auxiliary error distribution (that will usually be amisspecification).

– Estimated cumulative distribution function is a step function even for continuousdata.

– Additional efforts required to avoid crossing of quantile curves.



• Advantages of expectile regression:

– Computationally simple (iteratively weighted least squares).

– Still allows to characterize the complete conditional distribution of the response.

– Quantiles (or conditional distributions) can be computed based on expectiles.

– Expectiles seem to be more efficient in close-to-Gaussian situations then quantiles.

– Expectile crossing seems to be less of an issue as compared to quantile crossing.

– The estimated expectile curve is smooth.

• Disadvantages of expectile regression:

– Immediate interpretation of expectiles is difficult.


Thomas Kneib Summary

Summary

• There is more than mean regression!

• Semiparametric extensions become available also for models beyond mean regression.

• You can do this at home:

– Quantile regression: R-package quantreg.

– Bayesian quantile regression: BayesX (MCMC) and forthcoming R-package onvariational Bayes approximations (VA).

– GAMLSS: R-packages gamlss and gamboostLSS.

– Expectile regression: R-package expectreg.

• Interesting addition to the models considered: Modal regression (yet to be explored).


Thomas Kneib Summary

• Acknowledgements:

– This talk is mostly based on joint work with Nora Fenske, Benjamin Hofner, TorstenHothorn, Goran Kauermann, Stefan Lang, Andreas Mayr, Matthias Schmid, LindaSchulze Waltrup, Fabian Sobotka, Elisabeth Waldmann and Yu Yue.

– Financial support has been provided by the German Research Foundation (DFG).

• A place called home:

http://www.statoek.wiso.uni-goettingen.de


Beyond Mean Regression - Universität Innsbruck€¦ · Thomas Kneib Introduction Introduction •...

Documents

Transcript of Beyond Mean Regression - Universität Innsbruck€¦ · Thomas Kneib Introduction Introduction •...