A critical assessment of shrinkage-based regression approaches for estimating the adverse health...

8
Atmospheric Environment 39 (2005) 6223–6230 A critical assessment of shrinkage-based regression approaches for estimating the adverse health effects of multiple air pollutants Steven Roberts , Michael Martin School of Finance and Applied Statistics, Faculty of Economics and Commerce, Australian National University, Canberra ACT 0200, Australia Received 8 April 2005; received in revised form 23 June 2005; accepted 2 July 2005 Abstract Most investigations of the adverse health effects of multiple air pollutants analyse the time series involved by simultaneously entering the multiple pollutants into a Poisson log-linear model. Concerns have been raised about this type of analysis, and it has been stated that new methodology or models should be developed for investigating the adverse health effects of multiple air pollutants. In this paper, we introduce the use of the lasso for this purpose and compare its statistical properties to those of ridge regression and the Poisson log-linear model. Ridge regression has been used in time series analyses on the adverse health effects of multiple air pollutants but its properties for this purpose have not been investigated. A series of simulation studies was used to compare the performance of the lasso, ridge regression, and the Poisson log-linear model. In these simulations, realistic mortality time series were generated with known air pollution mortality effects permitting the performance of the three models to be compared. Both the lasso and ridge regression produced more accurate estimates of the adverse health effects of the multiple air pollutants than those produced using the Poisson log-linear model. This increase in accuracy came at the expense of increased bias. Ridge regression produced more accurate estimates than the lasso, but the lasso produced more interpretable models. The lasso and ridge regression offer a flexible way of obtaining more accurate estimation of pollutant effects than that provided by the standard Poisson log-linear model. r 2005 Elsevier Ltd. All rights reserved. Keywords: Time series; Air pollution; Mortality; Lasso; Ridge regression 1. Introduction Numerous time series studies have investigated the association between daily mortality or morbidity and daily ambient air pollution concentrations (Derriennic et al., 1989; Ito et al., 1995; Rahlenbeck and Kahl, 1996; Ostro et al, 1999; Chock et al., 2000; Cifuentes et al, 2000; Lee et al., 2000; Moolgavkar, 2003; Roberts, 2004; Yang et al., 2004). These studies typically fit a Poisson log-linear model using a generalized additive model (GAM) (Hastie and Tibshirani, 1990) or generalized linear model (GLM) (McCullagh and Nelder, 1989) to ARTICLE IN PRESS www.elsevier.com/locate/atmosenv 1352-2310/$ - see front matter r 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.atmosenv.2005.07.004 Corresponding author. Tel.: +61 2 6125 3470; fax: +61 2 6125 0087. E-mail address: [email protected] (S. Roberts).

Transcript of A critical assessment of shrinkage-based regression approaches for estimating the adverse health...

ARTICLE IN PRESS

1352-2310/$ - se

doi:10.1016/j.at

�Correspondfax: +612 6125

E-mail addr

Atmospheric Environment 39 (2005) 6223–6230

www.elsevier.com/locate/atmosenv

A critical assessment of shrinkage-based regressionapproaches for estimating the adverse health effects of

multiple air pollutants

Steven Roberts�, Michael Martin

School of Finance and Applied Statistics, Faculty of Economics and Commerce, Australian National University,

Canberra ACT 0200, Australia

Received 8 April 2005; received in revised form 23 June 2005; accepted 2 July 2005

Abstract

Most investigations of the adverse health effects of multiple air pollutants analyse the time series involved by

simultaneously entering the multiple pollutants into a Poisson log-linear model. Concerns have been raised about this

type of analysis, and it has been stated that new methodology or models should be developed for investigating the

adverse health effects of multiple air pollutants. In this paper, we introduce the use of the lasso for this purpose and

compare its statistical properties to those of ridge regression and the Poisson log-linear model. Ridge regression has

been used in time series analyses on the adverse health effects of multiple air pollutants but its properties for this

purpose have not been investigated. A series of simulation studies was used to compare the performance of the lasso,

ridge regression, and the Poisson log-linear model. In these simulations, realistic mortality time series were generated

with known air pollution mortality effects permitting the performance of the three models to be compared. Both the

lasso and ridge regression produced more accurate estimates of the adverse health effects of the multiple air pollutants

than those produced using the Poisson log-linear model. This increase in accuracy came at the expense of increased bias.

Ridge regression produced more accurate estimates than the lasso, but the lasso produced more interpretable models.

The lasso and ridge regression offer a flexible way of obtaining more accurate estimation of pollutant effects than that

provided by the standard Poisson log-linear model.

r 2005 Elsevier Ltd. All rights reserved.

Keywords: Time series; Air pollution; Mortality; Lasso; Ridge regression

1. Introduction

Numerous time series studies have investigated the

association between daily mortality or morbidity and

e front matter r 2005 Elsevier Ltd. All rights reserve

mosenv.2005.07.004

ing author. Tel.: +612 6125 3470;

0087.

ess: [email protected] (S. Roberts).

daily ambient air pollution concentrations (Derriennic et

al., 1989; Ito et al., 1995; Rahlenbeck and Kahl, 1996;

Ostro et al, 1999; Chock et al., 2000; Cifuentes et al,

2000; Lee et al., 2000; Moolgavkar, 2003; Roberts, 2004;

Yang et al., 2004). These studies typically fit a Poisson

log-linear model using a generalized additive model

(GAM) (Hastie and Tibshirani, 1990) or generalized

linear model (GLM) (McCullagh and Nelder, 1989) to

d.

ARTICLE IN PRESSS. Roberts, M. Martin / Atmospheric Environment 39 (2005) 6223–62306224

concurrent time series of daily mortality or morbidity,

ambient air pollution, and meteorological covariates.

The fitted models are then used to quantify the adverse

health effects of ambient air pollution. Because the US

Environmental Protection Agency regulates pollutants

independently, most of the current time series research

on the adverse health effects of air pollution has focused

on estimating the effects of a single pollutant (Dominici

and Burnett, 2003). However, due to the potentially high

correlation between ambient air pollutants, the results

from studies that focus on a single pollutant can be

difficult to interpret in practice (Vedal et al., 2003). For

example, an observed positive association could occur

because the single air pollutant is a proxy for another air

pollutant or a mixture of air pollutants. To overcome

the limitations of single-pollutant time series studies, a

number of recent studies have investigated the con-

current adverse health effects of multiple air pollutants

(Hales et al., 1999; Moolgavkar 2003; Wong et al.,

2002). In the majority of studies of this nature, the

multiple air pollutants are simultaneously entered into a

single Poisson log-linear model.

In this paper, we introduce the use of the lasso

(Tibshirani, 1996) for estimating the adverse health

effects of multiple air pollutants in air pollution time

series studies. The statistical properties of the lasso for

this purpose will be compared to those of ridge

regression and the standard method of using a Poisson

log-linear model. Ridge regression has previously been

used to estimate the adverse health effects of multiple air

pollutants (Tze Wai et al., 1997) but to the best of our

knowledge its statistical properties for this purpose have

not been investigated. Both the lasso and ridge regres-

sion belong to a class of regression techniques called

shrinkage methods. The term shrinkage derives from the

fact that these methods use a penalty term to deliber-

ately bias, or ‘‘shrink’’, their coefficient estimates to

account for excessive variation in the original, unbiased

estimates. The consideration of shrinkage methods is

desirable when variables in a regression are highly

correlated, which is often the case with multiple air

pollutants in air pollution time series studies. It will be

shown that both the lasso and ridge regression can offer

an increase in statistical estimation precision com-

pared to the standard method of simultaneously

entering the multiple pollutants into a single Poisson

log-linear model. The development of new methodology

or models to concurrently estimate the adverse

health effects of multiple air pollutants has been

identified by statisticians, epidemiologists, and policy-

makers as an important area of future research

(Dominici and Burnett, 2003; Cox, 2000). The introduc-

tion of the lasso for this purpose and the investigation of

the statistical properties of both the lasso and ridge

regression for this purpose is a practical step in this

direction.

2. Methods

2.1. Data

The data used in this paper were obtained from the

publicly available National Morbidity, Mortality, and

Air Pollution Study (NMMAPS) database. The data

extracted consists of concurrent daily time series of

mortality, weather, and air pollution for Cook County,

Illinois and Harris County, Texas in the United States

for the period 1987–2000.

The mortality time series data, aggregated at the level

of county, are non-accidental daily deaths of individuals

aged 65 and over. Deaths of non-residents were excluded

from the mortality counts. The weather time series data

are 24 h averages of temperature and dew point

temperature, computed from hourly observations.

The five air pollutants considered are particulate

matter of less than 10mm in diameter (PM), ozone (O3),

sulphur dioxide (SO2), carbon monoxide (CO), and

nitrogen dioxide (NO2). For PM, SO2, CO, and NO2

average daily concentrations were used. For O3 the

maximum hourly concentration for each day was used.

For both counties the largest pairwise correlation

between the pollutants was 0.70 between NO2 and CO,

with all the other pairwise correlations below 0.60.

3. Simulation study

3.1. Mortality generation

In order to conduct the simulations, a way of

generating realistic mortality time series with known

air pollution mortality effects was required. We used a

method previously shown to generate realistic mortality

time series (Roberts, 2005), which proceeds by fitting the

following Poisson log-linear model similar to those used

in previous NMMAPS analyses (Daniels et al., 2000) to

the actual Cook County mortality and meteorological

time series data

logðmtÞ ¼ mþ St1ðtime; 4df per yearÞ

þ St2ðtemp0; 6dfÞ þ St3ðtemp1�3; 6dfÞ

þ St4ðdew0; 3dfÞ þ St5ðdew1�3; 3dfÞ þ gDOWt,

ð1Þ

where the subscript t refers to the day of the study, mt is

the mean number of deaths on day t, and m is an

intercept term. The Sti ( ) are smooth functions of time,

temperature and dew point temperature with the

indicated degrees of freedom. The smooth functions

are represented using natural cubic splines. The quantity

temp0 is the current day’s mean 24 h temperature and

temp1–3 is the average of the previous three days’ 24 h

mean temperatures. The values dew0 and dew1–3 are

ARTICLE IN PRESSS. Roberts, M. Martin / Atmospheric Environment 39 (2005) 6223–6230 6225

similarly defined for the 24 h mean dew point tempera-

ture, and DOWt is a set of indicator variables for the day

of the week. In this paper, all the models considered

were fit using R, version 2.1.0.

Once model (1) was fit, the estimated mean mortality

counts, denoted m̂t, were extracted. The effects of the five

air pollutants on mortality were then explicitly specified

and incorporated into the generated mortality time series,

by generating mortality time series of length 2920 days that

were Poisson distributed with mean ct on day t where

logðctÞ ¼ logðm̂tÞ þ y1X 1t þ y2X 2t

þ y3X 3t þ y4X 4t þ y5X 5t ð2Þ

Here, Xit i ¼ 1, y, 5, are, respectively, the current day’s

daily concentrations of PM, NO2, SO2, CO, and O3, and

yi i ¼ 1,y, 5, are the corresponding, explicitly specified,

mortality effects of each air pollutant. Since each air

pollution time series was standardised, 100yi is approxi-

mately the percentage increase in mean daily mortality

for a unit standard deviation increment in pollutant i.

In the simulations, 9 (y1,y2,y3,y4,y5) combinations

were used. For each of the 9 combinations, the overall

air pollution effectP5

i¼1yi was equal to either 0.02, 0.01,

0.005, 0.0025, or 0. Based on some recent studies

(Moolgavkar, 2003; Vedal et al., 2003), these values of

the overall air pollution effect are plausible. Recently it

has been stated that assessing the overall effect of the

mixture of air pollutants on mortality may be both more

meaningful and more achievable than attempting to

isolate the effect of individual pollutants (Stieb et al.,

2002). The reason for this is that the high correlations

that exist between pollutants often results in the effect

estimates for the individual pollutants being difficult to

interpret (Vedal et al., 2003) and/or unstable. In our

context, this means being more concerned with estimat-

ing the ‘‘overall’’ air pollution effectP5

i¼1yirather than

the effect of the individual pollutants (y1,y2,y3,y4,y5).The overall pollution effect

P5i¼1yi represents the

increase in mortality for a simultaneous one standard

deviation increment in each of the individual pollutants.

This quantity is important for regulatory purposes

because it provides an indication of the total increase

in mortality that can be attributed to air pollution in

general. Estimating this quantity over a number of cities

and pooling the results, as has been done for individual

pollutants (Daniels et al., 2000), could provide regula-

tors with important information on the overall effect of

air pollution on mortality and whether the current air

pollution standards are adequate.

3.2. Statistical analysis

3.2.1. Standard model

As mentioned above, the majority of time series

studies that concurrently investigate the adverse health

effects of multiple air pollutants simultaneously enter

the pollutants into a single Poisson log-linear model.

Under this model, the daily mortality counts are

modelled as independent Poisson random variables with

a time varying mean mt where

logðmtÞ ¼ confounderst þ b1X 1t þ b2X 2t þ � � � þ bkX kt

(3)

and where confounderst represents other time-varying

confounding variables related to daily mortality. The

confounders included in this model will have the same

specification as the confounders that were used to

generate the mortality time series; that is, confounderst

will have the same specification as the right-hand side of

model (1). The covariates Xit, i ¼ 1,y,k, represent the k

pollutants that are being investigated; in this paper,

k ¼ 5. Hereafter, model (3) will be referred to as the

‘‘standard model’’.

3.2.2. Lasso and ridge regression

Both the lasso and ridge regression can be used to

concurrently investigate the adverse health effects of

multiple air pollutants by modelling the square-root of

the daily mortality counts as independent normal

random variables with a time varying mean $t, subject

to a constraint on the parameters bj:

$t ¼ confounderst þ b1X 1t þ b2X 2t þ � � � þ bkX kt

subject toXk

j¼1

bj

�� ��pps. ð4Þ

Here, confounderst and Xit, i ¼ 1,y,k, have the same

specification as in the standard model. For the lasso

p ¼ 1 while for ridge regression p ¼ 2. The value s is a

tuning parameter that controls the amount of shrinkage

that is applied to the estimates: the smaller the value of s,

the greater the shrinkage. A heuristic argument for the

use of shrinkage methods is that in a regression when

variables are highly correlated, a large positive coeffi-

cient for one variable can be effectively cancelled out by

a correspondingly large negative coefficient on a highly

correlated variable. By imposing a size constraint on the

air pollution coefficients, both the lasso and ridge reg-

ression mitigate this undesirable phenomenon (Hastie

et al., 2001). In a sense, in the presence of sufficiently

high correlation, such parameters can become non-

identifiable; that is, they cannot be separately estimated.

Close to this condition, the estimators exhibit very high

variances, and lasso and ridge estimators effectively

make what statisticians term a ‘‘bias-variance trade-off’’,

allowing a certain amount of bias in estimation in

exchange for more palatable (i.e. smaller) variance.

Hereafter, model (4) with p ¼ 1 will be referred to as the

‘‘lasso model’’ and with p ¼ 2 as the ‘‘ridge model’’.

For each simulation, for the lasso model four equally

spaced values of the tuning parameter s ranging from

ARTICLE IN PRESSS. Roberts, M. Martin / Atmospheric Environment 39 (2005) 6223–62306226

0:85P5

j¼1 b̂j

������ to

P5j¼1 b̂j

������, inclusive were considered,

where b̂j are the parameter estimates obtained from

model (4) with no constraints placed on the coefficients.

This range of s values was based on an initial

exploratory analysis which suggested that smaller values

of s resulted in too much shrinkage and the fact that

values of sXP5

j¼1 b̂j

������ result in identical estimates being

obtained. For each simulation, the value of s within this

range that was used to obtain the estimates was based on

a test data set of length 365 days—the last 365 days of

each of the generated mortality time series was set-aside

and the s value that was ‘‘best’’ able to predict the 365

days of set-aside mortality was selected. For each

simulation, for the ridge model four values of the tuning

parameter s were also considered. The four values

considered corresponded to a similar level of shrinkage

as used in the lasso model. As for the lasso model, the s

value used to obtain the estimates was based on a test

data set of length 365 days. It is important to reiterate

that for each simulation that was run (i.e., for each

generated mortality time series) the procedures de-

scribed above were used to estimate two new values of

the tuning parameter s—one for the lasso model and one

for the ridge model. The estimation of the tuning

parameters was an explicit part of both the lasso and

ridge models estimation procedures.

The standard model was implemented using Poisson

log-linear regression based on the daily mortality counts,

while both the lasso and ridge models were implemented

using normal linear regression based on the square-root

of the daily mortality counts. The reason for this choice

is that the lasso and ridge regression were originally

developed in the context of the normal linear regression

framework. Previous studies have shown that for our

purposes Poisson log-linear regression and normal linear

regression with a square-root transformation lead to

similar results (Smith et al., 2000).

3.2.3. Criteria for evaluation

For each of the 9 (y1,y2,y3,y4,y5) combinations, 1000mortality time series were generated using model (2).

For each generated mortality time series, the individual

pollutant effects (y1,y2,y3,y4,y5) and the overall air

pollution effectP5

i¼1yi were estimated using the

standard, lasso and ridge models. For each set of 1000

simulations, the root mean squared error (rmse),

standard deviation, and bias of the estimates obtained

from each of the three models was calculated. The rmse

is a standard method of comparing estimators or models

that provides a measure of accuracy that incorporates

both bias and variance. Smaller rmse values correspond

to better estimators in the sense of having both relatively

small bias and small variance. The results of the

simulations are reported in Tables 1 and 2.

4. Results

Table 1 contains the results of the simulations for

mortality generated with each pollutant having either no

effect on mortality or each pollutant having an effect on

mortality. Table 2 contains the results of the simulations

for mortality generated with PM and NO2 having an

effect on mortality and SO2, CO, and O3 having no

effect on mortality. These tables contain the rmse, bias,

and standard deviation of the estimates obtained from

the standard, lasso, and ridge models.

In both tables the rmse values for the estimates

obtained from both the lasso and ridge models are

generally smaller than the rmse values for the estimates

obtained from the standard model. This indicates that

both the lasso and ridge models provide more accurate

estimates of both the effect of the individual pollutants

on mortality and of the overall effect of air pollution on

mortality than the standard model. The increase in

accuracy can sometimes be substantial. For example, in

simulation 9 the rmse value of the estimate obtained for

the effect of O3 on mortality from the lasso model was

30% smaller than the corresponding estimate obtained

from the standard model.

As discussed above, the increased accuracy of the

estimates obtained from both the lasso and ridge models

is due to a ‘‘bias-variance trade-off’’. From Tables 1 and

2, it can be seen that the individual pollutant effect

estimates obtained from both the lasso and ridge models

have a larger bias than the estimates obtained from the

standard model. This bias becomes material in simula-

tion 9 when PM and NO2 have a large effect on

mortality and SO2, CO, and O3 have no effect on

mortality. However, importantly, it can also be seen that

the overall pollution effect estimates obtained from the

both the lasso and ridge models do not have a material

bias. The increased bias of the estimates obtained from

both the lasso and ridge models is typically more than

compensated for by a reduction in variance, as

illustrated by the increased accuracy of the estimates

obtained from both the lasso and ridge models compared

to the standard model. The lasso model did not perform

well when large values of the overall effect of air

pollution on mortality were used, most notably simula-

tion 5. In this case, the bias component of the rmse rose

significantly for the lasso model. Statistically, the reason

for this phenomenon is that shrinkage is largest when

the effects are large resulting in the large increase in the

negative bias of the estimates from the lasso model.

The lasso model has one important advantage over

both the ridge and standard models. The nature of the

constraint on the coefficients in the lasso means that it

can return parameter estimates of exactly zero (Tibshir-

ani, 1996). In our context, this means that the lasso

model will sometimes return air pollutant effect esti-

mates of exactly zero; this will never be the case for both

ARTICLE IN PRESS

Table 1

Results of fitting the standard, lasso and ridge models to the generated mortality time series

PM NO2 SO2 CO O3 Overall

Pollutant effectsa 0 0 0 0 0 0

Rmseb—lasso 2.61 3.56 2.83 2.95 3.86 4.50

rmse—ridge 2.49 3.30 2.70 2.79 3.55 4.31

rmse—standard 2.78 3.90 2.95 3.17 4.28 4.83

Bias (SD)c—lasso 0.01 (2.61) �0.38 (3.54) 0.06 (2.83) 0.27 (2.94) 0.02 (3.86) �0.02 (4.50)

Bias (SD)—ridge 0.01 (2.50) �0.53 (3.26) 0.10 (2.70) 0.36 (2.77) 0.15 (3.55) 0.09 (4.31)

Bias (SD)—standard 0.05 (2.78) �0.14 (3.90) �0.08 (2.95) 0.11 (3.17) 0.12 (4.28) 0.07 (4.83)

Pollutant effects 0.50 0.50 0.50 0.50 0.50 2.50

rmse—lasso 2.63 3.53 2.78 2.96 3.89 4.50

rmse—ridge 2.54 3.21 2.66 2.77 3.56 4.29

rmse—standard 2.78 3.88 2.91 3.21 4.40 4.84

Bias (SD)—lasso �0.15 (2.63) �0.30 (3.52) 0.11 (2.78) 0.15 (2.96) �0.16 (3.89) �0.36 (4.49)

Bias (SD)—ridge �0.16 (2.53) �0.42 (3.18) 0.14 (2.65) 0.21 (2.76) �0.09 (3.56) �0.32 (4.28)

Bias (SD)—standard �0.09 (2.78) �0.02 (3.88) �0.02 (2.91) �0.04 (3.21) �0.01 (4.40) �0.18 (4.84)

Pollutant effects 1.00 1.00 1.00 1.00 1.00 5.00

rmse—lasso 2.61 3.48 2.72 2.84 3.81 4.41

rmse—ridge 2.49 3.12 2.60 2.65 3.52 4.25

rmse—standard 2.79 3.84 2.87 3.09 4.35 4.75

Bias (SD)—lasso �0.01 (2.61) �0.16 (3.48) �0.05 (2.73) 0.16 (2.83) �0.27 (3.80) �0.34 (4.40)

Bias (SD)—ridge �0.02 (2.49) �0.30 (3.10) �0.01 (2.60) 0.24 (2.64) �0.24 (3.51) �0.33 (4.24)

Bias (SD)—standard 0.08 (2.79) 0.14 (3.84) �0.19 (2.87) �0.02 (3.09) �0.09 (4.35) �0.07 (4.75)

Pollutant effects 2.00 2.00 2.00 2.00 2.00 10.00

rmse—lasso 2.70 3.68 2.70 2.97 3.74 4.11

rmse—ridge 2.58 3.36 2.59 2.83 3.63 4.11

rmse—standard 2.87 3.95 2.83 3.20 4.37 4.55

Bias (SD)—lasso �0.11 (2.70) �0.52 (3.64) 0.03 (2.70) 0.37 (2.95) �0.20 (3.74) �0.42 (4.09)

Bias (SD)—ridge �0.08 (2.58) �0.63 (3.30) 0.08 (2.59) 0.46 (2.79) �0.23 (3.63) �0.40 (4.09)

Bias (SD)—standard 0.07 (2.87) �0.23 (3.95) �0.04 (2.83) 0.22 (3.19) 0.19 (4.37) 0.20 (4.54)

Pollutant effects 4.00 4.00 4.00 4.00 4.00 20.00

rmse—lasso 2.70 3.57 2.81 2.99 3.95 5.06

rmse—ridge 2.50 3.13 2.65 2.75 3.80 4.58

rmse—standard 2.77 3.78 2.90 3.15 4.33 4.69

Bias (SD)—lasso �0.43 (2.67) �0.23 (3.57) �0.13 (2.81) �0.08 (2.99) �0.86 (3.86) �1.73 (4.76)

Bias (SD)—ridge �0.21 (2.49) �0.45 (3.10) 0.05 (2.65) 0.16 (2.75) �0.69 (3.74) �1.14 (4.44)

Bias (SD)—standard 0.03 (2.77) 0.06 (3.78) �0.01 (2.90) �0.16 (3.15) 0.04 (4.33) �0.04 (4.69)

a1000 times the pollutant effects that were used to generate mortality.b1000 times the rmse of the individual and overall pollutant effect estimates.c1000 times the bias and standard deviation of the individual and overall pollutant effect estimates. The standard deviations are in

parentheses.

S. Roberts, M. Martin / Atmospheric Environment 39 (2005) 6223–6230 6227

the ridge and standard models. Table 2 contains the

percentage of times in each set of 1000 simulations that

the lasso model returned an effect estimate of exactly

zero for each pollutant in simulations 6–9. From this

table it can be seen that in simulation 9, the lasso model

assigns estimates of zero a large percentage of the time

to the three pollutants with a zero mortality effect, but

apart from this the percentage of zero estimates assigned

to these three pollutants is quite small. Smaller values of

the tuning parameter s would result in a higher

percentage of zero estimates at the expense of further

increased bias. The ability of the lasso model to return

estimates of exactly zero is important because it may lead

to more interpretable models than those obtained from

both the ridge and standard models. This feature is

particularly important given that researchers have

recently concluded that pollutants measured and included

in models of daily mortality might better be interpreted as

indicators of the biologically relevant pollutant mixture

(Moolgavkar, 2003; Venners et al., 2003). The lasso model

with its ability to assign air pollutant effect estimates of

exactly zero clearly fits better into this paradigm, because

it allows for particular pollutant coefficients to be

eliminated completely during the modelling process.

ARTICLE IN PRESS

Table 2

Results of fitting the standard, lasso and ridge models to the generated mortality time series

PM NO2 SO2 CO O3 Overall

Pollutant effectsa 1.25 1.25 0 0 0 2.50

Rmseb—lasso 2.78 3.43 2.56 2.90 3.83 4.14

percent zeroc—lasso 4.6 4.1 4.5 6.3 7.5

Rmse—ridge 2.64 3.11 2.45 2.71 3.42 3.88

Rmse—standard 2.96 3.81 2.72 3.15 4.34 4.48

bias (SD)d—lasso �0.14 (2.78) �0.49 (3.40) 0.11 (2.56) 0.40 (2.88) �0.14 (3.83) �0.25 (4.13)

bias (SD)—ridge �0.21 (2.64) �0.67 (3.03) 0.16 (2.44) 0.52 (2.66) 0.04 (3.42) �0.15 (3.88)

bias (SD)—standard �0.05 (2.96) �0.15 (3.81) �0.06 (2.72) 0.18 (3.15) �0.04 (4.34) �0.13 (4.48)

Pollutant effects 2.5 2.5 0 0 0 5.00

Rmse—lasso 2.71 3.37 2.69 2.79 3.59 4.30

Rmse—ridge 2.54 2.95 2.56 2.63 3.28 4.15

percent zero—lasso 5.9 5.1 5.8 7.0 10.6

Rmse—standard 2.86 3.73 2.90 3.06 4.20 4.71

bias (SD)—lasso �0.22 (2.70) �0.38 (3.35) �0.01 (2.69) 0.39 (2.77) 0.03 (3.59) �0.19 (4.30)

bias (SD)—ridge �0.30 (2.53) �0.70 (2.86) 0.10 (2.56) 0.59 (2.56) 0.22 (3.27) �0.08 (4.16)

bias (SD)—standard �0.04 (2.86) 0.09 (3.73) �0.21 (2.89) 0.10 (3.06) 0.11 (4.20) 0.04 (4.71)

Pollutant effects 5 5 0 0 0 10.00

Rmse—lasso 2.72 3.53 2.63 2.88 3.58 4.26

percent zero—lasso 1.6 1.7 11.1 10.6 17.7

Rmse—ridge 2.43 3.05 2.58 2.73 3.26 4.06

Rmse—standard 2.75 3.88 2.95 3.31 4.48 4.81

bias (SD)—lasso �0.40 (2.69) �0.51 (3.49) 0.11 (2.63) 0.28 (2.86) 0.06 (3.58) �0.47 (4.24)

bias (SD)—ridge �0.63 (2.35) �1.08 (2.85) 0.33 (2.56) 0.66 (2.66) 0.29 (3.25) �0.44 (4.04)

bias (SD)—standard 0.00 (2.75) 0.26 (3.88) �0.19 (2.94) �0.22 (3.31) 0.10 (4.48) �0.05 (4.81)

Pollutant effects 10 10 0 0 0 20.00

Rmse—lasso 2.91 3.68 2.16 2.57 3.13 4.04

percent zero—lasso 0.0 0.3 22.8 20.7 40.5

Rmse—ridge 2.71 3.51 2.41 2.78 3.09 3.92

Rmse—standard 2.78 3.92 2.75 3.15 4.44 4.63

bias (SD)—lasso �0.79 (2.80) �1.12 (3.51) 0.19 (2.15) 0.55 (2.51) 0.08 (3.13) �1.09 (3.90)

bias (SD)—ridge �1.36 (2.35) �2.29 (2.67) 0.84 (2.26) 1.44 (2.38) 0.44 (3.06) �0.92 (3.81)

bias (SD)—standard �0.01 (2.78) 0.13 (3.92) �0.09 (2.75) �0.10 (3.15) 0.02 (4.44) �0.04 (4.63)

a1000 times the pollutant effects that were used to generate mortality.b1000 times the rmse of the individual and overall pollutant effect estimates.cThe percentage of time that the lasso model assigned an effect estimate of exactly zero to the given pollutant.d1000 times the bias and standard deviation of the individual and overall pollutant effect estimates. The standard deviations are in

parentheses.

S. Roberts, M. Martin / Atmospheric Environment 39 (2005) 6223–62306228

Additional sets of simulations were conducted to

investigate the performance of the lasso, ridge, and

standard models under model misspecification. In these

simulations the three models were fit using either a

subset or a superset of the confounders used to generate

mortality or the mortality time series were generated

allowing for interaction between the different pollutants.

For the simulations that allowed for interaction effects,

mortality time series were generated assuming pairwise

interactions between both PM and O3 and NO2 and CO

or a single pairwise interaction between PM and O3, the

lasso, ridge, and standard models that were fit to these

generated mortality time series did not allow for these

interactions. The results of these additional simulations

were similar to those described above where the three

models were correctly specified. This suggests that even

under model misspecification or unaccounted for inter-

action effects that the conclusions regarding the relative

performance of the standard, lasso, and ridge models are

not materially altered.

5. Application

In this section the data from Cook County and Harris

County for the period 1987–2000 were used to compare

the standard, lasso, and ridge models in a real data

setting. For these analyses days with missing values

ARTICLE IN PRESS

Table 3

Results of applying the lasso, ridge, and standard models to data from Cook County, Illinois and Harris County, Texas for the period

1987–2000

PM NO2 SO2 CO O3 Overalla

Cook County

lassob 0.49 (0.21) 0.00 (0.26) �0.45 (0.24) �0.08 (0.24) 0.00 (0.27) �0.05 (0.34)

Ridge 0.53 (0.23) 0.05 (0.33) �0.50 (0.24) �0.14 (0.25) 0.03 (0.31) �0.02 (0.36)

Standard 0.55 (0.23) 0.05 (0.30) �0.47 (0.22) �0.15 (0.24) 0.10 (0.34) 0.07 (0.37)

Harris County

Lasso �0.53 (0.53) �0.12 (0.84) 0.93 (0.58) �0.04 (0.62) 0.00 (0.67) 0.25 (0.80)

ridge �0.43 (0.52) �0.09 (0.73) 0.73 (0.51) �0.07 (0.60) 0.07 (0.56) 0.21 (0.78)

Standard �0.50 (0.60) �0.44 (1.02) 1.03 (0.62) 0.10 (0.73) 0.06 (0.72) 0.25 (0.82)

aThe increase in mortality for a simultaneous one standard deviation increment in the concentration of each pollutant.bThe estimated percentage increase in mortality for a one standard deviation increment in the pollutant. The values in parentheses

are the standard deviations of the estimates.

S. Roberts, M. Martin / Atmospheric Environment 39 (2005) 6223–6230 6229

across any of the variables were removed. This left 4835

days of data for Cook County and 1703 days of data for

Harris County. The form of the standard, lasso, and

ridge models that were fit to the data from each county is

specified in models (3) and (4) above.

Table 3 contains the results of applying the models to

the data from both counties. In both counties the

standard, lasso, and ridge models gave similar results for

the estimates of the effect of the individual pollutants on

mortality and for the overall effect of air pollution on

mortality. The ability of the lasso model to assign air

pollution effect estimates of exactly zero is clearly

illustrated for these two counties with PM and O3

receiving effect estimates of zero in Cook County and O3

receiving an effect estimate of zero in Harris County. As

discussed above, for reasons of interpretability and the

current interest in biologically relevant pollutant mix-

tures, the ability of the lasso model to assign zero effect

estimates is an important attribute that is not shared by

either the standard or ridge models.

6. Discussion

The results of our study demonstrate that both the

lasso and ridge regression can be used to provide more

accurate estimates of both the individual and overall

mortality effects of multiple air pollutants compared to

the standard method of using a Poisson log-linear

model. However, for estimating the mortality effect of

individual pollutants the increased accuracy of both the

lasso and ridge regression sometimes comes at the

expense of larger bias. This fact should be kept in mind

when these shrinkage methods are used to estimate the

adverse health effects of multiple pollutants.

The study also showed that the more accurate

estimates of the overall air pollution effect obtained

from both the lasso and ridge regression compared to

the standard Poisson log-linear model, came without

the introduction of material bias. This is important

information because it has recently been stated that

assessing the overall effect of the mixture of air

pollutants on mortality may be more meaningful than

assessing the effect of individual pollutants. This study

shows that for this purpose ridge regression should be

preferred to both the lasso and the standard Poisson log-

linear model.

Some other studies have also looked at alternative

models for investigating the adverse health effects of

multiple pollutants. Wong et al. (2002) adopted a

pairwise approach. If more than one pollutant seemed

to be associated with the outcome, the association with

one pollutant stratified by the level of the other pollutant

was sought. Hong et al. (1999) used a number of a priori

fixed air pollution indices to evaluate the combined

effects of various air pollutants. These models differ

from the standard, lasso, and ridge models investigated in

this paper. Unlike these three models which provide

individual effect estimates for each pollutant of interest,

the approach of Wong et al. will only provide estimates

for a subset of two of these five pollutants and the

approach of Hong et al. will only provide an estimate for

an a priori fixed air pollution index on mortality. Since

the approaches of Wong et al. and Hong et al. do not

provide individual effect estimates for each of the five

pollutants considered in this paper they were not

compared here to the standard, lasso, and ridge models.

Use of shrinkage methods such as the lasso or ridge

regression offer a flexible way of obtaining more

accurate estimation of pollutant effects than that

provided by the standard approach. This more accurate

estimation is due to the ‘‘bias-variance trade-off’’. The

results presented in this paper should help researchers

investigating the adverse health effects of multiple air

pollutants decide whether shrinkage methods are appro-

priate for their needs.

ARTICLE IN PRESSS. Roberts, M. Martin / Atmospheric Environment 39 (2005) 6223–62306230

References

Chock, D.P., Winkler, S.L., Chen, C., 2000. A study of the

association between daily mortality and ambient air

pollutant concentrations in Pittsburgh, Pennsylvania. Jour-

nal of Air &Waste Management Association 50, 1481–1500.

Cifuentes, L.A., Vega, J., Kopfer, K., Lave, L.B., 2000. Effect

of the fine fraction of particulate matter versus the coarse

mass and other pollutants on daily mortality in Santiago,

Chile. Journal of Air & Waste Management Association 50,

1287–1298.

Cox, L.H., 2000. Statistical issues in the study of air pollution

involving airborne particulate matter. Environmetrics 11,

611–626.

Daniels, M.J., Dominici, F., Samet, J.M., et al., 2000.

Estimating particulate matter-mortality dose–response

curves and threshold levels: an analysis of daily time-series

for the 20 largest US cities. American Journal of Epide-

miology 152, 397–406.

Derriennic, F., Richardson, S., Mollie, A., et al., 1989. Short-

term effects of sulphur dioxide pollution on mortality in two

French cities. International Journal of Epidemiology 18,

186–197.

Dominici, F., Burnett, R.T., 2003. Risk models for particulate

air pollution. Journal of Toxicology and Environmental

Health Part A 66, 1883–1889.

Hales, S., Salmond, C., Town, G.I., et al., 1999. Daily mortality

in relation to weather and air pollution in Christchurch,

New Zealand. Australian and New Zealand Journal of

Public Health 24, 88–91.

Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements Of

Statistical Learning. Springer, Berlin.

Hastie, T.J., Tibshirani, R.J., 1990. Generalized Additive

Models. Chapman & Hall, London.

Hong, Y.C., Leem, J.H., Ha, E.H., Christiani, D.C., 1999.

PM10 exposure, gaseous pollutants, and daily mortality in

Inchon, South Korea. Environmental Health Perspectives

107, 873–878.

Ito, K., Kinney, P.L., Thurston, G.D., 1995. Variations in PM-

10 concentrations within two metropolitan areas and their

implications for health effects analyses. Inhalation Toxicol-

ogy 7, 735–745.

Lee, J.T., Kim, H., Hong, Y.C., et al., 2000. Air pollution and

daily mortality in seven major cities of Korea, 1991–1997.

Environmental Research 84, 247–254.

McCullagh, P., Nelder, J.A., 1989. Generalized Linear Models.

Chapman & Hall, London.

Moolgavkar, S.H., 2003. Air pollution and daily mortality in

two US counties: season-specific analyses and exposure-

response relationships. Inhalation Toxicology 15, 877–907.

Ostro, B.D., Hurley, S., Lipsett, M.J., 1999. Air pollution and

daily mortality in the Coachella Valley, California: a study

of PM10 dominated by coarse particles. Environmental

Research 81, 231–238.

Rahlenbeck, S.I., Kahl, H., 1996. Air population and mortality

in East Berlin during the winters of 1981–1989. Interna-

tional Journal of Epidemiology 25, 1220–1226.

Roberts, S., 2004. Interactions between particulate air pollution

and temperature in air pollution mortality time series

studies. Environmental Research 96, 328–337.

Roberts, S., 2005. An investigation of distributed lag models in

the context of air pollution and mortality time series

analysis. Journal of Air & Waste Management Association

55, 273–282.

Smith, R.L., Davis, J.M., Sacks, J., et al., 2000. Regression

models for air pollution and daily mortality: analysis of data

from Birmingham, Alabama. Environmetrics 11, 719–743.

Stieb, D.M., Judek, S., Burnett, R.T., 2002. Meta-analysis of

time-series studies of air pollution and mortality: effects of

gases and particles and the influence of cause of death, age,

and season. Journal of Air & Waste Management Associa-

tion 52, 470–484.

Tibshirani, R., 1996. Regression shrinkage and selection via the

lasso. Journal of the Royal Statistical Society B 58, 267–288.

Tze Wai, W, Ka Ming, H, Tai Shing L, et al. 1997. A study of

short-term effects of ambient air pollution on public health.

A consultancy report for Environmental Protection Depart-

ment Hong Kong. Available from: http://www.epd.gov.hk/

epd/english/environmentinhk/air/studyrpts/cuhk97.html.

Accessed January 3, 2005.

Vedal, S., Brauer, M., White, R., et al., 2003. Air pollution and

daily mortality in a city with low levels of pollution.

Environmental Health Perspectives 111, 45–52.

Venners, S.A., Wang, B., Peng, Z., et al., 2003. Particulate

matter, sulfur dioxide, and daily mortality in Chongqing,

China. Environmental Health Perspectives 111, 562–567.

Wong, T.W., Tam, W.S., Yu, T.S., et al., 2002. Associations

between daily mortalities from respiratory and cardiovas-

cular diseases and air pollution in Hong Kong, China.

Occupational and Environmental Medicine 59, 30–35.

Yang, C.Y., Chen, Y.S., Yang, C.H., et al., 2004. Relationship

between ambient air pollution and hospital admissions for

cardiovascular diseases in Kaohsiung, Taiwan. Journal of

Toxicology and Environmental Health Part A 67, 483–493.