Ozone Ensemble Forecast with Machine Learning...

13
JOURNAL OF GEOPHYSICAL RESEARCH, VOL. ???, XXXX, DOI:10.1029/, Ozone Ensemble Forecast with Machine Learning Algorithms Vivien Mallet INRIA, Paris-Rocquencourt research center, France CEREA, joint laboratory ENPC - EDF R&D, Universit´ e Paris-Est, Marne la Vall´ ee, France Gilles Stoltz epartement de math´ ematiques et applications, ´ Ecole normale sup´ erieure, CNRS, 45 rue d’Ulm, 75005 Paris, France HEC Paris, CNRS, 1 rue de la Lib´ eration, 78351 Jouy-en-Josas, France Boris Mauricette Abstract. We apply machine learning algorithms to perform sequential aggregation of ozone forecasts. The latter rely on a multimodel ensemble built for ozone forecasting with the modeling system Polyphemus. The ensemble simulations are obtained by changes in the physical parameterizations, the numerical schemes, and the input data to the mod- els. The simulations are carried out for summer 2001 over Western Europe in order to forecast ozone daily peaks and ozone hourly concentrations. Based on past observations and past model forecasts, the learning algorithms produce a weight for each model. A convex or linear combination of the model forecasts is then formed with these weights. This process is repeated for each round of forecasting and is therefore called sequential aggregation. The aggregated forecasts demonstrate good results; for instance, they al- ways show better performance than the best model in the ensemble and they even com- pete against the best constant linear combination. In addition, the machine learning al- gorithms come with theoretical guarantees with respect to their performance, that hold for all possible sequences of observations, even non-stochastic ones. Our study also demon- strates the robustness of the methods. We therefore conclude that these aggregation meth- ods are very relevant for operational forecasts. 1. Introduction The large uncertainties in air quality forecasts have lately been evaluated with ensemble approaches. Besides the prop- agation of input-data uncertainties with Monte Carlo sim- ulations [e.g., Hanna et al., 1998; Beekmann and Derognat , 2003], multimodel ensembles have been introduced in order to account for the uncertainties in the chemistry-transport- models formulation [Mallet and Sportisse , 2006b; van Loon et al., 2007]. In this case, the models are based on dif- ferent physical parameterizations, different numerical dis- cretizations and different input data. Each model in the ensemble brings information that may be used to improve the forecasts. A class of methods linearly combines the ensemble mem- bers in order to produce a single forecast, hopefully more skillful than any individual model of the ensemble. Hence- forth, we refer to these methods as aggregation methods. The simplest example is the ensemble mean, which usually brings limited improvement, if any (depending on the tar- get), compared to the best model in the ensemble [McKeen et al., 2005; van Loon et al., 2007]. Other methods asso- ciate weights to the models, using the performance of the models against the observations in the past. Pagowski et al. [2005, 2006] applied linear regression methods, which re- sulted in strong improvements. The weights were computed per observation station, which did not enable the forecast of 2D fields of concentrations. Also dynamic linear regression is usually not a robust method [West and Harrison , 1999]. Copyright 2008 by the American Geophysical Union. 0148-0227/08/$9.00 In Mallet and Sportisse [2006a], the weights only depend on time (so, not on the location) and were computed with least-squares optimization like in Krishnamurti et al. [2000]. This led to significant improvements in the forecasts. The method seemed quite robust for ozone forecasts, but it is an empirical method, without any theoretical guarantee. In this paper, we apply some new methods developed by the machine learning community. Just like the ones dis- cussed above, they perform sequential aggregation based on ensemble simulations and past observations. These machine learning methods come with a strong mathematical back- ground [Cesa-Bianchi and Lugosi , 2006]. They provide the- oretical bounds on the discrepancy between the performance of some best model-combinations and the performance of the aggregated forecast, for any possible sequence of ob- servations. Hence, they can provide a reliable and robust framework for ensemble forecasting. The basis of learning al- gorithms and more detailed explanations for two algorithms and their variants are addressed in section 2. We rely on the same ensemble of ozone forecasts as in Mallet and Sportisse [2006a]. This ensemble was built with the Polyphemus system [Mallet et al., 2007b] and is made of 48 models run during four months in 2001 and over Europe. The experiment setup is briefly explained in section 3. The learning algorithms are applied to ozone daily peaks and to ozone hourly concentrations. Their results are re- viewed in section 4. The analysis addresses issues like the robustness of the forecasts and the ability of the methods to capture extreme events. Comparison to Previous Machine Learning Approaches in the Field Machine learning aims at designing and developing algo- rithms that can be implemented on computers to make some 1

Transcript of Ozone Ensemble Forecast with Machine Learning...

  • JOURNAL OF GEOPHYSICAL RESEARCH, VOL. ???, XXXX, DOI:10.1029/,

    Ozone Ensemble Forecast with Machine Learning AlgorithmsVivien MalletINRIA, Paris-Rocquencourt research center, France

    CEREA, joint laboratory ENPC - EDF R&D, Université Paris-Est, Marne la Vallée, France

    Gilles StoltzDépartement de mathématiques et applications, École normale supérieure, CNRS, 45 rue d’Ulm, 75005 Paris,France

    HEC Paris, CNRS, 1 rue de la Libération, 78351 Jouy-en-Josas, France

    Boris Mauricette

    Abstract. We apply machine learning algorithms to perform sequential aggregation ofozone forecasts. The latter rely on a multimodel ensemble built for ozone forecasting withthe modeling system Polyphemus. The ensemble simulations are obtained by changes inthe physical parameterizations, the numerical schemes, and the input data to the mod-els. The simulations are carried out for summer 2001 over Western Europe in order toforecast ozone daily peaks and ozone hourly concentrations. Based on past observationsand past model forecasts, the learning algorithms produce a weight for each model. Aconvex or linear combination of the model forecasts is then formed with these weights.This process is repeated for each round of forecasting and is therefore called sequentialaggregation. The aggregated forecasts demonstrate good results; for instance, they al-ways show better performance than the best model in the ensemble and they even com-pete against the best constant linear combination. In addition, the machine learning al-gorithms come with theoretical guarantees with respect to their performance, that holdfor all possible sequences of observations, even non-stochastic ones. Our study also demon-strates the robustness of the methods. We therefore conclude that these aggregation meth-ods are very relevant for operational forecasts.

    1. Introduction

    The large uncertainties in air quality forecasts have latelybeen evaluated with ensemble approaches. Besides the prop-agation of input-data uncertainties with Monte Carlo sim-ulations [e.g., Hanna et al., 1998; Beekmann and Derognat ,2003], multimodel ensembles have been introduced in orderto account for the uncertainties in the chemistry-transport-models formulation [Mallet and Sportisse, 2006b; van Loonet al., 2007]. In this case, the models are based on dif-ferent physical parameterizations, different numerical dis-cretizations and different input data. Each model in theensemble brings information that may be used to improvethe forecasts.

    A class of methods linearly combines the ensemble mem-bers in order to produce a single forecast, hopefully moreskillful than any individual model of the ensemble. Hence-forth, we refer to these methods as aggregation methods.The simplest example is the ensemble mean, which usuallybrings limited improvement, if any (depending on the tar-get), compared to the best model in the ensemble [McKeenet al., 2005; van Loon et al., 2007]. Other methods asso-ciate weights to the models, using the performance of themodels against the observations in the past. Pagowski et al.[2005, 2006] applied linear regression methods, which re-sulted in strong improvements. The weights were computedper observation station, which did not enable the forecast of2D fields of concentrations. Also dynamic linear regressionis usually not a robust method [West and Harrison, 1999].

    Copyright 2008 by the American Geophysical Union.0148-0227/08/$9.00

    In Mallet and Sportisse [2006a], the weights only dependon time (so, not on the location) and were computed withleast-squares optimization like in Krishnamurti et al. [2000].This led to significant improvements in the forecasts. Themethod seemed quite robust for ozone forecasts, but it is anempirical method, without any theoretical guarantee.

    In this paper, we apply some new methods developed bythe machine learning community. Just like the ones dis-cussed above, they perform sequential aggregation based onensemble simulations and past observations. These machinelearning methods come with a strong mathematical back-ground [Cesa-Bianchi and Lugosi , 2006]. They provide the-oretical bounds on the discrepancy between the performanceof some best model-combinations and the performance ofthe aggregated forecast, for any possible sequence of ob-servations. Hence, they can provide a reliable and robustframework for ensemble forecasting. The basis of learning al-gorithms and more detailed explanations for two algorithmsand their variants are addressed in section 2.

    We rely on the same ensemble of ozone forecasts as inMallet and Sportisse [2006a]. This ensemble was built withthe Polyphemus system [Mallet et al., 2007b] and is made of48 models run during four months in 2001 and over Europe.The experiment setup is briefly explained in section 3.

    The learning algorithms are applied to ozone daily peaksand to ozone hourly concentrations. Their results are re-viewed in section 4. The analysis addresses issues like therobustness of the forecasts and the ability of the methods tocapture extreme events.

    Comparison to Previous Machine Learning Approachesin the Field

    Machine learning aims at designing and developing algo-rithms that can be implemented on computers to make some

    1

  • X - 2 MALLET, STOLTZ, AND MAURICETTE: OZONE ENSEMBLES AND MACHINE LEARNING

    automatic decisions or predictions. (Here, we are interestedin predicting the concentrations of a pollutant.) One way ofmaking good decisions is of course to consider a statisticalprocedure based on a preliminary estimation step. However,not all machine learning algorithms rely on the estimationof statistical parameters. The sequential aggregation tech-niques described in this paper are examples of such machinelearning algorithms not resorting to estimation.

    Quite different machine learning techniques have alreadybeen intensively used in atmospheric sciences, and in par-ticular, neural networks, see, e.g., Lary et al. [2004]; Loyola[2006]. Neural networks combine in a non-linear way differ-ent input data. However, their design is usually a difficulttask in view of all possible combinations involved (numberof hidden layers, choice of the input categories). In additionto the choice of the structure of the neural network, weightshave also to be associated to each possible input and hiddenlayer. These weights are chosen by the user as well. In total,one obtains this way a given model, whose performance isdifficult to study from a theoretical point of view.

    In this paper, we are concerned with the aggregation ofseveral models, some of them being possibly given by differ-ent neural networks. We thus work at a meta-model level. Itis true that neural networks could be used at this level too,to combine some base models in a non-linear way. However,we focus below on learning techniques that need only one,or maybe two, user choices, as, e.g., the penalization fac-tor λ of the ridge regression forecaster of section 2.3 or thelearning rate η of the exponentiated gradient forecaster ofsection 2.4. Having very few parameters to set up is the onlyway for the procedure to be carried out in some automaticmanner by a computer.

    2. Learning Methods as Ensemble Forecasters2.1. Principle and Notation

    Ensemble forecasts are based on a set of models (a mul-timodel ensemble) M = {1, . . . , N}. Each model mayhave its own physical formulation, numerical formulationand input data (see section 3). Its forecasts are comparedagainst measurements from a network of monitoring stationsN = {1, . . . , S}. (The indexations of M and N are madein an arbitrary order.) At station s ∈ N and time indext ∈ {1, . . . , T} (the indexes of the days or of the hours), theprediction xsm,t of model m ∈M is compared to the observa-tion yst . In practice, the performance of model m is assessedwith a root mean square error including observations fromall stations and all time indexes (see section 2.2.2).

    The general idea is to combine linearly the predictions ofthe models to get a more accurate forecast. A weight is as-sociated with each model of the ensemble so as to producean improved aggregated prediction ŷst =

    ∑Nm=1 vm,t x

    sm,t.

    The weights vm,t may also depend on the station s. In thispaper, we essentially focus on weights independent of thestations so that the linear combination should still be validaway from the stations. Otherwise, the ensemble methodswould compete with purely statistical models whose resultsare very satisfactory at a low cost [e.g., ESQUIF , 2001].

    2.2. Sequential Aggregation of Models

    The weights of the combination depend of course on thepast predictions of each model and on the past observa-tions. The formalized dependence is called the aggregationrule, and its result (the linear combination) is the aggregatedforecaster. It does not attempt to model the evolutions ofthe concentrations but simply uses the models as black boxesand aims at performing better than the best of them. Theaggregation rule does not rely on any stochastic assumption

    on the evolution of the simulated or observed concentrations.This paper is at a meta-level of prediction: it explains howto get improved performance given an ensemble of modelswhose preliminary construction is barely dealt with here.One may take any set of models one trusts. Stochastic orstatistical modeling might also be used to construct one orseveral models; and the aggregated rule, since it usually im-proves on the models, will then benefit automatically fromit.

    Compared to Mallet and Sportisse [2006a], one contribu-tion of this paper lies in the significant improvements of theroot mean square errors of the aggregated forecasts. An-other key contribution is that the learning algorithms aretheoretically grounded: explicit bounds on their practicalperformance can be exhibited.2.2.1. Definition of a Sequential Aggregation Rule

    At each time index t, a linear sequential aggregation ruleproduces a weight vector vt = (v1,t, . . . , vN,t) ∈ RN basedon the past observations ys1, . . . , y

    st−1 (for all s ∈ N ) and

    the past predictions xsm,1, . . . , xsm,t−1 (for all s ∈ N and

    m ∈ M). The final prediction at t is then obtained by lin-early combining the predictions of the models according tothe weights given by the components of the vector vt. Moreprecisely, the aggregated prediction for station s at time in-dex t equals

    ŷst = vt · xst =N∑m=1

    vm,t xsm,t. (1)

    Convex sequential aggregation rules constrain the weightvector vt to indicate a convex combination of N elements,which means that

    ∑Nm=1 vm,t = 1 with vm,t ≥ 0. In this

    case, we use the notation vt = pt. When the weights areleft unconstrained and can be possibly any vector of RN ,they are denoted by vt = ut.

    Note that it is always possible to apply the aggregationrule per station. In this case, a weight vector vst is producedfor each station s.2.2.2. Assessment of the Quality of a Sequential Ag-gregation Rule

    Not all stations are active at a given time index t. We de-note by Nt ⊂ N the set of active stations on the network Nat time t. These are the stations s that monitored the ozoneconcentrations yst . When we indicated above that aggrega-tion rules rely on past predictions and past observations,we of course meant at the active stations. We can assessthe quality of our strategies only on active stations. This iswhy the measure of performance, the well-known root meansquare error (rmse), of a rule A is defined as

    rmse(A) =

    √√√√ 1∑Tt=t0|Nt|

    T∑t=t0

    ∑s∈Nt

    (vt · xst − yst )2 (2)

    where t0 is the first time index when the evaluation starts(hence 1 ≤ t0 < T ), and |Nt| is the cardinality (i.e., thenumber of elements) of the set Nt. One may choose t0 > 1so that the time period {1, . . . , t0−1} should serve as a shortspin-up period for the aggregation rules.

    In the sequel, a model will be said to be the best for agiven set of observations if it has the lowest rmse.2.2.3. Reference Performance Measures

    We indicate below some challenging performance mea-sures, in terms of rmse, beyond which it is impossible ordifficult to go.

    The best performance is the following rmse that no fore-caster, even knowing the observations beforehand, can beat:

    Bp =

    √√√√ 1∑Tt=t0|Nt|

    T∑t=t0

    minut∈RN

    ∑s∈Nt

    (ut · xst − yst )2. (3)

  • MALLET, STOLTZ, AND MAURICETTE: OZONE ENSEMBLES AND MACHINE LEARNING X - 3

    It should be seen as the potential of sequential aggregation.We introduce three other reference performance mea-

    sures. The first one picks the best constant linear combi-nation in RN over the period {t0, . . . , T}:

    BRN = minu∈RN

    √√√√ 1∑Tt=t0|Nt|

    T∑t=t0

    ∑s∈Nt

    (u · xst − yst )2. (4)

    The second one corresponds to the best constant convexcombination. We denote by X the set of all vectors p indi-cating convex combinations over N elements and define

    BX = minp∈X

    √√√√ 1∑Tt=t0|Nt|

    T∑t=t0

    ∑s∈Nt

    (p · xst − yst )2. (5)

    The third reference performance is that of the best model:

    BM = minm=1,...,N

    √√√√ 1∑Tt=t0|Nt|

    T∑t=t0

    ∑s∈Nt

    (xsm,t − yst

    )2. (6)

    Note that by definitions, in view of the inclusions be-tween the sets on which minima are taken, it always holdsthat Bp ≤ BRN ≤ BX ≤ BM. Comparing to convex or evenlinear combinations of models is by far more challenging interms of rmse than simply competing with respect to thebest model. This will be illustrated in section 4, devoted tonumerical results.2.2.4. Minimization of the rmse via Minimization ofthe Regret

    Given a sequential aggregation rule A, and the weightvectors vt it chose at time indexes t = 1, . . . , T , we wantto compare its rmse to one of the reference performancemeasures. To do so, we define LT (A) and LT (v) as the cu-mulative square errors (over all time indexes, and not onlyafter t0) of the rule A and of the constant linear combinationv,

    LT (A) =T∑t=1

    ∑s∈Nt

    (vt · xst − yst )2 , (7)

    LT (v) =

    T∑t=1

    ∑s∈Nt

    (v · xst − yst )2 . (8)

    The rules of interest, like the two discussed in sections 2.3and 2.4, should ensure that the difference

    RT (v) = LT (A)− LT (v) (9)

    is small, i.e., o(T ), for all v ∈W . The comparison set W isRN or X , depending whether the rule is a linear aggregationrule or a convex aggregation rule.

    The difference RT (v) is termed the regret of the rule A.The rmse of A may be bounded in terms of the regret as

    (rmse(A)

    )2 ≤ infv∈W

    {(rmse(v))2 +

    1∑Tt=t0|Nt|

    RT (v)

    +1∑T

    t=t0|Nt|

    t0∑t=1

    (v · xst − yst )2}.(10)

    This infimum is small: in the limit as T →∞, it equals B2RN ,if W = RN , or B2X , if W = X . This is because the regretis sublinear, as is guaranteed by the learning techniques ofinterest–we refer to the bounds on the regret proposed be-low in equations (12) and (19) and to the comments thatfollow them. As a consequence, the averaged regret term

    1∑Tt=t0

    |Nt|RT (v) tends to 0. And so does the third term

    of the right-hand side of equation (10), which is a constantdivided by something of the order of T . Therefore, in to-tal, the rmse of the aggregation rule A tends to be at leastas small as the rmse of the best constant linear or convexcombination.

    In other words, the machine learning algorithms of inter-est here guarantee that, in the long run, the overall perfor-mance of their aggregated forecasts is at least as good as theperformance of the best constant combination. This resultholds whatever the sequence of observations and predictionsmay be, and without any stochastic assumption. This makesthe methods of interest very efficient and very robust. Wenow describe two such methods.

    2.3. A First Aggregated Forecaster: Ridge RegressionStatement

    The ridge regression forecaster Rλ is presented, for in-stance, in Cesa-Bianchi and Lugosi [2006, section 11.7]. Itis parameterized by λ ≥ 0 and chooses u1 = (0, . . . , 0);then, for t ≥ 2,

    ut = argminu∈RN

    λ ‖u‖22 + t−1∑t′=1

    ∑s∈Nt′

    (u · xst′ − yst′)2 . (11)

    The interpretation is as follows. R0 is also called “follow-the-leader forecaster” (in a least-squares regression sense)because it uses, at any time index t ≥ 2, the linear combi-nation that would have been the best over the past. Rλ isdefined similarly, except that its definition includes a penal-ization factor λ ‖u‖22 to keep the magnitude of the chosenut small and to have smoother variations from one ut to thenext ut+1.Bound on the regret

    An adaptation of the performance bound given in Cesa-Bianchi and Lugosi [2006, section 11.7] is presented in Mal-let et al. [2007a, section 12]. For all λ > 0 and all u ∈ RN ,

    RT (u) ≤λ

    2‖u‖22 + SC

    2N∑i=1

    ln(

    1 +µiλ

    )(12)

    where S is the total number of monitoring stations,

    C = maxt=1,...,T

    maxs∈Nt

    |ut · xst − yst | (13)

    and µ1, . . . , µN are the eigenvalues of the matrix

    T∑t=1

    ∑s∈Nt

    xst

    (xst

    )>(= AT+1 − λIN with the notation below) .

    (14)Since each ut depends in an awkward way on the parame-ter λ we cannot propose any theoretical optimal parameterto minimize the upper bound; we simply mention that itis obvious from the bound that λ should not be too largewhile also bounded away from 0; however, as noted in Cesa-Bianchi and Lugosi [2006, section 11.7], the typical order ofmagnitude of the bound is O(lnT ) as the µi = O(ST ).Implementation

    Parameters: penalization factor λ ≥ 0Initialization: u1 = (0, . . . , 0), A1 = λINFor each time index t = 1, 2, . . . , T,

    1. predict with ut;

    2. compute, with the predictions xst ,

    At+1 = At +∑s∈Nt

    xst (xst )> ; (15)

  • X - 4 MALLET, STOLTZ, AND MAURICETTE: OZONE ENSEMBLES AND MACHINE LEARNING

    3. get the observations yst , and compute

    ut+1 = ut −A−1t+1

    (∑s∈Nt

    (ut · xst − yst ) xst

    )(16)

    (where A−1t+1 denotes the pseudo-inverse of At+1 in case At+1is not invertible).

    2.4. A Second Aggregated Forecaster: ExponentiatedGradientStatement

    The basic version of the exponentiated gradient forecasterEη is presented, for instance, in Cesa-Bianchi [1999]. Itis parameterized by a learning rate η > 0 and choosesp1 = (1/N, . . . , 1/N); then, for t ≥ 2, pt is defined as

    pm,t =exp

    (−η

    ∑t−1t′=1

    ˜̀m,t′

    )∑Nj=1 exp

    (−η

    ∑t−1t′=1

    ˜̀j,t′

    ) (17)for all m = 1, . . . , N , where

    ˜̀m,t′ =

    ∑s∈Nt′

    2(pt′ · xst′ − yst′

    )xsm,t′ . (18)

    These pt all define convex combinations.Bound on the regret

    An adaptation of the performance bound given in Cesa-Bianchi [1999] is presented in Mallet et al. [2007a, section3]. Denoting by L a bound on the

    ∣∣˜̀m,t′

    ∣∣ (for all m and t′),the regret of Eη against all convex combinations is uniformlybounded as

    supp∈X

    RT (p) ≤lnN

    η+Tη

    2L2 = L

    √2T lnN = O(

    √T ),

    (19)

    where the last two equalities hold for the (theoretical) opti-mal choice η∗ = L−1

    √(2 lnN)/T .

    2.5. Two Variants of the Previous AggregatedForecasters, Obtained by Windowing or Discounting2.5.1. Windowing

    Windowing relies on the following heuristic. Recent pastindicates a trend on the current air quality but far away pastis not very informative. Windowing is a way to only accountfor the most recent past. It formally consists in using theresults of at most t1 past dates to form the prediction. Thiswill be referred to by a superscript w(t1).

    For instance, the windowing version of ridge regressionRw(t1)λ is the same asRλ for times t ≤ t1+1, but for t > t1+1it uses ut defined by

    ut = argminu∈RN

    λ ‖u‖22 + t−1∑t′=t−t1

    ∑s∈Nt′

    (u · xst′ − yst′)2 ; (20)

    it differs from (11) only by the starting index of the firstsummation. See Mallet et al. [2007a, section 14].

    The windowing version of exponentiated gradient Ew(t1)ηpredicts as Eη for times t ≤ t1 + 1, but for t > t1 + 1 it usespt defined by

    pm,t =exp

    (−η

    ∑t−1t′=t−t1

    ˜̀m,t′

    )∑Nj=1 exp

    (−η

    ∑t−1t′=t−t1

    ˜̀j,t′

    ) (21)

    for all m = 1, . . . , N , where ˜̀m,t′ is defined in equation (18).Again, the differences to (17) concern the indexes of thesummations. See Mallet et al. [2007a, section 5].

    A drawback of this technique is that no non-trivial theo-retical bound can be exhibited–because no sublinear boundon the regret can hold.2.5.2. Discounting

    Discounting relies on the same heuristic. For the farthestaway past to have a limited influence, the discrepancies withthe observations are weighted according to their closenessto the present (the closer, the higher weight). The weightsβ = (βt)t≥1 are given by a decreasing sequence of positivenumbers, which will be referred to by a superscript β.

    For instance, the β–discounted version of ridge regression

    Rβλ chooses u1 = (0, . . . , 0) and for t ≥ 2,

    ut = argminu∈RN

    λ ‖u‖22 + t−1∑t′=1

    (1 + βt−t′

    ) ∑s∈Nt′

    (u · xst′ − yst′)2 .

    (22)See Mallet et al. [2007a, section 13]. In our experiments, wetook βt−t′ = 100/(t− t′)2 for ridge regression.

    The β–discounted version of exponentiated gradient Eβηchooses p1 = (1/N, . . . , 1/N) and for t ≥ 2, pt is defined as

    pm,t =exp

    (−(η/

    √t)∑t−1t′=1

    (1 + βt−t′

    )˜̀m,t′

    )∑Nj=1 exp

    (−(η/

    √t)∑t−1t′=1

    (1 + βt−t′

    )˜̀j,t′

    ) (23)for all m = 1, . . . , N , where ˜̀m,t′ is defined in equation (18).See Mallet et al. [2007a, section 6], which provides a theoret-ical performance bound for this forecaster under a suitablechoice of β. In our experiments, we took βt−t′ = 1/(t− t′)2for the exponentiated gradient forecaster.

    2.6. Remarks2.6.1. Addition of Aggregated Forecasts in the En-semble

    All forecasters presented in this paper (with the exceptionof their windowing versions) are competitive with respect toall possible constant convex or linear combinations of N baseforecasting models. The parameter N enters merely in a log-arithmic way in most of the performance bounds, so that,from a theoretical viewpoint, some more base forecastingmodels can be added at a negligible theoretical price, sayN ′ of them.

    For instance, the learning methods do not make use of anystochastic assumption. But if one suspects that a stochasticmodeling would be useful, then a model exploiting this canbe added to the ensemble. Since it usually improves on eachof the individual models, the aggregated forecaster, will thenbenefit from this modeling.

    The additional N ′ models can also be given by methodsthat are supported by the intuition and for which no precisetheoretical guarantee can be exhibited. Running the learn-ing algorithms then leads to an aggregated forecaster thatexploits the intuitions that yielded the N ′ new models. Atthe same time, the theoretical performance guarantees stillhold. They even improve since all convex or linear combi-nations over the first N models are combinations over theN +N ′ models.

    To illustrate this, we tested at some point the addition ofthe aggregated forecasters Rw(10)0 , R

    w(20)0 and R

    w(30)0 to the

    base 48 models. This forms the 51-model ensemble. Thedescription of the three additional forecasters follows froma combination of sections 2.3 and 2.5.1: they predict withthe weights of the best constant linear combination, in theleast-squares sense and over the learning window made of

  • MALLET, STOLTZ, AND MAURICETTE: OZONE ENSEMBLES AND MACHINE LEARNING X - 5

    the 10, 20 or 30 previous dates. These three aggregatedforecasters were already introduced in Mallet and Sportisse[2006a] under the notation elsd (which stands for ensem-ble least-squares method per date). In Mallet et al. [2007a],they are denoted by els10, els20 and els30.2.6.2. Per Station and Hourly Predictions

    Any learning algorithm may be applied for predictions perstation. In this case, each station s has its own sequence ofweights vst computed with the algorithm. The algorithmis independently applied to each station (on each singletonnetworkN = {s}). The sole difference is that when a stationis unavailable on a given day, the learning step is skippedand the last available weights at the station are used for thenext day. Some results are shown in the appendix.

    In case of hourly forecasts, an algorithm is independentlyapplied to each hour of the day. For instance, the algorithmpredicts the weights for the forecast at 15:00 UT only basedon the observations and the predictions at 15:00 UT in theprevious days. This strategy is expected to be more efficientbecause one uses information from forecasts in comparablesituations (e.g., with respect to the emissions or to the heightof the planetary boundary layer). It already exhibited betterperformance in Mallet and Sportisse [2006a].

    3. Experiment Setup3.1. Ensemble Design

    We define a numerical model for ozone forecasting by1. its physical formulation, 2. its numerical discretization,and, for convenience, 3. its input data.

    Eulerian chemistry-transport models all solve the samegiven reactive-transport equation, but they include differentphysical parameterizations to estimate the coefficients of theequation. Many alternative parameterizations are availableto compute a coefficient or a set of related coefficients inthe equation. For instance, a given model includes a limitednumber of chemical species whose reactions are describedby a chemical mechanism. Several chemical mechanismswere developed for ozone [Gery et al., 1989; Carter , 1990;Stockwell et al., 1990, 1997] and incorporated in chemistry-transport models.

    Similarly, several numerical schemes were proposed, e.g.to handle the strong concentration gradients in the vicin-ity of emission sources, or to deal with the stiffness of gasphase chemistry [see, e.g., Hundsdorfer and Verwer , 2003].Other numerical issues are addressed in the models, suchas the distribution of the vertical layers, the space steps,and the time steps. Consequently, every model has its ownnumerical formulation.

    In addition, an air quality simulation involves a wideset of input data: land data, chemical data, meteorologicalfields, and emissions. There are alternative databases, andthe data is often uncertain, with uncertainties often rangingfrom 30 % to 100 % [e.g., Hanna et al., 1998]. Hence themodeler is doomed to make choices at that stage as well.Thereafter, for convenience, we consider that the selectedinput data (to a model) is actually part of the model. Thusa model is also defined by a set of physical data.

    In this paper, our ensemble encompasses most sources ofuncertainties. We rely on models built with different phys-ical parameterizations, numerical discretizations, and datasets. The same ensemble as in Mallet and Sportisse [2006a]is used, see this paper for further details. A similar ensemblewas deeply analyzed in Mallet and Sportisse [2006b]. Themodels are generated within the air quality modeling systemPolyphemus [Mallet et al., 2007b] which is flexible enoughto build models that behave in drastically different manners,see Figure 1.

    3.2. Simulations over Europe

    The ensemble simulations cover four months, essen-tially in summer 2001, over Western Europe. In thisparagraph, we provide the main common characteristics

    of all simulations. The domain is [10.25◦W, 22.25◦ E] ×[40.25◦N, 56.75◦N]. The horizontal resolution is 0.5◦. Themeteorological fields are provided by the ECMWF (12-hourforecast cycles starting from analyzed fields). Raw emissionsare retrieved in EMEP database and chemically processedaccording to Middleton et al. [1990]. Biogenic emissions arecomputed as in Simpson et al. [1999]. The boundary condi-tions are computed by Mozart 2 [Horowitz et al., 2003].

    Among the changes in the models, those related to thevertical diffusion parameterization [Louis, 1979; Troen andMahrt , 1986] and to the chemical mechanism (RADM 2

    0 5 10 15 20

    Hour

    30

    40

    50

    60

    70

    80

    90

    100

    110

    Concentration

    Figure 1. Ozone daily profiles of the 48 models. Theconcentrations are in µg m−3 and are averaged over Eu-rope (at ground level) and over the four months of thesimulation. The ensemble shows a wide spread, even onthese strongly-averaged hourly concentrations.

    -10 -5 0 5 10 15 20

    42

    44

    46

    48

    50

    52

    54

    56

    0 5 10 15 20 25 30 35 40 45

    Figure 2. Map of best model indexes. In each cell ofthe domain, the color shows which model (marked withits index, in J0, 47K) gives the best ozone peak forecast on7 May 2001 at the closest station to the cell center. Notethat the most outer regions have almost no monitoringstations: the best model can barely be identified there.Despite all, there are enough stations to produce a highlyfragmented map in which a significant number of modelsdeliver the best forecast in some region. In addition, thismap strongly changes from one day to the other.

  • X - 6 MALLET, STOLTZ, AND MAURICETTE: OZONE ENSEMBLES AND MACHINE LEARNING

    [Stockwell et al., 1990] and RACM [Stockwell et al., 1997])have the most prominent impacts on the ozone concentra-tions.

    The resulting ensemble contains 48 models or members.It shows a wide spread (Figure 1).

    3.3. Individual Performance Measures of the EnsembleMembers

    Although the ensemble shows a wide spread, all modelsbring useful information in the ensemble. For instance, at agiven date, the ozone peaks forecast over Europe may varystrongly from one model to another, and many models turnout to be the best in some region. This is shown in Figure 2.

    The models are evaluated over the last 96 days of thesimulation. The first 30 days are excluded in the evaluationbecause they are considered as a learning period for the en-semble algorithms. With the notation of section 2.2.2, thefirst forecast date with the ensemble methods is at t0 = 31.

    As in Mallet and Sportisse [2006a], three networks areused:

    – Network 1 is composed of 241 urban and regional sta-tions, primarily in France and Germany (116 and 81 stationsrespectively). It provides about 619 000 hourly concentra-tions and 27 500 peaks.

    – Network 2 includes 85 EMEP stations (regional sta-tions distributed over Europe), with about 240 000 hourlyobservations and 10 400 peaks.

    – Network 3 includes 356 urban and regional stations inFrance. It provides 997 000 hourly measurements and 42 000peaks. Note that it includes most of the French stations ofnetwork 1.

    The study is chiefly carried out with network 1 (includ-ing calibration of the parameters). The performance of thelearning methods on the two other networks is as good as onnetwork 1, even though all parameters are determined fromthe optimization on network 1. See section 4.4.1 for moredetails on that and a summary of the results for networks 2and 3.

    The performance of the individual models is first mea-sured with the root mean square error (rmse) over network 1and is presented in Figure 3. The best model, see equa-tion (6), has a rmse of BM = 22.43 µg m

    −3. This is thereference performance to beat with the ensemble methods.

    0 5 10 15 20 25 30 35 40 4510

    15

    20

    25

    30

    35

    Figure 3. Root mean square errors (rmse, in µg m−3)for the 48 members in the ensemble. The error is com-puted with all ozone peak observations of network 1 dur-ing the last 96 days of the simulations. Here the modelsare sorted according to their rmse.

    Table 1. Reference Performance Measures of Section 2.2.3(rmse, µg m−3) over the Last 96 Simulated Days.

    Daily Peaks

    Network Bp BRN BX BMNet. 1 11.99 19.24 21.45 22.43Net. 2 8.47 18.16 20.91 21.90Net. 3 12.46 20.26 22.60 23.87

    Hourly Concentrations

    Network Bp BRN BX BMNet. 1 14.88 22.80 24.81 26.68Net. 2 12.03 23.52 24.51 25.98Net. 3 15.32 23.19 26.28 28.45

    4. Results

    For convenience, the unit of the rmses (µg m−3) is omit-ted in this section. Unless mentioned, the rmses are com-puted with the observations of network 1.

    As introduced in section 2.6.1, we consider two ensem-bles: the 48-model ensemble and a 51-model ensemble in-cluding Rw(10)0 , R

    w(20)0 and R

    w(30)0 as additional models.

    These three aggregated models are added due to their goodperformance (20.78, 20.27, and 20.18, respectively) and dueto their natural formulation, derived from a least-squaresminimization.

    Table 2. Performance (rmse) of the Aggregated Forecastersfor Ozone Daily Peaks over the Last 96 Simulated Days onNetwork 1. The Ensemble Is Made of the 48 Base ForecastingModels.

    BRN BX BM19.24 21.45 22.43

    R100 E2×10−5 Rw(45)100 E

    w(83)

    2×10−5 Rβ1000 E

    β

    1.2×10−4

    20.77 21.47 20.03 21.37 19.45 21.31

    Rw(10)0 Rw(20)0 R

    w(30)0

    20.78 20.27 20.18

    0 20 40 60 80 100 120

    Day

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    Weight

    Figure 4. Weights associated by E2×10−5 to the 48 mod-els against the time indexes (i.e., days). Because of theexponentiated-gradient formulation, the weights are pos-itive and sum up to 1. Several models contribute tothe linear combination, and the weight may vary quickly,even at the end of the time period.

  • MALLET, STOLTZ, AND MAURICETTE: OZONE ENSEMBLES AND MACHINE LEARNING X - 7

    In sections 4.2 and 4.3, all results are for daily peaks andon network 1.

    4.1. Reference Performance Measures

    In order to assess the possible improvements of ensembleforecasts, we provide the reference performance measures ofsection 2.2.3 in Table 1. The most desirable performanceto reach on network 1 and for daily peaks is Bp ' 12.It is certainly impossible to fulfill this objective because ofthe erratic time evolution of the associated weights [Malletand Sportisse, 2006a], but this demonstrates the potentialof combining the models forecasts.

    A more reasonable objective is to compete against thebest constant linear combination, since several aggregatedforecasters are guaranteed to have similar performance inthe long run. The reference performance for daily peaksand on network 1 is as small as 19.24, which is a strong im-provement compared to the best-model performance, BM =22.43.

    More generally, Table 1 shows that the reference rmsesof convex combinations are roughly 5 % smaller (10 % forhourly concentrations) than those of the best models, andthat an additional 10 % improvement is obtained by consid-ering all linear combinations in RN .

    4.2. Performance of Aggregated Forecasters

    The performance of several aggregated forecasters isshown in Table 2. Note that the parameters of the algo-rithms were tuned (off-line) on network 1 to get the bestperformance [Mallet et al., 2007a]. Section 4.4.1 will showthat these tunings are robust and seem to be valid for a widerange of situations. Based on our experience, the methodsare not too sensitive to the parameters, and optimal param-eters (e.g., those proposed here) can be applied to differentnetworks, different periods of time, different target concen-trations and even different ensembles. Note also that thetheoretical guarantees hold for any value of the parameters(e.g., any penalization factor λ > 0, in the ridge regressionalgorithm).

    Table 2 and the results of many other learning algorithms[Mallet et al., 2007a] indicate that ridge regression-type ag-gregated forecasters show the best performance. Severalalgorithms get results less than or close to the best con-stant convex combination (BX = 21.45), but none except

    0 20 40 60 80 100 120

    Day

    -15

    -10

    -5

    0

    5

    10

    Weight

    Figure 5. Weights associated by Rβ1000 to the 48 mod-els against the time indexes (i.e., days). The forecasterattributes significant weights to a large set of models.

    ridge regression may compete against the best constant lin-ear combination (BRN = 19.24). The use of the aggregated

    forecasterRβ1000 results in a rmse of 19.45 and thus, the the-oretical guarantee of performing similarly, in the long run, tothe best constant linear combination is essentially achieved.

    It is noteworthy that ridge regression is of a similar na-ture as the least-squares methods. However, the algorithmbenefits from the penalization factor driven by λ > 0; takingλ = 0 leads to worse performance. IncludingRw(10)0 , R

    w(20)0 ,

    and Rw(30)0 in the ensemble leads to strong improvements inthe performance of several algorithms, see the technical re-port [Mallet et al., 2007a] for further details. This showsthat the aggregated forecasters are generally driven by thebest performance in the ensemble. Adding aggregated pre-dictions to the ensemble, as was suggested by section 2.6.1,is consequently a meaningful and efficient strategy. Never-theless, this does not apply to ridge regression with discountwhich has a slightly better rmse with the base 48-model en-semble. In this case, its rmse is about 3 µg m−3 lower thanthe best model, which is a strong improvement–in Malletand Sportisse [2006a], the decrease of the rmse was about2 µg m−3, and without theoretical bounds on the methodperformance.

    A drawback of ridge regression might be its lack of con-straints on the weights (and thus, the lack of interpretabilityof the obtained weights). In contrast, several algorithms pro-duce convex weight vectors. Such weights may be appliedfar from the stations more safely than the unconstrainedweights. The combined predictions will always fall in theenvelop of the ensemble predictions, which avoids unrealisticconcentrations. For instance, with the 48-model ensemble,the exponentiated-gradient aggregated forecaster performsas well as the best constant convex combination (21.47, whileBX = 21.45) which is of course better than the best model(BM = 22.43). The weights computed by E2×10−5 are shownin Figure 4.

    In section 4.3, we focus on ridge regression with discount.For further analysis of all results, please refer to the ap-pendix or to Mallet et al. [2007a].

    4.3. Ridge Regression with Discount4.3.1. Further Results

    The weights computed by ridge regression with discountare shown in Figure 5. A key feature is the high numberof contributing models: significant weights are associated to

    20 40 60 80 100 120

    Day

    10

    15

    20

    25

    30

    35

    40

    45

    RMSE

    Figure 6. rmse of Rβ1000 (continuous line) and of thebest model (dotted line), against the day (or time index).

  • X - 8 MALLET, STOLTZ, AND MAURICETTE: OZONE ENSEMBLES AND MACHINE LEARNING

    0 50 100 150 200

    Station index

    10

    20

    30

    40

    50

    60

    70R

    MS

    E

    Figure 7. rmse against the station index (for 241 stations). In green, Rβ1000; in blue, the best model(over all stations); in black, the best model and the worst model for the station.

    many models. This is a rather clear difference with mostof the other aggregated forecasters (refer to Mallet et al.[2007a] in order to consult the time-evolution of the weightsof all algorithms).

    The correlation (computed with all observations at once,in the last 96 days) between observations and simulated datais improved: the best model (with respect to the rmse) has

    a correlation of 0.78 while the aggregated forecaster Rβ1000predicts with a correlation of 0.84. The bias factor, definedas

    1∑Tt=t0|Nt|

    T∑t=t0

    ∑s∈Nt

    vt · xstyst

    , (24)

    is 1.06 for the best model and 1.03 for Rβ1000.4.3.2. Robustness

    The theory brings guarantees on the performance in thelong run. This means that the ridge regression algorithm(with or without discount) is meant to be robust, which isan important feature in day-to-day operational forecasts. Inaddition to global performance, one concern is the perfor-mance at each single station and on each single day. Weneed robust aggregation methods that produce reasonableforecasts at all stations and on all days. We must avoid anymethod which would have a good overall performance butwhich would show disastrous performance at some stationsor on some days.

    In practice, we noticed that the aggregated forecastersbehave very well not only on average but also at most sta-

    tions and on most days; Rβ1000 even predicts better thanthe (overall) best model for 73 % of all observations. In

    Figure 6, the aggregated forecaster Rβ1000 shows better per-formance than this best model for most days (83 % in thelast 96 days). There is no day when it has an unreason-ably high rmse compared to the best model. Interestinglyenough, the converse is not true: for some days, the rmse

    of the best model is much higher than the one of Rβ1000.In Figure 7, Rβ1000 performs better than the (overall) bestmodel at 223 stations (93 % of the 241 stations). It per-forms better than the best model per station (with respect

    to the rmse at the individual station) at 167 stations (70 %

    of the stations). Its performance is always better than the

    worst model per station–a non-trivial fact, since the weights

    for the aggregation are unconstrained, as is illustrated, e.g.,

    by Figure 5. One can clearly conclude from these tests that

    this learning method is robust.

    4.3.3. Do Statistics Miss Extreme Events?A usual criticism against statistical methods is that they

    may be efficient on average but they miss the most impor-

    50 100 150 200 250

    Concentration

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Figure 8. Frequency of improved forecast against

    the observed concentration, for Rβ1000. The frequencyF[oi,oi+1] is computed as in equation (25). This illustratesthat the discounted version of ridge regression is efficienton extreme events. (Remember from section 4.3.2 that

    Rβ1000 predicts better than the best model for 73 % of allobservations, that is, F[0,∞[ = 0.73.)

  • MALLET, STOLTZ, AND MAURICETTE: OZONE ENSEMBLES AND MACHINE LEARNING X - 9

    tant events. In daily air-quality forecasts, the most impor-tant events to forecast are the highest pollution peaks.

    The methods we apply in this paper are not subject to thesame limitations as purely statistical methods because theyrely on the physical models. The weights are computed withlearning methods, but the forecasts are still carried out bythe models. In addition, the methods are responsive enoughto allow quick variations of the weights (see Figures 4 and 5).

    To check that an aggregated model still has good skillsin predicting extreme events, we count the number of im-proved forecasts (with respect to the best model) for thehighest observed concentrations. We apply this to the dis-

    counted version of ridge regression, Rβ1000, whose predictionsare denoted ŷst . If the predictions of the overall best modelare xsm,t (and y

    st are the corresponding observations), then

    we compute the score

    F[oi,oi+1] =∣∣{(t, s) such that |ŷst − yst | < |xsm,t − yst | and yst ∈ [oi, oi+1]}∣∣∣∣{(t, s) such that yst ∈ [oi, oi+1]}∣∣(25)

    for a set of concentration intervals [oi, oi+1] that cover allobserved daily peaks. A perfect score is F[...] = 1 (all pre-

    dictions of Rβ1000 better than the best model) and the worstscore is F[...] = 0. Figure 8 plots this score. In this fig-ure, the number of observations considered to draw eachbar varies of course with the bar; for instance, the left-mostand right-most bars are for extreme events, hence with few

    10 20 30 40

    Number of models

    19

    20

    21

    22

    23

    24

    25

    RM

    SE

    Default orderIncreasing RMSEDecreasing RMSE

    Figure 9. rmse of Rβ1000 against the number of modelsin the ensemble. Three sets of ensembles with an increas-ing number of models are built. They follow the defaultorder [Mallet and Sportisse, 2006a], the increasing rmseorder, and the decreasing rmse order.

    Table 3. Performance (rmse) of Rβ1000 Applied To a SingleModel (the First Models of the Ensembles Defined in Sec-tion 4.3.4).

    Order First-model rmse Rβ1000 rmseDefault order 24.01 22.43Increasing rmse 22.43 21.66Decreasing rmse 35.79 24.78

    observations. Starting from the right-most bar and movingto the left, the number of observations per bar is 4, 8, 11,26, 60, and so on, and the histogram means that first, thefour highest observations are better predicted with learningthan with the best model; second, that seven of the eightfollowing observations are better predicted; and so on. Sim-ilar results are obtained with the other learning methods.The conclusion is that the extreme events are well caughtby our approach.4.3.4. Influence of the Number of Models

    The 48 models are sorted as in Mallet and Sportisse[2006a]–herein referred to as the default order. Note thisorder has no link with the performance of the models. 48ensembles are then built: the first ensemble only includesthe first model, the second ensemble includes the first twomodels, the third ensemble includes the first three models,

    and so on. Rβ1000 is applied to each ensemble in order to testthe influence of the number of models on the performance.

    Since this experiment depends on the inclusion order, twoadditional orders are tested: 1. from the best model (lowestrmse) to the worst model (highest rmse), and 2. from theworst model to the best model. The results are shown inFigure 9.

    The performance of Rβ1000 increases with the number ofmodels. However, the improvement brought by one modelrapidly decreases as the ensemble size increases. There is noobvious relation between the performance increase and theindividual performance of the models.

    Note that Rβ1000 applied to the first model alone showsa lower rmse than this first model (see Table 3), which in-dicates that this aggregated forecaster enjoys a property ofautomatic bias correction. Here, there is only one weight,associated with the single model, and it can indeed be in-terpreted as multiplicative bias correction factor.

    4.4. Other Results4.4.1. Application to Networks 2 and 3

    Table 4. rmse of the Leading Methods on the Three Net-works, with the 48 Base Models Only (Upper Part), with

    Rw(10)0 , Rw(20)0 , and R

    w(30)0 in Addition To the 48 Base Mod-

    els (Lower Part).

    48-Model Ensemble

    E2×10−5 R100 Rβ1000 R

    w(45)100

    Network 1 21.47 20.77 19.45 20.03Network 2 21.05 19.12 18.12 18.91Network 3 24.12 21.92 20.88 21.10

    Rw(10)0 Rw(20)0 R

    w(30)0 BM

    Network 1 20.78 20.27 20.18 22.43Network 2 19.50 19.08 18.93 21.90Network 3 21.95 21.31 21.20 23.87

    51-Model Ensemble

    E3×10−5 R106 Rβ1000 R

    w(45)1000

    Network 1 19.77 19.92 19.62 19.83Network 2 18.65 18.96 18.26 18.62Network 3 21.55 20.74 20.70 20.82

    Table 5. rmse of Leading Methods on the Three Networksfor Hourly Forecasts.

    BM E2×10−5 R100 Rβ1000 R

    w(45)100

    Network 1 26.68 24.31 22.69 22.02 22.32Network 2 25.98 24.67 23.53 22.82 23.29Network 3 28.45 26.01 22.82 22.27 22.54

  • X - 10 MALLET, STOLTZ, AND MAURICETTE: OZONE ENSEMBLES AND MACHINE LEARNING

    All results shown in the previous sections were presentedfor network 1. The two other networks are used for valida-tion: in our study process, the algorithms were intensivelyanalyzed and even tuned on network 1, and they were sub-sequently applied to networks 2 and 3 without specific op-timization. This means that we directly use for them theparameters optimal for network 1. (The precise study ofMallet et al. [2007a] shows, by the way, that these off-lineoptimal parameters for network 1 are usually larger thanthose prescribed by theory, leading to faster learning ratesand a greater ability for the weight vectors to change whenneeded, as is illustrated, for instance, on Figure 4.)

    The methods achieve good performance on the two othernetworks; the results of Table 4 show in particular that thereference performance measures BX and BRN are almostachieved again. All in all, this shows that the methods canbe successfully applied to new data sets, demonstrating somekind of robustness.

    We do not provide satisfactory automatic tuning rulesfor the parameters yet. Theoretical optimal tuning rules forexponentiated gradient were proposed in the machine learn-ing literature [Mallet et al., 2007a, section 4] but they leadto lower performance. However, one may note that mostof the proposed aggregation rules only depend on one ortwo learning parameters, to be set by the user. The re-sults, though sensitive to tuning, are always better than theones of the best model, even with the worst possible choicesof the parameters. This is contrast to statistical methods,which require several parameters depending on underlyingdistributions to be estimated on data, or to other machinelearning techniques, like neural networks, where the userhas to choose the structure of the network and the values ofseveral weights.4.4.2. Results on Hourly Forecasts

    The methods are also used for hourly ozone forecasts, seesection 2.6.2. Just like in the previous section, only the lead-ing aggregated forecasters for network 1 and daily peaks areapplied. Their parameters were tuned on network 1 and fordaily peaks, but they perform well for hourly concentrationsand on all networks. The scores are summarized in Table 5.

    The aggregated forecasters are more efficient for hourlyconcentrations than for the daily peaks: the improvementswith respect to the best model are stronger (decrease of thermse over 4.5 µg m−3 on average). The reference perfor-

    mance measure BRN is beaten by Rβ1000 on the three net-

    works. Since the physical models predict better the dailypeaks than the hourly concentrations, one might state thatthe machine learning methods compensate for stronger phys-ical limitations in the hourly-predictions case.

    5. Conclusion

    Based on ensemble simulations generated with thePolyphemus modeling system, machine learning algorithmsprove to be efficient in forecasting ozone concentrations bysequential aggregation of models. This conclusion applies toone-day forecasts of ozone hourly concentrations and dailypeaks. In short, the results improve the previous work Mal-let and Sportisse [2006a] with respect to the performanceand the mathematical framework. In this study, the learn-ing algorithms always turn out to be better than the bestmodel in the ensemble and they even compete with the best(constant and unconstrained) linear combination of mod-els. Meanwhile the algorithms come with theoretical boundsthat guarantee high quality forecasts in the long run. Inaddition, our analysis clearly shows the robustness of theapproach: good performance on new observation sets, nounreasonably high rmse per station or per date comparedto the best model, improvements almost in all regions, and

    for a majority of dates. It is noteworthy that the learningmethods behave very well on the extreme events. Because ofall these desirable properties, the methods are very relevantfor operational forecasting.

    This study comes with several applicable algorithms forair pollution. Although they rely on well-known methodsfrom machine learning, they were adapted, primarily to ac-count for the presence of multiple monitoring stations andfor past observations to have a relevance that decreases withtime.

    Next developments may concentrate on the spatial valid-ity of the aggregation. Furthermore, the sequential selectionof (small subsets of) models in the ensemble may also helpimprove the results or at least decrease the computationalload. The performance of the sequential aggregation shouldbe compared with classical data assimilation (optimal in-terpolation, Kalman filtering, variational assimilation). Forinstance, while classical data assimilation often faces a lim-ited impact in time, the learning methods could be moreefficient, e.g., for two-day forecasts.

    Acknowledgments. The authors acknowledge the supportof the French Agence Nationale de la Recherche (ANR), undergrant ATLAS (JCJC06 137446) “From Applications to Theoryin Learning and Adaptive Statistics”. The first author is in theINRIA-ENPC joint project-team CLIME, and in the ENPC-EDFR&D joint laboratory CEREA.

    Appendix

    We provided in the main body of this paper a briefoverview of the empirical analysis of two families of ma-chine learning forecasting methods. Further details on theanalysis, as well as a discussion of other methods, follow.All of them are included in the technical report Mallet et al.[2007a], which is available online at http://www.dma.ens.fr/edition/publis/2007/resu0708.html.

    A1. Per Station Predictions

    Any learning algorithm may be applied for predictionsper station. In this case, each station s has its own se-quence of weights vst computed with the algorithm. Thealgorithm is independently applied to each station (on eachsingleton network N = {s}). The sole difference is thatwhen a station is unavailable on a given day, the learningstep is skipped and the last available weights at that sta-tion are used for the next day. Also, the additional forecastsRw(10)0 , R

    w(20)0 , and R

    w(30)0 cannot be computed and there-

    fore added to the ensemble for per station predictions: the

    Table 6. rmse of Leading Methods on the Three Networksfor Daily Peaks and per Station Predictions.

    E2×10−5 R100 Rβ1000 R

    w(45)100

    Network 1 23.26 19.72 19.73 19.59Network 2 22.50 17.44 17.43 17.83Network 3 24.42 19.73 19.72 19.57

    Table 7. rmse of Leading Methods for Hourly Forecasts andper Station Predictions.

    E2×10−5 R100 Rβ1000 R

    w(45)100

    Network 1 23.16 19.91 19.98 20.04Network 2 23.34 18.41 18.37 18.58Network 3 24.43 20.12 20.30 20.20

  • MALLET, STOLTZ, AND MAURICETTE: OZONE ENSEMBLES AND MACHINE LEARNING X - 11

    least-squares problems associated with Rw(10)0 , Rw(20)0 and

    Rw(30)0 are under-determined.We kept the same learning parameters as those used in

    the main body of the paper. The performance is still com-puted with the global rmse defined by (2). They are shownin Table 6 for daily peaks and in Table 7 for hourly concen-trations, and should be compared to, respectively, Tables 4and 5. Except for network 1 and daily predictions, the rmsesusually improve when the aggregation rules are run stationby station.

    We do not provide further details on the predictions perstation as they are not the main objective of this paper; theyare essentially provided for reference.

    A2. Further Learning Methods

    We report briefly in this section the references and resultsof other learning methods. We only describe in detail thegradient descent forecasters because they showed promisingresults on networks and settings not reported here.A2.1. Gradient Descent Forecasters

    The simplest version of gradient descent has already beenimplemented and studied in Mallet and Sportisse [2006a]. Itis parameterized by η > 0 and is referred to below as Gη. Itstarts with an arbitrary weight vector u1, we take for sim-plicity and for efficiency u1 = (1/N, . . . , 1/N). Then, fort ≥ 1, the update is

    ut+1 = ut − η ˜̀t = ut − η ∑s∈Nt

    2 (ut · xst − yst ) xst . (A1)

    This forecaster is competitive with respect to BRN , its regretbeing bounded as follows. For all η > 0 and all u ∈ RN ,

    RT (u) ≤‖u− u1‖22

    2η+ 2 ηN

    (SCB

    )2T = O

    (√T)

    (A2)

    with the notation C of section 2.3, the bound B ≥ |xst,m| forall m, t, and s, and for a choice η = O(1/

    √T ). See Mallet

    et al. [2007a, section 7] and Cesa-Bianchi [1999] for moredetails.

    The previous gradient descent forecaster is unconstrained.A variation with an additional projection step (e.g., on theset X of all convex combinations) was considered by Zinke-vich [2003], leading to the forecaster referred to below as Zη(with parameter η > 0). For t ≥ 1, the update is

    pt+1 = PX(pt − η ˜̀t) (A3)

    = PX

    (pt − η

    ∑s∈Nt

    2 (pt · xst − yst ) xst

    )(A4)

    where PX denotes the projection onto the set X of convexcombinations. This forecaster is competitive with respect toBX since its regret is uniformly bounded as

    supp∈X

    RT (p) ≤1

    η+ 2 ηN

    (SCB

    )2T = O

    (√T)

    (A5)

    for a choice η = O(1/√T ). See Mallet et al. [2007a, section

    10] and Zinkevich [2003] for more details.A2.2. Overview of Ten Other Forecasters

    We do not describe them in detail and we only presentTable 8 to summarize all previously introduced forecastersand the new ones. We indicate also the reference perfor-mance to which each aggregated forecaster is guaranteed tobe close, as well as the bound on the regret to be substitutedinto (10) to control the discrepancy between the referenceperformance and the performance of the aggregated fore-caster. The reader interested in the mathematical detailsmay take a look at our technical report Mallet et al. [2007a]

    to have more information. The column “Section” refers toa section of this report. Only the algorithms of the first halfof the table were described above.

    A2.3. Automatic bias correctionThis trick, introduced by Kivinen and Warmuth [1997],

    transforms an aggregated forecaster A proposing convexcombinations pt into an aggregated forecaster A′ proposinglinear combinations ut. More precisely, A′ proposes predic-tions u1, u2, . . . in the `1 ball of a given radius U > 0 cen-tered at the origin, which we denote by B‖·‖1(0, U) ⊂ R

    N .For the converted algorithm, we have a bound on the re-gret with respect to all elements of B‖·‖1(0, U). The interestof this trick is to correct biases in an automatic way sincethe weights ut do not necessarily sum up to 1. (E.g., if allmodels propose predictions that are larger than the actualobservation, the aggregated forecaster has a chance not topredict too high a value.)

    Formally, the conversion algorithm is as follows. We takeU > 0, denote zst = (Ux

    st , −Uxst ), and let the original aggre-

    gated forecaster A be fed with the zst (it therefore behaves asif there were 2N models); it outputs predictions q1, q2, . . .in the set of all convex combinations over 2N elements. The“extended” weights ut of A′ are defined as follows. For allj = 1, . . . , N ,

    uj,t = U (qj,t − qj+N,t) ; (A6)these ut satisfy that for all stations s,

    ut · xst = qt · zst . (A7)

    Now, it is easy to see that all u ∈ B‖·‖1(0, U) may berepresented in the following sense by a convex combinationq of 2N elements,

    q · zst = u · xst (A8)for all t and s. Equations (A7) and (A8) thus show that theregret of A′ against the elements of B‖·‖1(0, U) is smallerthan about U times the regret of A against all convex com-binations in X .

    See Mallet et al. [2007a, section 20] for more details andmore precise bounds. The experiments reported there showthat U = 0.99 is sometimes an interesting value for some ag-gregated forecasters A, in accordance to the automatic biascorrection alluded at above. However, this procedure sel-dom performs better than the aggregated forecaster alone.We mention it here for the sake of completeness and we planto further study this ability of automatic bias correction.

    We recall that another occurrence of an ability of auto-matic bias correction can be found at the end of section 4.3.4.

    References

    Beekmann, M., and C. Derognat, Monte Carlo uncertainty anal-ysis of a regional-scale transport chemistry model constrainedby measurements from the atmospheric pollution over theParis area (ESQUIF) campaign, J. Geophys. Res., 108 (D17),8,559, 2003.

    Carter, W. P. L., A detailed mechanism for the gas-phase atmo-spheric reactions of organic compounds, Atmos. Env., 24A,481–518, 1990.

    Cesa-Bianchi, N., Analysis of two gradient-based algorithms foron-line regression, Journal of Computer and System Sciences,59(3), 392–411, 1999.

    Cesa-Bianchi, N., and G. Lugosi, Prediction, Learning, andGames, Cambridge University Press, 2006.

    ESQUIF, Étude et simulation de la qualité de l’air en ı̂le de France– rapport final, 2001.

    Gery, M. W., G. Z. Whitten, J. P. Killus, and M. C. Dodge, Aphotochemical kinetics mechanism for urban and regional scalecomputer modeling, J. Geophys. Res., 94, 12,925–12,956, 1989.

  • X - 12 MALLET, STOLTZ, AND MAURICETTE: OZONE ENSEMBLES AND MACHINE LEARNING

    Hanna, S. R., J. C. Chang, and M. E. Fernau, Monte Carlo esti-mates of uncertainties in predictions by a photochemical gridmodel (UAM-IV) due to uncertainties in input variables, At-mos. Env., 32 (21), 3,619–3,628, 1998.

    Horowitz, L. W., et al., A global simulation of tropospheric ozoneand related tracers: description and evaluation of MOZART,version 2, J. Geophys. Res., 108 (D24), 2003.

    Hundsdorfer, W., and J. G. Verwer, Numerical Solution of Time-Dependent Advection-Diffusion-Reaction Equations, Series inComput. Math. 33, Springer, 2003.

    Kivinen, J., and M. Warmuth, Exponentiated gradient versusgradient descent for linear predictors, Information and Com-putation, 132(1), 1–63, 1997.

    Krishnamurti, T. N., C. M. Kishtawal, Z. Zhang, D. B. T. LaRow,and E. Williford, Multimodel ensemble forecasts for weatherand seasonal climate, J. Climate, 13, 4,196–4,216, 2000.

    Lary, D. J., M. D. Müller, and H. Y. Mussa, Using neural net-works to describe tracer correlations, Atmos. Chem. Phys.,4 (1), 143–146, 2004.

    Louis, J.-F., A parametric model of vertical eddy fluxes in theatmosphere, Boundary-Layer Meteor., 17, 187–202, 1979.

    Loyola, D., Applications of neural network methods to the pro-cessing of earth observation satellite data, Neural Networks,19 (2), 168–177, 2006.

    Mallet, V., and B. Sportisse, Ensemble-based air quality fore-casts: A multimodel approach applied to ozone, J. Geophys.Res., 111 (D18), 2006a.

    Mallet, V., and B. Sportisse, Uncertainty in a chemistry-transportmodel due to physical parameterizations and numerical ap-proximations: An ensemble approach applied to ozone model-ing, J. Geophys. Res., 111 (D1), 2006b.

    Mallet, V., B. Mauricette, and G. Stoltz, Description of sequen-tial aggregation methods and their performances for ozoneensemble forecasting, Tech. Rep. DMA-07-08, École normalesupérieure, Paris, http://www.dma.ens.fr/edition/publis/2007/resu0708.html, 2007a.

    Mallet, V., et al., Technical Note: The air quality modeling sys-tem Polyphemus, Atmos. Chem. Phys., 7 (20), 5,479–5,487,2007b.

    McKeen, S., et al., Assessment of an ensemble of seven real-timeozone forecasts over eastern North America during the summerof 2004, J. Geophys. Res., 110 (D21), 2005.

    Middleton, P., W. R. Stockwell, and W. P. L. Carter, Aggrega-tion and analysis of volatile organic compound emissions forregional modeling, Atmos. Env., 24A(5), 1,107–1,133, 1990.

    Pagowski, M., et al., A simple method to improve ensemble-basedozone forecasts, Geophys. Res. Lett., 32, 2005.

    Pagowski, M., et al., Application of dynamic linear regression toimprove the skill of ensemble-based deterministic ozone fore-casts, Atmos. Env., 40, 3,240–3,250, 2006.

    Simpson, D., et al., Inventorying emissions from nature in Eu-rope, J. Geophys. Res., 104 (D7), 8,113–8,152, 1999.

    Stockwell, W. R., P. Middleton, J. S. Chang, and X. Tang,The second generation regional acid deposition model chemi-cal mechanism for regional air quality modeling, J. Geophys.Res., 95 (D10), 16,343–16,367, 1990.

    Stockwell, W. R., F. Kirchner, M. Kuhn, and S. Seefeld, A newmechanism for regional atmospheric chemistry modeling, J.Geophys. Res., 102 (D22), 25,847–25,879, 1997.

    Troen, I., and L. Mahrt, A simple model of the atmosphericboundary layer; sensitivity to surface evaporation, Boundary-Layer Meteor., 37, 129–148, 1986.

    van Loon, M., et al., Evaluation of long-term ozone simulationsfrom seven regional air quality models and their ensemble, At-mos. Env., 41, 2,083–2,097, 2007.

    West, M., and J. Harrison, Bayesian forecasting and dynamicmodels, Series in statistics, second ed., Springer, 1999.

    Zinkevich, M., Online convex programming and generalized in-finitesimal gradient ascent, in Proceedings of the TwentiethInternational Conference on Machine Learning (ICML ’03),2003.

    [email protected], [email protected]

  • MALLET, STOLTZ, AND MAURICETTE: OZONE ENSEMBLES AND MACHINE LEARNING X - 13

    Table 8. Overview of the Considered Aggregated Forecasters.

    We indicate their reference performance measures (BRN , BX , or BM, see section 2.2.3), their theoreticalregret bounds (if available), the section of the technical report where more information can be found, theparameters we used, and the results obtained for prediction of daily peaks on network 1.

    Section Notation Name Reference Theo. bound Parameters rmse

    12 Rλ Ridge regression RN O(ln T ) R100 20.773 Eη Exponentiated gradient X O(

    √T ) E2×10−5 21.47

    14 Rw(t1)λ Windowing ridge regression RN − Rw(45)100 20.03

    5 Ew(t1)η Windowing exponentiated gradient X − Ew(83)

    2×10−5 21.37

    13 Rβλ Discounted ridge regression RN o(T ) Rβ1000 19.45

    6 Eβη Discounted exponentiated gradient X o(T ) Eβ1.2×10−4 21.31

    10 Zη Projected gradient descent X O(√

    T ) Z10−6 21.287 Gη Gradient descent RN O(

    √T ) G4.5×10−9 21.56

    17 F ′α,η Gradient fixed-share X O(√

    T ) F ′0.02, 2.5×10−5 21.35

    4 Ea,b Adaptive exponentiated gradient X O(√

    T ) E100,1 21.489 P ′p Gradient polynomially weighted average X O(

    √T ) P ′14 21.49

    19 Oβ Online Newton step X O(ln T ) O6×10−7 21.5116 Fα,η Fixed-share M O(

    √T ) F0.15, 5×10−5 21.89

    1 Aη Exponentially weighted average M O(1) A3×10−6 22.4615 Rvawλ Nonlinear ridge regression R

    N O(ln T ) Rvaw106

    22.90

    2 Mη Mixture X O(ln T ) M10−3 23.18 Pp Polynomially weighted average M O(

    √T ) P1.2 23.20

    11 Πη Prod X O(√

    T ) Π5.5×10−7 23.33