JEFS Calibration: Bayesian Model Averaging

Adrian E. Raftery

J. McLean Sloughter

Tilmann Gneiting

University of Washington

Statistics

JEFS Calibration:Bayesian Model Averaging

Eric P. Grimit

Clifford F. Mass

Jeff Baars

University of Washington

Atmospheric Sciences

Research supported by:Office of Naval Research

Multi-Disciplinary University Research Initiative (MURI)

23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA

The General Goal

“The general goal in EF [ensemble forecasting] is to produce a probability density function (PDF) for the future state of the atmosphere that is reliable…and sharp…”

-- Plan for the Joint Ensemble Forecast System (2nd Draft),

Maj. F. Anthony Eckel


Calibration and Sharpness

Calibration ~ reliability (also: statistical consistency)A probability forecast p, ought to verify with relative frequency p.

The verification ought to be indistinguishable from the forecast ensemble (the verification rank histogram* is uniform).

However, a forecast from climatology is reliable (by definition), so calibration alone is not enough.

Sharpness ~ resolution (also: discrimination, skill)The variance, or confidence interval, should be as small as possible, subject to calibration.

*Verification Rank Histogram

Record of where verification fell (i.e., its rank) among the ordered ensemble members:

Flat Well-calibrated (truth is indistinguishable from ensemble members)

U-shaped Under-dispersive (truth falls outside the ensemble range too often)

Humped Over-dispersive


5.0 %

4.2 %

1 2 3 4 5 6 7 8 936h *ACMEcore0.0

0.1

0.2

0.3

0.4

1 2 3 4 5 6 7 8 936h *ACMEcore0.0

0.1

0.2

0.3

0.4

36h *ACMEcore

36h *ACMEcore+

1 2 3 4 5 6 7 8 936h *ACMEcore0.0

0.1

0.2

0.3

0.4

36h *ACMEcore

36h *ACMEcore+

1 2 3 4 5 6 7 8 936h *ACMEcore0.0

0.1

0.2

0.3

0.4

Pro

bab

ilit

y

Verification Rank

(d) T2

(c) WS10

(b) MSLP

1 2 3 4 5 6 7 8 936h ACMEcore0.0

0.1

0.2

0.3

0.4

1 2 3 4 5 6 7 8 936h ACMEcore0.0

0.1

0.2

0.3

0.4

Verification Rank

(a) Z500

1 2 3 4 5 6 7 8 936h ACMEcore0.0

0.1

0.2

0.3

0.4

Verification Rank

1 2 3 4 5 6 7 8 936h ACMEcore0.0

0.1

0.2

0.3

0.4

*UWME*UWME+

EOP*

9.0 %

6.7 %

25.6 %

13.3 %

43.7 %

21.0 %

Surface/Mesoscale Variable

( Errors Depend on Model Uncertainty )

SynopticVariable

( Errors Depend on Analysis Uncertainty )

*UWME*UWME+

*UWME*UWME+

*UWME*UWME+

Typical Verification Rank Histograms

*Excessive Outlier Percentage

[c.f. Eckel and Mass 2005, Wea. Forecasting]


Objective and Constraints

Objective: Calibrate JEFS (JGE and JME) output.

Utilize available analyses/observations as surrogates for truth.

Employ a method thataccounts for ensemble member construction and relative skill.

Bred-mode / ETKF initial conditions (JGE; equally skillful members)

Multiple models (JGE and JME; differing skill for sets of members)

Multi-scheme diversity within a single model (JME)

is adaptive.Can be rapidly relocated to any theatre of interest.

Does not require a long history of forecasts and observations.

accommodates regional/local variations within the domain.Spatial (grid point) dependence of forecast error statistics.

works for any observed variable at any vertical level.


First Step: Mean Bias Correction

Calibrate the first moment the ensemble mean.

In a multi-model and/or multi-scheme physics ensemble, individual members have unique, often compensatory, systematic errors (biases).

Systematic errors do not represent forecast uncertainty.

Implemented a member-specific bias correction for UWME using a 14-day training period (running mean).

Advantages and disadvantages:Ensemble spread is reduced (in an under-dispersive system).

The ensemble spread-skill relationship is degraded.(Grimit 2004, Ph.D. dissertation)

Forecast probability skill scores improve.

Excessive outliers are reduced.

Verification rank histograms become quasi-symmetric.


Second Step: Calibration

Calibrate the higher moments the ensemble variance.

Forecast error climatologyAdd the error variance from a long history of forecasts and observations to the current (deterministic) forecast.

For the ensemble mean, we shall call this forecast mean error climatology (MEC).

MEC is time-invariant (a static forecast of uncertainty; a climatology).

MEC is calibrated for large samples, but not very sharp.

Advantages and disadvantages:Simple. Difficult to beat!

Gaussian.

Not practical for JGE/JME implementation, since a long history is required.

A good baseline for comparison of calibration methods.


FIT MEC

Mean Error Climatology (MEC) Performance

CRPS = continuous ranked probability score

[Probabilistic analog of the mean absolute error (MAE) for scoring deterministic forecasts]

Comparison of *UWME 48-h 2-m temperature forecasts:Member-specific mean bias correction applied to both [14-day running mean]

FIT = Gaussian fit to the raw forecast ensembleMEC = Gaussian fit to the ensemble-mean + the mean error climatology

[00 UTC Cycle; October 2002 – March 2004; 361 cases]


Bayesian Model Averaging (BMA)

BMA has several advantages over MEC:A time-varying uncertainty forecast.

A way to keep multi-modality, if it is warranted.

Maximizes information from short (2-4 week) training periods.

Allows for different relative skill between members through the BMA weights (multi-model, multi-scheme physics).

Bayesian Model Averaging (BMA) Summary

Member-specific mean-bias correction parametersMember-specific BMA weightsBMA variance(not-member specific here, but can be)

[c.f. Raftery et al. 2005, Mon. Wea. Rev.]


MEC

BMA Performance Using Analyses

BMA

BMA initially implemented using training data from the entire UWME 12-km domain (Raftery et al. 2005, MWR).

No regional variation of BMA weights, variance parameters.Used observations as truth.

After several attempts to implement BMA with local or regional training data using NCEP RUC 20-km analyses as truth, we found that:

when the training data is selected from a neighborhood of grid points with similar land-use type and elevation produced EXCELLENT results!Example application to 48-h 2-m temperature forecasts uses only 14 training days.


MEC BMA

BMA-Neighbor* Calibration and Sharpness

*neighbors have same land use type and elevation difference < 200 m within a search radius of 3 grid points (60 km)

MEC BMA FIT

calibration

sharpness

Probability integral transform (PIT)

histograms an analog of verification rank histograms for

continuous forecasts


BMA-Neighbor* CRPS Improvement

BMA improvement over MEC

*neighbors have same land use type and elevation difference < 200 m within a search radius of 3 grid points (60 km)


BMA-Neighbor Using Observations

Use observations, remote if necessary, to train BMA.

Follow the Mass-Wedam procedure for bias correction, to select the BMA training data.1. Choose the N closest observing locations to

the center of the grid box, which have similar elevation and land-use characteristics.

2. Find the K occasions during a recent period (up to Kmax days previous), on which the interpolated forecast state was similar to the current interpolated forecast state at each station n = 1, …, N.a) Similar ensemble mean forecast states.

b) Similar min/median/max ensemble forecast states.

3. If N*K matches are not found, relax the similarity constraints and repeat (1) and (2).


Summary and the Way Forward

Mean error climatologyGood benchmark to evaluate competing calibration methods.

Generally beats a raw ensemble, even though it is not state-dependent.

The ensemble mean contains most of the information we can use.

The ensemble variance (state-dependent) is generally a poor prediction of uncertainty, at least on the mesoscale.

Bayesian model averaging (BMA)A calibration method that is becoming popular. (CMC-MSC)

A calibration method that meets many of the constraints that FNMOC and AFWA will face with JEFS.

It accounts for differing relative skill of ensemble members (multi-model, multi-scheme physics).

It is adaptive (short training period).

It can be rapidly relocated to any theatre.

It can be extended to any observed variable at any vertical level

(although, research is ongoing on this point).


For quantities such as wind speed and precipitation, distributions are not only non-Gaussian, but not purely continuous – there are point masses at zero.For probabilistic quantitative precipitation forecasts (PQPF):

Model P(Y=0) with a logistic regression.Model P(Y>0) with a finite Gamma mixture distribution.

Fit Gamma means as a linear regression of the cubed-root of observation on forecast and an indicator function for no precipitation.Fit Gamma variance parameters and BMA weights by the EM algorithm, with some modifications.

Extending BMA to Non-Gaussian Variables

[c.f. Sloughter et al. 200x, manuscript in preparation]


PoP Reliability Diagrams

Ensemble consensus voting as crosses.

Results for January 1, 2003 through December 31, 2004 24-hour accumulation PoP forecasts, with 25-day training, no regional parameter variations.


BMA PQPF model as red dots.


PQPF Rank Histograms

Verification RankHistogram

PIT Histogram


QUESTIONS and DISCUSSION


-0.05

0.00

0.05

0.10

0.15

0.20

0.25

00 03 06 09 12 15 18 21 24 27 30 33 36 39 42 45 48

Lead Time (h)

BS

S

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

00 03 06 09 12 15 18 21 24 27 30 33 36 39 42 45 48

*ACMEcoreACMEcore*ACMEcore+ACMEcore+Uncertainty

*UWME

UWME

*UWME+

UWME+

Skill vs. Lead Time for FP of the event: WS10 > 18 kt

-0.05

00 03 06 09 12 15 18 21 24 27 30 33 36 39 42 45 48

BSS = 1, perfect

BSS < 0, worthless

* Bias-corrected

Forecast Probability Skill vs. Lead Time The event: 10-m wind speed > 18 kt

Bri

er S

kill

Sco

re (

BS

S)

better

Forecast Probability Skill Example

(0000 UTC Cycle; October 2002 – March 2003)Eckel and Mass 2005


Resolution (~ @ 45 N ) ObjectiveAbbreviation/Model/Source Type Computational Distributed Analysis

GFS, Global Forecast System (GFS), Spectral T382 / L64 1.0 / L14 SSINational Centers for Environmental Prediction ~35 km ~80 km 3D Var CMCG, Global Environmental Multi-scale (GEM), Finite 0.9 / L28 1.25 / L11 4D VarCanadian Meteorological Centre Diff ~70 km ~100 km ETA, North American Mesoscale limited–area model, Finite 12 km / L45 90 km / L37 SSINational Centers for Environmental Prediction Diff. 3D Var GASP, Global AnalysiS and Prediction model, Spectral T239 / L29 1.0 / L11 3D VarAustralian Bureau of Meteorology ~60 km ~80 km

JMA, Global Spectral Model (GSM), Spectral T213 / L40 1.25 / L13 4D VarJapan Meteorological Agency ~65 km ~100 km NGPS, Navy Operational Global Atmos. Pred. Sys. Spectral T239 / L30 1.0 / L14 3D VarFleet Numerical Meteorological & Oceanographic Cntr. ~60 km ~80 km

TCWB, Global Forecast System, Spectral T79 / L18 1.0 / L11 OITaiwan Central Weather Bureau ~180 km ~80 km UKMO, Unified Model, Finite 5/65/9/L30 same / L12 4D VarUnited Kingdom Meteorological Office Diff. ~60 km

UWME: Multi-Analysis/Forecast Collection

Perturbed surface boundary parameters according to their suspected uncertainty

Assumed differences between model physics options approximate model error coming from sub-grid scales

1) Albedo

2) Roughness Length

3) Moisture Availability

UWME

UWME: MM5 Physics Configuration(January 2005 - current)

vertical Cloud 36-km 12-km shlw. SST Land UseSoil diffusion Microphysics Domain Domain cumls. Radiation Perturbation Table

MRF 5-Layer Y Reisner II Kain-Fritsch Kain-Fritsch N CCM2 none default

GFS+ MRF LSM Y Simple Ice Kain-Fritsch Kain-Fritsch Y RRTM SST_pert01 LANDUSE.plus1

CMCG+ MRF 5-Layer Y Reisner II Grell Grell N cloud SST_pert02 LANDUSE.plus2

ETA+ Eta 5-Layer N Goddard Betts-Miller Grell Y RRTM SST_pert03 LANDUSE.plus3

GASP+ MRF LSM Y Shultz Betts-Miller Kain-Fritsch N RRTM SST_pert04 LANDUSE.plus4

JMA+ Eta LSM N Reisner II Kain-Fritsch Kain-Fritsch Y cloud SST_pert05 LANDUSE.plus5

NGPS+ Blackadar 5-Layer Y Shultz Grell Grell N RRTM SST_pert06 LANDUSE.plus6

TCWB+ Blackadar 5-Layer Y Goddard Betts-Miller Grell Y cloud SST_pert07 LANDUSE.plus7

UKMO+ Eta LSM N Reisner I Kain-Fritsch Kain-Fritsch N cloud SST_pert08 LANDUSE.plus8

UWME+

CumulusPBL / LSM


Ave

rage

RM

SE

(C

)an

d

(sh

aded

) A

vera

ge B

ias

12 h

24 h 36

h 48 h

Member-Wise Forecast Bias Correction


UWME+ 2-m Temperature

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

NGPS+ TCWB+ UKMO+ MEAN+ -2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

GFS+ CMCG+ ETA+ GASP+ JMA+

12 h

36 h

24 h

48 h


-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Ave

rag

e R

MS

E a

nd

Bia

s (m

b)

plus01 plus02 plus03 plus04 plus05 plus06 plus07 plus08 mean -2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Ave

rag

e R

MS

E a

nd

Bia

s (m

b)

Ave

rage

RM

SE

(C

)an

d

(sh

aded

) A

vera

ge B

ias

12 h

24 h 36

h 48 h

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

plus01 plus02 plus03 plus04 plus05 plus06 plus07 plus08 mean -2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Member-Wise Forecast Bias Correction

UWME+ 2-m Temperature14-day running-mean bias correction


*NGPS+ *TCWB+ *UKMO+ *MEAN+ *GFS+ *CMCG+ *ETA+ *GASP+ *JMA+

Sample ensemble forecasts

Post-Processing: Probability Densities

Q: How should we infer forecast probability density functions from a finiteensemble of forecasts?

A: Some options are…

Democratic Voting (DV)P = x / Mx = # members > or < thresholdM = # total members

Uniform Ranks (UR)***Assume flat rank histogramsLinear interpolation of the DVprobabilities between adjacentmember forecastsExtrapolation using a fitted Gumbel(extreme-value) distribution

Parametric Fitting (FIT)Fit a statistical distribution (e.g.,normal) to the member forecasts

***currently operational scheme


A Concrete Example


A Concrete Example

Minimize False AlarmsMinimize Misses


How to Model Zeroes

logit of proportion of rain versuscubed root of bin center


How to Model Non-Zeroes

mean (left) and variance (right) of fitted gammas on each bin


Power-Transformed Obs

Untransformed:

Square root:

Cube root:

Fourth root:


A Possible Fix

Try a more complicated model, fitting a point mass at zero, an exponential for “drizzle,” and a gamma for true rain around each member forecast

Red: no rain, Green: drizzle, Blue: rain

JEFS Calibration: Bayesian Model Averaging

Documents

Transcript of JEFS Calibration: Bayesian Model Averaging