JEFS Calibration: Bayesian Model Averaging
description
Transcript of JEFS Calibration: Bayesian Model Averaging
Adrian E. Raftery
J. McLean Sloughter
Tilmann Gneiting
University of Washington
Statistics
JEFS Calibration:Bayesian Model Averaging
Eric P. Grimit
Clifford F. Mass
Jeff Baars
University of Washington
Atmospheric Sciences
Research supported by:Office of Naval Research
Multi-Disciplinary University Research Initiative (MURI)
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
The General Goal
“The general goal in EF [ensemble forecasting] is to produce a probability density function (PDF) for the future state of the atmosphere that is reliable…and sharp…”
-- Plan for the Joint Ensemble Forecast System (2nd Draft),
Maj. F. Anthony Eckel
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
Calibration and Sharpness
Calibration ~ reliability (also: statistical consistency)A probability forecast p, ought to verify with relative frequency p.
The verification ought to be indistinguishable from the forecast ensemble (the verification rank histogram* is uniform).
However, a forecast from climatology is reliable (by definition), so calibration alone is not enough.
Sharpness ~ resolution (also: discrimination, skill)The variance, or confidence interval, should be as small as possible, subject to calibration.
*Verification Rank Histogram
Record of where verification fell (i.e., its rank) among the ordered ensemble members:
Flat Well-calibrated (truth is indistinguishable from ensemble members)
U-shaped Under-dispersive (truth falls outside the ensemble range too often)
Humped Over-dispersive
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
5.0 %
4.2 %
1 2 3 4 5 6 7 8 936h *ACMEcore0.0
0.1
0.2
0.3
0.4
1 2 3 4 5 6 7 8 936h *ACMEcore0.0
0.1
0.2
0.3
0.4
36h *ACMEcore
36h *ACMEcore+
1 2 3 4 5 6 7 8 936h *ACMEcore0.0
0.1
0.2
0.3
0.4
36h *ACMEcore
36h *ACMEcore+
1 2 3 4 5 6 7 8 936h *ACMEcore0.0
0.1
0.2
0.3
0.4
Pro
bab
ilit
y
Verification Rank
(d) T2
(c) WS10
(b) MSLP
1 2 3 4 5 6 7 8 936h ACMEcore0.0
0.1
0.2
0.3
0.4
1 2 3 4 5 6 7 8 936h ACMEcore0.0
0.1
0.2
0.3
0.4
Verification Rank
(a) Z500
1 2 3 4 5 6 7 8 936h ACMEcore0.0
0.1
0.2
0.3
0.4
Verification Rank
1 2 3 4 5 6 7 8 936h ACMEcore0.0
0.1
0.2
0.3
0.4
*UWME*UWME+
EOP*
9.0 %
6.7 %
25.6 %
13.3 %
43.7 %
21.0 %
Surface/Mesoscale Variable
( Errors Depend on Model Uncertainty )
SynopticVariable
( Errors Depend on Analysis Uncertainty )
*UWME*UWME+
*UWME*UWME+
*UWME*UWME+
Typical Verification Rank Histograms
*Excessive Outlier Percentage
[c.f. Eckel and Mass 2005, Wea. Forecasting]
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
Objective and Constraints
Objective: Calibrate JEFS (JGE and JME) output.
Utilize available analyses/observations as surrogates for truth.
Employ a method thataccounts for ensemble member construction and relative skill.
Bred-mode / ETKF initial conditions (JGE; equally skillful members)
Multiple models (JGE and JME; differing skill for sets of members)
Multi-scheme diversity within a single model (JME)
is adaptive.Can be rapidly relocated to any theatre of interest.
Does not require a long history of forecasts and observations.
accommodates regional/local variations within the domain.Spatial (grid point) dependence of forecast error statistics.
works for any observed variable at any vertical level.
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
First Step: Mean Bias Correction
Calibrate the first moment the ensemble mean.
In a multi-model and/or multi-scheme physics ensemble, individual members have unique, often compensatory, systematic errors (biases).
Systematic errors do not represent forecast uncertainty.
Implemented a member-specific bias correction for UWME using a 14-day training period (running mean).
Advantages and disadvantages:Ensemble spread is reduced (in an under-dispersive system).
The ensemble spread-skill relationship is degraded.(Grimit 2004, Ph.D. dissertation)
Forecast probability skill scores improve.
Excessive outliers are reduced.
Verification rank histograms become quasi-symmetric.
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
Second Step: Calibration
Calibrate the higher moments the ensemble variance.
Forecast error climatologyAdd the error variance from a long history of forecasts and observations to the current (deterministic) forecast.
For the ensemble mean, we shall call this forecast mean error climatology (MEC).
MEC is time-invariant (a static forecast of uncertainty; a climatology).
MEC is calibrated for large samples, but not very sharp.
Advantages and disadvantages:Simple. Difficult to beat!
Gaussian.
Not practical for JGE/JME implementation, since a long history is required.
A good baseline for comparison of calibration methods.
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
FIT MEC
Mean Error Climatology (MEC) Performance
CRPS = continuous ranked probability score
[Probabilistic analog of the mean absolute error (MAE) for scoring deterministic forecasts]
Comparison of *UWME 48-h 2-m temperature forecasts:Member-specific mean bias correction applied to both [14-day running mean]
FIT = Gaussian fit to the raw forecast ensembleMEC = Gaussian fit to the ensemble-mean + the mean error climatology
[00 UTC Cycle; October 2002 – March 2004; 361 cases]
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
Bayesian Model Averaging (BMA)
BMA has several advantages over MEC:A time-varying uncertainty forecast.
A way to keep multi-modality, if it is warranted.
Maximizes information from short (2-4 week) training periods.
Allows for different relative skill between members through the BMA weights (multi-model, multi-scheme physics).
Bayesian Model Averaging (BMA) Summary
Member-specific mean-bias correction parametersMember-specific BMA weightsBMA variance(not-member specific here, but can be)
[c.f. Raftery et al. 2005, Mon. Wea. Rev.]
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
MEC
BMA Performance Using Analyses
BMA
BMA initially implemented using training data from the entire UWME 12-km domain (Raftery et al. 2005, MWR).
No regional variation of BMA weights, variance parameters.Used observations as truth.
After several attempts to implement BMA with local or regional training data using NCEP RUC 20-km analyses as truth, we found that:
when the training data is selected from a neighborhood of grid points with similar land-use type and elevation produced EXCELLENT results!Example application to 48-h 2-m temperature forecasts uses only 14 training days.
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
MEC BMA
BMA-Neighbor* Calibration and Sharpness
*neighbors have same land use type and elevation difference < 200 m within a search radius of 3 grid points (60 km)
MEC BMA FIT
calibration
sharpness
Probability integral transform (PIT)
histograms an analog of verification rank histograms for
continuous forecasts
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
BMA-Neighbor* CRPS Improvement
BMA improvement over MEC
*neighbors have same land use type and elevation difference < 200 m within a search radius of 3 grid points (60 km)
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
BMA-Neighbor Using Observations
Use observations, remote if necessary, to train BMA.
Follow the Mass-Wedam procedure for bias correction, to select the BMA training data.1. Choose the N closest observing locations to
the center of the grid box, which have similar elevation and land-use characteristics.
2. Find the K occasions during a recent period (up to Kmax days previous), on which the interpolated forecast state was similar to the current interpolated forecast state at each station n = 1, …, N.a) Similar ensemble mean forecast states.
b) Similar min/median/max ensemble forecast states.
3. If N*K matches are not found, relax the similarity constraints and repeat (1) and (2).
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
Summary and the Way Forward
Mean error climatologyGood benchmark to evaluate competing calibration methods.
Generally beats a raw ensemble, even though it is not state-dependent.
The ensemble mean contains most of the information we can use.
The ensemble variance (state-dependent) is generally a poor prediction of uncertainty, at least on the mesoscale.
Bayesian model averaging (BMA)A calibration method that is becoming popular. (CMC-MSC)
A calibration method that meets many of the constraints that FNMOC and AFWA will face with JEFS.
It accounts for differing relative skill of ensemble members (multi-model, multi-scheme physics).
It is adaptive (short training period).
It can be rapidly relocated to any theatre.
It can be extended to any observed variable at any vertical level
(although, research is ongoing on this point).
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
For quantities such as wind speed and precipitation, distributions are not only non-Gaussian, but not purely continuous – there are point masses at zero.For probabilistic quantitative precipitation forecasts (PQPF):
Model P(Y=0) with a logistic regression.Model P(Y>0) with a finite Gamma mixture distribution.
Fit Gamma means as a linear regression of the cubed-root of observation on forecast and an indicator function for no precipitation.Fit Gamma variance parameters and BMA weights by the EM algorithm, with some modifications.
Extending BMA to Non-Gaussian Variables
[c.f. Sloughter et al. 200x, manuscript in preparation]
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
PoP Reliability Diagrams
Ensemble consensus voting as crosses.
Results for January 1, 2003 through December 31, 2004 24-hour accumulation PoP forecasts, with 25-day training, no regional parameter variations.
[c.f. Sloughter et al. 200x, manuscript in preparation]
BMA PQPF model as red dots.
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
PQPF Rank Histograms
Verification RankHistogram
PIT Histogram
[c.f. Sloughter et al. 200x, manuscript in preparation]
QUESTIONS and DISCUSSION
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
-0.05
0.00
0.05
0.10
0.15
0.20
0.25
00 03 06 09 12 15 18 21 24 27 30 33 36 39 42 45 48
Lead Time (h)
BS
S
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
00 03 06 09 12 15 18 21 24 27 30 33 36 39 42 45 48
*ACMEcoreACMEcore*ACMEcore+ACMEcore+Uncertainty
*UWME
UWME
*UWME+
UWME+
Skill vs. Lead Time for FP of the event: WS10 > 18 kt
-0.05
00 03 06 09 12 15 18 21 24 27 30 33 36 39 42 45 48
BSS = 1, perfect
BSS < 0, worthless
* Bias-corrected
Forecast Probability Skill vs. Lead Time The event: 10-m wind speed > 18 kt
Bri
er S
kill
Sco
re (
BS
S)
better
Forecast Probability Skill Example
(0000 UTC Cycle; October 2002 – March 2003)Eckel and Mass 2005
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
Resolution (~ @ 45 N ) ObjectiveAbbreviation/Model/Source Type Computational Distributed Analysis
GFS, Global Forecast System (GFS), Spectral T382 / L64 1.0 / L14 SSINational Centers for Environmental Prediction ~35 km ~80 km 3D Var CMCG, Global Environmental Multi-scale (GEM), Finite 0.9 / L28 1.25 / L11 4D VarCanadian Meteorological Centre Diff ~70 km ~100 km ETA, North American Mesoscale limited–area model, Finite 12 km / L45 90 km / L37 SSINational Centers for Environmental Prediction Diff. 3D Var GASP, Global AnalysiS and Prediction model, Spectral T239 / L29 1.0 / L11 3D VarAustralian Bureau of Meteorology ~60 km ~80 km
JMA, Global Spectral Model (GSM), Spectral T213 / L40 1.25 / L13 4D VarJapan Meteorological Agency ~65 km ~100 km NGPS, Navy Operational Global Atmos. Pred. Sys. Spectral T239 / L30 1.0 / L14 3D VarFleet Numerical Meteorological & Oceanographic Cntr. ~60 km ~80 km
TCWB, Global Forecast System, Spectral T79 / L18 1.0 / L11 OITaiwan Central Weather Bureau ~180 km ~80 km UKMO, Unified Model, Finite 5/65/9/L30 same / L12 4D VarUnited Kingdom Meteorological Office Diff. ~60 km
UWME: Multi-Analysis/Forecast Collection
Perturbed surface boundary parameters according to their suspected uncertainty
Assumed differences between model physics options approximate model error coming from sub-grid scales
1) Albedo
2) Roughness Length
3) Moisture Availability
UWME
UWME: MM5 Physics Configuration(January 2005 - current)
vertical Cloud 36-km 12-km shlw. SST Land UseSoil diffusion Microphysics Domain Domain cumls. Radiation Perturbation Table
MRF 5-Layer Y Reisner II Kain-Fritsch Kain-Fritsch N CCM2 none default
GFS+ MRF LSM Y Simple Ice Kain-Fritsch Kain-Fritsch Y RRTM SST_pert01 LANDUSE.plus1
CMCG+ MRF 5-Layer Y Reisner II Grell Grell N cloud SST_pert02 LANDUSE.plus2
ETA+ Eta 5-Layer N Goddard Betts-Miller Grell Y RRTM SST_pert03 LANDUSE.plus3
GASP+ MRF LSM Y Shultz Betts-Miller Kain-Fritsch N RRTM SST_pert04 LANDUSE.plus4
JMA+ Eta LSM N Reisner II Kain-Fritsch Kain-Fritsch Y cloud SST_pert05 LANDUSE.plus5
NGPS+ Blackadar 5-Layer Y Shultz Grell Grell N RRTM SST_pert06 LANDUSE.plus6
TCWB+ Blackadar 5-Layer Y Goddard Betts-Miller Grell Y cloud SST_pert07 LANDUSE.plus7
UKMO+ Eta LSM N Reisner I Kain-Fritsch Kain-Fritsch N cloud SST_pert08 LANDUSE.plus8
UWME+
CumulusPBL / LSM
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
Ave
rage
RM
SE
(C
)an
d
(sh
aded
) A
vera
ge B
ias
12 h
24 h 36
h 48 h
Member-Wise Forecast Bias Correction
(0000 UTC Cycle; October 2002 – March 2003)Eckel and Mass 2005
UWME+ 2-m Temperature
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
NGPS+ TCWB+ UKMO+ MEAN+ -2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
GFS+ CMCG+ ETA+ GASP+ JMA+
12 h
36 h
24 h
48 h
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Ave
rag
e R
MS
E a
nd
Bia
s (m
b)
plus01 plus02 plus03 plus04 plus05 plus06 plus07 plus08 mean -2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Ave
rag
e R
MS
E a
nd
Bia
s (m
b)
Ave
rage
RM
SE
(C
)an
d
(sh
aded
) A
vera
ge B
ias
12 h
24 h 36
h 48 h
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
plus01 plus02 plus03 plus04 plus05 plus06 plus07 plus08 mean -2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Member-Wise Forecast Bias Correction
UWME+ 2-m Temperature14-day running-mean bias correction
(0000 UTC Cycle; October 2002 – March 2003)Eckel and Mass 2005
*NGPS+ *TCWB+ *UKMO+ *MEAN+ *GFS+ *CMCG+ *ETA+ *GASP+ *JMA+
Sample ensemble forecasts
Post-Processing: Probability Densities
Q: How should we infer forecast probability density functions from a finiteensemble of forecasts?
A: Some options are…
Democratic Voting (DV)P = x / Mx = # members > or < thresholdM = # total members
Uniform Ranks (UR)***Assume flat rank histogramsLinear interpolation of the DVprobabilities between adjacentmember forecastsExtrapolation using a fitted Gumbel(extreme-value) distribution
Parametric Fitting (FIT)Fit a statistical distribution (e.g.,normal) to the member forecasts
***currently operational scheme
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
A Concrete Example
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
A Concrete Example
Minimize False AlarmsMinimize Misses
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
How to Model Zeroes
logit of proportion of rain versuscubed root of bin center
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
How to Model Non-Zeroes
mean (left) and variance (right) of fitted gammas on each bin
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
Power-Transformed Obs
Untransformed:
Square root:
Cube root:
Fourth root:
23 August 2005 11:30 AMJEFS Technical Meeting; Monterey, CA
A Possible Fix
Try a more complicated model, fitting a point mass at zero, an exponential for “drizzle,” and a gamma for true rain around each member forecast
Red: no rain, Green: drizzle, Blue: rain