03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill...

25
Categorical Verification Tina Kalb Contributions from Tara Jensen, Matt Pocernich, Eric Gilleland, Tressa Fowler, Barbara Brown and others M H F Observation Forecast

Transcript of 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill...

Page 1: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Categorical Verification

Tina KalbContributions from Tara Jensen, Matt Pocernich, Eric Gilleland,

Tressa Fowler, Barbara Brown and others

M HF

Observation

Forecast

Page 2: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Finley Tornado Data (1884)

Forecast answering the question:

Will there be a tornado?

Observation answering the question:

Did a tornado occur?

YESNO

Answers fall into 1 of 2 categories ù Forecasts and Obs are Binary

YESNO

Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 3: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Yes No TotalYes 28 72 100No 23 2680 2703Total 51 2752 2803

Observed

Forecast

Finley Tornado Data (1884)

Contingency Table

Copyright 2015, University Corporation for Atmospheric Research, all rights reserved

Page 4: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Yes No TotalYes 28 72 100No 23 2680 2703Total 51 2752 2803

Observed

Forecast

A Success?

Percent Correct = (28+2680)/2803 = 96.6% !!!!Copyright 2015, University Corporation for Atmospheric Research, all rights reserved

Page 5: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Yes No TotalYes 0 0 0No 51 2752 2803Total 51 2752 2803

Observed

Forecast

What if forecaster never forecasted a tornado?

Percent Correct = (0+2752)/2803 = 98.2% !!!!Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 6: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

maybe Accuracy is not the most informative statistic

But the contingency table concept is good…

Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 7: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

2 x 2 Contingency Table

Yes No Total

Yes HitFalse Alarm

Forecast Yes

No MissCorrect

NegativeForecast

NoTotal Obs. Yes Obs. No Total

ObservedFo

reca

st

Example: Accuracy = (Hits+Correct Negs)/Total

MET supports both 2x2 and NxN Contingency Tables

Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 8: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Common Notation(however not universal notation)

Example: Accuracy = (a+d)/n

Yes No TotalYes a b a+bNo c d c+dTotal a+c b+d n

Observed

Forecast

Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 9: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

What if data are not binary?

Examples:Temperature < 0 CPrecipitation > 1 inchCAPE > 1000 J/kgOzone > 20 µg/m³Winds at 80 m > 24 m/s500 mb HGTS < 5520 mRadar Reflectivity > 40 dBZMSLP < 990 hPaLCL < 1000 ftCloud Droplet Concentration > 500/cc

Hint: Pick a thresholdthat is meaningfulto your end-user

Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 10: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Contingency Table for Freezing Temps (i.e. T<=0 C)

Another Example:Base Rate (aka sample climatology) = (a+c)/n

<= 0C > 0C Total<= 0C a b a+b> 0C c d c+dTotal a+c b+d n

Observed

Fore

cast

Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 11: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Alternative Perspective on Contingency Table

Hits

CorrectNegatives

False Alarms Misses

Forecast = yes Observed = yes

Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 12: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Conditioning to form a statistic• Considers the probability of one event given another event• Notation: p(X|Y=1) is probability of X occuring given

Y=1 or in other words Y=yes

Conditioning on Fcst provides:• Info about how your forecast is performing• Apples-to-Oranges comparison if comparing stats from 2 models

Conditioning on Obs provides:• Info about ability of forecast to discriminate between event and non-

event - also called Conditional Probability or “Likelihood”• Apples-to-Apples comparison if comparing stats from 2 models

Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 13: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Conditioning on forecasts

Forecast = yesf=1

Observed = yesx=1

p(x|f=1) p(x=1|f=1) = a / aUb = a/(a+b) = Fraction of Hitsp(x=0|f=1) = b / aUb = b/(a+b) = False Alarm Ratio

Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 14: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Conditioning on observations

Forecast = yesf=1

Observed = yesx=1

p(f|x=1) p(f=1|x=1) = a / aUc = a/(a+c) = Hit Ratep(f=0|x=1) = c / aUc = c/(a+c) = Fraction of Misses

Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 15: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

What’s considered good?

Conditioning on ForecastFraction of hits - p(x=1|f=1) = a/(a+b) : close to 1 False Alarm Ratio - p(x=0|f=1) = b/(a+b) : close to 0

Conditioning on ObservationsHit Rate - p(f=1|x=1) = a/(a+c): close to 1

[aka Probability of Detection Yes (PODy)]Fraction of misses p(f=0|x=1) = a/(a+c) : close to 0

Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 16: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Examples of Categorical Scores(most based on conditioning)

• PODy = a/(a+c)• False Alarm Ratio (FAR) = b/(a+b)• PODn = d/(b+d) = ( 1 – POFD)• False Alarm Rate (POFD) = b/(b+d)• (Frequency) Bias (FBIAS) = (a+b)/(a+c)• Threat Score or Critical Success Index = a/(a+b+c)

a cb

d

PODProbability of

Detection

POFDProbability of

False Detection

(CSI)

Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 17: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Examples of CTC calculations

Yes No TotalYes 28 72 100No 23 2680 2703Total 51 2752 2803

ObservedForecast

Threat Score = 28 / (28 + 72+ 23) = 0.228 Probability of Detection = 28 / (28 + 23) = 0.55 False Alarm Ratio= 72/(28 + 72) = 0.720

Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 18: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Example

Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 19: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Relationships among scores• CSI is a nonlinear function of POD and FAR• CSI depends on base rate (event frequency) and Bias

1CSI 1 1 1POD 1 FAR

=+ -

-

CSI

Very different combinations of FAR and POD can lead to same CSI value

PODBias1 FAR

=-

Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 20: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

9km - Ensemble Mean – 6h Precip

>0.1 in.>0.5 in.>1.0 in.>2.0 in.

HMT Performance Diagram

• On same plot• POD• 1-FAR (aka Success

Ratio)• CSI• Freq Bias

• Dots: Scores Aggregated Over Lead Time

• Colors: different thresholds

• Results:• Decreasing skill with

higher thresholds across multiple metrics

• Highest skill 18 – 24 h lead times

Best

Success Ratio (1-FAR)

Freq

Bias

Equal lines of CSI

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Roberts et al. (2011), Roebber (WAF, 2009), Wilson (presentation, 2008)

Equal lines of Freq. Bias

0612

182430

36424854

606672

Page 21: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Skill ScoresHow do you compare the skill of easy to predict events with difficult to predict events?• Provides a single value to summarize performance.• Reference forecast - best naive guess; persistence;

climatology.• Reference forecast must be comparable.• Perfect forecast implies that the object can be perfectly

observed.

Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 22: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Generic Skill Score( )

( )

climo

1

ref

perf ref

A ASS

A A

MSEMSESSMSE

-=

-

= -

• Interpreted as fractional improvement over reference forecast• Reference could be: Climatology, Persistence, your baseline forecast, etc..• Climatology could be a separate forecast or a gridded forecast sample

climatology• SS typically positively oriented with 1 as optimal

where A = any measureref = referenceperf = perfect

where MSE = Mean Square ErrorExample:

Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 23: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Commonly Used Skill Scores

• Gilbert Skill Score - based on the CSI corrected for the number of hits expected by chance.

• Heidke Skill Score - based on Accuracy corrected by the number of hits expected by chance.

• Hanssen-Kuipers Discriminant – (Pierce Skill Score) measures ability of forecast to discriminate between (or correctly classify) events and non-events. H-K=POD-POFD

• Brier Skill Score for probabilistic forecasts• Fractional Skill Score for neighborhood methods• Intensity-Scale Skill Score for wavelet methods

Page 24: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

Example

Copyright 2018, University Corporation for Atmospheric Research, all rights reserved

Page 25: 03 Categorical Jan18 - dtcenter.org · 1/31/2018  · Commonly Used Skill Scores • Gilbert Skill Score -based on the CSI corrected for the number of hits expected by chance. •

References:

Jolliffe and Stephenson (2012): Forecast Verification: a practitioner’s guide, Wiley & Sons, 240 pp.

Wilks (2011): Statistical Methods in Atmospheric Science, Academic press, 467 pp.

Stanski, Burrows, Wilson (1989) Survey of Common Verification Methods in Meteorology

http://www.eumetcal.org.uk/eumetcal/verification/www/english/courses/msgcrs/index.htm

WMO Verification working group forecast verification web page, http://www.cawcr.gov.au/projects/verification/

Thank you!