This site uses cookies, tags, and tracking settings to store information that help give you the very best browsing experience. Dismiss this warning

A User-Focused Approach to Evaluating Probabilistic and Categorical Forecasts

Nicholas Loveday aBureau of Meteorology, Melbourne, Victoria, Australia

Search for other papers by Nicholas Loveday in
Current site
Google Scholar
PubMed
Close
,
Robert Taggart bBureau of Meteorology, Sydney, New South Wales, Australia

Search for other papers by Robert Taggart in
Current site
Google Scholar
PubMed
Close
, and
Mohammadreza Khanarmuei cBureau of Meteorology, Brisbane, Queensland, Australia

Search for other papers by Mohammadreza Khanarmuei in
Current site
Google Scholar
PubMed
Close
Full access

Abstract

A user-focused verification approach for evaluating probability forecasts of binary outcomes (also known as probabilistic classifiers) is demonstrated that is (i) based on proper scoring rules, (ii) focuses on user decision thresholds, and (iii) provides actionable insights. It is argued that when categorical performance diagrams and the critical success index are used to evaluate overall predictive performance, rather than the discrimination ability of probabilistic forecasts, they may produce misleading results. Instead, Murphy diagrams are shown to provide better understanding of overall predictive performance as a function of user probabilistic decision threshold. It is illustrated how to select a proper scoring rule, based on the relative importance of different user decision thresholds, and how this choice impacts scores of overall predictive performance and supporting measures of discrimination and calibration. These approaches and ideas are demonstrated using several probabilistic thunderstorm forecast systems as well as synthetic forecast data. Furthermore, a fair method for comparing the performance of probabilistic and categorical forecasts is illustrated using the FIxed Risk Multicategorical (FIRM) score, which is a proper scoring rule directly connected to values on the Murphy diagram. While the methods are illustrated using thunderstorm forecasts, they are applicable for evaluating probabilistic forecasts for any situation with binary outcomes.

© 2024 American Meteorological Society. This is an Author Accepted Manuscript distributed under the terms of the default AMS reuse license. For information regarding reuse and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Nicholas Loveday, [email protected]

Abstract

A user-focused verification approach for evaluating probability forecasts of binary outcomes (also known as probabilistic classifiers) is demonstrated that is (i) based on proper scoring rules, (ii) focuses on user decision thresholds, and (iii) provides actionable insights. It is argued that when categorical performance diagrams and the critical success index are used to evaluate overall predictive performance, rather than the discrimination ability of probabilistic forecasts, they may produce misleading results. Instead, Murphy diagrams are shown to provide better understanding of overall predictive performance as a function of user probabilistic decision threshold. It is illustrated how to select a proper scoring rule, based on the relative importance of different user decision thresholds, and how this choice impacts scores of overall predictive performance and supporting measures of discrimination and calibration. These approaches and ideas are demonstrated using several probabilistic thunderstorm forecast systems as well as synthetic forecast data. Furthermore, a fair method for comparing the performance of probabilistic and categorical forecasts is illustrated using the FIxed Risk Multicategorical (FIRM) score, which is a proper scoring rule directly connected to values on the Murphy diagram. While the methods are illustrated using thunderstorm forecasts, they are applicable for evaluating probabilistic forecasts for any situation with binary outcomes.

© 2024 American Meteorological Society. This is an Author Accepted Manuscript distributed under the terms of the default AMS reuse license. For information regarding reuse and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Nicholas Loveday, [email protected]
Save