SRNWP Expert Team on Diagnostics, validation and verification
Workplan for 2008-09

Version 1.0 Clive Wilson

Deputy chairperson: This is still vacant

Improved Communication:
Video-conferencing has been proposed for discussions on verification score reviews and key technical points. This will be explored.

Key reference list:
This has been compiled and is kept up-to-date by the ET chair. ET members are encouraged to email additional references.

Extreme events

A list has been started:

EUMETNET programme proposal

The proposal was discussed and drafted by the ET and submitted at the 33rd Eumetnet Council meeting in Rekyavik, 28-29 May 2008. The proposal was approved and forms part of the Programme requirement and a call for Responsible Member will be made. The final programme proposal will be approved at the 34th Council meeting (16-17 Oct , Brussels), to start 1 November 2008 and last 2 years.

The deliverables are to be:

Council commented that D1 should include a review of all available methods used by Members and other NWP centres. Verification of severe weather forecasts is to be added as a new deliverable. D3 was very important and it should start as soon as possible.

The main activity of the ET will be to ensure that the final programme is agreed, approved, initiated and put into operation so that the benefits are realised. A responsible member needs to be identified.

Verification Methods workshop

The WMO WWRP/WGNE joint working group on verification are to hold their 4th International methods workshop in Helsinki from 8-11 June 2009. They have asked if we would be interested in holding a joint SRNWP workshop with them. This would be an opportunity to address the verification of extremes and severe weather, and high resolution forecast verification. This has now been agreed to be a joint meeting following consultation with the ET. ET members will help in the planning of the programme.

Consortia Activities & Plans


The common verification package for the Aladin models is operational in Slovenia. It was not significantly changed during last year except the inclusion of the results of the Météo-France and Poland Aladin forecasts. It allows a comparison of the different versions of Aladin against surface stations and radio-sounding European data.

Application of fuzzy methods, pattern recognition will be tested in Poland in a pluri-annual program.

Quasi-operational use of the fuzzy methods in Météo-France to compare high and low resolution models. The deterministic forecasts of phenomena are first transformed in probabilities of occurrence of these phenomena by computing the forecasted frequency of the phenomena in a neighbourhood. Then 2 versions of the Brier skill score against the persistence forecast are used to evaluate the performance of the different models: we compare the probabilities either to the 0 or 1 value observed at the centre of the neighbourhood (BSS_SO for single observation) or to the observed frequency of the phenomena in the same neighbourhood (BSS_NO for neighbourhood observation). These score have been used to compare a hierarchy of operational and research models with respect to the climatological French raingauge network (Amodei and Stein 2008). They are currently used to validate the improvements proposed for the prototype version of the AROME model at 2.5 km before its operational use planned before the end of 2008.

Development of the comparison of forecasted brightness temperatures with observed temperature by Meteosat 8 and 9 for ALADIN and AROME models as post-processing of the forecasts. Classical and probabilistic scores are also applied to this comparison in order to quantify the double penalty influence on the comparison between 2 models of different resolution.

A supplementary data set will be used to evaluate high resolution forecasts using the radar data provided by the French network under two complementary forms. From one side, the reflectivities will be computed by using the observation operator designed for its assimilation in AROME and compared with the radar observations. And from the other side, the rain analysis (meshes: 1 km and 1 hour) at the ground obtained by a mixture of radar and raingauge information (Antilope project) will be used as reference to compare with precipitation forecasts. The fuzzy approach will also be used to quantify the quality of the forecasts.


The Common Verification Suite [Raspanti et. al., 2006] which has been developed in the last years at the Italian Met Service is now installed in all COSMO Met Services and is the official verification software inside COSMO. It contains the "traditional" verification of SYNOPs and TEMPs. For mslp, t2m, td2m and wind speed the mean error and rms is calculated in 3h-steps and for precipitation the frequency bias and equitable threat score are calculated for thresholds of 0.2, 2, 5 and 10 mm/6h and mm/12h, and of 0.2, 2, 10 and 20 mm/24h. Results of this standard verification are published periodically on the COSMO website (protected area).

Main Activities underway

These following two activities represent the main priorities and can be seen as the minimal developments to be able to explore the real quality of NWP forecasts and to have and maintain a "state of the art" verification system for the near future.

Conditional Verification (CV) library

Work is underway to include a Conditional Verification (CV) library in the Common Verification Software. The typical approach to CV consists of the selection of one or several forecast products and one or several mask variables or conditions, which would be used to define thresholds for the product verification (e.g. verification of t2m only for grid points with zero cloud cover in model and observations). After the selection of the desired conditions, classical verification tools to turn out statistical indexes can be used. Once delivered and applied routinely, it should provide information straight to the developers in order to provide them hints which could be the causes of model deficiencies that can be seen in the operational verification. The more flexible way to perform a selection of forecasts and observations following a certain number of conditions is to use an "ad hoc database" to store data needed, where the mask or filter could be simple or complex SQL statements.

Development of object-based and fuzzy verification methods and techniques

One important task of this project is to show if the very high resolution models (~2-3km) show to be better than high resolution models (~7km), which is not trivial using classical scores. Going down to resolutions of the order of 2 km, leads to the problem of proliferation of grid points. Although giving more details, these are rarely at the correct place at the correct moment (double penalty problem). Some type of aggregation is thus needed: the keyword of 'fuzzy verification' can be attached to this activity.

Beth Ebert's fuzzy verification package that includes 13 different methods (Ebert, 2008) has been installed at DWD and MeteoSwiss and first results have been obtained with the COSMO versions running at 2.2 (resp. 2.8) km and 7 km.

At MeteoSwiss two fuzzy verification methods, "Upscaling" and the "Fraction Skill Score" (FSS), have been used to compare COSMO-7 and COSMO-2 against 3 hourly rain accumulations of the Swiss Radar network during summer 2007 (MAP D-PHASE). Both scores give significantly better results for the high resolution integration of COSMO-2. Strongest improvements of COSMO-2 with respect to COSMO-7 are achieved on coarse scales (~90km) and for medium rain intensity in the order of 4mm /3 h, resembling weak convective events. The two fuzzy verification methods Upscaling" and the "Fraction Skill Score" (FSS) are recommended for practical use.

At DWD the application of Beth Ebert's package was evaluated with the COSMO forecasts since 2007. Verification against radar data is nearly operational. Results are calculated for every model run of GME, COSMO-EU and COSMO-DE. Verification will be carried out for different regions and with calibrated radar data.

In this field possible collaborations could be with the Development Testbed Centre (NCAR) group who developed the MET software that already includes some of the features described here, and with the Australian Bureau of Meteorology.

Planned activities

The following items represent what should be done in order to have a more complete common verification system.

Development of a common global score

To better explore the improvement of the different implementations of COSMO model (and also to compare them) a Global Score has to be developed. This score will be based on a mixing of continuous and categorical elements and will provide a yearly trend of the general behaviour of the model. This score has to be conceived to be useful from an administrative, as well as developers points of view and is constructed in a similar way as the UK NWP index.
In particular it will be based on total cloudiness, t2m, 10m wind vector and precipitation and will be included in the Common Verification Suite.

Development of probabilistic forecasts and ensemble verification

Suitable verification methods have to be applied also in the probabilistic and ensemble forecasts direction. In general there are three methods to evaluate this kind of forecasts:

Among the most widely used measures can be found: continuous rank probability score, related skill score, rank histogram (Talagrand diagram) Brier score and its decomposition and ROC.
In this framework a common choice of "probabilistic measures" has to be done and included, in the next future, in the common verification package.


The HIRLAM-A programme maintains a web portal where members can access verification statistics and monitor observation usage and diagnostic output from operational suites of member institutes. The portal can be used for comparing the performance of different implementations of the HIRLAM and HARMONIE forecasting systems, and for spotting anomalous behaviour in a given forecast suite. A harmonization of production and display among the partners, as well as extending the material on forecast charts and meteograms, field verification statistics, departure statistics from the data assimilation and on line comparison of forecasts with localized profile and flux measurements, are all planned for the year 2008.

Observation verification statistics are produced by the new HARMONIE verification package (Andrae, 2007), and presented as tables, maps, time-series, vertical profiles, histograms, scatter plots, and diurnal or seasonal cycles. Comparative descriptors include bias and rms scores and actual data values as well as a large number of descriptors related to contingency tables of categorical forecasts. SAL verification (Wernli et. al., 2008) is a recent capability of the package.

Calibration and validation of the ensemble prediction system GLAMEPS is realized using the HPPV verification package developed at AEMET (Santos and Hagel, 2007). The package is suited for use in a multi model environment, and yields Rank histograms, PIT histograms, spread-skill relation, Brier skill score, ROC-curves, ROC-area, reliability diagrams, sharpness histograms and RV curves.

Measurements of radar reflectivity by Finnish radars are used for quality assessment of high-resolution forecasts with the aid of the radar simulation model (RSM, Haase and Crewell 2000, Haase and Fortelius, 2001), applying i.a. SAL verification. It is planned to include the RSM software in the HARMONIE distribution within the near future.

Fuzzy methods for evaluating of the information in mesoscale forecasts by using model output statistics (MOS) and traditional verification scores are being developed and tested (Kok et al., 2008).

Met Office

Ensemble verification has been built into the verification package to enable verification of the MOGREPS forecasts, which are now routinely running operationally and being evaluated. Reliability tables, rank histograms, ROC curves, Brier, value plots have been included. Evaluation of multi model ensembles is also planned.

The fractional skill score, FSS (Roberts & Lean, 2008) for precipitation has been calculated from operational 12km and 4km forecasts over the UK, using the radar composite analyses as truth, since March 2007. The usefulness of the score will be assessed following summer 2008. The intensity/scale technique of Casati et al (2004) has also been routinely applied to the same forecasts and will be compared and contrasted with the FSS. The resolution of the UK model is planned to improve to 1.5km in 2009, and test forecasts now underway will be assessed with these methods to assess which methods are most informative, robust and suitable for such high resolution.

The pilot OPERA European radar composite products are being assessed and evaluated for their potential use in verification of forecasts over Europe. Using NAE short period forecasts to quality control the composites has revealed regional characteristics and radar problems that need to be addressed and/or allowed for.

A moderate severe weather index is being developed to measure the skill of limited area model forecasts in predicting high impact weather events. We are collaborating with Exeter University (Chris Ferro and David Stephenson) in exploiting theoretical statistical methods of Extreme Value Theory for evaluating forecasts of extreme events.

A general review of the verification methods for warnings is under way in collaboration with Exeter University (Ian Jolliffe, David Stephenson).


Andrae, U., 2007: Verification and monitoring in HARMONIE. HARMONIE workshop on physical parameterizations, Helsinki 10-14 September 2007.

Ebert, E. 2008: Fuzzy verification of high-resolution gridded forecasts: a review and proposed frameworks. Met. Appl., 15, 51-64.

Haase, G., and S Crewell, 2000: Simulation of radar reflectivities using a mesoscale weather forecast model. Water Resources Res., 36, 3331-2231.

Haase, G., and C. Fortelius, 2001: Simulation of radar reflectivities using HIRLAM forecasts. Hirlam Tech. Rep. 57. 24 pp.

Kok, K., B. Wichers Schreur, and D. Vogelezang, 2008: Valuing information from mesoscale forecasts. Meteorological applications, 15, 103-111.

Raspanti, A., A. Celozzi, A. Galliani 2006: Common Verification Suite. User and Reference Manual. 30pp. CNMCA [Internal Report].

Santos, C., and E. Hagel, 2007: INM-SREPS and GLAMEPS Postprocessing and Verification. HIRLAM-A all staff meeting and ALADIN general assembly 2007. Oslo, 23-26. April 2007.

Wernli, H., P. Paulat, M. Hagen, C. Frei, 2008: SAL - a novel quality measure for the verification of quantitative precipitation forecasts. To appear in Monthly Weather Review.