The UW APL Implementation of the BMA Post-Processing Technique for the MM5 Ensemble


Patrick Tewson


Applied Physics Laboratory, University of Washington, Seattle, Washington


Last Revised January 4th, 2005



1. Introduction The UW APL Bayesian Model Averaging system describes a method of applying statistical post-processing to output of a mesoscale ensemble weather forecast to create a probabilistic weather forecast for the Pacific Northwest region, available via the web on a real-time basis (i.e., the BMA forecast is available some useful period of time before the time at which the forecast becomes valid). The BMA approach to model post-processing has been described extensively in Using Bayesian Model Averaging to Calibrate Forecast Ensembles (Raftery et al, Monthly Weather Review). Because different forecast parameters (such as surface temperature, precipitation, &c., &c.) behave differently in terms of statistics, data access and organization, and effective presentation, parameters have been and will continue to be added to the system as time progresses. At this point in time, the system is in place and providing a forecast for surface temperature at 0 hours GMT (4 or 5pm local time) as well as a minimum and maximum temperature over a 24-hour period. Future plans exist to add precipitation and wind speed to the system.


The module within the UW APL system that performs the BMA is actually a package developed by the UW Department of Statistics that will soon be available to the public; this package is usable in a general sense for any set of models predicting normally distributed values. The package is treated as somewhat of a “black box” by this system, and so a description of the APL implementation applies chiefly to the particular problem addressed by the system and resulting infrastructure, i.e. forecasting over the 12km domain of the UWMM5 ensemble forecast produced by UW Atmospheric Sciences. However, the methods and overall approach are likely to be useful for any institution wishing to create a similar system to provide post-processing for meteorological ensembles. Specifically, the process can be separated into three basic steps: data acquisition and archival, running the BMA package and generating BMA parameters, and visualization of results. In the UW APL system, the first two steps run on a scheduled basis but are reliant on data becoming available. The last step is triggered by a user request. Viewed as a black box, the system takes ensemble forecasts and observations from several observational networks as input; it produces plots to convey the probabilistic forecast as an output.


Figure 1.1. System diagram


This document is intended to describe the approach UW APL has taken to run this system, but also focus on how the approach could be generalized to anyone working with an equivalent ensemble system. To this end, details specific to the UW APL problem will be described but flagged as such, with thoughts on generalizing the step included.


2. Data acquisition and Archival The BMA system ingests ensemble forecasts and observational data; this first step of the system involves gathering this data in a formalized manner and creating the files that arrange this data in preparation for the next step. Apart from quality control, there is no calculation involved with this step – only data gathering and arrangement.


Model and observational data are both stored on disk locally at APL, although both data sets are pulled from storage at atmospheric sciences on a daily basis. The UW APL system gathers ensemble data and observational data via a copy operation scheduled to occur at a time when the ensemble forecast is thought to be complete. This is functional and simple to set up, but forecast lead time and reliability would be improved if the data gathering were somehow triggered by the data availability, i.e. when the ensemble forecast is complete or the observations are available. Locally, data is arranged by date, ensemble member, init hour and domain in a manner as follows:


<STORAGE_BASE>/YYYYMMDDII/ARCHIVE_????_d1

<STORAGE_BASE>/2004122000/ARCHIVE_ukmo_d1


Where the directory is named by date (year + month + day + init hour) followed by a file name labeled by member name and a domain indicator, and 'STORAGE_BASE' is the path under which all the files are stored. Similarly, observation files are stored as follows:


<STORAGE_BASE>/YYYYMM/obs.YYYYMMDDHH

<STORAGE_BASE>/200412/obs.2004121817


Where the first subdirectory stores all the observations for a single month, and the file is labeled by year, month, day, and hour. This organization arranges the model data in what amounts to an informal database, accessed through building a file name and path from the requested parameters. A possible improvement to this system would use a more formal database to provide access to this data. APL's implementation accesses data through software that pulls a forecast from each ensemble member at each observation location and combines the data into a single file, arranged by calendar date, parameter, and forecast hour that contains the forecast from each ensemble member, at the observation location for a single calendar date. This file is organized as follows:


DDMM YYYYDDMMII LATDECDEG LONDECDEG ENS[0] ... ENS[N] OBS


Where each row corresponds to an observed value. The fields are: two redundant date fields, latitude and longitude, and then a forecast value for each member of an N-member ensemble, and finally the observed value. The ensemble forecasts should be in the same units as the observed value.



Figure 1. Data flow diagram for the data acquisition and archival phase, N training days.


In this system, the data collection process creates not only a file fore ens+observation files for that day, but for any day within the BMA training period (25 days) that does not have an associated file. Thus, if there is a problem creating the file on any particular run, the gap is filled during the next successful run. At the end of the data collection process, when all possible daily files are complete, a 'BMA input file' is created, which is simply the concatenation of each individual daily file within the training period into a single file. This file is used as input to the next step, the running of the BMA algorithm and creation of the BMA parameters.


Daily files are archived by day; this provides the ability to examine past results, and also to arbitrarily assemble a input file over any training period by simply combining these files together. Again, an informal database is created through encoding of file names and directory hierarchy. The archive format use for this system is as follows, by param, domain, init time, and date:


<STORAGE_BASE>/bma_?????_d?_??/YYYYMM/bma.YYYYMMDD

<STORAGE_BASE>/bma_MINT2_d2_00/200412/bma.20041201


Different forecast parameters behave differently. Because of this, some parameter-dependent handling is required in the data access stage. For instance, creating a file for 48 forecast of 2m temperature requires only getting the temperature forecasts and corresponding observation; however, maximum or minimum temperature requires gathering observations and forecasts over a spread of time and pulling the corresponding minimum or maximum out of each. Similarly, precipitation fields are given as cumulative throughout the duration of the (48 hour-long) model run. Processing is required to compare this field to an observation that shows accumulation over a much shorter window of time.


3. Running the BMA Package The UW APL system described here is based upon the BMA package that runs in the R program, a common and freely-available prototype language used in math and statistics. The BMA package was developed in the UW Department of Statistics (Raftery et al) and will soon be available to the public through the CRAN web site. Running the BMA in R for the small set of parameters currently running at APL does not stretch currently available computing resources, but an effort with more ambitious goals to cover diverse parameters and time steps would want to consider mimicking the BMA package in a more efficient language, or attaining a FORTRAN implementation currently under development.


The BMA R package is itself based around an EM algorithm. While the package as distributed provides functions to run the algorithm and some utilities to interpret results, there is no capability to parse input files or write output files. To provide an interface to the rest of the system, for reading the input file from the correct location, and for writing the output file to the appropriate place in the archives, additional companion libraries were developed at APL (they make up the APL.R script) to perform these functions and arrange input data into the format taken by the distributed library. This companion library also includes a single function that triggers the entire BMA computation. With these script files in place, running the BMA is as simple as launching a cron script that starts R, importing the script files, and calling the top-level function.


Figure 3.1. Data flow diagram for running the BMA package.


Output results are archived alongside the daily input files, in the following format:


<STORAGE_BASE>/bma_?????_d?_??/YYYYMM/results.YYYYMMDD

<STORAGE_BASE>/bma_MINT2_d2_00/200412/results.20041201


This file contains a row for each of: “A” bias, “B” bias, weight, and variance. For an N-member ensemble, each of these rows contains N tokens following the row label (“A”, “B”, “W”, or “V”). Although the current system uses a constant variance across the ensemble, a value is written for each member for simplicity and to support a possible variable variance. This output file, when combined with forecast data, serves as the input to the visualization element of the system that brings the probabilistic forecast to the user.


4. Visualization Unlike the previously described steps in the BMA system, which are scheduled to occur on a daily basis, the visualizations are built in response to a request by a user. This is also where much of the the APL development effort has gone, the rest of the process being data arrangement and archival for running the provided BMA package. Servlet elements running under a Apache/Tomcat web server take parameters from the user request and return plots for the requested model init date, parameter, forecast hour if applicable, and for location plots the requested latitude and longitude.


As mentioned above, the visualization process runs in Java by way of a “Servlet”, or a Java program that runs under a web server with the same manner of form-processing capability as a CGI script. The visualization process follows a few basic steps; the user submits a request via the BMA web page, after which plots are generated for the requested parameters using either the Openmap or Ptolemy plotting packages. Finally, the generated plot is streamed to the requesting client. JSP technology is used extensively in the form interaction and dynamic HTML generation. CDF evaluation of Gaussian normals is performed using publicly available Java functions.


All the plots currently displayed are generated in response to user requests, and are streamed back to the requester as part of the HTTP response. In contrast to a system where a pre-determined set of plots are written to disk and made available to the user, plots for the UW APL implantation are never written to a file. Instead, the Servlet that generates the plots sets the response mime type to “image/png” and sends the BufferedImage using a call to :

javax.imageio.ImageIO.write(RenderedImage im, String formatName, OutputStream output)

Where the OutputStream is a FileCacheImageOutputStream created on the response output stream, grabbed in the “doGet()” or “doPost()” method of the Servlet. This technique, where the requested plot is generated dynamically in response to the user's request and streamed directly to the client browser, motivates the body of the visualization process.

The MURI program has focused on providing a set of five fields available from applying the BMA results: 1) A deterministic forecast, 2) an upper and 3) lower bound of a predictive interval covering a certain percent of likely outcomes, the 4) width of that predictive interval, and 5) the probability that a certain threshold is exceeded. These will be referred to as the 'deterministic', 'upper bound', 'lower bound', 'interval width', and 'probability of exceeding threshold', and are shown for a single point in figure 4.1 below. Each of these parameters reduces to a single scalar value once a boundary or threshold is specified, and a grid of such values is assembled to produce the visualization when a color scale is applied to the grid. In addition a box plot or PDF plot is available at any point on the map. While the deterministic field is simply a weighted average of the forecasts of the ensemble, the other fields require evaluation of the BMA CDF, which in turn requires integration of each of the ensemble member Gaussian normals. A bisection method is used to find the correct parameter value at which the requested percent of probability has been covered. Integrating the normal is achieved through a publicly available implementation of the error function 'erf', the original FORTRAN version of which is located at http://www.netlib.org/specfun/erf. Generation of a grid plot requires this evaluation to be performed over the entire grid, which in our case is approximately 100 x 100 points, or 10,000 individual locations. Each CDF requires a Gaussian normal evaluation for each of the 8 ensemble members, and the bisection method usually takes about a dozen or so iterations to converge. Each plot generation requires 960,000 integrations of a Gaussian normal, or roughly one million for a plot request, emphasizing the importance of an efficient error function. While the BMA PDF itself is non-normal, evaluation of the PDF and CDF can be accomplished by a (weighted) sum of the same results for the aggregate normals.

The BMA visualization set also includes a PDF plot for a single point on the map, similar to what is shown in figure 4.1. Although the user can manually enter the desired latitude and longitude, it is by far more convenient to return a plot for a location selected by the user by clicking on the map. This is accomplished by treating the map plot image as a an image map that posts to or gets from the servlet. The X and Y locations in pixel space are then passed in the parameter list, and are converted to a lat/lon pair on the server by performing an inverse projection from the map object. The model data is then interpolated to the lat/lon location and the plot is assembled.



Figure 4.1. The five forecast parameters shown in reference to a single PDF.


5. Conclusion The elements that combine to make the APL BMA system possible – archival of and access to data, archival of results, and the ability to assemble the two on the fly to create the visualizations and display them via the web – are commonly available and all software and tools not developed at APL are (or are soon to be) freely and publicly available. While the implementation developed at APL is specific to the MM5 ensemble and the Pacific Northwest, the techniques are generally applicable for a large range of situations and problems.


Although the values vary over time, generally the BMA system (surface temperature) has good coverage (coverage of the 90% interval is typically within a few percent of its intended value) with the a 90% interval half-width of 3.5 to 5.5 deg C; i.e., the observed value should fall within 3.5 to 5.5 degrees of the forecast value 90% of the time. Over time, there is a high correlation of forecast error to model disagreement, implying that the width of the BMA interval is a useful predictor of error and uncertainty.

References


Raftery, A.E., T. Gneiting, F. Balabdaoui and M. Polakowski (2004) Using Bayesian Model Averaging to Calibrate Forecast Ensembles, to appear in Monthly Weather Review


The Ptolemy package is available from http://ptolemy.eecs.berkeley.edu/


The Openmap package is available from http://openmap.bbn.com/


R is available from http://cran.r-project.org/


Java is available at http://java.sun.com


The Apache Web Server is located at http://www.apache.org/


The Tomcat Servlet Container is located at http://jakarta.apache.org/tomcat/


Appendix I: Principle classes, scripts, and third-party modules


Genearal:

TransferRayobs.sh (script to copy obs from atmos. sciences)

TransferMM5Archives.sh (script to copy mm5 from atmos. sciences)

createBmaWeights.sh (script to make data files, run BMA)

muri.bma.BmaConstants (path structures, param list, etc.)


Data acquisition and archival:

muri.bma.BmaDataUtils

muri.monitor.web.DataExtractorClient

muri.obs.ReadQCObs2

muri.obs.QCO


Running BMA:

APL.R (companion libraries to stat code)

bma_msc2.R (BMA distribution from statistics)

bmaInput (lists commands to feed R to run)


Visualization:

ptolemy (package used to make line plots)

openmap (package used for GIS plotting)

vmap_edge_thin.shp (map edge and shape files)

vmap_edge_thin.ssx (map edge and shape files)

mm5_12km_lats.txt

mm5_12km_lons.txt

mm5_36km_lats.txt

mm5_36km_lons.txt

muri.bma.MM5Domain

muri.bma.BmaResultReader

muri.bma.BmaContourPlotGenerator

muri.bma.BmaPointPlotGenerator

muri.bma.BmaGridParams

muri.bma.BmaPointParams

muri.bma.BmaPDFPlot

muri.bma.web.BmaServlet

muri.bma.web.BmaTag

muri.bma.CDFUtils

muri.util.Stats

muri.util.AtSciColors

muri.openmap.ContourGridLayer

muri.openmap.MapWidgetLayer


Appendix 2: Complete class listing from muri.bma.jar


muri/bma/web/BmaServlet$1.class

muri/bma/web/BmaServlet$StringDate.class

muri/bma/web/BmaServlet.class

muri/bma/web/BmaTag.class

muri/bma/BmaConstants.class

muri/bma/MM5Domain.class

muri/bma/BmaContourPlotGenerator.class

muri/bma/BmaDataUtils.class

muri/bma/BmaEvaluation.class

muri/bma/BmaGridParams.class

muri/bma/BmaPDFPlot$1.class

muri/bma/BmaPDFPlot.class

muri/bma/BmaPlotGenerator.class

muri/bma/BmaPointParams.class

muri/bma/BmaPointPlotGenerator.class

muri/bma/BmaResultReader.class

muri/bma/BmaWindUtils.class

muri/bma/CDFUtils.class

muri/bma/BmaBoxPlotGenerator.class

muri/bma/MoreBmaEvaluation.class

muri/openmap/ContourGridLayer.class

muri/openmap/GridLayer.class

muri/openmap/MapWidgetLayer$CityMapObj.class

muri/openmap/MapWidgetLayer$I90.class

muri/openmap/MapWidgetLayer$1.class

muri/openmap/MapWidgetLayer.class

muri/util/Legend.class

muri/util/Convert.class

muri/util/MMData.class

muri/util/MM5GridCoords.class

muri/util/AtSciColors.class

muri/util/Stats.class

muri/util/BoxPlot.class

muri/util/FunctionPlot$1.class

muri/util/FunctionPlot.class

muri/obs/QCOb.class

muri/obs/ReadQCObs2.class

muri/obs/ReadQCObs.class

muri/meteogram/DataSet.class

muri/monitor/web/DataExtractorClient.class

ptolemy/plot/Plot$1.class

ptolemy/plot/Plot$2.class

ptolemy/plot/Plot$3.class

ptolemy/plot/Plot$4.class

ptolemy/plot/Plot$5.class

ptolemy/plot/Plot$6.class

ptolemy/plot/Plot$7.class

ptolemy/plot/Plot$Format.class

ptolemy/plot/Plot.class

ptolemy/plot/PlotBox$1.class

ptolemy/plot/PlotBox$ButtonListener.class

ptolemy/plot/PlotBox$ZoomListener.class

ptolemy/plot/PlotBox$DragListener.class

ptolemy/plot/PlotBox$CommandListener.class

ptolemy/plot/PlotBox.class

ptolemy/plot/EPSGraphics.class

ptolemy/plot/PlotFormatter$1.class

ptolemy/plot/PlotFormatter$2.class

ptolemy/plot/PlotFormatter.class

ptolemy/plot/PlotPoint.class

ptolemy/gui/Query$QueryActionListener.class

ptolemy/gui/Query$QueryFileChooser.class

ptolemy/gui/Query$QueryFocusListener.class

ptolemy/gui/Query$QueryItemListener.class

ptolemy/gui/Query$QueryScrollPane.class

ptolemy/gui/Query$SliderListener.class

ptolemy/gui/Query.class

ptolemy/gui/QueryListener.class

ptolemy/gui/ComponentDialog$1.class

ptolemy/gui/ComponentDialog$2.class

ptolemy/gui/ComponentDialog.class

ptolemy/gui/CloseListener.class

edu/washington/apl/xml/MM5XmlFileReader.class

edu/washington/apl/metoc/mm5/ensemble/extractor/MMData.class

edu/washington/apl/math/Matrix.class

edu/washington/apl/math/MVector.class

edu/washington/apl/math/Util.class

edu/washington/apl/math/Stats.class

edu/washington/apl/io/Util.class


Page 15