Analyzing Annual Cumulative Site Properties

Research Question:

What kinds of annual cumulative values can be used to characterize weather/climate for particular sites and how can they be used to monitor changes in those sites over time?

Producing Meteorological Cumulatives

Tracking daily meteorological parameters gives interesting patterns allows us to define diurnal, day-to-day, and seasonable patterns. When daily values are accumulated over a year those accumulated values (we will call them "cumulatives") provide a single-value "snapshot" of the year. An obvious example is precipitation. Comparing the precipitation cumulatives over several years allows us to say whether a particular site in XXXX was wetter or drier than YYYY. The graph below shows precipitation cumulatives derived from hourly data from NOAA's Climate Reference Network (CRN) site at Avondale, Pennsylvania, from 2007 (its first complete year of operation for this site) through 2012.

Charts like these make it very easy to summarize a year and to isolate interesting events that might be worth additional study. At this site, the average annual 2007-2012 precipitation (rainfall plus the snow equivalent) is about 1200 mm (47") per year. Outstanding precipitation events occurred during Tropical Storm Lee in 2011 and Superstorm Sandy in 2012. Note that even with Sandy, 2012 was the driest year since 2007. Without Sandy, this would have been an exceptionally dry year!

An Example: Growing Degree Days

In agriculture, it is very useful to have a cumulative measure of how often air temperatures are above some pre-determined threshold. In temperate climates, the "base temperature" for crop growth is often considered to be 50°F (10°C). (In the U.S., temperatures are still almost always reported in Fahrenheit degrees.) This value is used to calculate growing degree days (GDD) accumulated over a growing season, often starting March 1. For climatological purposes, GDD can be accumulated over an entire calendar year starting January 1. The most common definition for GDD is:

GDD = Σ[MIN(0, T_{day average} - T_base)]

Considering that the hourly CRN data are available, a modified definition is:

GDD = Σ[MIN(0, T_{hour average} - T_base)]

Typically, this will give a slightly different result. The graph below shows GDD for 2012 in Avondale, PA, calculated from the hourly CRN data. There are a few missing data values in this data set, which have been interpolated as described in the discussion below.

There are many other cumulatives of interest. A corresponding calculation can be done for freezing degree days – fruit trees need a certain amount of time at temperatures below freezing to set fruit properly. Soil temperature warm and cold degree days are of interest for predicting the emergence of agricultural pests which develop underground. The equivalent of GDD for a base temperature of 90°F (32°C) applied to maximum temperatures would be useful for monitoring long-term changes in the extent of heat waves. Once the simple defining equations are set up in a spreadsheet, different cumulatives can be calculated simply by changing the base temperature value in a single cell. All of these cumulatives provide ways to compare years and sites.

In addition to precipitation, non-temperature cumulatives include yearly insolation.

Dealing With Missing Data

It is inevitable that large data sets will include some missing values. This fact means that what might seem like an inordinate amount of time needs to be dedicated to dealing with missing data – a process which will make or break the success of any project to generate cumulatives.

In the case of the hourly NOAA CRN data, missing values appear when temporary glitches in hardware or software produce no hourly values at all or values that for one reason or another do not pass quality control checks applied to that particular parameter. By definition, cumulatives require an unbroken data record. This means that some way of accounting for missing data must be derived. In the worst case, long stretches of missing data may make it impossible for cumulatives to be derived for a site. Fortunately, it is more likely that for "good" data sets like those from the CRN sites, it will be possible to develop physically and mathematically reasonable methods of filling in much of the missing data in a way which does not seriously impact yearly cumulatives.

The graph below, for 2012 average hourly air temperature at the CRN site in Avondale, PA, illustrates the problem. Missing air temperature values (and other parameters) are represented by values of -9999. With the y-axis limited to a minimum of -20°C, these values show up as the vertical lines extending down to the x-axis. Some missing hours are shown in detail in the righthand image. The shape of the curve, in which the hourly average air temperatures change in relatively predictable ways, suggests that at least for one or two missing points, linear interpolation is a perfectly reasonable way to fill in missing data. This will maintain the unbroken data record needed for cumulatives without having any significant impact on the results that would have been obtained if those values were not missing. It is worth noting that precipitation data are much different from temperature data; precipitation behaves more in an "on/off" mode and there is not necessarily any continuity which can be exploited, as is the case with temperatures.

When more consecutive points are missing linear interpolation no longer makes physical sense and the problem becomes more challenging. Fortunately there are alternatives, one of which is cubic polynomial interpolation. Here's the math. (See THIS LINK or any of numerous online sources.)

For a cubic interpolation f(x) between two data points,

f(x) = ax³+ bx² + cx + d

There are four unknowns – the coefficients a, b, c, and d. Knowing f(x) at the two endpoints of an interpolation interval will provide only two equations, so we need another two equations. If we know the derivative (slope) of the function of the endpoints of the interpolation interval, the cubic interpolation will blend smoothly with the existing data.

f '(x) = 3ax² + 2bx + c

Suppose we have data at four points, x=-1, x=0, x=1, and x=2, where we wish to interpolate values between x=0 and x=1. Then the four equations needed to define a, b, c, and d are:

f(0) = d
f(1) = a + b + c + d
f '(0) = c
f '(1) = 3a + 2b + c

By straightforward algebraic manipulation,

a = 2f(0) - 2f(1) + f '(0) + f '(1)
b = -3f(0) + 3f(1) - 2f '(0) - f '(1)
c = f '(0)
d = f(0)

The illustrative example below is taken from the link cited above. First of all, note that when plotting these data in Excel, using the "smoothed" scatter plot option, it is obvious that Excel has applied some kind of non-linear interpolation to create the curve between the data points – possibly a cubic spline interpolation. That method uses multiple cubic polynomial functions to piece together a curve that smooths between data points. (Mathematically speaking, cubic spline interpolation over multiple data points produces a set of curves such that when they meet at a data point, both the value and derivative of the functions match.)

In the example below, assume that there values missing between x=0 and x=1. The cubic interpolation uses first the values at x=0 and x=1:

f(0) = p₁
f(1) = p₂

But, what should we use for the slopes f '(0) and f '(1)? We need the interpolated data between x=0 and x=1 to match smoothly with the data previous to x=0 and after x=1. One possibility is to use the slopes of straight lines drawn between x=-1 and x=1 (p₀ and p₂), and x=1 and x=3 (p₁ and p₃):

f '(0) = (p₂ - p₀)/2
f '(1) = (p₃ - p₁)/2

Then:

a = -p₀/2 + 3p₁/2 - 3p₂/2 + p₃/2
b = p₀ - 5p₁/2 + 2p₂ - p₃/2
c = -p₀/2 + p₂/2
d = p₁

The red curve shows the result.

An alternative is to use the slope of the lines between x=-1 and x=0, and x=1 and x=2:

f '(0) = (p₁ - p₀)/1
f '(1) = (p₃ - p₂)/1

Then:

a = -p₀ + 3p₁ - 3p₂ + p₃
b = 2p₀ - 5p₁ + 4p₂ - p₃
c = -p₀ + p₁
d = p₁

This result is the green line.

The second alternative doesn't look as reasonable as the first. But, this is because the slopes at x=0 and x=1 are based on the value at a point that is "far away" from x=0 and x=1. This definition of the slope would work better if we could calculate the slopes from values "close to" x=0 and x=1:

f '(0) = (p₁ - p_1-dx)/dx
f '(1) = (p_2+dx - p₂)/dx

In the limit as dx becomes smaller and smaller, this guarantees that a function representing the data has a slope that is the same just "before" and just "after" x=0 and x=1, as it should.

Now let's apply the cubic polynomial interpolation to some air temperature data from the Avondale, Pennsylvania CRN site. Here are some data from 2012. The graph shows average air temperature, from column 4. The x-axis, time in fractional days, is in column 3. Data for 5 hours are missing. The equations for the coefficients are:

a = 2p₁ - 2p₂ + (p₁ - p_1-dx)/dx + (p_2+dx - p₂)/dx
b = -3p₁ + 3p₂ - 2(p₁ - p_1-dx)/dx - (p_2+dx) - p₂)/dx
c = (p₁ - p_1-dx))/dx
d = p₁

In order to be compatible with the above discussion, the x-axis values are "normalized" so that the last point before the missing data is x=0 and the value when the data returns is x=1.

Is the interpolation "right"? There is, of course, no way of knowing for sure. It is always possible that something unexpected was happening during the five missing hours! But, based on the assumption that the two values before and after the missing data are accurate, the results are certainly reasonable. In the context of the surrounding days, the results also look reasonable.

How many missing hours are too many for using this interpolation scheme? This is a question which cannot be given a definitive answer. One clearly inappropriate case would be where data are missing from around noon or night on one day to around noon or night on the following day when temperatures are changing slowly. Then diurnal variability will be lost because it is not possible to calculate reasonable slopes for defining the polynomial coefficients. In such cases, it might be necessary to copy and average data from surrounding days in the hopes of approximating the temperature cycle for a missing day. It seems unlikely that a few completely modeled missing days would significantly affect the value of annual temperature cumulatives.

Resources

IESRE's online application for accessing CRN data is HERE.
See this link for a summary of the first ten years of operation for the Climate Reference Network.
A brief introduction to some interpolation methods, including cubic interpolation.
Historical weather data for Philadelphia, PA, from the Franklin Institute, dating back to 1872. NOTE: These data are of VERY variable quality and are a blend of data from several sources that have changed over the years. Relatively recently, the Franklin Institute moved its weather station to the roof of its building, which is NOT a good place to record air temperatures and which results in many more very hot days in the summer than is actually representative of Philadelphia. (You should be able to determine when this move was made if you count days per year where maximum temperature was above 90°F.)