Missing data estimation for 1–6 h gaps in energy use and weather data using different statistical methods

  • Published on

  • View

  • Download


  • INTERNATIONAL JOURNAL OF ENERGY RESEARCHInt. J. Energy Res. 2006; 30:10751091Published online 17 May 2006 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/er.1207

    Missing data estimation for 16 h gaps in energy use andweather data using different statistical methods

    David E. Claridgen,y and Hui Chen

    Energy Systems Laboratory, Texas A&M University System, College Station, Texas, U.S.A.


    Analysing hourly energy use to determine retrot savings or diagnose system problems frequently requiresrehabilitation of short periods of missing data. This paper evaluates four methods for rehabilitating shortperiods of missing data. Single variable regression, polynomial models, Lagrange interpolation, and linearinterpolation models are developed, demonstrated, and used to ll 16 h gaps in weather data, heating dataand cooling data for commercial buildings. The methodology for comparing the performance of the fourdifferent methods for lling data gaps uses 11 1-year data sets to develop different models and ll over500 000 pseudo-gaps 16 h in length for each model. These pseudo-gaps are created within each data setby assuming data is missing, then these gaps are lled and the lled values compared with the measuredvalues. Comparisons are made using four statistical parameters: mean bias error (MBE), root mean squareerror, sum of the absolute errors, and coefcient of variation of the sum of the absolute errors. Comparisonbased on frequency within specied error limits is also used.A linear interpolation model or a polynomial model with hour-of-day as the independent variable both

    ll 16 missing hours of cooling data, heating data or weather data, with accuracy clearly superior to thesingle variable linear regression model and to the Lagrange model. The linear interpolation model is thesimplest and most convenient method, and generally showed superior performance to the polynomialmodel when evaluated using root mean square error, sum of the absolute errors, or frequency of llingwithin set error limits as criteria. The eighth-order polynomial model using time as the independentvariable is a relatively simple, yet powerful approach that provided somewhat superior performance forlling heating data and cooling data if MBE is the criterion as is often the case when evaluating retrotsavings. Likewise, a tenth-order polynomial model provided the best performance when lling dew-pointtemperature data when MBE is the criterion. It is possible that the results would differ somewhat for otherdata sets, but the strength of the linear and polynomial models relative to the other models evaluated seemsquite robust. Copyright # 2006 John Wiley & Sons, Ltd.

    KEY WORDS: lling data gaps; heating data; cooling data; dry-bulb temperature data; dew-pointtemperature data

    Received 23 June 2004Revised 11 December 2005Accepted 2 February 2006Copyright # 2006 John Wiley & Sons, Ltd.

    yE-mail: dclaridge@tamu.edu

    Contract/grant sponsor: Texas State Energy Conservation Ofce

    nCorrespondence to: David E. Claridge, Energy Systems Laboratory, Texas A&M University System, College Station,Texas, U.S.A.


    Any long-term monitoring effort will have some data records that are missing or bad. Thesemissing or bad records may be due to data processing problems or instrumentation and mon-itoring hardware problems. The Texas LoanSTAR program monitored energy use data forperiods of a year or more from over 200 buildings starting in 1990. This data has been used inconjunction with hourly National Weather Service data to determine retrot savings and as anaid in diagnosing operating problems in the buildings. About 1% of the weather records aremissing (Chen, 1999) and about 2% of the energy records are missing (Haberl et al., 1998).However, since daily totals and daily average values are often used for savings determination, asingle missing record in a day requires the missing value to be estimated, or the entire day ofdata to be discarded. Analysis of the missing data showed that all of the missing NWS data andabout 60% of the missing energy data was in gaps 16 h in length (Chen, 1999).Three different investigators have reported efforts to ll hourly weather data used for energy

    simulation. Colliver et al. (1995) investigated the use of linear, third-order polynomial, andcubic spline interpolation techniques to obtain 24 hourly readings per day from 3h data andfound that linear interpolation was the best for lling the dew-point temperature gaps and thecubic spline technique provided better results for dry-bulb temperature data. Developing theTypical Meteorological Year data sets used in energy simulation required lling some missingweather data. Gaps of up to 5 h were lled by linear interpolation, except for relative humidity,which was calculated based on measured or lled dry-bulb and dew-point temperature data.Gaps of length 647 h were lled by using data from adjacent days for identical hours and thenby adjusting the data so that there were no abrupt changes in data values between the lled andmeasured data (Marion and Urban, 1995). Haberl et al. (1995) reported that the DOE-2 weatherpacker uses linear interpolation to ll weather gaps of less than 24 h.Numerous other investigators have used a variety of techniques to ll other forms of missing

    data. Kemp et al. (1983) used a linear model and weighted regression to calculate missing dailytemperature data within stations in northern and central Idaho. Baker et al. (1988) used a linearmodel to generate hourly temperature data from the daily highs and lows. Acock and Pachepsky(2000) used data from adjacent days to ll missing data maximum and minimum daily tem-peratures using the so-called group method of data handling. Schneider (2001) used a regu-larized expectation maximization algorithm to impute missing values of mean July temperatureswhere spacially adjacent values were present. Others who have investigated techniques forlling missing non-weather data include Beckers and Rixen (2003), Farhangfar et al. (2004),Latini et al. (2001), Junninen et al. (2004), Smith et al. (2003), and Sprott (2004).It is significant to note that while many other investigators have examined the use of

    techniques for interpolating weather data, only Baltazar and Claridge (2002) haveexamined techniques for interpolating building energy use data. They examined the use ofcubic spline and Fourier series techniques for lling 16 h gaps in cooling and heating dataseries.This paper evaluates the use of three methods that have not been previously used for lling

    data gaps of 16 h in data sets of dry-bulb and dew-point temperature and commercial buildingheating and cooling energy use. The methods examined are single variable regression, poly-nomial interpolation, and Lagrange interpolation. These methods are examined with temper-ature and with hour-of-day as the independent variable and the accuracy of these models iscompared with one another and with simple linear interpolation.

    Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Energy Res. 2006; 30:10751091



    Five nearly complete 1-year data sets of dry-bulb temperature, dew-point temperature, andheating data and six 1-year sets of cooling data were used to evaluate the different gap llingtechniques. Thousands of articial data gaps, which will be called pseudo-gaps hereafter, werecreated within each data set and the values estimated by each gap lling technique within eachpseudo-gap were compared with the measured values to evaluate the techniques.Each interpolation model is evaluated for lling data gaps of 16 consecutive hours. The gaps

    evaluated are created by creating a pseudo-gap of a particular length (e.g. 6 h) starting with the13th hour of a 1-year data set, lling the missing data, and evaluating the errors; the secondpseudo-gap created begins with the 14th hour of the data set, the gap is lled, evaluated, etc.until all possible pseudo-gaps in the data set have been created and evaluated The rst pseudo-gap starts with the 13th hour of the data set and the last pseudo-gap ends with the 13th hourfrom the end of the data set since up to 12 h of data on each side of the gap are required by themodels used to ll the pseudo-gaps. All pseudo-gaps that can be created in each data set areevaluated, so the maximum number of pseudo-gaps that are created in a complete 8760 h dataset varies from 8731 (for 6-h gaps) to 8736 (for 1-h gaps). The number of pseudo-gaps created isreduced by the presence of some real gaps in the data sets used.The single variable regression and polynomial models use the 12 data points on each side of

    the pseudo-missing data (24 total points) to create a model and ll the gap. 12 h was chosen afterinvestigating the accuracy of shorter and longer periods on either side (Chen, 1999). The linearinterpolation model is based on a single measured point on either side of the data gap, and theLagrange model is based on four measured data values on either side of the data gaps.

    2.1. Criteria used to evaluate the models

    The criteria used in this paper to evaluate models for lling data gaps are model accuracy andmodel simplicity. Model accuracy is the primary criterion, but if two models have comparableaccuracy, the simpler model is preferred.Model accuracy will be expressed in terms of multiple statistical parameters. Minimizing the

    mean bias error (MBE) is the most important criterion when the data are used for savingsdetermination. However, the root mean square error (RMSE), sum of the absolute value of theerrors (SAE), coefcient of variation of SAE (CV-SAE), Error percent, and Relative Error arealso used.The error % is simply the percent error between a single lled pseudo-gap value and the

    measured value of that point. Measures of gap lling accuracy presented that are based on error %are the percent of lled points that are within 5, 10, and 15% of the correct heating and coolingvalues, or % of gaps where SAE is within 1, 2 and 38F of the correct values.Most of the measures above will be presented as average monthly values. These values are

    determined as follows, using RMSE as an example. The average value of the RMSE for eachmonth is rst calculated for all pseudo-gaps that have been lled during each month in the dataset from a particular building or weather station. Then the average of these monthly values iscomputed for all data sets treated in that specific case.There are so many individual comparisons to consider that an additional measure has been

    adopted to simplify the nal comparisons. Relative error (RE) has been dened using the

    Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Energy Res. 2006; 30:10751091


  • normal definition RE = 100%(Err2 Err1)/ Err1 where Err1 and Err2 are the monthly averagevalues of the errors of models 1 and 2, respectively.Hence a positive value of the relative error indicates that model 1 is superior to model 2 in this

    comparison (and vice versa) whenever a small value of the quantity compared is desirable.A negative value indicates that model 1 is superior to model 2 (and vice versa) whenever a largevalue of the quantity compared is desirable.

    2.2. Models investigated

    2.2.1. Linear interpolation model. The literature reviewed has heavily used linear interpolationwith apparent success, so this approach was included among the techniques to be investigated.The linear interpolation model adopted in this paper is the normal model of the form:

    f1x f x0 f x1 f x0

    x1 x0x x0 1

    The independent variable, x, was considered to be the time, which was at 1 h intervals in allcomparisons in this paper.

    2.2.2. Lagrange interpolation model. It is often convenient or possible to use Lagrange inter-polation at both equal and unequal intervals (Steven and Raymond, 1996; Erwin, 1983). TheLagrange interpolating polynomial can be represented concisely as

    pnx Xnj0

    f xiPjx 2


    Pjxk Ynj0j=i

    x xjx xi

    ; k j; 0; k=j




    The independent variable, x, was considered to be the time, which was at 1 h intervals in allcomparisons in this paper.

    2.2.3. Polynomial model. A one variable polynomial model is dened as

    y a0 a1x a2x2 amxm e 4

    where y represents the dependent variable and x the independent variable. The largest exponent,or power, of x used in the model is known as the degree of the model, and it is customary for amodel of degree m to include all terms with lower powers of the independent variable. The leastsquares method is used to estimate values of the parameters a0; a1; a2; . . . ; am that minimize thesum of the squared differences between the actual y values and the values, #yi predicted by theequation (Steven and Raymond, 1996; Erwin, 1983).The independent variable, x, was considered to be either the time, at 1 h intervals, or the

    ambient temperature, in all comparisons in this paper. It was also necessary to investigate thepreferred number of points on either side of the gap to use for gap lling and the optimum orderof the polynomial to be used.

    Copyright # 2006 John Wiley & Sons, Ltd. Int. J. Energy Res. 2006; 30:10751091


  • 2.2.4. Single variable regression model. Building heating and cooling consumption is generallyconsidered to correlate with ambient temperature more closely than with any other variable.While a variety of regression models have been used to model long-term energy use data, thesimple two parameter regression model was investigated for use in lling data gaps of 6 h or lesswith outside air dry-bulb temperature as the only regression variable. The functional form ofthis model is

    E B0 B1T 5

    B0 is the energy consumption at the intercept T 0 and B1 is the temperature slope.

    2.3. Data sets analysed

    2.3.1. Gap length analysis. This study began by examining multi-year hourly data from 87 datachannels derived from the National Weather Service (NWS) and from the LoanSTAR database.Data examined included 27 temperature, 20 relative humidity, 14 dew point, eight cooling, eightwhole building electricity, seven heating, and three air handler electricity channels. These weremulti-year records that included over 300 channel-years of data.About 2% of the data were missing in both the LoanSTAR data and the NWS data acquired

    by the LoanSTAR program. The frequency of the LoanSTAR energy use gaps is far lower thanthe frequency of the LoanSTAR and NWS weather data gaps, but there are more long gaps inthe energy use data. Data gaps 16 h in length cover almost all missing weather data and themajority of the missing energy use data.For weather data from both the NWS and LoanSTAR, there are more 1-h data gaps than any

    other length. The frequency of data gaps with 2 or 3 consecutive missing data hours is far lowerthan the frequency of gaps with 1 h of missing data. For example, the NWS temperature datafrom 1 January 1992 to 31 August 1997 for College Station, Texas, was examined. This analysisfound that data gaps with 1-h d...


View more >