Journal of Computational Science - Yang Xu · H. He et al. / Journal of Computational Science 13...

13
Journal of Computational Science 13 (2016) 83–95 Contents lists available at ScienceDirect Journal of Computational Science journa l h om epage: www.elsevier.com/locate/jocs Data synthesis in the Community Land Model for ecosystem simulation Hongsheng He a , Dali Wang b,, Yang Xu c , Jindong Tan a a Department of Mechanical, Aerospace and Biomedical Engineering, The University of Tennessee, Knoxville, TN 37996, USA b Environmental Sciences Division at Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA c Department of Geography, The University of Tennessee, Knoxville, TN 37996, USA a r t i c l e i n f o Article history: Received 17 December 2015 Accepted 21 January 2016 Available online 10 February 2016 Keywords: Data synthesis Data analysis Machine learning Affinity Propagation ARIMA model a b s t r a c t Though many ecosystem states are physically observable, the number of measured variables is limited owning to the constraints of practical environments and onsite sensors. It is therefore beneficial to only measure fundamental variables that determine the behavior of the whole ecosystem, and to sim- ulate other variables with the measured ones. This paper proposes an approach to extract fundamental variables from simulated or observed ecosystem data, and to synthesize the other variables using the fundamental variables. Because the relation of variables in the ecosystem depends on sampling time and frequencies, a region of interest (ROI) is determined using a sliding window on time series with a prede- fined sampling point and frequency. Within each ROI, system variables are clustered in accordance with a group of selective features by a combination of Affinity Propagation and k-Nearest-Neighbor. In each clus- ter, the unobserved variables are synthesized from selected fundamental variables using a linear fitting model with ARIMA errors. In the experiment, we studied the performance of variable clustering and data synthesis under a community-land-model based simulation platform. The performance of data synthesis is evaluated by data fitting errors in prediction and forecasting, and the change of system dynamics when synthesized data are in the loop. The experiment proves the high accuracy of the proposed approach in time-series analysis and synthesis for ecosystem simulation. © 2016 Elsevier B.V. All rights reserved. 1. Introduction Ecosystem variables play different roles in the control and repre- sentation of ecosystem states and dynamics. With a limited number of onsite sensors, ecosystem variables are commonly observed in part since many variables are unobservable or expensive to observe using onsite sensors. The problem to address is therefore the identi- fication of significant system variables and synthesis of unobserved system variables, in order to reduce the number of onsite sensors and save the expense of practical monitoring systems. In addition, it is common practice to explore variables in ecosystem simulation for the sake of predicting climatic changes based on incomplete onsite observation. The exploration is subject to the constraints imposed by the underlying physics of geosystem variables, such that the degree of freedom in data exploration is much less than the number of variables. Data synthesis could alleviate the difficulty in Corresponding author. E-mail addresses: [email protected] (H. He), [email protected] (D. Wang), [email protected] (J. Tan). data exploration while guaranteeing the physical rationality of the data. In general, part of the ecosystem variables dominate the dynamics of the whole ecosystem, and these fundamental system variables are commonly of great interest to ecosystem scientists because of their manifest physical meanings, e.g., sun light, vegeta- tion root growth, and ground temperature. The other variables are typically correlated to the fundamental variables in the ecosystem. Therefore, it is feasible to synthesize dependent variables with fun- damental ones, so as to reduce the number of physically observed variables. Identification of fundamental variables and data synthe- sis are economically and operationally beneficial in the selection and placement of onsite sensors. This paper aims to identify fundamental system variables from simulated or observed ecosystem data, and to synthesize other vari- ables using selected fundamental system variables. The variable synthesis can avoid unnecessary observation of dependent vari- ables and facilitate ecosystem simulation. A modular ecosystem simulation platform 1 was developed based on Community Land 1 http://cem-base.ornl.gov/CLM Web/CLM Web.html. http://dx.doi.org/10.1016/j.jocs.2016.01.005 1877-7503/© 2016 Elsevier B.V. All rights reserved.

Transcript of Journal of Computational Science - Yang Xu · H. He et al. / Journal of Computational Science 13...

Page 1: Journal of Computational Science - Yang Xu · H. He et al. / Journal of Computational Science 13 (2016) 83–95 Models (CLM) at Oak Ridge National Laboratory, to simulate sur-face

Df

Ha

b

c

a

ARAA

KDDMAA

1

sopufisaifoitn

(

h1

Journal of Computational Science 13 (2016) 83–95

Contents lists available at ScienceDirect

Journal of Computational Science

journa l h om epage: www.elsev ier .com/ locate / jocs

ata synthesis in the Community Land Modelor ecosystem simulation

ongsheng Hea, Dali Wangb,∗, Yang Xuc, Jindong Tana

Department of Mechanical, Aerospace and Biomedical Engineering, The University of Tennessee, Knoxville, TN 37996, USAEnvironmental Sciences Division at Oak Ridge National Laboratory, Oak Ridge, TN 37831, USADepartment of Geography, The University of Tennessee, Knoxville, TN 37996, USA

r t i c l e i n f o

rticle history:eceived 17 December 2015ccepted 21 January 2016vailable online 10 February 2016

eywords:ata synthesisata analysisachine learning

ffinity PropagationRIMA model

a b s t r a c t

Though many ecosystem states are physically observable, the number of measured variables is limitedowning to the constraints of practical environments and onsite sensors. It is therefore beneficial toonly measure fundamental variables that determine the behavior of the whole ecosystem, and to sim-ulate other variables with the measured ones. This paper proposes an approach to extract fundamentalvariables from simulated or observed ecosystem data, and to synthesize the other variables using thefundamental variables. Because the relation of variables in the ecosystem depends on sampling time andfrequencies, a region of interest (ROI) is determined using a sliding window on time series with a prede-fined sampling point and frequency. Within each ROI, system variables are clustered in accordance with agroup of selective features by a combination of Affinity Propagation and k-Nearest-Neighbor. In each clus-ter, the unobserved variables are synthesized from selected fundamental variables using a linear fitting

model with ARIMA errors. In the experiment, we studied the performance of variable clustering and datasynthesis under a community-land-model based simulation platform. The performance of data synthesisis evaluated by data fitting errors in prediction and forecasting, and the change of system dynamics whensynthesized data are in the loop. The experiment proves the high accuracy of the proposed approach intime-series analysis and synthesis for ecosystem simulation.

. Introduction

Ecosystem variables play different roles in the control and repre-entation of ecosystem states and dynamics. With a limited numberf onsite sensors, ecosystem variables are commonly observed inart since many variables are unobservable or expensive to observesing onsite sensors. The problem to address is therefore the identi-cation of significant system variables and synthesis of unobservedystem variables, in order to reduce the number of onsite sensorsnd save the expense of practical monitoring systems. In addition,t is common practice to explore variables in ecosystem simulationor the sake of predicting climatic changes based on incompletensite observation. The exploration is subject to the constraintsmposed by the underlying physics of geosystem variables, such

hat the degree of freedom in data exploration is much less than theumber of variables. Data synthesis could alleviate the difficulty in

∗ Corresponding author.E-mail addresses: [email protected] (H. He), [email protected] (D. Wang), [email protected]

J. Tan).

ttp://dx.doi.org/10.1016/j.jocs.2016.01.005877-7503/© 2016 Elsevier B.V. All rights reserved.

© 2016 Elsevier B.V. All rights reserved.

data exploration while guaranteeing the physical rationality of thedata.

In general, part of the ecosystem variables dominate thedynamics of the whole ecosystem, and these fundamental systemvariables are commonly of great interest to ecosystem scientistsbecause of their manifest physical meanings, e.g., sun light, vegeta-tion root growth, and ground temperature. The other variables aretypically correlated to the fundamental variables in the ecosystem.Therefore, it is feasible to synthesize dependent variables with fun-damental ones, so as to reduce the number of physically observedvariables. Identification of fundamental variables and data synthe-sis are economically and operationally beneficial in the selectionand placement of onsite sensors.

This paper aims to identify fundamental system variables fromsimulated or observed ecosystem data, and to synthesize other vari-ables using selected fundamental system variables. The variable

synthesis can avoid unnecessary observation of dependent vari-ables and facilitate ecosystem simulation. A modular ecosystemsimulation platform1 was developed based on Community Land

1 http://cem-base.ornl.gov/CLM Web/CLM Web.html.

Page 2: Journal of Computational Science - Yang Xu · H. He et al. / Journal of Computational Science 13 (2016) 83–95 Models (CLM) at Oak Ridge National Laboratory, to simulate sur-face

8 utatio

Mffacf

itsftidu

eaatTrsoapsaepMsfimrcmssdapwsmdavod

iotcmpdveaTrtsm

4 H. He et al. / Journal of Comp

odels (CLM) at Oak Ridge National Laboratory, to simulate sur-ace energy, water, carbon, and nitrogen fluxes and state variablesor both vegetated and non-vegetated land surfaces [1]. The vari-ble synthesis methods in this paper were implemented in theurrent simulation platform as a plugin module that simplifies andacilitates geographical studies.

The complexity of ecosystem brings many unique challengesn data analysis and synthesis. Firstly, the relation between sys-em variables highly depends on sampling time and observationcale. In other words, the relation is a function of time, samplingrequency, and time span. Subsequently, ecosystem variables areightly coupled such that the change of one system variable maynfluence a group of dependent variables. Finally, big data obtaineduring longtime observation render it very difficult to discover thenderlying interaction between the variables.

Data synthesis is a problem that incorporates data from a vari-ty of sources to produce new or enhanced information about

system following basic physical principles [2]. A model-basedpproach was proposed in [3] for the identification and predic-ion of phenological attributes from satellite image time series.he Nonlinear Harmonic Model was utilized to fit intra-annualesponse of land cover multispectral reflectances obtained fromatellite image time series. The work focuses on the problemf model fitting of a given time series. A Fourier series basedpproach was presented in [4] to address the data missingroblem using multi-temporal analysis. A functional curve, con-isting of a group of Fourier series with different coefficients,re optimally fitted to yearly observed data through least squarestimation (LSE). Recent work [5] presented a procedure forroducing temporally smoothed and spatially complete NASA’ soderate Resolution Imaging Spectroradiometer (MODIS) data

ets. A data series was smoothed, and gaps in the series werelled to generate high-quality data from observations withissing points. From time series observed by coarse-spatial-

esolution and hyper-temporal earth satellites, the land coverhanges were detected automatically using different clusteringethods and feature extraction processes [6]. In that paper,

hort term Fourier transform coefficients were computed overubsequences of MODIS data within a temporal sliding win-ow, and meaningful sequential time series were extracted fornalysis and change detection. A function fitting method was pro-osed in [7] to discover seasonality in time series. The methodas based on nonlinear least squares fits of asymmetric Gaus-

ian model functions directly to the time series. Data fittingethods have gained success in system data analysis and pre-

iction [8,9]. These fitting methods, however, cannot be directlypplied to ecosystem data synthesis, because many geosystemariables are physically heterogeneous, and inherent propertiesf geosystem variables are not directly observed in the timeomain.

A similar concept that relates to this paper’s work is data assim-lation, which incorporates observations into a computing modelf a real system. Data assimilation is used to estimate variableshat are not directly observed from space but are needed for appli-ations [10]. Data assimilation technique was utilized to estimateodel parameters from time-series observations to modify the

athways while preserving model complexity [11]. The work [12]emonstrated that data assimilation combining different obser-ations with a dynamics model improved the understanding ofcosystem carbon exchange. An ensemble Kalman filter was used tossociate time series with a box model of carbon transformations.he paper [13] proposes an automatic time-series generation using

anked data quality indicators and stepwise temporal interpola-ion of short data gaps. Pixel-level data are employed to filter timeeries and interpolate invalid data with statistical or contextualethodologies.

nal Science 13 (2016) 83–95

The unique problem to solve in this paper is to synthesizeunknown or unobserved yet intensely dependent variables usingpredefined, observed, or measured data in ecosystem simulationand prediction. This paper utilizes machine learning algorithmsto better understand the behavior of the ecosystem and to bridgethe gap between the geosystem simulation and onsite observation.Instead of direct synthesis of time series, the paper synthesizesdata using variables with similar features that are categorized inthe same cluster, to improve the fitting accuracy of models withreduced complexity.

The scheme of the proposed method is visualized in Fig. 1, whichillustrates the main components of the framework: data samp-ling, feature extraction, data clustering, and data fitting. Interestedtime series are firstly resampled by a sliding window in differentsampling regions and sub-sampling frequencies. Features in timeand frequency domain are then extracted from the resampled timeseries, and configured into a hybrid feature according to geoscien-tists’ interest. A fused clustering algorithm of Affinity Propagationand k-Nearest-Neighbor is utilized to classify the feature into clus-ters. In each cluster, a set of fundamental variables are selectedto synthesize other variables. We propose to use a linear regres-sion model with ARIMA errors to describe the relation betweenfundamental variables and the others to synthesize.

The main contribution of the paper is a novel framework ofdata analysis and synthesis, which was implemented as a modulein the current CLM-based modular ecosystem simulation system.The paper proposes an algorithm to synthesize time series by clus-ters, where ecosystem variables with similar attributes are groupedtogether, instead of direct fitting in time domain. Specifically,

1. the paper proposes a feature extraction method from time series,which is customizable for different physical properties in timeand frequency domain;

2. the paper proposes a data synthesis method within clustersusing Affinity Propagation and linear fitting;

3. the paper recovers the physical meanings of geosystem variablesin different feature space, and models the underlying relation ofthe variables.

2. CLM-based modular ecosystem simulation

The Community Land Model (CLM) within Community EarthSystem Model, developed by NSF and DOE, simulates surfaceenergy, water, carbon, and nitrogen fluxes and state variables forboth vegetated and non-vegetated land surfaces [14]. The CLM-based simulation is designed to understand the way that naturaland human changes in ecosystems affect the climate. Within CLM,biogeophysical and biogeochemical processes are represented inthe simulation on a hierarchical landscape surface data structure:grid cell, land unit, column, and Plant Function Type (PFT) inde-pendently. Water, energy, flux and each sub-grid unit maintain itsown prognostic variables. The same atmospheric forcing is used toforce all sub-grid units within a grid cell. The surface variables andfluxes required by the atmosphere are obtained by averaging thesub-grid quantities weighted by their fractional areas. The dynam-ics of CLM is difficult to understand because of its large amountof sub-models and global variables. The response of the CLM to asimulated environmental stimulus is unclear though the dynamicsof a module is well studied. The flow of information and propaga-tion of module-level interaction is intractable, especially in extremeconditions.

The paper focuses on the variables in the CanopyFluxes moduleof the developed CLM-based simulation platform. The ecosystemvariables in the CanopyFluxes module is described in Table 1with explanations of their physical meanings. According to the

Page 3: Journal of Computational Science - Yang Xu · H. He et al. / Journal of Computational Science 13 (2016) 83–95 Models (CLM) at Oak Ridge National Laboratory, to simulate sur-face

H. He et al. / Journal of Computational Science 13 (2016) 83–95 85

s usin

psasdraspttaa

tcEdwnwfnea

Fig. 1. Framework of data synthesi

roperty of physical observability, the ecosystem variables are clas-ified into four categories: direct-measurable, indirect-measurable,ssessable, and unobserved. Direct-measurable variables are mea-ured using onsite sensors; indirect-measurable variables are notirectly measured by sensors, but evaluated by a strongly cor-elated variables that are direct-measurable; assessable variablesre computed with a theoretical or experimental relation of mea-urable variables; unobserved variables usually have an abstracthysical meanings that are directly observable. The main objec-ive of this paper is to synthesis the unobservable variables usinghe other types of variables. Nonetheless, the proposed approach ispplicable to other types of variables provided that relevant vari-bles are obtained.

Many observations of ecosystem variables are non-stationaryime series exhibiting both trend and cycle properties, i.e., the auto-orrelation for any particular lag does not hold at different time.cosystem variables are paradigmatic seasonal series. Almost byefinition, it may be necessary to examine differenced data whene have seasonality. Seasonality usually causes the series to beon-stationary because the average values at some particular timeithin the seasonal span (months, for example) may be different

rom the average values at other time. Ecosystem variables can beormalized and decomposed into trend and cycle components. Anxemplary decomposition is illustrated in Fig. 2 for a system vari-ble, surface air density. The trend component demonstrates the

g clustering and fitting algorithms.

relatively static change across the year, whereas the cycle com-ponent depicts the dynamics of the variable at different samplingtime.

3. Data clustering

Though CLM has thousands of variables interacting with eachother, many system variables behave in a similar dynamic pat-tern, such as temperature and moisture during the same seasonof a year. By defining a group of interested features, system vari-ables can be clustered into different categories. As the variablesin a category are with similar properties, part of the variables isrequired to be measured, and other variables can be synthesizedusing the observed variables. This section presents the methods offeature selection and data clustering using Affinity Propagation andk-Nearest-Neighbor.

3.1. Feature selection

The fundamental state vector represents the set of independentvariables that describe the state of the ecosystem. A Canopy-

Fluxes state vector would contain, for instance, vegetation rootsand leaves, photosynthesis, sunshine, and water. In a simulationmodel of ecosystem dynamics, the observation data of a variablewould include measurements at different spatial positions, such as
Page 4: Journal of Computational Science - Yang Xu · H. He et al. / Journal of Computational Science 13 (2016) 83–95 Models (CLM) at Oak Ridge National Laboratory, to simulate sur-face

86 H. He et al. / Journal of Computational Science 13 (2016) 83–95

surface air density

Fig. 2. Decompose of “surface air density” time ser

Table 1Categories of the variables in the CLM model (D: direct-measurable, I: indirect-measurable, A: assessable, U: unobserved).

Variable Physical description

fwet (D) Fraction of canopy that is wet (0–1)laisun (DA) Sunlit projected leaf area indexelai (DA) One-sided leaf area index with burying by snowhtop (D) Canopy top (m)t grnd (D) Ground temperaturefdry (D) Fraction of foliage that is green and dryfrac veg nosno (D) Fraction of vegetation not covered by snowforc hgt u pft (D) Wind forcing heightforc th (I) atm potl temperature, downscaled to columnforc u (I) atm wind speed, east direction (m/s)forc pco2 (I) CO2 partial pressure (Pa)forc v (I) atm wind speed, north direction (m/s)forc q (I) atm specific humidity, downscaled to columnforc po2 (I) O2 partial pressure (Pa)forc pbot (I) Surface atm pressure (Pa)forc rho (I) Surface air density (kg/m3)forc lwrad (I) Downward IR longwave radiation (W/m2)rssha (A) Shaded stomatal resistance (s/m)rssun (A) Sunlit stomatal resistance (s/m)esai (A) One-sided stem area index with burying by snowlaisha (A) Shaded projected leaf area indexpsnsha wp (U) Product-limited shaded leaf photosynthesispsnsha wj (U) RuBP-limited shaded leaf photosynthesisrootfr (U) Fraction of roots in each soil layeralphapsnsun (U) Sunlit 13c fractionationemv (U) Vegetation emissivitypsnsha (U) Shaded leaf photosynthesis (�mol/m2 s)thm (U) Intermediate variablepsnsun wc (U) Rubsico-limited sunlit leaf photosynthesissabv (U) Solar radiation absorbed by vegetation (W/m2)rc14 atm (U) C14O2/C12O2 in atmospherepsnsun (U) Sunlit leaf photosynthesis (�mol/m2 s)psnsha wc (U) Rubsico-limited shaded leaf photosynthesisalphapsnsha (U) Shaded 13c fractionationpsnsun wj (U) RuBP-limited sunlit leaf photosynthesispsnsun wp (U) Product-limited sunlit leaf photosynthesisrhaf (D) Fractional humidity of canopy air (dimensionless)vcmaxcintsun (I) Leaf-canopy scaling (sunlit leaf vcmax)vcmaxcintsha (I) Leaf-canopy scaling (shaded leaf vcmax)

tvlvam

xt = ∇dt,u − min(∇dt,u)max(∇dt,u) − min(∇dt,u)

. (1)

he sunlit in different height. In this paper, the sequential obser-ation is stored in a matrix of time series D ∈ R

l×n, where l is theength of an interested system variable, and n is the total number ofariables by flattening a vector variable as individual singular vari-

bles. The scales of the variables, which vary in different physicaleanings, are normalized before data analysis.

trend and cycle decomposition

ies of 2013 into trend and cycle components.

Climate scientists are interested in the correlation of vari-ables with similar patterns, the response of ecosystem variables toexternal stimuli, and the trend of physically correlated variables.Corresponding to different purposes in the simulation, the paperproposes to extract customizable features in time and frequencydomain. The customizable feature selection allows the climate sci-entists to optimize the manner by which cluster system variablesare clustered with respect to different objectives and observationstatus. A composite feature vector can be composed by integratingvarious types of normalized features with the different weights.In addition, more options of feature selection methods are accept-able by the model to cluster and synthesize variables according todifferent criteria.

3.1.1. Time-domain featureTime series are preprocessed before feature extraction to meet

the observation interest of a climate scientist. The preprocess-ing includes resampling and selection of region of interest (ROI).Some variable dynamics to observe appear in certain observationfrequencies and specific time spans. Denote a time series of anecosystem variable as dt,u starting at sampling time t to samplingtime u. The length of a time series u − t and the sampling frequencyare determined by climate scientists considering the interestedtime span to observe.

As compared to absolute variable values, the most interestedproperty of geosystem variables in time domain is dynamics andtrends of a time series. Without loss of generality, the dynam-ics of a time series is measured by the first-order differencing∇dt,u = dt+1,u − dt,u−1. The scale of the differencing needs to be nor-malized before variable clustering. In general, there are three typesof methods to normalize a feature vector: rescaling, standardiza-tion, and unit normalization. Rescaling maps the range of featuresto the target range, e.g., [0, 1] or [− 1, 1]; standardization scales thecomponents of a vector such that the distribution of the normalizedfeatures have zero-mean and unit-variance; and unit normaliza-tion is the simplest technique that normalize the feature to a unitlength. We utilize the rescaling technique to normalize the scales ofecosystem variables, and the time-series features in time domainare given as

The temporal features of two time series, “ces-t grnd” and “ces-t soisno”, are compared in Fig. 3. The similar dynamics of the two

Page 5: Journal of Computational Science - Yang Xu · H. He et al. / Journal of Computational Science 13 (2016) 83–95 Models (CLM) at Oak Ridge National Laboratory, to simulate sur-face

H. He et al. / Journal of Computational Science 13 (2016) 83–95 87

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1

ces-t_grnd

ces-t_soisno

Fig. 3. Temporal features of two variables with similar dynamics.

Table 2Clusters of variables with the number of variables in each cluster. The emphasizedvariables are used in data synthesis and forecasting.

Cluster 1 (13) Cluster 2 (13) Cluster 3 (10) Cluster 4 (9)

ces-t grnd pcf-psnsha clm a2l-forc lwrad cws-h2osoi volces-t h2osfc pcf-psnsha wj clm a2l-forc q pcf-lmrsha zces-t soisno pcf-psnsha z clm a2l-forc u pcf-psnsha wcces-thv pcf-psnsun clm a2l-forc v pcf-psnsun wcclm a2l-forc th pcf-psnsun z pef-parsun z pef-sabvcws-dqgdT pef-parsha z pepv-downreg pps-dt veg(prefer)cws-qg h2osfc pps-laisun pes-q ref2m pps-fdrycws-qg pps-laisun z pes-t10 pps-ram1cws-qg snow ppsyns-ag pps-fwet pps-rssuncws-qg soil ppsyns-aj pws-h2ocanpes-t ref2m ppsyns-anpes-t veg ppsyns-gb molpes-thm ppsyns-gs mol

Cluster 5 (9) Cluster 6 (5) Cluster 7 (3)

clm a2l-forc rho pcf-lmrsun z clm a2l-forc pbotpes-rh ref2m pcf-psnsun wj clm a2l-forc pco2pps-laisha pps-vcmaxcintsun clm a2l-forc po2pps-laisha z ppsyns-acpps-rhaf ppsyns-appps-rsshapps-rssha z

tiT

3

tPfovfiatmcaW

f

pps-rssun zpps-vcmaxcintsha

ime series are represented by the temporal features, and they arendeed classified into the same cluster in the experiment, given inable 2.

.1.2. Frequency-domain featureIn addition to features in time domain, we extract frequency fea-

ures from time series to cluster geosystem variables using Waveletacket Transform (WPT), which is a generation of wavelet trans-orm (WT). WT captures both the spatial and frequency informationf a time series by decomposing it into a coarse approximationia low-pass filtering and into detailed information via high-passltering. The approximation coefficients are split into a vector ofpproximation coefficients and a vector of detail coefficients. In WT,he decomposition is performed recursively on low-pass approxi-

ation coefficients obtained at each level, while in WPT, each detailoefficient vector is also decomposed into two parts using the same

pproach as in the splitting of approximation vectors. Therefore,PT extracts more comprehensive information that PT.We use an orthogonal wavelet to generate a wavelet package

or computation simplicity. In multi-resolution signal analysis, the

Fig. 4. Wavelet decomposition of the ecosystem variable “pps-ram”.

family functions in the wavelet framework can be represented by[15]

Wj,n,k(x) = 2−j/2Wn(2−jx − k) (2)

where j ∈ Z is a scale parameter, k ∈ Z is a time translation param-eter, and n ∈ N is an index of wavelet functions in each resolutionlevel j. The wavelet decomposition parameters with respect tothese wavelet functions (2) at a specific level is

xf = [aj,0, aj,1, . . ., aj,2j−1]T (3)

In this paper, we utilize the Daubechies db4 wavelet and choosej = 5, which results a feature vector in length 25. A sample waveletanalysis of “pps-ram” time series is illustrated in Fig. 4, showingthat the response of the time series becomes minor when the fre-quency is high. We thus restrict the frequency span in waveletdecomposition to reduce the dimension of feature vectors.

By combining the extracted features after normalization, weform a time-series feature vector,

x = [xt , ˛xf ] (4)

where is a weighting parameter to balance the contribution offeatures in time and frequency domains.

3.2. Data clustering

The independent system variables that dominate geosystemdynamics are identifiable by their physical meanings. Becauseof their importance, the independent variables are clusteredautonomously without presetting the number of clusters or thecluster centers. In contrast, the dependent variables are allocatedto the clusters of independent variables. Therefore, geosystemvariables are clustered in two manners: interested independentvariables are clustered using Affinity Propagation (AP) [16], whichautomatically determine the number and exemplar of clusters; theother system variables are classified by k-Nearest-Neighbor [17]to the exemplars determined by Affinity Propagation. The cluster-ing scheme is illustrated in Fig. 5. Separate processing of variablesenables the geoscientists to customize the fundamental variablesto construct clusters and reduces the computation complexity.

AP proposes an equivalent formalization of the K-center prob-lem, defined in terms of energy minimization. The concept of

Affinity Propagation is to find an optimal configuration of exem-plars by iteratively maximizing an energy function. In this paper,we choose the Euclidean distance s(i, j) = − ||xi − xj||2 to measurethe pairwise similarity between data features. A set of exemplars
Page 6: Journal of Computational Science - Yang Xu · H. He et al. / Journal of Computational Science 13 (2016) 83–95 Models (CLM) at Oak Ridge National Laboratory, to simulate sur-face

88 H. He et al. / Journal of Computatio

ir

m

a

wotisttntp

4

tsoas

4

vhscnatpp

ome

y

Fig. 5. Data clustering using Affinity Propagation and k-Nearest-Neighbor.

n independent variables are determined when AP converges oreaches the maximal iterations.

The cluster to which dependent variables x belong to is deter-ined with respect to the exemplars in a nearest-neighbor manner

rgminc

‖x − xc‖ ∀xc ∈ E (5)

here E = {xc |exemplar xc in cluster c} is a set of exemplarsbtained by AP clustering. Dependent variables are clustered tohe groups of independent variables according to the clusterndices. The first reason of the two-step clustering is that geo-cientists are normally interested in a specific group of variableshat play dominant roles in system dynamics and should be clus-ered automatically without restrictions. The second reason is thatearest-neighbor clustering is much faster than AP clustering, andherefore the two-step clustering is faster than the AP clusteringrocess.

. Data synthesis

System variables with similar properties and dynamics are clus-ered in the same cluster. In this section, dependent variables areynthesized and reconstructed through linear function fitting usingbserved variables in the same cluster. The variables in time domainre recoverable from the trained linear model through data synthe-is or prediction.

.1. Dynamic regression model

The relation between the variables to synthesize and the knownariables is modeled using dynamic regression models. On oneand, the geosystem variables in the same cluster have similartatic and dynamic features, and thus their deterministic relationan be described by a regression model. On the other hand, manyon-stationary geosystem variables may exhibit cycle dynamics,nd therefore the dynamic mechanism is introduced into the modelo link observations in different periods. The introduced dynamicart can also model dynamic noises in time series forecasting andrediction.

The unknown time series y are synthesized by the otherbserved time series D in the same cluster using a linear regression

odel with Autoregressive Integrated Moving Average (ARIMA)

rror [18],

= w0 + wD + u (6)

nal Science 13 (2016) 83–95

where w0 is the bias constant, w are the linear weight of observedtime series, and the fitting error is

ut∼ARIMA(p, d, q) (7)

where (p, d, q) are respectively non-seasonal autoregressive order,differencing, and moving average order. A linear model is deemedcompetent to represent the relation provided that the time series inthe same cluster have similar temporal and frequency properties.The ARIMA model is utilized to represent inherent data propertiesand to predict future points after model fitting. The fitting residueof the linear regression model is assumed to an ARIMA(p, d, q)stochastic process given as [18]

˚(L)�dut = � + �(L)�t ∀t ≥ 0 (8)

where �t is white noise with variance �2� , � is a constant, and the

lag polynomials are explicitly

˚(L) = 1 − �1L − · · · − �pLp

�(L) = 1 − �1L − · · · − �qLq(9)

with �p /= 0 and �q /= 0. The differencing operation �d, which isperformable in high orders and invertible, is an effective approachto tackle non-stationary time series. The differenced series is thechange between consecutive observations in the original series, andcan be written as

(1 − L)dut = ∇dut. (10)

Occasionally the differenced data will not appear stationary and itmay be necessary to difference the data a second time to obtain astationary series. The seasonal differencing aims to reduce seasonaltrend in system variable, such as sunshine at the first quarter eachyear. The non-seasonal differencing functions to remove trend,such as sunshine change from the first quarter to the second quar-ter of a year. It should be noted that the feature extraction process(1) includes a first-order differencing so that the order of differenceoperation in (7) is one order lower.

The structure (p, d, q) of ARIMA errors are selected to representdifferent system dynamics, and the model parameters ( , �, ˚, �)in general can be fitted by least squares regression to minimizefitting errors. It is good practice to find the smallest values of pand q that provide an acceptable fit to the data so as to avoid over-fitting, which renders a ARIMA model not invertible. In experiment,we found that ARIMA(2, 0, 1) has sufficient degrees of freedom torepresent most time series in the ecosystem simulation. We willinvestigate the method to estimate model parameters in the fol-lowing section.

4.2. Model learning

The dynamic regression model consists of a linear regressionmodel and ARIMA errors. The general approach on model selec-tion and parameter estimation is iterative fitting and evaluating ofARIMA models [18], which has achieved acceptable performancein time series forecasting and analysis. The general approach, how-ever, cannot be directly applied to estimate the data synthesismodel (6). The main purpose of the dynamic model in this paper isdata synthesis instead of forecasting, and we are more interested inthe accuracy of linear regression. In addition, the dynamic modelis estimated on a group of time series with similar temporal andfrequency features preselected by the clustering process. We cancontrol the ARIMA errors of these time series within a limited rangein order to avoid repeated selecting of ARIMA error models.

The model parameters (w, �, �) are estimated in two steps. Weestimate the regression model without considering the autocorre-lation in residues, and choose the ARIMA structure of the residues.With the selected ARIMA structure, we reestimate the parameters

Page 7: Journal of Computational Science - Yang Xu · H. He et al. / Journal of Computational Science 13 (2016) 83–95 Models (CLM) at Oak Ridge National Laboratory, to simulate sur-face

utatio

otwsawttmeendAeet

rmnrtviw

u

ttaoirr

m

wtlto

w

Awma(

rmi

u

BoAI

pt

H. He et al. / Journal of Comp

f the entire model of linear regression and ARIMA errors usinghe maximal likely estimation (MLE). In the parameter estimation,e decide not to utilize a proxy model for the ARIMA errors as

uggested in [18]. The parameter estimation aims to determine data synthesis model that is general for time series synthesis,hile the method in [18] aims to find the most suitable model in

he representation. Since the regressors are within the same clus-er sharing similar dynamics, we depend on the linear regression

odel to fit time series and the ARIMA model to fit residues. Thestimation method is well known as “spurious regression” since thestimated coefficients are not the optimal estimates. The method isevertheless effective in time series to determine the model as weesire to improve the generality of the model to other time series.nother concern is that the final time series are integrated from thestimated differencing, and the integration operation accumulatestimation errors in each step. We therefore directly synthesize theime series without taking many steps of differencing.

The linear regression model cannot correctly represent theelation between non-stationary time series, and thus the esti-ated parameters could be incorrect if some time series are

on-stationary. Since we have put non-stationary in the ARIMAesidues, we preprocess the target y and the regressors D to makehem stationary by differencing. Through differencing, we may con-ert a regression model with ARIMA errors into a regression modeln differences with ARMA errors. The equivalency is straightfor-

ard as it generally holds that

t∼ARIMA(p, d, q) ⇒ ∇dut∼ARMA(p, q). (11)

After the time series become stationary, we estimate linear fit-ing parameters w in the linear regression model and determinehe ARIMA structure using the residues. The estimated parametersre also used as initial setting and iteratively estimate parametersf the data synthesis model. The coefficients are learned by min-mizing the fitting error between the ecosystem output and theesponse predicted by the linear approximation, given by a ridgeegression [19]

inw

‖y − wD‖22 + ‖w‖2

2 (12)

here > 0 controls the model’s complexity. By introducing a fit-ing tolerance, we may substantially decrease the variance of theinear regression model following the bias-variance tradeoff, so aso improve the generality of the model and to reduce fitting errorn new data. The ridge regression problem can be solved by

∗ = (DTD + I)−1DTy (13)

large Tikhonov regularization term will yield a linear modelith constrained parameters that control the sensitivity of theodel to data fluctuation. It should be noted the constant w0 dis-

ppears after the differencing operation and it is not reflected in13).

The estimation process infers the disturbances of the underlyingesponse series and then fits the model to the response data viaaximum likelihood. The residues of the linear regression model

s given as

= y − w∗D. (14)

ased on the time-series residues, we estimate the structuref ARIMA models by testing the model to the highest order toRIMA(2, 0, 2) and select the model with the smallest Akaike’ s

nformation Criterion (AIC).After we determined the structure (˚, �) of ARIMA model, the

arameters of the entire model is estimated by minimizing the errorerm

= ˚(L)�−1(L)y − w{˚(L)�−1(L)}D (15)

nal Science 13 (2016) 83–95 89

which is white noise with zero means and variance �2� . Thus the

parameters can be estimated using Least Square Estimation andMaximal Likelihood Estimation that finds the values of the param-eters which maximize the probability of obtaining the data that wehave observed [20–22].

4.3. Data synthesis and prediction

Once we have selected the structure and parameters of themodel, we use the observed time series to synthesize unobserveddata in the same cluster. Two parts of (6) need to be computedto synthesize time series using the dynamic regression models. Forthe regression part, the regressors are either observed or computedvariables in the simulated geosystem. Part of the input variablesof the CLM-based simulation platform are practical measurementsfrom onsite sensors, and some explanatory variables are calculatedbased on the observations following the physical relation betweenthem. For the error part, the ARIMA error model generates timeseries based on the probability distribution of white noise. An unob-served time series is synthesized from the fitted ARIMA models bycomputing the two parts.

We can use the fitting model to predict time series dynamicsif we have assess to future observations of the regressors in themodel. We can also use the model to forecast time series with opti-mal predictors that minimizes mean square prediction errors. Thenon-stationary part, including differencing and the linearly regres-sion, determines long-term predictions, while the stationary part,including AR and MA, generates short-term predictions.

5. Experiment

Two types of experiment were performed to examine the per-formance of the proposed data synthesis approach. Firstly, theaccuracy of data synthesis is evaluated by the difference betweenthe synthesized data and the ground-truth data in the CLM-basedsimulation system across different sampling regions. Secondly, theinfluence of data synthesis on the whole simulation system is stud-ied by comparing the output difference when part of system inputvariables are synthesized, since the fitting error in the synthesizeddata is inevitable.

5.1. Experiment setup

The data synthesis was implemented as a pluggable module inthe current simulation platform. We have developed a functiontest platform to create direct linkages between site measurementsand process-based CLM function within Community Earth SystemModel (CESM) [23]. That platform provides the needed integrationinterfaces for both field experimentalists and ecosystem modelersto improve the model’s representation of ecosystem processes. Thisfunction test platform is designed to eliminate the majority of soft-ware complexity to allow scientists to interactively select externalforcing, manipulate ecophysiological parameters and compare themathematical descriptions of ecosystem functions with measure-ments and observations. More recently, we have further improvedthe automation of ecosystem function test system generation usingcompiler-based software analysis such that we are able to extracta specific scientific function (single or a group of subroutines) fromCLM and to automatically generate a corresponding function-testmodule. Using this testing system, we have successfully testedmost ecosystem functions, and it can be extended to all other sci-

entific functions in CLM or even other components within CESM.Moreover, the function test platform also supports new CLM-basedmodule design and other customized ecosystem model develop-ments.
Page 8: Journal of Computational Science - Yang Xu · H. He et al. / Journal of Computational Science 13 (2016) 83–95 Models (CLM) at Oak Ridge National Laboratory, to simulate sur-face

90 H. He et al. / Journal of Computational Science 13 (2016) 83–95

clusters1 2 3 4 5 6 7 8

num

ber

of v

aria

bles

in e

ach

clus

ter

0

2

4

6

8

10

12

14

16

temporal featurefrequency featurehybrid Feature

clusters1 2 3 4 5 6 7 8 9

num

ber

of v

aria

bles

in e

ach

clus

ter

0

5

10

15

20

25

30

dailyweeklymonthlysemi-annually

F gure

s tion w

tostts

5

titttk(oofawtigftwev

rafseoiWwfvesl

ig. 6. Distribution of variables in each cluster with at least two variables. The top fielection methods, and the bottom figure is the comparison of the number distribu

The experiment data are time series of all geosystem variableshat were dumped from the CLM-Based simulation system andbserved variables from onsite sensors. We used the historical timeeries in 2008 for evaluation and comparison. There are 123 ecosys-em variables that were sampled every half an hour, and hencehe total time series are stored in D ∈ R

123×17520. The length ofampling windows is one month.

.2. Data clustering

We performed clustering on all the geosystem variables usinghe two-step clustering method on the July data of 2008. The damp-ng coefficient of Affinity Propagation was set as 0.5. To investigatehe distribution of variables in different clusters, we computedhe statistics of the numbers of variables in clusters with at leastwo variables. Fig. 6 shows the clustering results by using threeinds of features, temporal features, frequency features, and hybridtemporal + frequency) features. Around half of the variables, 68ut of 123, were grouped with other variables, and the other halff variables were clustered into a single-element group. There-ore, we may reduce one-fourth of the variables to observe inn optimistic case. The distribution of variables in each clusterere slightly different by using different features. In general, clus-

ering using temporal features attempts to balance the numbersn each cluster, clustering using frequency features is prone toroup more variables in one cluster, and clustering using hybrideatures compromises in between. It should be noted that the dis-ribution of variables in each cluster depends on the data andeights of the two features in clustering. This is merely a gen-

ral rule to select and configure features in grouping ecosystemariables.

We also investigated the influence of sampling frequency andange on clustering performance. The distribution of the vari-bles in each cluster is compared in Fig. 6 with different samplingrequencies from daily, weekly, monthly, to semi-annually. Theampled time series have the same starting time yet with differ-nt lengths of sampling ranges. The figure reveals that the numberf single-element clusters dropped and the number of variablesn each cluster increased as the sampling frequency became high.

hen the sampling frequency was high, the length of time seriesas short such that unique features to identify were lost. It is

avorable to have more variables in each cluster as more potential

ariables are available for regression during data syntheses; how-ver, the similarity between time series downgrades if we increaseampling frequencies. We need to compromise between the simi-arity and the number to time series in each cluster.

is the comparison of the number distribution in each cluster using different featurehen the data were in different sampling time (temporal feature, annual).

Table 2 shows the clustering results in the experiment, whichindeed reflect the physical meanings of the variables. It is inter-esting to notice that all the column-level variables and vegetationenergy states are grouped together. They are directly responsiveto each other in the fine modeling time step (30 minutes), such asthe column-level ground temperature, vegetation temperature, 2-m reference temperature. Variables in Cluster 2 present the strongrelationship between photosynthesis and lai estimation. Variablesin Cluster 3 are most vegetation energy states which are closelyrelated to atmospheric forcing, and variables in cluster 7 are atmo-spheric physical features. Cluster 5 shows the similarity patternsbetween key variables related to shaded canopy area, and Cluster7 shows the direct (linear) relationship among air properties.

5.3. Data synthesis

We evaluated the performance of time-series synthesis bysynthesizing interested variables using the other variables. We con-ducted two types of experiment: one experiment was to evaluatethe fitting accuracy of dynamic regression models; the other exper-iment was to predict and forecast time series using the trainedmodel. In the experiment, we fit the interested variables with allthe other variables in the same cluster, without loss of general-ity. In practical application, the observed variables are determinedby onsite sensors and hardware configuration, and the dynamicregression model is conveniently adjustable to be trained usingother regressor variables. Alternatively, we can select a group offundamental variables that serve as the regressors for all the othervariables.

In this experiment, the synthesis precision was measured byrelative mean squared error (MSE) and correlation (COR) betweentime series. The relative MSE between time series x and y is com-puted by

mse = 1n

n∑i=1

(xi − yi)2

(y)2(16)

where n is the length of the time series and y =∑n

i=1yi is the mean,and the COR between time series x and y is calculated by

cor =∑n

i=1(xi − x)(yi − y)√∑ni=1(xi − x)2

√∑ni=1(yi − y)2

(17)

where x =∑n

i=1xi is the mean of x. We utilized a relative MSE (16)instead of the standard MSE as the scales of time series vary signif-icantly and we wish to evaluate the relative synthesis precision.

Page 9: Journal of Computational Science - Yang Xu · H. He et al. / Journal of Computational Science 13 (2016) 83–95 Models (CLM) at Oak Ridge National Laboratory, to simulate sur-face

H. He et al. / Journal of Computational Science 13 (2016) 83–95 91

sampling cycles0 50 100 150

time

serie

s

290

295

300

305

310

315

320ces-t_grnd, mse: 5.606e-05, cor: 0.910

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

0.015

0.02

0.025

0.03

0.035

0.04cws-qg, mse: 1.528e-29, cor: 1.000

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

0

1

2

3

4

5

6

7pcf-psnsha, mse: 1.267e-30, cor: 1.000

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

0

2

4

6

8

10

12

14

16

18pcf-psnsun, mse: 1.671e-30, cor: 1.000

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

0

0.2

0.4

0.6

0.8

1pps-laisun, mse: 1.983e-30, cor: 1.000

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

350

355

360

365

370

375

380

385

390

395clm_a2l-forc_lwrad, mse: 1.140e-04, cor: 0.949

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

0.3

0.32

0.34

0.36

0.38

0.4cws-h2osoi_vol, mse: 3.643e-04, cor: 0.332

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

0.6

0.8

1

1.2

1.4

1.6

1.8pps-laisha, mse: 5.245e-31, cor: 1.000

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

0

1000

2000

3000

4000

5000pps-rssun_z, mse: 1.189e-02, cor: 0.972

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

0.6

0.65

0.7

0.75

0.8pps-vcmaxcintsha, mse: 6.090e-05, cor: 0.997

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4pps-vcmaxcintsun, mse: 2.288e-02, cor: 0.925

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

37.1

37.15

37.2

37.25

37.3

37.35

37.4

37.45

37.5

37.55clm_a2l-forc_pco2, mse: 2.151e-20, cor: 1.000

ground-truthsynthesized

Fig. 7. Time series fitting.

tmsibwa

We investigated the performance of the proposed data syn-hesis method on the geosystem variables in the CanopyFluxes

odule. The ARIMA error in the dynamics regression model iset as high as ARIMA(2, 0, 1). We utilized the first 20 variables

n a cluster to train the regression model when the num-er of variables in the cluster is more than 20. The variablesere selected by climate scientists to the best of their interest,

nd the first variable in a cluster was selected if the scientist

selected none in that cluster. The selected variables are highlightedin Table 2. The selected variables are direct-measurable “ces-t grnd”, accessible “cws-qg”, unobservable “pcf-psnsha”, unob-servable “pcf-psnsun”, indirect-measurable “clm a2l-forc lwrad”,

accessible “cws-h2osoi vol”, assessable “pps-laisha”, assessable“pps-rssun z”, indirect measurable “pps-vcmaxcintsha”, indirectmeasurable “pps-vcmaxcintsun”, indirect-measurable “clm-a2l-forc pco2”. We synthesized the variables with the other variables
Page 10: Journal of Computational Science - Yang Xu · H. He et al. / Journal of Computational Science 13 (2016) 83–95 Models (CLM) at Oak Ridge National Laboratory, to simulate sur-face

92 H. He et al. / Journal of Computational Science 13 (2016) 83–95

sampling cycles0 50 100 150

time

serie

s

290

295

300

305

310

315

320ces-t_grnd

ground-truthprediction, mse: 5.519e-05, cor: 0.929forecast, mse: 3.199e-05, cor: 0.960

sampling cycles0 50 100 150

time

serie

s

0.015

0.02

0.025

0.03

0.035

0.04

0.045cws-qg

ground-truthprediction, mse: 6.039e-29, cor: 1.000forecast, mse: 1.457e-06, cor: 1.000

sampling cycles0 50 100 150

time

serie

s

-1

0

1

2

3

4

5

6

7pcf-psnsha

ground-truthprediction, mse: 1.694e-29, cor: 1.000forecast, mse: 1.265e-29, cor: 1.000

sampling cycles0 50 100 150

time

serie

s

-2

0

2

4

6

8

10

12

14

16pcf-psnsun

ground-truthprediction, mse: 2.177e-29, cor: 1.000forecast, mse: 8.076e-05, cor: 1.000

sampling cycles0 50 100 150

time

serie

s

-0.2

0

0.2

0.4

0.6

0.8

1pps-laisun

ground-truthprediction, mse: 4.363e-30, cor: 1.000forecast, mse: 1.127e-03, cor: 0.999

sampling cycles0 50 100 150

time

serie

s

0.3

0.32

0.34

0.36

0.38

0.4cws-h2osoi_vol

ground-truthprediction, mse: 9.547e-04, cor: -0.009forecast, mse: 5.954e-04, cor: -0.095

sampling cycles0 50 100 150

time

serie

s

0.6

0.8

1

1.2

1.4

1.6

1.8pps-laisha

ground-truthprediction, mse: 8.011e-31, cor: 1.000forecast, mse: 3.433e-06, cor: 1.000

sampling cycles0 50 100 150

time

serie

s

0.6

0.65

0.7

0.75

0.8

0.85pps-vcmaxcintsha

ground-truthprediction, mse: 9.449e-05, cor: 0.996forecast, mse: 4.921e-05, cor: 0.998

sampling cycles0 50 100 150

time

serie

s

0

0.5

1

1.5pps-vcmaxcintsun

ground-truthprediction, mse: 3.999e-01, cor: 0.759forecast, mse: 3.629e-01, cor: 0.783

0 50 100 150

time

serie

s

37.25

37.3

37.35

37.4

37.45

37.5

37.55

37.6

37.65

37.7clm_a2l-forc_pco2

ground-truthprediction, mse: 3.314e-20, cor: 1.000forecast, mse: 2.539e-16, cor: 1.000

mpling0 50 100 150

time

serie

s

330

340

350

360

370

380

390

400

410clm_a2l-forc_lwrad

ground-truthprediction, mse: 1.958e-04, cor: 0.928forecast, mse: 1.130e-04, cor: 0.960

0 50 100 150

time

serie

s

-1000

0

1000

2000

3000

4000

5000pps-rssun_z

ground-truthprediction, mse: 3.531e-02, cor: 0.979forecast, mse: 1.470e-02, cor: 0.991

mode

idTiq“tts“pt

sampling cycles sa

Fig. 8. Generating time series using a trained dynamic regression

n the same cluster using the proposed method. We trained theynamic regression model and synthesized the selected variables.he synthesized time series and the ground truth are comparedn Fig. 7. As we can see from the figures, the variables, “cws-g”, “pcf-psnsha”, “pcf-psnsun”, “pps-laisun”, “pps-laisha”, andclm a2l-forc pco2” were completely reconstructed with MSE lesshan 1 × 10−20 and correlation 1. The high accuracy was due tohe clustering process that discovered ecosystem variables with

imilar dynamics. The variables “ces-t grnd”, “clm a2l-forc lwrad”,pcf-rssun z”, “pps-vcmaxcintsha”, and “pps-vcmaxcintsun” wererecisely reconstructed with MSE < 0.02 and COR > 0.7. The syn-hesis accuracy of the variable “cws-h2osoi vol” was the worst in

cycles sampling cycles

l for data in the following month in the same year (August 2008).

all the variables. Though the correlation is only 0.332, the fittingerrors MSE is very small, MSE = 3.643 × 10−4. The reasons of thelow accuracy are that the scale of this variable is relatively smalland the variables in the cluster have various dynamics. In general,the proposed data synthesis method was able to recover the originaldata both in trend and seasonality aspects as shown in the figures.The trends of the original data were precisely fitted throughoutthe whole sampling periods. The seasonality of the original data,

however, was not precisely reconstructed, with error in frequen-cies and local dynamics. The recovery accuracy of the seasonalitydepends on the dynamics of the other variables in the same cluster,which were determined by feature selection. Thus, to obtain precise
Page 11: Journal of Computational Science - Yang Xu · H. He et al. / Journal of Computational Science 13 (2016) 83–95 Models (CLM) at Oak Ridge National Laboratory, to simulate sur-face

H. He et al. / Journal of Computational Science 13 (2016) 83–95 93

sampling cycles0 50 100 150

time

serie

s

295

300

305

310

315ces-t_grnd, mse: 4.283e-06, cor: 0.993

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

0.015

0.02

0.025

0.03

0.035

0.04

0.045cws-qg, mse: 2.513e-06, cor: 1.000

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

0

1

2

3

4

5

6

7pcf-psnsha, mse: 1.190e-29, cor: 1.000

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

0

5

10

15pcf-psnsun, mse: 5.649e-31, cor: 1.000

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

0

0.2

0.4

0.6

0.8

1pps-laisun, mse: 5.637e-30, cor: 1.000

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

0.2

0.25

0.3

0.35

0.4

0.45

0.5cws-h2osoi_vol, mse: 1.715e-03, cor: 0.377

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

0.8

1

1.2

1.4

1.6

1.8pps-laisha, mse: 1.442e-30, cor: 1.000

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

0.6

0.65

0.7

0.75

0.8

0.85pps-vcmaxcintsha, mse: 9.889e-04, cor: 0.997

ground-truthsynthesized

sampling cycles0 50 100 150

time

serie

s

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8pps-vcmaxcintsun, mse: 3.869e-01, cor: 0.777

ground-truthsynthesized

time

serie

s

37.05

37.1

37.15

37.2

37.25

37.3

37.35

37.4

37.45

37.5clm_a2l-forc_pco2, mse: 1.954e-20, cor: 1.000

ground-truthsynthesized

mpling

time

serie

s

320

340

360

380

400

420

440clm_a2l-forc_lwrad, mse: 2.671e-04, cor: 0.937

ground-truthsynthesized

time

serie

s

0

500

1000

1500

2000

2500

3000

3500

4000

4500pps-rssun_z, mse: 1.574e-01, cor: 0.980

ground-truthsynthesized

on mo

rd

5

smtdtb

sampling cycles0 50 100 150

sa0 50

Fig. 9. Generating time series using a trained dynamic regressi

ecovery of seasonality, we have to extract frequency features inata clustering.

.4. Data predicting and forecasting

To further investigate the generality of the models in dataynthesis and generation, we using trained dynamic regressionodels to generated time series in other time spans than the

raining data. We generated time series in two manners, pre-icting and forecasting. In predicting manner, we simply appliedhe trained dynamic regression model to synthesize a variabley regressing the available variables. The generated time series

cycles100 150

sampling cycles0 50 100 150

del for a different month in the following year (August 2009).

were essentially the addition of the linear regression and ARIMAerrors. In forecasting manner, time series are generated by extend-ing the data in the previous sampling cycle following the dynamicsdefined in a dynamic regression model, instead of regenerat-ing the whole time series as in the predicting manner. We canonly forecast time series in the consecutive sampling regionsof training data, whereas we can predict time series in anysampling regions where the relation between variables still hold.

For instance, we have to use a dynamic regression model in apredicting manner if we need to generate time series for Julyin the next year with the trained model by July data of thisyear.
Page 12: Journal of Computational Science - Yang Xu · H. He et al. / Journal of Computational Science 13 (2016) 83–95 Models (CLM) at Oak Ridge National Laboratory, to simulate sur-face

9 utational Science 13 (2016) 83–95

bmcmttttfatM

yassmtrtfto

ptaotnt

5

msdtstsfd

issapawpTodwa

aosmc

Fig. 10. The differences between output variables with and without data synthesisof input variables for a month.

0 10 20 30 40 50 60 70 80-0.2

0

0.2

0.4

0.6

0.8

1

1.2

CORMSE

4 H. He et al. / Journal of Comp

We compared the performance of data synthesis in two mannersy generating time series in the following sampling span, i.e., oneonth later. The synthesized time series and the ground truth are

ompared in Fig. 8. The generality of the trained dynamic regressionodels was proven by the fact that the trained models were able

o precisely reconstruct time series at different sampling time. Theime series in different sampling time were precisely reconstructedhough the overall performance downgrades a bit as compared tohe fitting results in Fig. 7. The variables, “pcf-psnsha” and “clm a2l-orc pbot”, which were completely reconstructed could also beccurately synthesized by the same model at different samplingime. The other variables were also synthesized with acceptable

SE MSE < 0.2, and correlation CORR > 0.58.Fig. 9 presents the results of data generation in the following

ear using the trained dynamic regression model. The synthesisccuracy of the dynamic regression models trained using timeeries at different sampling time is generally lower than the synthe-is accuracy on the training time series. The dynamics of time seriesay change slightly at different sampling time, and hence the rela-

ion between each other might be time invariant. It is assumed theelation of variables hold at different sampling time as modeled inhe dynamic regression model. This assumption is commonly validor a short time difference, yet not for all cases. When the assump-ion fails, the error terms in the models could represent the changef the relation between variables.

The dynamic regression model in the two modes had com-arable synthesis accuracy for time series at different samplingime. For most variables, the synthesis model in forecasting modechieved high precision than in prediction mode. The advantagef the forecasting mode is to synthesize time series based on therend of adjacent observed data; however, forecasting mode can-ot be applied to time series in arbitrary sampling time, whereashe predicting mode is still applicable.

.5. Hybrid simulation

The ecosystem simulation system becomes hybrid when it keepseasured and synthesized variables in the loop. Since the synthe-

is error is inevitable especially when no variables with the sameynamics are available for regression, we need to examine whetherhe synthesis errors propagate within the system, and whether theimulation system still reflect the correct trends of most impor-ant ecosystem variables. We therefore quantify the influence of theynthesis error on the ecosystem dynamics in the simulation plat-orm by comparing system dynamics with and without synthesizedata.

We developed a debugging tool to modify and export variablesn the CLM-based simulation system, so that we can input synthe-ized data in the simulation loop and monitor the output of theystem. The tool deconstructs the CLM source code into identifi-ble tokens (i.e., function calls and variables). During the scanningrocess, the tool records the name and category of the variablesnd functions that have been used by a subroutine. By doing so,e are able to create a utility to automatically insert code blocks atlaces before and after a particular subroutine is executed in CLM.he purpose of inserting the code blocks is to retrieve the valuesf input and output variables for this subroutine at each time stepuring the CLM simulation. A compiler-assisted workflow analysisas also performed to better understand the internal data structure

nd scientific workflow of CLM subroutines.In the experiment, all the input variables highlighted in Table 2

re replaced with synthesized values in the CLM system, while the

ther input variables remain as computed or measured values. Theynthesized data are formatted and aligned with calculations andeasurements in sampling time. The CLM system simulation was

onducted and the output of the system was recorded. The total

Fig. 11. Average COR and MSE between output variables with and without inputsynthesis.

number of recorded variables is 82 for a month. The MSE betweenthe output variables with and without data synthesis is given inFig. 10. The figure reveals that the relative differences are within asmall range. We further calculated the averaged MSE and COR forthe output variables, as shown in Fig. 11. The COR and MSE valuesare 1 and 0 for two identical time series, thus the area betweenthe two boundary lines reflect the similarity of the output with andwithout synthesized input variables. The average correlations ofall the output variables are above 0.7, which means that the trendsand dynamics of the output variables are similar, with and withoutsynthesized input. Despite the small local differences, the dynam-ics of the output variables of the simulation system are essentiallyreflected when partial input was synthesized. The synthesis errorswere within the disturbance tolerance of the simulation system.The hybrid simulation proves the feasibility to synthesize unob-served system variables in CLM while guaranteeing the overallsimulation accuracy.

5.6. Discussion

The sampling configuration may slightly influence the accuracyof data synthesis, including the starting point, the length of samp-ling windows, and the sampling frequency. A carefully selectedstarting point help capture important phenomena to observer.

Page 13: Journal of Computational Science - Yang Xu · H. He et al. / Journal of Computational Science 13 (2016) 83–95 Models (CLM) at Oak Ridge National Laboratory, to simulate sur-face

utatio

Iofwawiasg

aaavcp

6

tsoapp

A

arMr(Nf

R

[

[

[

[

[

[

[

[

[

[

[

[

[

[

H. He et al. / Journal of Comp

n practice, we usually coincide the starting point with the startf months, quarters, and seasons to capture the climate featuresor a certain interest range. Likewise, the length of the samplingindows is chosen as daily, biweekly, monthly, quarterly, and

nnually. The most influential factor is the sampling frequency,hich depends on the hardware configuration and observation

nterest. A high sampling frequency is suitable for instantaneousnd transit dynamics, such as the change of moisture, while a lowampling frequency is proper to capture a slow process, such as therowth of vegetate roots.

Though many variables can be synthesized by the other vari-bles in the same cluster, some variables are unique in dynamicsnd we cannot find variables to synthesize them. Those variablesre required to be observed or computed in the simulation. Thoseariables are identifiable using the clustering methods, which willlassify unique variables into a single-element cluster when thearameters are properly tuned.

. Conclusion

This paper has proposed a data synthesis approach using clus-ering and dynamic regression methods for CLM-based climaticimulation. The number of variable to measure could reduce byne-fourth by synthesizing these variables using the other vari-bles. The proposed method was evaluated in data synthesis, datarediction and forecasting, and hybrid simulation. The experimentroved the effectiveness and efficiency of the proposed method.

cknowledgements

The work was supported in part by NSFC Grant 61305114. Theuthors also would like to acknowledge the supports from Ter-estrial Ecosystem Science (TES) project and Accelerated Climateodeling for Energy (ACME) project funded by Biological and Envi-

onmental Research (BER), Office of Science, Department of EnergyDOE). This research also used computing resources at Oak Ridgeational Laboratory (ORNL), which is managed by UT-Battelle LLC

or the Department of Energy under contract DE-AC05-00OR22725.

eferences

[1] R.E. Dickinson, K.W. Oleson, G. Bonan, F. Hoffman, P. Thornton, M.Vertenstein, Z.-L. Yang, X. Zeng, The community land model and its climatestatistics as a component of the community climate system model, J. Climate19 (11) (2006) 2302–2324.

[2] G.D. Reeves, Data assimilation and data synthesis in radiation belt modeling,in: National Space Weather Workshop: Research to Applications, 1998.

[3] H. Carrão, P. Gonalves, M. Caetano, A nonlinear harmonic model for fittingsatellite image time series: analysis and prediction of land cover dynamics,IEEE Trans. Geosci. Remote Sens. 48 (4) (2010) 1919–1930.

[4] E.B. Brooks, V.A. Thomas, R.H. Wynne, J.W. Coulston, Fitting the multitemporalcurve: a Fourier series approach to the missing data problem in remotesensing analysis, IEEE Trans. Geosci. Remote Sens. 50 (9) (2012) 3340–3353.

[5] F. Gao, J.T. Morisette, R.E. Wolfe, G. Ederer, J. Pedelty, E. Masuoka, R. Myneni,B. Tan, J. Nightingale, An algorithm to produce temporally and spatiallycontinuous MODIS-LAI time series, Geosci. Remote Sens. Lett. 5 (1) (2008)60–64.

[6] B.P. Salmon, J.C. Olivier, K.J. Wessels, W. Kleynhans, F. Van den Bergh, K.C.Steenkamp, Unsupervised land cover change detection: meaningfulsequential time series analysis, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.4 (2) (2011) 327–335.

[7] P. Jonsson, L. Eklundh, Seasonality extraction by function fitting to time-seriesof satellite sensor data, IEEE Trans. Geosci. Remote Sens. 40 (8) (2002)1824–1832.

[8] S.J. Wenger, N.A. Som, D.C. Dauwalter, D.J. Isaak, H.M. Neville, C.H. Luce, J.B.Dunham, M.K. Young, K.D. Fausch, B.E. Rieman, Probabilistic accounting ofuncertainty in forecasts of species distributions under climate change, Global

Change Biol. 19 (11) (2013) 3343–3354.

[9] M.P. Friedlander, M. Schmidt, Hybrid deterministic-stochastic methods fordata fitting, SIAM J. Sci. Comput. 34 (3) (2012) A1380–A1405.

10] R.H. Reichle, Data assimilation methods in the earth sciences, Adv. WaterResour. 31 (11) (2008) 1411–1418.

nal Science 13 (2016) 83–95 95

11] Y. Spitz, J. Moisan, M. Abbott, Configuring an ecosystem model using datafrom the Bermuda Atlantic time series (BATS), Deep Sea Res. II: Top. Stud.Oceanogr. 48 (8) (2001) 1733–1768.

12] M. Williams, P.A. Schwarz, B.E. Law, J. Irvine, M.R. Kurpius, An improvedanalysis of forest carbon dynamics using data assimilation, Global ChangeBiol. 11 (1) (2005) 89–105.

13] R.R. Colditz, C. Conrad, S.W. Dech, Stepwise automated pixel-basedgeneration of time series using ranked data quality indicators, IEEE J. Sel. Top.Appl. Earth Obs. Remote Sens. 4 (2) (2011) 272–280.

14] K.W. Oleson, D.M. Lawrence, B. Gordon, M.G. Flanner, E. Kluzek, J. Peter, S.Levis, S.C. Swenson, E. Thornton, J. Feddema, et al., Technical description ofversion 4.0 of the community land model (clm), Tech. rep., NCAR TechnicalNote, 2013.

15] A.N. Akansu, R.A. Haddad, Multiresolution Signal Decomposition: Transforms,Subbands, and Wavelets, Academic Press, 2001.

16] B.J. Frey, D. Dueck, Clustering by passing messages between data points,Science 315 (5814) (2007) 972–976.

17] T.M. Cover, P.E. Hart, Nearest neighbor pattern classification, IEEE Trans. Inf.Theory 13 (1) (1967) 21–27.

18] R.J. Hyndman, G. Athanasopoulos, Forecasting: Principles and Practice,OTexts, 2014.

19] W.H. Press, Numerical Recipes 3rd Edition: The Art of Scientific Computing,Cambridge University Press, 2007.

20] D.A. Pierce, Least squares estimation in the regression model withautoregressive-moving average errors, Biometrika 58 (2) (1971) 299–312.

21] R. Harris, R. Sollis, Applied Time Series Modelling and Forecasting, Wiley,2003.

22] A. Pankratz, Forecasting with Dynamic Regression Models, vol. 935, JohnWiley & Sons, 2012.

23] D. Wang, W. Wu, T. Janjusic, Y. Xu, C. Iversen, P. Thornton, M. Krassovisk,Scientific functional testing platform for environmental models: anapplication to community land model, in: International Workshop onSoftware Engineering for High Performance Computing in Science, 37thInternational Conference on Software Engineering, 2015.

Hongsheng He received the PhD degree in electricaland computer engineering from the National Univer-sity of Singapore in 2012. He is currently a Lecturer atthe Department of Mechanical, Aerospace and Biomed-ical Engineering, The University of Tennessee, USA. Hisresearch interests include machine learning, intelligentrobotics and computer vision.

Dali Wang is a computational environmental scientist.His primary research interests include climate and envi-ronmental modeling, environmental data sciences andsystems, high performance computing and geographicinformation systems, large-scale environmental systemsimulation and integration. He is staff member of Envi-ronmental Sciences Division (ESD) and Climate ChangeSciences Institute (CCSI) at Oak Ridge National Laboratory.

Yang Xu received the Ph.D. degree in Department of Geog-raphy from the University of Tennessee, Knoxville, in2015. His research interests include space-time GIS, socialmedia, and GIS for transportation.

Jindong Tan received the Ph.D. degree in electricaland computer engineering from Michigan State Univer-sity, East Lansing, in 2002. He is currently an AssociateProfessor in Department of Mechanical, Aerospace andBiomedical Engineering, The University of Tennessee,Knoxville. He has been an Assistant/Associate Professorin the Department of Electrical and Computer Engi-

neering at Michigan Technological University, Houghton.His research interests include mobile sensor networks,augmented reality and biomedical imaging, dietaryassessment, and mobile manipulation. Dr. Tan is a memberof the ACM and Sigma Xi.