Granger Causality Analysis in Irregular Time Series · domization test [17], phase slope index...

$: Granger Causality Analysis in Irregular Time Series · domization test [17], phase slope index [20], and so on. It turns out, a regression-based method called \Granger Causality"$
Granger Causality Analysis in Irregular Time Series

Mohammad Taha Bahadori∗ Yan Liu∗†

Abstract

Learning temporal causal structures between time se-

ries is one of the key tools for analyzing time series

data. In many real-world applications, we are confronted

with Irregular Time Series, whose observations are not

sampled at equally-spaced time stamps. The irregu-

larity in sampling intervals violates the basic assump-

tions behind many models for structure learning. In

this paper, we propose a nonparametric generalization of

the Granger graphical models called Generalized Lasso

Granger (GLG) to uncover the temporal dependencies

from irregular time series. Via theoretical analysis and

extensive experiments, we verify the effectiveness of our

model. Furthermore, we apply GLG to the application

dataset of δ18O isotope of Oxygen records in Asia and

achieve promising results to discover the moisture trans-

portation patterns in a 800-year period.

1 Introduction

In the era of data tsunami, we are confronted withlarge-scale time series data, i.e., a sequence observa-tions of concerned variables over a period of time. Forexample, terabytes of time series microarray data areproduced to record the gene expression levels underdifferent treatments over time; petabytes of climateand meteorological data, such as temperature, so-lar radiation, and carbon-dioxide concentration, arecollected over the years; and exabytes of social me-dia contents are generated over time on the Internet.As we can see, how to develop effective and scalablemachine learning algorithms to uncover temporal de-pendency structures between time series and revealinsights from data has become a key problem in ma-chine learning and data mining.

Up to now many different approaches have beendeveloped to solve the problem, such as autocorrela-tion, cross-correlations [8], transfer entropy [4], ran-domization test [17], phase slope index [20], Grangercausality [13] and so on, and achieved successes inmany applications. However, all the existing ap-

∗Computer Science Department, Viterbi School of Engi-

neering, University of Southern California, USA†Correspondence author. Email: [email protected]

proaches assume that the time series observations areobtained at equally spaced time stamps and fail inanalyzing irregular time series. Irregularly sampledtime series are those with samples missing at blocks ofsampling points or collected at non-uniformly spacedtime points. It is a common challenge in practice dueto natural constraints or human factors. For example,in biology, it is very difficult to obtain blood samplesof human beings at regular time intervals for a longperiod of time; in climate datasets, many climate pa-rameters (e.g., temperature and CO2 concentration)are measured by different equipments with varyingfrequencies.

Existing methods for analyzing irregular time se-ries can be categorized into three directions: (i) therepair approach, which recovers the missing observa-tions via smoothing or interpolation [25, 14, 16, 10,22]; (ii) generalization of spectral analysis tools, suchas Lomb-Scargle Periodogram (LSP) [24] or wavelets[12, 28, 19]; and (iii) the kernel methods [22]. Whilethe first two approaches provide sufficient flexibilityin analyzing irregular time series, they have beenshown to magnify the smoothing effects and resultin huge errors for time series with large gaps [25].

In this paper, we propose the Generalized Lasso-Granger (GLG) framework for causality analysis ofirregular time series. It defines a generalizationof inner product for irregular time series based onnon-parametric kernel functions. As a result, theGLG optimization problem takes the form of a Lassoproblem and enjoys the computational scalabilityfor large scale problems. We also investigate aweighted version of GLG (W-GLG), which aims toimprove the performance of GLG by giving higherweights to important observations. For theoreticalcontribution, we propose a sufficient condition forthe asymptotic consistency of our GLG method. Inparticular, we show that compared with the popularlocally weighted regression (LWR), GLG has thesame asymptotic consistency behavior, but achieveslower absolute error.

In order to demonstrate the effectiveness of GLG,we conduct experiments on four synthetic datasetswith different sampling patterns and objective func-

mailto:[email protected]

tions so that we can test the asymptotic convergencebehavior, effects of different kernels, and impact ofdifferent irregularity patterns respectively. GLG out-performs all state-of-the-art algorithms and improvesthe accuracy of the inferred temporal casual graph upto 15%. In addition, we also apply GLG to uncoverthe patterns of transfer of moisture across Asia during800 years by analysis of the temporal causal relation-ship between density of δ18O (a radioactive isotopeof oxygen) recorded in four caves in India and China.While the results identify the moisture transfer pat-terns, they confirm that during the warm periods theair masses had more energy to travel across the con-tinent.

The rest of the paper is organized as follows: wefirst formally define the problem of learning temporalstructures in irregular time series in Section 2 andreview related work in Section 3. Then we describethe proposed GLG framework and provide theoreticalinsights. In Section 5, we demonstrate the superiorperformance of GLG through extensive experiments.After the summary of the paper and hints on futurework, we provide the proof of the theorem in theAppendix.

2 Problem Definitions and Notation

Irregular time series are time series whose observa-tions are not sampled at equally-spaced time stamps.They could appear in many applications due to var-ious factors. In summary, there are three types ofirregular time series: including Gappy Time Series,Nonuniformly Sampled Time Series and Time Serieswith Missing Data. Gappy Time Series refer to thosewith regular sampling rate but having finitely manyblocks of data missing. For example, in astronomy,a telescope’s sight could be blocked by a cloud orcelestial obstacles for certain period of time, whichmakes the recorded samples unavailable during thattime [10]. Nonuniformly Sampled Time Series re-fer to those with observations at non-uniform timepoints. For example, in health care applications pa-tients usually have difficulties in rigorously recordingdaily (or weekly) health conditions or pill-taking logsfor a long period of time [16]. Time series with miss-ing data appear commonly in applications such assensor networks, where some data points are missingor corruptted due to sensor failure. In this paper,unless otherwise stated, the term irregular time se-ries refers to all the mentioned three groups or anycombinations of them.

While the regular time series need only one vectorof values on uniformly spaced time stamps, irregular

time series need a second vector specifying the timestamps at which the data are collected. Following thenotation of [22], we represent an irregular time serieswith a tuple of (time stamp, value) pairs. Formally,we have the following definition:

Definition 2.1. An irregular time series x of lengthN is denoted by x = {(tn, xn)}Nn=1 where time-stampsequence {tn} are strictly increasing, i.e., tl < t2 <. . . < tN and xn are the value of the time series atthe corresponding time stamps.

The central task of this paper is as follows: givenP number of irregular time series {x(1), . . . ,x(P )},we are interested in developing efficient learningalgorithms to uncover the temporal causal networksthat reveals temporal dependence between these timeseries.

3 Related Work

In this section, we review existing work on learningtemporal causal networks and irregular time seriesanalysis.

3.1 Temporal Causal Networks Learning tem-poral causal networks for regular time series has be-come one of the most important problems for timeseries analysis. It has broad applications in biology,climate science, social science, and so on. Many ap-proaches have been developed to solve the problem,such as autocorrelation, cross-correlations [8], ran-domization test [17], phase slope index [20], and soon. It turns out, a regression-based method called“Granger Causality” has been introduced in the areaof econometrics for time series analysis [13]. It statesthat a variable x is the cause of another variable y ifthe past values of x are helpful in predicting the fu-ture values of y. In other words, among the followingtwo regressions:

xt =

L∑l=1

alxt−l(3.1)

xt =

L∑l=1

alxt−l +

L∑l=1

blyt−l,(3.2)

where L is the maximal time lag, if Equation (3.2)is a significantly better model than Equation (3.1),we determine that time series y Granger causes timeseries x. Granger causality has gained tremendoussuccess across many domains due to its simplicity,robustness, and extendability [21, 9].

Most existing algorithms for detecting Grangercausality are based on a statistical significance test.

It is extremely time-consuming and very sensitive tothe number of observations in the autoregression. In[2], the Lasso-Granger method is proposed to solvethe problem and shown to have superior performancein terms of accuracy and scalability. Suppose we haveP number of time series x(1), . . . ,x(P ) with observa-tions at equally spaced time stamps t = 1, . . . , T . Thebasic idea of the Lasso-Granger method is to utilizeL1-penalty to perform variable selection, which cor-responds to neighborhood selection in learning tem-poral causal network [2]. Specifically, for each timeseries x(i), we can obtain a sparse solution of the co-efficients a by solving the following Lasso problem:(3.3)

min{ai}

T∑t=L+1

∥∥∥∥∥∥x(i)t −

P∑j=1

a>i,jx(j)t,Lagged

∥∥∥∥∥∥2

2

+ λ ‖ai‖1 ,

where x(j)t,Lagged is the concatenated vector of lagged

observations, i.e., x(j)t,Lagged =

[x

(j)t−L, . . . , x

(j)t−1

], ai,j is

the j-th vector of coefficients ai modeling the effectof the time series j on time series i, λ is the penaltyparameter, which determines the sparseness of ai.The resulting optimization problem can be solvedefficiently by sub-gradient method [7], LARS [29] andso on. Then we determine that there is an edge fromtime series j to time series i if and only if ai,j is non-zero vector.

The Lasso-Granger method not only significantlyreduces the computational complexity compared withpairwise significant tests, but also achieves nice the-oretical properties on its consistency [2]. It has beenwidely applied to biology applications [30], climateanalysis [11], fMRI data analysis [23] and so on withstrong empirical results.

3.2 Irregular Time Series Analysis Existingmethods for analyzing irregular time series can besummarized into three categories: (i) Repairing theirregularities of the time series by either filling thegaps or resampling the series at uniform intervals and(i) Generalization of spectral analysis for irregulartime series; and (iii) Applications of kernel techniquesto time series data.

Repair Methods The basic idea of repairmethod is to interpolate the given time series in reg-ularly spaced time stamps. The produced time seriescan be used in the temporal causal analysis for regu-lar time series. For Gappy Time Series or Time Serieswith Missing Data, we can use a regression algorithmto recover the time series by filling the blank timestamps, which is also known as Surrogate Data Mod-

eling [16]. In the case of non-uniformly sampled timeseries, the common practice is to find the value ofthe time series in regularly-spaced time stamps viaa regression algorithm [25, 16, 22]. In some applica-tions, a transient time model that describes the be-havior of the system (e.g., by differential equations) isavailable and can be used to recover the missing data[14, 10]. However, this approach strongly depends onthe model accuracy, which prevents it being applica-ble to the datasets with complex natural processes(e.g. climate system) [22]. One major issue with allrepair methods is that the interpolation error propa-gates throughout all steps after data processing. Asa result, quantifying the effect of the interpolation onany resulting statistical inference becomes a challeng-ing task [19].

Generalization of the Spectral AnalysisTools The basic idea of this approach is to find thespectrum of the irregular time series by generaliza-tion of the Fourier transform. Lomb-Scargle Peri-odogram (LSP) [24] is one classical example of thisapproach. In LSP, we first fit the time series to a sinecurve in order to obtain their spectrum, which canbe used to calculate the power spectral density (e.g.,by Fourier transform). Then the auto-correlation andcross-correlation functions can be found by taking theinverse Fourier transform (using iFFT) of the cor-responding power spectral density and cross powerspectral density functions [3, 22, 26]. The versatil-ity of Wavelet transforms and their numerous appli-cations in analysis of time series have motivated re-searchers to extend the transform to irregular timeseries [28, 12, 19]. LSP-based algorithms are spe-cially designed for spectral analysis of the periodicsignals. They do not perform well for non-periodicsignals [27], and are not robust to the outliers in thedata [22]. Furthermore, computing the entire cross-correlation function is a much more challenging taskthan our task of learning temporal dependence. Fur-thermore the produced correlation function can be acomplex function, which is difficult to interpret andconduct theoretical analysis on.

Kernel-based Method The basic idea ofkernel-based methods is to generalize regular correla-tion operator via kernels without the complex com-putation of correlation function. One classical exam-ple of this approach is the Slotting Techniques [22],which compute a weighted (kernelized) correlation oftwo irregularly sampled time series. For two irregu-

lar time series {(txn, xn)}Nxn=1 and {(tym, ym)}Ny

m=1 the

3

correlation is computed as follows:

(3.4) ρ(l,∆t) =

∑Nx

n=1

∑Ny

m=1 xnymwl,∆t (txn, tym)∑Nx

n=1

∑Ny

m=1 xnym (tym − txn),

where w is the kernel function, l is the lag number,and ∆t is the average sampling interval length. Forexample, w can be Gaussian kernel as follows:

(3.5) wl,∆t(t1, t2) = exp

(− (t2 − t1 − l∆t)2

σ2

),

where σ is the kernel bandwith. The main advantageof kernel-based methods is that they are nonparamet-ric and can be easily extended to any regression-basedmethods. Furthermore, they do not have specific as-sumptions on the data, e.g. the model assumption inthe repair methods and the periodicity assumption inLSP-based methods.

4 Methodology

The major challenge to uncover temporal causalnetworks for irregular time series is how to effectivelycapture the temporal dependence without directlyestimating the values of missing data or makingrestricted assumptions about the generation processof the time series. In this paper, we propose a simpleyet extremely effective approach by generalizing theinner product operator via kernels so that regression-based temporal casual models can be applicable toirregular time series.

In this section, we describe in detail our proposedmodel, analyze its theoretical properties, and discussone extension that takes into account the irregularpatterns of the input time series for accurate predic-tion.

4.1 Generalized Lasso Granger (GLG) Thekey idea of our model is as follows: if we treat ai,j ineq (3.3) as a time series, a>i,jx

(j) can be considered as

its inner product with another time series x(j). If weare able to generalize the inner product operator toirregular time series, the temporal causal models forregular time series can be easily extended to handleirregular ones.

Let us denote the generalization of dot productbetween two irregular time series x and y by x � y,which can be interpreted as a (possibly non-linear)function that measures the unnormalized similaritybetween them. Depending on the target application,one can define different similarity measures, and thusinner product definitions. For example, we can define

the inner product as a function linear with respect tothe first time series components as follows:

(4.6) x� y =

Nx∑n=1

∑Ny

m=1 xnymw (txn, tym)∑Ny

m=1 w (txn, txm)

,

where w is the kernel function. For example w canbe the Gaussian kernel defined as in Equation (3.5).Many other similarity measures for time series havebeen developed for the classification and clusteringtasks [32, 34], and can be used for our dot productdefinition.

Given the generalization of the inner product op-erator, we can now extend the regression in Equation(3.3) to obtain the desired optimization problem forirregular time series. Formally, suppose P number ofirregular time series x(1), . . . ,x(P ), are given. Let ∆tdenote the average length of the sampling intervalsfor the target time series (e.g. x(i)) and a′i,j(t) be apseudo time series, i.e.:

a′i,j(t) = {(tl, ai,j,l)|l = 1, . . . , L, tl = t− l∆t},

which means that for different value of t, a′i,j(t)share the same observation vectors (i.e, {ai,j}), butthe time stamp vectors vary according to the valueof t. We can perform the causality analysis bygeneralized Lasso Granger (GLG) method that solvesthe following optimization problem:

min{ai,j}

Ni∑n=`0

∥∥∥∥∥x(i)n −P∑j=1

a′i,j(t(i)n )� x(j)

∥∥∥∥∥2

2

+ λ ‖ai‖1 ,(4.7)

where `0 is the smallest value of n that satisfiest(i)n ≥ L∆t.

The above optimization problem is not convex ingeneral and the convex optimization algorithms canonly find a local minimum. However if the generalizedinner product is defined to make Problem 4.7 convex,there are efficient algorithms such as FISTA [5] tosolve optimization problems of the form f(θ) + ‖θ‖1where f(θ) is convex. In this paper, we use the lineargeneralization of the inner product given by Equation(4.6) with which Problem (4.7) can be reformulated

as linear prediction of x(i)n using parameters a′i,j(t

(i)n )

subject to norm-1 constraint on the value of theparameters. Thus, the problem is a Lasso problemand can be solved more efficiently by optimized Lassosolvers such as [29].

4.2 Extension of GLG Method Notice that ev-ery data point x

(i)n in the GLG method has equal

impact in the regression. We make the following two

Time

Time Series #1

Time Series #2

Time Series #3

𝑥𝑛11

𝑥𝑛21

𝐿 + 1 Δt 𝐿 + 1 Δt

Figure 1: Time Series #1 is the target time series in

this figure. Prediction of x(1)n1 should receive a higher

weight than x(1)n2 in the depicted scenario because it

can be predicted more accurately.

Time

Time Series #1

𝑥𝑛11

𝑥𝑛21

Figure 2: Time Series #1 is the target time seriesin this figure while the other time series are not

shown. Prediction of x(1)n2 should have higher weight

in the causality inference than x(2)n2 because x

(2)n2 is in

a denser region of the time series.

observations, which could help improve the perfor-mance.

The first observation, depicted in Figure 1, statesthat the samples that have more data points forprediction will be predicted more accurately. Thus,they should have higher contribution in learning thecausality graph. Thus we define the following weight

for the subproblem of prediction of x(i)n ,

v1

(x(i)n

)=

P∑j=1

L∑`=1

Nj∑m=1

w(t(i)n − `∆t, t(j)m

).

The second observation in Figure 2 states thatthe learning should be uniformly distributed in time.In other words, the samples that are in a denserregion should contribute less than those in sparseregions. Thus we define the following weights,

v2

(x(i)n

)=

w(x

(i)n , x

(i)n

)∑Ni

m=1 w(x

(i)n , x

(i)m

)Using the introduced weights, we define the

Weighted Generalized Lasso Granger (W-GLG) asfollows:

Time

TS #1

TS #2

TS #3

𝑥𝑛(1)

𝐿 + 1 Δ𝑡

Time

𝑥1𝑡 + 𝑧1

𝑡

𝐿 + 1 Δ𝑡

z1𝑡−Δ𝑡 𝑧1

𝑡−2𝛥𝑡 z1𝑡−3Δ𝑡 z1

𝑡−4Δ𝑡



𝑡−4Δ𝑡



𝑡−4Δ𝑡

TS #1

TS #2

TS #3

GLG

Repair

Methods

𝑧𝑡𝑛−3Δ𝑡(1)


𝑧𝑡𝑛−Δ𝑡(1)










Figure 3: The sources of repair errors in GLG when

x(1)n is being predicted. In order to predict data point

x(1)n GLG repairs the time series in L points before

the time t1n. At each point of time a repair error

z(i)t−`∆t is produced.

min{ai,j}

Ni∑n=`0

v1(t(i)n

)v2(t(i)n

)∥∥∥∥∥∥x(i)n −P∑j=1

a′i,j(t(i)n )� x(j)

∥∥∥∥∥∥2

2

+ λ ‖ai‖1(4.8)

4.3 Asymptotic Consistency of GLG We fol-low the procedure in [18] to study the consistencyof our generalized Lasso Granger Method. Supposethere are two instantiations for the set of random pro-cesses xi, one with regular sampling frequency andthe other one with irregular sampling frequency.

Figure 3 shows the source of the errors when x(1)n

is being predicted using GLG. It can be seen in thefigure that GLG interpolates the value of the timeseries in regular steps [tn−L∆t, . . . , tn−∆t] and uses

them for prediction of x(1)n . We can model the errors

induced at time tn − `∆t in the jth time series by

z(j)tn−`∆t and define the regular time series x

(i)t = x

(i)t +

z(i)t for i = 1, . . . , P and t = tn − L∆t, . . . , tn − ∆t.

Now, it suffices to show that the graph inferred using

the regular time series x(i)t will produce the same

graph compared to the case that we had the original

regular time series x(i)t .

In order to proceed with the proof we need to

assume that x(i)t are Gaussian random variables with

zero mean. Similar to the regular case, let X(i)t be

the vector of length n of samples of xti. We can writeGLG as,(4.9)

aλi,j = arg minai,j

n−1

∥∥∥∥∥∥X(i)t −

∑j

X(j)ai,j

∥∥∥∥∥∥2

2

+ λ ‖ai,j‖1

.

Let Γ denote the set of all the time series. Sup-pose we have the regular time series. The neighbor-hood ne(i) of a time series x(i) is the smallest subset

5

of Γ\{(i)} so that x(i) is conditionally independent ofall the variables in the remaining time series. Now,define the estimated neighborhood of x(i) using ir-regular time series by neλ(i) = {x(j)|aλi,j 6= 0} in thesolution of the above problem.

Theorem 4.1. Let assumptions 1-6 in [18] hold forthe solution of Problem (4.9).

(a) Suppose E[z(i)p x

(i)q ] = 0 for all i = 1, . . . , N and

p, q = tn, . . . , tn − L∆t, p 6= q. Let the penaltyparameter satisfy λ ∼ dn−(1−ε)/2 with κ < ε < ξand d > 0. There exists some c > 0 so that forall xti ∈ Γ,

P(neλ(i) ⊆ ne(i)

)= 1−O (exp (−cnε)) , for n→∞.

P(ne(i) ⊆ neλ(i)

)= 1−O (exp (−cnε)) , for n→∞.

(b) The above results can be violated if E[z(i)p x

(i)q ] 6= 0

for all i = 1, . . . , N and p, q = tn, . . . , tn −L∆t, p 6= q.

Proof. A proof is given in the Appendix.The theorem specifies a sufficient condition with

which the temporal causal graph inferred by theirregular time series is asymptotically equal to thegraph that could be inferred if the time series wereavailable. The condition states that if the repairingerror during GLG’s operation is orthogonal to therepaired data points, the inferred temporal causalgraph will be the same as the actual one withprobability approaching one as the length of the timeseries grows.

Choice of Kernel in GLG MethodConsider the non-uniformly sampled time series.

Suppose the non-uniformity is due to clock jitterwhich is modeled by iid zero-mean Gaussian variablesat different sampling times. If the kernel functionw satisfies the following equation, the neighborhoodlearned by our algorithm is guaranteed to be the sameas the regular time series case:

(4.10)

N∑`=1

[K(`+ a`, q)−K(p, q)]w(`+ a`, p) = 0,

for all i, ` = 1, . . . , N . In this equation K(t, t′) is thecovariance function of the time series and aj is therandom variable modeling the clock jitter at jth timestamp.

Proof. By Theorem 4.1, it is sufficient to show thatthe following equation holds for all p, q = 1, . . . , N

which means error should be orthogonal to all theobservations used in Lasso-Granger.

(4.11) Ex

[∑N`=1(x

(i)

(`+a`)− x(i)p )w(`+ a`, p)∑N

`=1 w(`+ a`, p)x(i)q

]= 0.

The expectation is with respect to x, thus:(4.12)

Ex

[(N∑`=1

(x(i)(`+a`) − x

(i)p )w(`+ a`, p)

)x(i)q

]= 0.

Taking the expectation and using the definition of the

covariance function K(t, t′) = Ex[x

(i)t x

(i)t′

]yields the

result. �

A quick inspection of Equation (4.10) shows thatthe Gaussian kernel does not satisfy it. It is clear bysetting q = p and noting that the covariance func-tion always satisfies K(`+ a`, p) ≤ K(p, p), any non-negative kernel such as the Gaussian kernel cannotsatisfy Equation (4.10). However if the time seriesis smooth enough so that K(t, p) ≈ K(p, p) for t ina neighborhood of p, the Gaussian kernel can atten-uate the effect of K(t, p) − K(p, p) for values of toutside the neighborhood and the left side of Equa-tion (4.10) can become close to zero. In conclusion,while the Gaussian kernel is a suboptimal kernel touse for causality analysis purposes, for smooth timeseries it approximately satisfies Equation (4.10) andis expected to perform well.Comparison of GLG Method with Time De-cay Kernels and Locally Weighted RegressionAnalysis of the consistency of the repair methods suchas Kernel Weighted Regression algorithm can be donesimilar to the analysis of GLG method by introducingthe repair error variables. However GLG is expectedto have lower absolute error because it tries to predict

the actual observations x(i)n without additional repair

error. In contrast, repair methods such as LWR firstinterpolate the time series at regular time stamps;then try to predict the repaired samples which carryrepair errors with themselves.

As alluded in Section 3, the repair methodsintroduce huge amount of error during reconstructionof the irregular time series in the large gaps, seeFigure 4. In contrast, since there is no sample inthe gaps, GLG does not attempt to predict the valueof the time series in the gaps and avoids these typesof errors.

5 Experiment Results

In order to examine performance of GLG, we per-form two series of experiments, one on the synthetic

Time

Figure 4: The black circles show the time stampsof the given irregular time series. The crosses arethe time stamps used for repair of the time series.The red time stamp shows the moment in that therepair methods produce large errors and propagatethe error by predicting the erroneous repaired sample.GLG skips these intervals because it predicts only theobserved samples.

datasets, and the other one on the Paleo dataset foridentifying the monsoon climate patterns in Asia.

5.1 Synthetic Datasets Design of syntheticdatasets with irregular sampling time have been dis-cussed extensively in the literature. We follow themethodology in [22] to create four different syntheticdatasets. These datasets are constructed to emulatedifferent sampling patterns and functional behavior.Mixture of Sinusoids with Missing Data (MS-M):In order to create the MS-M dataset, we generateP − 1 time series according to a mixture of severalsinusoidal functions:(5.13)

x(i)t =

3∑j=1

A(i)j cos(2πf

(i)j t+ϕ

(i)j )+ε(i), for i = 2, . . . , P

where f(i)j ∼ Unif(fmin, fmax), ϕ

(i)j ∼ Unif(0, 2π)

and the vector of amplitudes A(i)j is distributed as

Dirichlet(13×1) so that all the time series have aleast amount of power. The range of the frequenciesis selected in a way that only few periods of sines willbe repeated in the dataset. The noise term ε(i) isselected to be zero-mean Gaussian noise. We createthe target time series according to a linear model,

x(1)t =

P∑i=2

L∑`=1

α(i)` x

(i)t−` + ε,

We set α(i)` to be sparse to model sparse graphs.

Finally we drop samples from all the time seriesindependently with probability pm. We create 20instances of MS-M random datasets and report theaverage performance on them.Mixture of Sinusoids with Jittery Clock (MS-J):For creating this dataset, first we create the sam-pling time stamps for the target time series: t1 =

[1, . . . , N ] + e1 where the random vector e1 =[e11, . . . , e1N ] is a zero mean Gaussian random vec-tor with covariance matrix γI and γ is called the jit-ter variance. We select the parameters of the othertime series and use them to calculate the value of thetarget time series at the given time stamps:

x(1)n =

P∑i=2

L∑`=1

α(i)`

3∑j=1

A(i)j cos(2πf

(i)j (tn−`)+ϕ(i)

j )+ε,

Then we produce the sampling times for the othertime series similarly as a Gaussian random vectorwith mean [1, . . . , N ] and covariance matrix γI anduse Equation (5.13) to produce the time series. Wecreate 50 instances of MS-J random datasets andreport the average performance on them.Auto-Regressive Time Series with Missing Data (AR-M): The procedure of creating the AR-M dataset issimilar to the one described for MS-M with a singledifference in producing time series 2 to P accordingto AR processes:

x(i)n =

3∑`=1

βi,`x(i)tn−` + ε, for i = 2, . . . , P

where βi,` are chosen randomly while keeping the ARtime series stable. We create 50 instances of AR-Mrandom datasets and report the average performanceon them.Mixture of Sinusoids with Poisson Process Samplingtimes (MS-P): The procedure of creating the MS-P dataset is similar to the one described for MS-Jwith the difference of producing the sampling timesaccording to a Poisson Point process; i.e. inter-sampling times are distributed according to Exp(1).We create 50 instances of MS-P random datasets andreport the average performance on them.Baselines We compare the performance of our al-gorithm with four state-of-the-art algorithms. Weuse Locally Weighted Regression (LWR) and Gaus-sian Process (GP) regression to repair the time seriesand perform the regular Lasso-Granger. The SlottingTechnique [22] and the LSP method are the two al-gorithms used for finding mutual correlation. Sinceperformance of W-GLG is close to the performanceof GLG, performance of W-GLG will be compared inonly one figure to avoid cluttered plots.Performance Measures In order to report theaccuracy of the inferred graph, we use the Area Underthe Curve (AUC) score. The value of AUC is theprobability that the algorithm will assign a highervalue to a randomly chosen positive (existing) edge

7

0 50 100 150 200 250 3000.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Length of Time Series

AU

C S

core

GLG LWR GP Slotting LSP0 50 100 150 200 250 300

0

0.2

0.4

0.6

0.8

1


AU

C S

core

GLG LWR GP Slotting LSP0 50 100 150 200 250 300

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


AU

C S

core

GLG LWR GP Slotting LSP

Figure 5: Study of convergence behavior of the algorithms in the Mixture of Sinusoids with (left) Missingdata points dataset, (middle) Jittery clock dataset and (right) Autoregressive Time Series with Missing datapoints dataset.

than a randomly chosen negative (non-existing) edgein the graph.Parameter Tuning Unless otherwise stated, inall of the kernel-based methods (GLG, LWR, andSlotting technique) we use Gaussian kernel withbandwidth σ = ∆t

2 . We select the value of λ in GLGby 5-fold cross-validation.

5.2 Results on the Synthetic Datasets .Experiment #1: Convergence Behavior ofThe Algorithms In Figures ??, we increase thelength of the time series in datasets MS-M and MS-J to study the convergence behavior of GLG. Bothfigures demonstrate excellent convergence behaviorof GLG and the fact that GLG consistently outper-forms other algorithms by a large margin. Note thatthe convergence behavior of LWR is very similar toGLG; but as alluded in the theoretical analysis, GLGhas lower absolute error. The Slotting technique andthe LSP method are designed for analysis of cross-correlation of time series and as expected do not per-form well in our settings. The poor performance ofLSP can be linked to the periodicity of signals as-sumption in LSP analysis method, which is violatedin our dataset.Experiment #2: Comparison of DifferentKernels In order to study the effect of different ker-nels on performance of GLG we test the convergencebehavior of GLG with three different kernels. Thekernels are Gaussian: w(t1, t2) = exp(−(t1− t2)2/σ),Sinc: w(t1, t2) = σ sin((t1 − t2)/σ)/(t1 − t2), and the

Inverse distance kernel w(t1, t2) = (‖t1 − t2‖22)−1. Asshown in Figure 6, as pointed out in the analysissection the Gaussian kernel shows acceptable conver-gence behavior. Performance of the Inverse Distancekernel suggests an asymptotic convergence behaviorbut with much higher absolute error. The Sinc kerneldoes not show properly converge.

0 50 100 150 200 250 300

0.5

0.6

0.7

0.8

0.9

1

Length of Time SeriesA

UC

Sco

re

Gaussian Sinc Inverse Distance

Figure 6: Convergence of different kernels in theMixture of Sinusoids with missing dataset.

0 0.1 0.2 0.3 0.4 0.5 0.60.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Missing Probability

AU

C S

core


Figure 7: The effect of missing data in the Mixtureof Sinusoids with missing data points.

0 0.2 0.4 0.6 0.8 10.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Jitter

AU

C S

core


Figure 8: The effect of clock jitter (γ) the Mixture ofSinusoids with Jittery Clock dataset.

GLG LWR GP Slotting LSP0

0.2

0.4

0.6

0.8

1

AU

C S

core

Figure 9: Performance comparison of the algorithmsin the Mixture of Sinusoids with Poisson pointssampling times.

Experiment #3: The Impact of Missing RateThe effect of missing data is examined in Figure 7.It is clear that as the probability of missing a datadecreases GLG becomes more accurate. Note (i) thesuperior performance of GLG compared to others, (ii)When only 10% of the data points are missing GLGperfectly uncovers the correct causality relationshipbetween the variables.Experiment #4: Clock Jitter Impact The effectof clock jitter (γ) is examined in Figure 8. GLG isrobust with respect to the amount of clock jitter.This is because (i) the time series are smooth, (ii)we are testing in a reasonable range of jitter.Experiment #5: Performance of the Algo-rithms in Extremely Irregular Datasets Due tothe limited space we only report the result of oneexperiment on the Mixture of Sinusoids with Poissonpoints sampling times, see Figure 9. In this extremelyirregularly sampled dataset, our algorithm outper-forms other algorithms by a large margin. This isdue to high possibility of large gaps in this datasetand the fact that GLG avoids huge errors caused byinterpolation in the gaps.Experiment #6: Comparison of W-GLGand GLG Figure 10 compares the performance ofweighted version of our algorithm (WGLG) with thenon-weighted version (GLG) in all the four syntheticdatasets. WGLG performs only marginally betterthan GLG in all the datasets except the dataset withPoisson sampling points. This is because only thisextremely irregular dataset provides the situation forweights to show their advantages.

5.3 Paleo Dataset Now we apply our method toa Climate dataset to discover the weather movementpatterns. Climate scientist usually rely on modelswith enormous number of parameters that are neededto be measured. The alternative approach is the data-

MS−J MS−M AR−M MS−P0

0.2

0.4

0.6

0.8

1

AU

C S

core

GLGW−GLG

Figure 10: Comparison of performance of W-GLGvas. GLG in all four datasets.

centric approach which attempts to find the patternsin the observations.

The Paleo dataset which is studied in this paperis the collection of density of δ18O, a radio-activeisotope of Oxygen, in four caves across China andIndia. The geologists were able to find the estimateof δ18O in ancient ages by drilling deeper in thewall of the caves. The data are collected fromDandak [6], Dongge [31], Heshang [15], Wanxiang[33] caves, see Figure 11, with irregular samplingpattern described in Table 1. The inter-samplingtime varies from high resolution 0.5 ± 0.35 to lowresolution 7.79 ± 9.79; however there is no large gapbetween the measurement times.

Wanxiang Heshang

Dandak

Dongge

Figure 11: Map of the locations and the monsoonsystems in Asia.

The density of δ18O in all the datasets is linkedto the amount of precipitation which is affected bythe Asian monsoon system during the measurementperiod. Asian monsoon system, depicted in Figure11, affects a large share of world’s population bytransporting moisture across the continent. Themovement of monsoonal air masses can be discoveredby analysis of their δ18O trace. Since the datasetsare collected from locations in Asia, we are able to

9

Wanxiang

Heshang

Dongge

Dandak

Wanxiang

Heshang

Dongge

Dandak

Wanxiang

Heshang

Dongge

Dandak

Wanxiang

Heshang

Dongge

Dandak

Wanxiang

Heshang

Dongge

Dandak

Wanxiang

Heshang

Dongge

Dandak

(a) (b) (c)

(d) (e) (f)

Figure 12: Comparison of the results on the PaleoDataset: (a) GLG in period 850AD-1563AD. (b)GLG in the period 1250AD-1564AD. (c) GLG inthe period 850AD-1250AD. (d) Slotting technique inperiod 850AD-1563AD. (e) Slotting technique in theperiod 1250AD-1564AD. (f) Slotting technique in theperiod 850AD-1250AD.

analyze the spatial variability of the Asian monsoonsystem.

In order to analyze the spatial transportation ofthe moisture we normalize all the datasets by sub-tracting the mean and divide them by their standarddeviation. We use GLG with the Gaussian kernelwith bandwidth equal to 0.5(y) and maximum lag of25(y); i.e. L = 50. In order to compare our resultswith the results produced by the slotting method in[22] we analyze the spatial relationship among the lo-cations in three age intervals: (i) The entire overlap-ping age interval 850AD-1563AD, (ii) The Cold phase850AD-1250AD, and (ii) The medieval warm period1250AD-1563AD. Figure 12 compares the graphs pro-duced by GLG with the ones reported by [22].

Table 1: Description of the Paleo Dataset.

Location Measur. Period ∆t Std(tn − tn−1)

Dandak 624AD-1562AD 0.50 (y) 0.35 (y)

Dongge 6930BC-2000AD 4.21 (y) 1.63 (y)

Heshang 7520BC-2002AD 7.79 (y) 9.78 (y)

Wanxiang 192AD-2003AD 2.52 (y) 1.19 (y)

5.4 Results on the Paleo Dataset Figure 12parts (a) and (d) show the results of causality analysiswith GLG and slotting technique, respectively. Ourresults identify two main transportation patterns.First, the edges from Dongge to other locations whichcan be interpreted as the effect of movement of air

masses from southern China to other regions via theEast Asian Monsoon System (EAMS). Second, anedge from Dandak to Dongge which shows the IndianMonsoon System (IMS) significantly affects Donggein southern China. The graph in the period 1250AD-1563AD is sparser than the graph in 850AD-1250ADwhich can be due to the fact that the former ageperiod is a cold period, in which air masses do nothave enough energy to move from India to China,while in contrast the latter age period is a warm phaseand the air masses initiated in India impact southernChina regions. During the warm period we can seethat other branches of EAMS are also more activewhich result in denser graph in the warm period.

The differences between our results and the re-sults from Slotting technique can be because of thefact that in the Slotting technique an edge is posi-tively identified even if two time series have signifi-cant correlation at zero lag. However, by the defi-nition of Granger causality, only past values of onetime series should help prediction of the other one inorder to be considered as a cause of it. The signifi-cant correlation at zero lag can be due to either fastmovement of the air masses or production of the δ18Oby an external source, such as changes in the sun’sradiation strength, that impacts all the places withthe same amount. Thus, the correlation value at zerolag cannot be a reliable sign for inference about themovement of the air masses.

The sparsity of the identified causality graphs canbe due to several reasons. While we cannot rule outthe possibility of non-linear causality relationship be-tween the locations, as [22] have noticed, the spar-sity can happen because the relationships are eitherin large millennial scales or short annual scales. Inthe former case the relationship cannot be capturedthrough analysis of periods with length several cen-turies. In the latter case, the resolution of the datasetis in the order of 3-4 years which does not capture an-nual or biennial links.

6 Conclusion and Future Work

In this paper, we propose a nonparametric general-ization of the Granger graphical models (GLG) touncover the temporal causal structures from irregu-lar time series. We provide theoretical proof on theconsistency of the proposed model and demonstrateits effectiveness on four simulation data and one real-application data. For future work, we are interestedin investigating how to design effective kernels forthe GLG model and extending the algorithm to largescale data analysis.

Acknowledgement

We thank Umaa Rebbapragada from JPL for dis-cussing the problem with us, Kira Rehfeld from PIKfor sharing the Paleo dataset with us and the anony-mous reviewers for valuable comments. This re-search was supported by the NSF research grants IIS-1134990.

References

[1] H. Akaike. A new look at the statistical modelidentification. IEEE Transactions on AutomaticControl, 19(6):716–723, Dec. 1974.

[2] A. Arnold, Y. Liu, and N. Abe. Temporal causalmodeling with graphical granger methods. In KDD’07, page 66, New York, New York, USA, Aug. 2007.ACM Press.

[3] P. Babu and P. Stoica. Spectral analysis of nonuni-formly sampled data - a review. Digital Signal Pro-cessing, 20(2):359–378, Mar. 2010.

[4] L. Barnett, A. B. Barrett, and A. K. Seth. Grangercausality and transfer entropy are equivalent forGaussian variables. Oct. 2009.

[5] A. Beck and M. Teboulle. A Fast IterativeShrinkage-Thresholding Algorithm for Linear In-verse Problems. SIAM Journal on Imaging Sciences,2(1):183, Jan. 2009.

[6] M. Berkelhammer, A. Sinha, M. Mudelsee,H. Cheng, R. L. Edwards, and K. Cannariato. Per-sistent multidecadal power of the Indian SummerMonsoon. Earth and Planetary Science Letters,290(1-2):166–172, 2010.

[7] D. P. Bertsekas and D. P. Bertsekas. NonlinearProgramming. Athena Scientific, 2nd edition, Sept.1999.

[8] G. E. P. Box and G. M. Jenkins. Time seriesanalysis; forecasting and control [by] George E. P.Box and Gwilym M. Jenkins. Holden-Day SanFrancisco,, 1970.

[9] A. Brovelli, M. Ding, A. Ledberg, Y. Chen, R. Naka-mura, and S. L. Bressler. Beta oscillations in a large-scale sensorimotor cortical network: directional in-fluences revealed by Granger causality. Proceedingsof the National Academy of Sciences of the UnitedStates of America, 101(26):9849–54, June 2004.

[10] J. C. Cuevas-Tello, P. Tino, S. Raychaudhury,X. Yao, and M. Harva. Uncovering delayed patternsin noisy and irregularly sampled time series: an as-tronomy application. Pattern Recognition, 43(3):36,Aug. 2009.

[11] J. B. Elsner. Granger causality and Atlantic hurri-canes. Tellus - Series A: Dynamic Meteorology andOceanography, 59(4):476–485, 2007.

[12] G. Foster. Wavelets for period analysis of unevenlysampled time series. The Astronomical Journal,112:1709, Oct. 1996.

[13] C. W. J. Granger. Investigating causal relationsby econometric models and cross-spectral methods.Econometrica, 37(3):pp. 424–438, 1969.

[14] W. K. Harteveld, R. F. Mudde, and H. E. A. VanDen Akker. Estimation of turbulence power spectrafor bubbly flows from Laser Doppler Anemometrysignals. Chemical Engineering Science, 60(22):6160–6168, 2005.

[15] C. Hu, G. M. Henderson, J. Huang, S. Xie, Y. Sun,and K. R. Johnson. Quantification of HoloceneAsian monsoon rainfall from spatially separated caverecords. Earth and Planetary Science Letters, 266(3-4):221–232, Feb. 2008.

[16] D. M. Kreindler and C. J. Lumsden. The effects ofthe irregular sample and missing data in time seriesanalysis. Nonlinear dynamics, psychology, and lifesciences, 10(2):187–214, Apr. 2006.

[17] T. La Fond and J. Neville. Randomization tests fordistinguishing social influence and homophily effects.In WWW ’10, pages 601–610, New York, NY, USA,2010. ACM.

[18] N. Meinshausen and P. Buhlmann. High-Dimensional Graphs and Variable Selection with theLasso. The Annals of Statistics, 34(3):1436–1462,June 2006.

[19] D. Mondal and D. B. Percival. Wavelet varianceanalysis for gappy time series. Annals of the Insti-tute of Statistical Mathematics, 62(5):943–966, Sept.2008.

[20] G. Nolte, A. Ziehe, V. V. Nikulin, A. Schlogl,N. Kramer, T. Brismar, and K.-R. Muller. Ro-bustly estimating the flow direction of informationin complex physical systems. Physical review letters,100(23):234101, June 2008.

[21] C. D. . V. Panchenko. Modified hiemstra-jones testfor granger non-causality. Computing in Economicsand Finance 2004 192, Society for ComputationalEconomics, 2004.

[22] K. Rehfeld, N. Marwan, J. Heitzig, and J. Kurths.Comparison of correlation analysis techniques forirregularly sampled time series. Nonlinear Processesin Geophysics, 18(3):389–404, 2011.

[23] S. Ryali, K. Supekar, T. Chen, and V. Menon. Mul-tivariate dynamical systems models for estimatingcausal interactions in fMRI. NeuroImage, 54(2):807–23, Jan. 2011.

[24] J. D. Scargle. Studies in astronomical time seriesanalysis. I - Modeling random processes in the timedomain. The Astrophysical Journal SupplementSeries, 45(Jan):1–71, 1981.

[25] M. Schulz and K. Stattegger. Spectrum: spectralanalysis of unevenly spaced paleoclimatic time series.Computers & Geosciences, 23(9):929–945, 1997.

[26] P. Stoica, P. Babu, and J. Li. New Method of SparseParameter Estimation in Separable Models and ItsUse for Spectral Analysis of Irregularly SampledData. IEEE Transactions on Signal Processing,

11

59(1):35–47, Jan. 2011.[27] P. Stoica, J. Li, and H. He. Spectral Analysis

of Nonuniformly Sampled Data: A New ApproachVersus the Periodogram. IEEE Transactions onSignal Processing, 57(3):843–858, Mar. 2009.

[28] W. Sweldens. The Lifting Scheme: A Constructionof Second Generation Wavelets. SIAM Journal onMathematical Analysis, 29(2):511, Mar. 1998.

[29] R. Tibshirani, I. Johnstone, T. Hastie, and B. Efron.Least angle regression. The Annals of Statistics,32(2):407–499, Apr. 2004.

[30] P. A. Valdes-Sosa, J. M. Sanchez-Bornot, A. Lage-Castellanos, M. Vega-Hernandez, J. Bosch-Bayard,L. Melie-Garcıa, and E. Canales-Rodrıguez. Es-timating brain functional connectivity with sparsemultivariate autoregression. Philosophical transac-tions of the Royal Society of London. Series B, Bio-logical sciences, 360(1457):969–81, May 2005.

[31] Y. Wang, H. Cheng, R. L. Edwards, Y. He, X. Kong,Z. An, J. Wu, M. J. Kelly, C. A. Dykoski, and X. Li.The Holocene Asian monsoon: links to solar changesand North Atlantic climate. Science (New York,N.Y.), 308(5723):854–7, May 2005.

[32] L. Wu, C. Faloutsos, K. P. Sycara, and T. R. Payne.Falcon: Feedback adaptive loop for content-basedretrieval. In VLDB ’00, VLDB ’00, pages 297–306,San Francisco, CA, USA, 2000. Morgan KaufmannPublishers Inc.

[33] P. Zhang, H. Cheng, R. L. Edwards, F. Chen,Y. Wang, X. Yang, J. Liu, M. Tan, X. Wang, J. Liu,C. An, Z. Dai, J. Zhou, D. Zhang, J. Jia, L. Jin, andK. R. Johnson. A test of climate, sun, and culturerelationships from an 1810-year Chinese cave record.Science (New York, N.Y.), 322(5903):940–2, Nov.2008.

[34] Y. Zhu and D. Shasha. Warping indexes withenvelope transforms for query by humming. InSIGMOD ’03, SIGMOD ’03, pages 181–192, NewYork, NY, USA, 2003. ACM.

7 Appendix

Proof of Theorem 4.1Let The error variables z

(i)` are zero mean Gaussian

variables due to zero mean Gaussian assumptionabout the distribution of X(i) and X(i). Followingthe proof in [18], construct the following correlationfunctions:

(7.14) Gk,`(a) = −2n−1

⟨X

(i)t −

∑j

X(j)ai,j ,X(k)`

⟩

and

(7.15) Gk,`(a) = −2n−1

⟨X

(i)t −

∑j

X(j)ai,j , X(k)`

⟩

(a) If E[z(i)p x

(i)q ] = 0 for all i = 1, . . . , N and

p, q = tn, . . . , tn − L∆t, p 6= q, the inner productsin Equations (7.14) and (7.15) are the same andsufficiency part of Lemma A.1 in [18] guarantees thesame ak,` = ak,`. This proves that the neighborhoodsinferred by analysis of the repaired time series isequal to the ones obtained by analysis of the regulartime series. Now using the assumptions 1-6 and anapplication of Theorems 1 and 2 in [18] concludes theproof. �

(b) Let, w.l.o.g, ak,` > 0 for some (k, `). Thenecessity part of the Lemma A.1 in [18] guaranteesthat Gk,`(a) = −λ. It is clear that by takinginto account the Z in Equation (7.15), we can haveGk,`(a) > −λ. (Note that we cannot show Gk,`(a) >−λ happens only with diminishing probability.) Nowthe sufficiency condition is not satisfied and we areno longer guaranteed to have ak,` = 0. Moreover,if the solution for Problem (7.15) is not unique and−λ < Gk,`(a) < λ then ak,` = 0 and the solution ofProblems (7.14) and (7.15) are indeed different. �

Granger Causality Analysis in Irregular Time Series · domization test [17], phase slope index...

Documents

Transcript of Granger Causality Analysis in Irregular Time Series · domization test [17], phase slope index...