Inference for stationary functional time series: dimension ...
Transcript of Inference for stationary functional time series: dimension ...
Faculte des Sciences
Departement de Mathematique
Inference for stationary functional time series:
dimension reduction and regression
Lukasz KIDZINSKI
These presentee en vue de l’obtention du grade de Docteur en Sciences,
orientation Statistique
Promoteur: Siegfried Hormann
Jury: Maarten Jansen, Davy Paindaveine, Thomas Verdebout, Laurent Delsol, Piotr Kokoszka
Septembre 2014
“Simplicity is the final achievement.
After one has played a vast quantity of notes and more
notes, it is simplicity that emerges as the crowning reward
of art.”
Fryderyk Chopin
in If Not God, Then What? by Joshua Fost.
Acknowledgements
First and foremost my heartfelt thanks to Professor Siegfried Hormann who, throughout three
years, taught me how to be a great scientist and a better person. I would like to thank him, for all
these long hours spent in front of the blackboard, for the passage from hard theoretical problems to
neat and valuable solutions, for his incredible precision and attention to detail which finally drove
me to be more careful, for his constantly positive attitude, charisma and expertise which will always
be a unique example for me, for the trust he gave me by letting me follow my own paths. It is a
great honour and privilege to be his first PhD student.
My gratitude is also extended to Piotr Kokoszka, for the support he showed and keeps showing
me from the moment we met, for his hospitality and the priceless opportunity to work together at
the Colorado State University.
My thanks goes to David Brillinger as well, who found time to share his experience with me at
UC Berkeley, regardless of the many obstacles.
My sincere thanks also go out to Cheng Soon Ong for accommodating me in the challenging
environment of the ETH Zurich.
My thanks also goes to my thesis committee for their guidance and our yearly recaps, to Davy
Paindaveine for his sharp remarks and exceptional humor, to Maarten Jansen for giving a great
example of scientific commitment, to Pierre Patie for valuable remarks at the beginning of my work,
and to Thomas Bruss for sharing his experience through countless stories and digressions during
lunches and coffee breaks. Likewise, my thanks to other members of my jury, Thomas Verdebout,
Laurent Delsol, for accepting my request and for their time.
Next, I would like to acknowledge the Communaut franaise de Belgique, for the grant within
Actions de Recherche Concertees (2010–2015) and the Belgian Science Policy Office, for the grant
within Interuniversity attraction poles (2012–2017). Thanks for the indispensable means which
allowed me to spend three years on my project.
Furthermore, I am aware that a scientific journey starts much earlier than in a doctoral school.
I would not be who I am without all the support from teachers starting from my childhood up till
now – I know that this thesis is not just mine, but their success too. In particular, I would like to
thank my primary school teacher Krzysztof Lukasiewicz and my high school teacher Jerzy Konarski
who taught me to enjoy mathematics.
Many fellow students and faculty members also supported me at the Universit Libre de Bruxelles.
A great thank you to my office colleague Remi for all the necessary breaks for random mathematical
problems, for refreshing algorithmic competitions, for chess games or simple discussions about the
essence of the universe. Thanks to my second office colleague Rabih, and neighbours, Sarah, Carine,
Stavros, Dominik, Germin and Christophe, fellow students Robson and Isabel and many others for
teaching me French and for maintaining my sanity through chats, dinners, joggings and others.
Thanks to the whole Gauss’oh Fast team, for the taste of victory and to the BSSM co-organisers,
i
Julien, Julie, Patrick, Yves, Thomas, Nicolas and others for quite the same reason.
I am also honoured by the support from outside of the university. Thanks to Daniel, Bella,
Felipe, Astrid, Senna, Thiago, Wolney, Anna, Omid, Maryam, and Sarah for enriching discussions
about science, politics, economics and any sort of regular gossip during Friday’s dinners. Thanks to
Jan, Dominika and Micha l for being there whenever I needed help. Thanks to my fantastic Polish
friends, Sebastian for his persistence, Karol for finding time for me no matter what, Natalia who
makes me remember I can achieve everything and Kinga for her exceptional life attitude. Thanks
to Leo for his constant positive thinking.
Thanks to my family, to my mother and sister who taught me the value of time, who always
believed in me and who will always protect me, to my father who was always motivating me to reach
for more.
Last, but certainly not least, I must acknowledge with tremendous and deep gratitude my lovely
Magda, for her limitless smiles, trust and support for all my ideas and decisions no matter how
crazy they seem. Together we are a team and for such a team every challenge is feasible.
ii
Contents
Acknowledgements i
Table of contents 1
Introduction 2
1 Functional data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Brief overview of functional data research . . . . . . . . . . . . . . . . . . . . 4
1.3 Hilbert spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Representation and fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Functional Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Model approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Lp-m-approximability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Mixing conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Cumulant condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1
Table of contents
3.3 Frequency domain methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Objectives and structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 13
I A Note on Estimation in Hilbertian Linear Models 15
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Estimation of Ψ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 The estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Consistency results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Applications to functional time series . . . . . . . . . . . . . . . . . . . . . . 23
3 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
II Estimation in functional lagged regression 38
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2 Model specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3 Estimation of the impulse response operators . . . . . . . . . . . . . . . . . . . . . . 43
4 Consistency of the estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 Assessment of the performance in finite samples . . . . . . . . . . . . . . . . . . . . . 47
5.1 Data generating processes and numerical implementation of the estimators . 47
5.2 Simulation settings and results . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.1 Auxiliary lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 Proofs of Lemma 1 and Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . 51
2
Table of contents
1 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
1.1 Relation to ordinary functional regression . . . . . . . . . . . . . . . . . . . . 55
1.2 Description of the FPE approach . . . . . . . . . . . . . . . . . . . . . . . . . 56
1.3 Proofs of Lemma 6 and Proposition 1 . . . . . . . . . . . . . . . . . . . . . . 57
A Dynamic Functional Principal Components 62
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2 Illustration of the method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3 Methodology for L2 curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.1 Notation and setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2 The spectral density operator . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.3 Dynamic FPCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4 Estimation and asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4 Practical implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5 A real-life illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Appendices 85
A General methodology and proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
A.1 Fourier series in Hilbert spaces. . . . . . . . . . . . . . . . . . . . . . . . . . . 86
A.2 The spectral density operator . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
A.3 Functional filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
A.4 Proofs for Section 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
B Large sample properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
C Technical results and background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
C.1 Linear operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
C.2 Random sequences in Hilbert spaces . . . . . . . . . . . . . . . . . . . . . . . 94
C.3 Proofs for Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3
Table of contents
General Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4
Introduction
Introduction Functional data analysis
The continuous advances in data collection and storage techniques allow us to observe and record
real-life processes in great detail. Examples include financial transaction data, fMRI images, satellite
photos, earths pollution distribution in time etc. Due to the high dimensionality of such data,
classical statistical tools become inadequate and inefficient. The need for new methods emerges and
one of the most prominent techniques in this context is functional data analysis (FDA).
The main objective of this work is to analyze temporal dependence in FDA. Such dependence
occurs, for example, if the data consist of a continuous time process which has been cut into segments,
days for instance. We are then in the context of so-called functional time series.
Many classical time series problems arise in this new setup, like modeling or prediction. In
this work we we will be concerned mainly with regression and dimension reduction, comparing
time–domain methods with frequency–domain methods.
In this chapter, we further discuss the motivational examples and introduce articles upon which
this thesis is based.
1 Functional data analysis
1.1 Motivation
The main concern of statistics is to obtain essential information from a sample of observations
X1, X2, ..., XN . We are given a finite sample of size N ∈ N , where Xii∈Z can be scalars, vectors
or more complex objects, like genotypes, fMRI scans or images.
Functional data analysis deals with observations which can be naturally expressed as functions.
Figures 1, 2 and 3 present several cases from different areas of science which fit into the framework
of functional data analysis.
When we deal with a physical process it is often natural to assume that it behaves in a con-
tinues manner and that the observations do not oscillate significantly between the measurements.
Although, in the Digital Era, we rarely record analog processes continuously, we often have enough
datapoints that interpolation does not cause a significant measurement error. Models incorporating
this additional structure can lead to more precise and meaningful foundings. In this context, FDA
can be seen as a tool which embeds the continuity feature into the model.
On the other hand, except for the good approximation of a continuous process, FDA can also
prove to be useful in a noisy, discontinuous case. Then, FDA can serve as a tool for denoising and
smoothing the data and is beneficial whenever the underlying process is the main concern.
From a pragmatic perspective, functional data can be seen simply as infinitely dimensional
vectors, with extended notion of variance and mean, and thus we may be tempted to employ
classical multivariate techniques. However, there are many practical and theoretical problems that
need to be addressed. For example, in the context of linear models, the inversion of the (infinite
dimensional) covariance operator is not straight forward and needs to be treated carefully, both,
from the theoretical and practical perspective. This issue, together with our novel approach to the
2
Introduction Functional data analysis
0 5 10 15 20 25 30
8010
012
014
016
018
0
x
Figure 1: Berkeley Growth Data: Heights of 20 girls taken from ages 0 through 18 (left). Growth processeasier to visualize in terms of acceleration (right). Tuddenham and Snyder [49] and Ramsey and Silverman[43]
Figure 2: Lower lip movement (top), acceleration (middle) and EMG of a facial muscle (bottom) of a speakerpronouncing the syllable “bob” for 32 replications. Malfait, Ramsay, and Froda [32]
Figure 3: Projections of DNA minicircles on the planes given by the principal axes of inertia (three panelson the left side: TATA curves, right: CAP curves). Mean curves are plotted in white. Panaretos, Kraus andMaddocks [36]
3
Introduction Functional data analysis
classical functional regression problem, is the topic of Chapter 1.
The FDA approach is also useful in a parsimonious representation of the data by taking advantage
of their smoothness. Instead of looking at a function as a dense vector of values, we can often
represent it in an linear combination of a handful of (well chosen) basis functions.
Finally, there are also advantages in the FDA approach which stem from the structure of the
data. For example, one of the drawbacks of an acclaimed multivariate Principal Component Anal-
ysis (PCA) is it’s scale dependence. It makes no sense to rescale a function componentwise (with
different scaling factors at different arguments) and hence for the functional counterpart of PCA,
the Functional Principal Component Analysis (FPCA), the lack of scale-invariance is not an issue.
A detailed introduction to Functional Principle Components is given in Section 1.6. In Chapter 3
we describe an extension of the technique benefiting from the time–dependent framework.
1.2 Brief overview of functional data research
One of the most influential works in the field of FDA is the seminal book by Ramsay and Silver-
man [43]. Together with the R and Matlab libraries, significantly facilitating both research and
practice in the area, they are a main reference in the field. Many important results were mapped
from the multivariate cases, often taking the advantage of the unique features of functional object,
whereas others, like the analysis of derivatives, were derived uniquely in this setting.
As a running example, Ramsay and Silverman [43] consider growth curves of 10 girls measured
at a set of 31 ages. They argue that statistics obtained on derivatives can be more informative than
the classical analysis of the curves themselves, performed earlier by Tuddenham and Snyder [49].
Practical applications of functional data analysis are spread across many areas of science and
engineering. Panaretos et al. [36] use [0, 1]→ R3 closed curves to analyze the behavior of DNA mi-
crocircles, providing the testing methodology for the comparison of two classes of curves. Aston and
Kirch [2] analyze the stationarity and change point detection for functional time series, with appli-
cations to fMRI data. Hadjipantelis et al. [18] analyze Mandarin language using functional principal
components. Functional time series also naturally emerge in financial applications – Kokoszka and
Reimherr [29] analyze predictability of the shape of intraday price curves.
From the theoretical perspective, Berkes et al. [5] extensively studied the problem of change
points within a set of functional observations, whereas Horvath et al. [26] recently investigated
testing for stationarity. Many multivariate techniques were extended to an infinite dimensional
setup, like functional dynamic factor models [20] or functional depth [31].
These works are only a fraction of the ongoing research and for a more accurate survey on
applications and theory we refer to books [43], [16], [25] and [6].
1.3 Hilbert spaces
For most of the results presented in this work we only require that the functional space is a separable
Hilbert space, i.e. a complete inner product space with a countable basis. This allows us to state
4
Introduction Functional data analysis
more general results, so that the space of square integrable functions L2([a, b]), a < b is a special
case.
Although most of our examples are concerned real–valued functions defined on a finite interval,
one should keep in mind other possible applications, including, for example, multivariate functions
or images and audio files, as described in Section 1.2.
1.4 Notation
Let H1, H2 be two (not necessarily distinct) separable Hilbert spaces. We denote by L(Hi, Hj),
(i, j ∈ 1, 2), the space of bounded linear operators from Hi to Hj . Further we write 〈·, ·〉Hfor the inner product on Hilbert space H and ‖x‖H =
√〈x, x〉H for the corresponding norm.
For Φ ∈ L(Hi, Hj) we denote by ‖Φ‖L(Hi,Hj) = sup‖x‖Hi≤1 ‖Φ(x)‖Hj the operator norm and by
‖Φ‖S(Hi,Hj) =(∑∞
k=1 ‖Φ(ek)‖2Hj)1/2
, where e1, e2, ... ∈ Hi is any orthonormal basis (ONB) of Hi,
the Hilbert-Schmidt norm of Φ. It is well known that this norm is independent of the choice of
the basis. Furthermore, with the inner product 〈Φ,Θ〉S(H1,H2) =∑
k≥1〈Φ(ek),Θ(ek)〉H2 the space
S(H1, H2) is again a separable Hilbert space. For simplifying the notation we use Lij instead of
L(Hi, Hj) and in the same spirit Sij , ‖ · ‖Lij , ‖ · ‖Sij and 〈·, ·〉Sij .
All random variables appearing in this work will be assumed to be defined on some common
probability space (Ω,A, P ). A random element X with values in H is said to be in LpH if νp,H(X) :=
(E‖X‖pH)1/p < ∞. More conveniently we shall say that X has p moments. If X possesses a first
moment, then X possesses a mean µ, determined as the unique element for which E〈X,x〉H =
〈µ, x〉H , ∀x ∈ H. For x ∈ Hi and y ∈ Hj let x⊗ y : Hi → Hj be an operator defined as x⊗ y(v) =
〈x, v〉y. If X ∈ L2H , then it possesses a covariance operator C, given by C = E[(X − µ)⊗ (X − µ)].
It can be easily seen that C is a Hilbert-Schmidt operator. Assume X,Y ∈ L2H . Following Bosq [6],
we say that X and Y are orthogonal (X ⊥ Y ) if EX ⊗ Y = 0. A sequence of orthogonal elements
in H with a constant mean and constant covariance operator is called H–white noise.
1.5 Representation and fit
Since we are dealing with infinite dimensional objects we need to represent and approximate them
in a convenient way. This is important from the practical as the theoretical perspective. From the
practical point, due to the limited computer memory, we will always work with approximations. We
want to use low dimensional approximations for computational reasons.
One of the possibilities to represent a curve, is to select a sufficiently fine grid and process the
vector of values of the function on the intervals induced by the gridpoints. This approach, often
used in practice, does not benefit from the continuity of functions.
In this work, we follow the ideas popularized by Ramsey and Silverman [43], based on the basis
function expansion. Most prominent the Karhunen–Loeve or Fourier extension. Let ei1≤i≤∞ be
an orthonormal basis of a separable Hilbert space H. Then, any element x ∈ H can be uniquely
5
Introduction Functional data analysis
represented as
x =
∞∑i=1
〈x, ei〉ei.
Note that, by Parseval’s formula,
‖x‖2 =∞∑i=1
|〈x, ei〉|2.
Since the sum is finite, for any ε, there exist d that
∞∑i=d
|〈x, ei〉|2 < ε.
We can therefore approximate the function with arbitrary precision ε > 0 using only the first d
basis elements. This approach is consistent with intuition. Indeed, if we use, for example, Fourier
basis functions, then the high frequency components are expected to be negligible and will be
diminishing.
Although the fitting and representation of functional data is an important and intensively studied
topic on its own, in this work we assume that observations are fully observed, i.e. we are given the
actual curves. For more information on fitting we refer to [43].
1.6 Dimension reduction
From a theoretical perspective a curve observation X is an intrinsically infinite–dimensional object.
Besides the choice of an appropriate basis, there is also the need for dimension reduction.
Arguably, functional principal components analysis (FPCA) is the key technique to this problem.
Like its multivariate equivalent, FPCA is based on the analysis of the covariance operator and it is
concerned with finding directions which contribute most to the variability of the observations.
Let X be a functional random variable taking values in some Hilbert space H and C = EX ⊗Xbe its covariance operator. (Without loss of generality we assume here and in many places that
EX0 = 0.) For C to exist, we assume that E‖X‖2 < ∞. One can show that C is a symmetric,
positive definite Hilbert-Schmidt operator and can hence by the spectral theorem be decomposed
into
C =
∞∑i=1
λiei ⊗ ei, (1)
where λ1 ≥ λ2 ≥ ... and λi ≥ 0 are the eigenvalues of C and and eii∈N are the corresponding
eigenfunctions, forming an orthonormal basis of the underlying Hilbert space H.
If we pick the first d basis elements eidi=1 and project the observation X on the space spanned
by them, we obtain the optimal d-dimensional approximation in terms of the mean square error, i.e.
E‖X −d∑i=1
〈X, ei〉ei‖2 ≤ E‖X −d∑i=1
〈X, e′i〉e′i‖2,
6
Introduction Functional Time Series
Time
0 2000 4000 6000 8000 10000
27560
27580
27600
27620
27640
27660
Figure 4: Horizontal component of the magnetic field measured in one minute resolution at Honolulu mag-netic observatory from 1/1/2001 00:00 UT to 1/7/2001 24:00 UT. 1440 measurements per day.
for any other orthonormal collection e′i1≤i≤d. The directions ei are called the principal components
of X and the coefficients 〈X, ei〉 are called PC scores. A simple computation shows that PC scores
are uncorrelated, which is another key feature.
We remark again that a main advantage of FPCA over the multivariate version is that scale-
invariance is not relevant. Consequently, it is much easier to interpret functional PCs and linear
combinations thereof. For detailed theory of multivariate principle components we refer to [28] and
to [45] for the functional setup.
FPCA gained popularity in both iid and time–dependent setup. However, in Chapter 3 we argue
that this technique is no longer optimal for time series and may lead to misconception when used
not carefully. We then propose an extension of FPCA, which benefits from the temporal dependence
structure.
2 Functional Time Series
In many practical situations functions are naturally ordered in time. For example, when we deal
with daily observations of the stock market or with sequences of tumor scans. Then, we are in the
context of a so–called functional time series (FTS).
As a motivating example consider Figure 4. Here, the assumption of independence can be too
strong – values at the beginning of each day are highly correlated with those at the end of the
preceding day. Moreover, we see that big jumps are often followed by significant drops.
These, and similar features, may indicate significant temporal dependence not just within a
subject, but also between different subjects (e.g. days). In this section we discuss possible frameworks
which allow to quantify, test and use this additional information.
7
Introduction Functional Time Series
2.1 Stationarity
Many physical processes are known to have time-invariant distribution. This motivates the frequen-
tionist approach to time series, where we assume that the structure does not change in time and
we interfer from estimated covariances. A functional test for stationarity was recently introduced
by Horvath et al. [26]. Non–stationary time series are also extensively studied, however they are
beyond the scope of this work.
Let Xt be a series of random functions. We say that Xt is stationary in the strong–sense if for
any h ∈ Z, k ∈N and any sequence of indices t1, t2, ..., tk vectors (Xt1 , ..., Xtk) and (Xh+t1 , ..., Xh+tk)
are identically distributed.
We also define weak stationary by looking only on the second order structure of the series. We
say that Xt is weakly stationary if E‖Xt‖2 <∞ and
1. EXt = EX0 for each t ∈ Z and
2. EXtXs = EXt−sX0 for each t, s ∈ Z.
2.2 Model approach
Arguably, one of the most popular models of temporal dependence is the functional autoregressive
model (FAR(p)) studied in great detail by Bosq [6]. In this model we assume that the state in time
t is a linear function of the p previous states and some independent white noise process. The main
concern is the estimation of the p linear operators involved. Once the AR structure is identified, we
can profit from the explicit probabilistic structure and dynamic of the time series. We describe this
model in detail in Chapter 1.
Many time series can, however, not be approximated by some FAR(p) process and the need for
more complex models arises. ARMA or GARCH-type models (cf. [21]) could serve as alternatives,
but the required theoretical foundations beyond the relatively simple auto-regressions is still sparse.
Furthermore, for many time series it is not clear which model they follow. Nonetheless time–series
procedures may still apply. It is then preferable to only impose a certain dependence structure,
rather than requiring a particular model. In the next sections we introduce three popular notions
of dependence and justify the choice of the framework employed throughout this work.
2.3 Lp-m-approximability
In this framework, weak dependence is defined by a “small” Lp distance between the process and
it’s approximation based on only last m innovations. This idea is made precise in the following two
definitions.
Definition 1. Suppose (Xn)n≥1 is the random process with values in H and let F−n = σ(..., Xn−2, Xn−1, Xn)
and F+n = σ(Xn, Xn+1, Xn+2, ...) be the σ-algebras generated by terms up to time n and after time
n respectively. Process Xn is said to be m-dependent if F−n and F+n+m are independent.
8
Introduction Functional Time Series
In practice, processes usually do not have the property from Definition 1, however they can be
often approximated by such series. This motivates the following approach to week dependence
Definition 2 (Hormann and Kokoszka [23]). A random sequence Xnn≥1 with values in H is called
Lp–m–approximable, if it can be represented as
Xn = f(δn, δn−1, δn−2, ...),
where the δi are iid elements taking values in a measurable space S and f is a measurable function
f : S∞ → H. Moreover, if δ′i are independent copies of δi defined on the same probability space,
then for
X(m)n = f(δn, δn−1, δn−2, ..., δn−m+1, δ
′n−m, δ
′n−m−1, ...) (2)
we have
∞∑m=1
E‖Xm −X(m)m ‖p <∞.
Note, that the independent copies in (2) are used for simplicity of proofs and the representation
can be more intuitively stated as
X(m)n = f(δn, δn−1, δn−2, ..., δn−m+1, 0, 0, ...),
leading to analogous results. Let us also stress, that the representation 2 is rather general, and
incorporates most time series models encountered in practice. Furthermore, checking the validity of
the dependence condition is simply reduced to p-th order moments, which is typically much simpler
than establishing classical mixing conditions explained next.
2.4 Mixing conditions
There exist numerous variants for mixing. We introduce the strong mixing (or α-mixing) condition,
which is one of the most prominent ones. In the functional context it has e.g. been used by Aston
and Kirch [1]. For an extensive introduction to mixing we refer to Bradley [8].
In this approach we quantify and bound the dependence of sigma fields generated by variables
X0, X−1, ... and Xm, Xm+1, ... for given m ∈N .
Definition 3. A strictly stationary process Xj : j ∈ Z is called strong mixing with mixing
rate rm if
supA,B|P (A ∩B)− P (A)P (B)| = O(rm), rm → 0,
where the supremum is taken over all A ∈ σ(. . . , X−1, X0) and B ∈ σ(Xm, Xm+1, . . .).
9
Introduction Linear models
2.5 Cumulant condition
Another approach to quantifying weak dependence is based on so–called cumulants, expressing the
high order cross–moment structure. In the finite dimensional case, it was popularized by Brillinger
[9]. In the context of Functional Time Series it was brought recently by Panaretos and Tavakoli
[37]. The k−th order cumulant kernel is given by
cum(Xt1(τ1), ..., Xtk(τk)) =∑
v=(v1,...,vp)
(−1)p−1(p− 1)!
p∏l=1
E
∏j∈vl
Xtj (τj)
.where the sum extends over all unordered partitions of 1, ..., k. If we assume that E‖X0‖l2 < ∞for l ≥ 1, then the cumulant kernels are well–defined in L2. For a given cumulant kernel of order 2k
one can define an order cumulant operator Rt1,...,t2k−1: L2([0, 1]k,R)→ L2([0, 1]k,R), defined by
Rt1,...,t2k−1h(τ1, ..., τk)
=
∫[0,1]k
cum(Xt1(τ1), ..., Xt2k−1(τ2k−1), Xt0(τ0))× h(τk+1, ..., τ2k)dτk+1...dτ2k.
We say that a time series satisfies the cumulant condition if and only if
1. E‖X0‖k2 <∞,∑∞
t1,...,tk−1=−∞ ‖cum(Xt1 , ..., Xtk−1, X0)‖2 <∞, ∀k≥2,
2. ‖Rt‖1 <∞ where ‖ · ‖1 is the nuclear norm or Schatten 1-norm.
2.6 Discussion
Many results for stationary processes were obtained assuming strong mixing conditions. However
they are hard to check in practice and they exclude some important statistical models, like, for
example, AR(1) time series with discrete innovations.
Although many concepts of dependence were developed in recent years, there is no dominant
framework. Therefore, researchers try to state results in a general setting, using only the basic
results from the dependence framework, wherever possibly. This approach allows future scientists
to use different dependence concepts as long as these basic results hold. An example can be found in
the work of Aston and Kirch [1], where they obtain their results considering both mixing conditions
and Lp-m-approximability.
In this work we take a similar approach, restricting ourselves to convergence results of eigenvec-
tors and eigenvalues of covariance operators. We choose to use Lp-m-approximability dependence
structure, since all our required results are established in Hormann and Kokoszka [23].
3 Linear models
As we have already pointed out in previous sections, many multivariate techniques have a natural
analogue in the functional data setup. In the sequel, we are concerned with linear models. They
10
Introduction Linear models
constitute the fundamental framework in many areas of statistics, thus they are naturally of great
interest in functional data analysis.
We start by introducing the classical linear regression, allowing the variables to be dependent
in time. Next, we discuss time series models in the functional setup. Finally, we briefly discuss
advantages of frequency domain methods, which are used in Chapters 2 and 3 for exploiting the
temporal dependence structure.
3.1 Linear regression
One of the most popular frameworks in classical statistics is the linear regression, where we try to
quantify the linear dependence between two (possibly multivariate) variables X and Y . The problem
of finding the relation of this type can be also addressed in FDA.
Assume the model
Yt = β(Xt) + εt, t ≥ 1 (3)
where β is a linear Hilbert-Schmidt operator from H1 → H2 and εt is some strong white noise
sequence, independent from (Xt).
As we are concerned with functional time series, we will assume that Xt, Yt are stationary and
weakly dependent. Classical case of iid Xt is of great scientific interest and the interested reader is
referred to [43] and [52].
Although the linear regression shares many properties with its multivariate equivalent, again
there are several important differences. Especially, we note that the linear operator β : H1 → H2
has infinite dimensional, which considerably complicates the estimation. If we approach the problem
in the classical way by multiplying both sides of (3) by Xt and taking the expectation we get
CXY = βCX , (4)
where CXY is the cross-covariance operator of X and Y and CX is the covariance of X. Now, the
natural way to obtain β is by applying the inverse of CX to both sides of the equation (4), which
yields
β = CXY (CX)−1.
The main problem is that the operator (CX)−1 is no longer bounded. Indeed, the domain of CX
is only a subset D, say, of H1. To see this, note that formally C−1X (x) =
∑k≥0 λ
−1k 〈ek, x〉ek, where
λk and ek are the eigenvalues (tending to zero) and eigenfunctions of CX . Hence, D = x ∈H1 :
∑k≥1〈x, ek〉2λ
−2k < ∞. The problem can be approached by some regularization. E.g. one
may replace C−1X by a finite dimensional approximation of the form
∑k≤K λ
−1k ek⊗ ek, where K is a
tuning parameter. This is still quite delicate, when applied to the sample version. Then for for large
values of K, if we underestimate one of the small eigenvalues, its reciprocal explodes and will lead
to very instable estimators. On the other hand, for small K we may get a very poor approximation
of β.
11
Introduction Linear models
This difficulty was addressed by Bosq [6], who gives an extensive survey on the problem. However,
proposed results are based on strong assumptions on the rate of convergence of eigenvalues, which
are impossible to check in practice. In Chapter 1 we present an alternative, data–driven approach.
Finally, note that exactly the same technique can be used for lagged linear regression. Consider
Yt =m∑k=0
βk(Xt−k) + εt, (5)
where m ∈ N . Let us introduce Zt = (Xt, Xt−1, ..., Xt−m) ∈ Hm1 . Then, the model can be written
as
Yt = BZt + εt, (6)
where B : Hm1 → H2 is a linear operator such that BZt =
∑mk=0 βk(Z
(k)t ).
Now, for estimating B in (6), we can apply the same estimation procedures as for (3). This
method of estimation in lagged regressien models is efficient only for small dimensions and small m,
as opposed to the technique discussed in Chapter 2, which gives esitmates at any lag.
3.2 Filtering
The linear models that we are concerned in Chapters 2 and 3 are based on the concept of linear
filtering, popular in multivariate time series as well as in signal processing. For the theory and
survey on applications in this context we refer to the classical book of Oppenheim and Schafer [35].
Definition 4. We say that A = Akk∈Z is a linear filter if for each k ∈ Z, Ak ∈ H1 → H2 is a
linear operator and∑‖At‖2H <∞.
Now, we can extend the model (3), so that it includes a possibility of the so–called lagged
dependence, i.e.
Yt =∑k∈Z
AkXt−k + εt. (7)
In Chapter 2, we are concerned with estimation of Ak and testing the significance of these
operators. Next, in Chapter 3 we consider a low-dimensional filters, i.e. such that dim(Im(Ak)) = d,
trying to find a filter which accounts for the smallest information loss in terms of the mean squared
error. In both cases, results are based on Fourier analysis and the seminal work of Brillinger [9].
3.3 Frequency domain methods
In time series analysis we often deal with periodical data. This feature motivates the Fourier-based
methods which allow us to discover seasonal patterns.
Suppose we are given daily data from a univariate signal plus noise time series, where the signal
comes from a sinusoidal curve with weekly periodicity. The periodogram, a key tool in the frequency
12
Introduction Objectives and structure of the thesis
domain analysis of time series, is a simple tool which allows to detect such a seasonal pattern. In
the given example it will be showing a spike at the frequency corresponding to the weekly period.
The Fourier transform has two important properties which simplify analysis of the process (3).
First, multiplication in the frequency domain is equivalent to convolution in the time domain.
Second, the Fourier transform is a bijection, so results in frequency domain are equivalent to these
in the time domain.
To illustrate the use of these features let us multiply equation (3) by Xs for some s ∈ Z and
take the expectation. By linearity we have
EYtXs =∑k∈Z
AkEXt−kXs,
and by stationarity
EYuX0 =∑k∈Z
AkEXu−kX0,
where u = t− s. Now, noting that on left we have CY Xu and on right we have the convolution of Ak
and CY Xu , the Fourier transform of both sides yields the cross-spectral operator between Yt and
Xt and can be obtained as
FY Xθ = A(θ)FXθ , (8)
where A(θ) =∑
k∈ZAkeikθ is the frequency response function of the series Akk∈Z and
∑k∈Z =
12π
∑k∈ZC
Xk e−ikθ is the spectral density operator of Xt.
Relation (8) is fundamental for this work. In Chapter 2 we use it for the estimation of A,
from which, by taking the inverse Fourier transform, we obtain estimates for operators in (7). In
Chapter 3 we argue that A(θ) built from principal components of FXθ minimizes the information
loss among all linear filters applied to Y .
4 Objectives and structure of the thesis
This work is organized in three chapters. Each chapter constitutes a reprint of a paper, published
or submitted for publication.
The first chapter proposes a data–driven technique for estimation of dimension in functional
AR(1) models. Although the regression problem was studied in great detail by Bosq [6], all tech-
niques were built on very strong assumptions, impossible to check in practice. In our work we
not only provide an alternative, data–driven technique but also prove its consistency without any
assumptions on convergence rates of the spectrum. We support our technique by an extensive sim-
ulation study, which reveals performance close to optimal. This chapter has been published in the
Scandinavian Journal of Statistics [22].
In the second chapter we discuss the estimation functional lagged regression models. As discussed
13
Introduction Objectives and structure of the thesis
in Section 3.1, the method described in Chapter 1 can be successfully adapted to the problem of
lagged covariance. However, in practice the dimension of the problem can outgrow the number of
observation and this may lead to misleading results. We investigate a frequency domain method
which gives consistent estimators at arbitrary chosen lag. Moreover, we provide testing methodology
addressing the significance of given lagged regression operators.
The third chapter extends the functional principal components to the time–dependent setup. In
our work we concentrate on the diagonality of the covariance matrix – one of the most important
features of the principal component analysis. In the time–dependent setup, lagged covariances of
principal components may not be diagonal, which restrains the analysis of independent components.
We relax the setup from the orthogonal projection to convolution and, using the frequency domain
approach, we find a time invariant linear mapping which gives a multivariate series with uncorrelated
components at all leads and lags. Moreover, the resulting vector sequences explain more variance
than the classical PCA with the same number of components. This chapter has been published in
the Journal of the Royal Statistics Society: Series B.
14
Chapter I
A Note on Estimation in Hilbertian Linear Models
A note on estimation in Hilbertian linear models∗
Siegfried Hormann, Lukasz Kidzinski
Department de Mathematique, Universite libre de Bruxelles (ULB), Belgium
Abstract
We study estimation and prediction in linear models where the response and the regressor variable both
take values in some Hilbert space. Our main objective is to obtain consistency of a principal components
based estimator for the regression operator under minimal assumptions. In particular, we avoid some incon-
venient technical restrictions that have been used throughout the literature. We develop our theory in a time
dependent setup which comprises as important special case the autoregressive Hilbertian model.
Keywords: adaptive estimation, consistency, dependence, functional regression, Hilbert spaces,
infinite-dimensional data, prediction.
1 Introduction
In this paper we are concerned with a regression problem of the form
Yk = Ψ(Xk) + εk, k ≥ 1, (I.1)
where Ψ is a bounded linear operator mapping from space H1 to H2. This model is fairly general
and many special cases have been intensively studied in the literature. Our main objective is the
study of this model when the regressor space H1 is infinite dimensional. Then model (I.1) can be
seen as a general formulation of a functional linear model, which is an integral part of functional
data literature. Its various forms are introduced in Chapters 12–17 of Ramsay and Silverman [25].
A few recent references are Cuevas et al. [11], Malfait and Ramsay [23], Cardot et al. [6], Chiou
et al. [8], Muller and Stadtmuller [24], Yao et al. [28], Cai and Hall [3], Li and Hsing [22], Hall
and Horowotiz [15], Reiss and Ogden [26], Febrero-Bande et al. [13], Crambes et al. [10], Yuan and
Cai [29], Ferraty et al. [14], Crambes and Mas [9].
From an inferential point of view, a natural problem is the estimation of the ‘regression operator’
Ψ. Once an estimator Ψ is obtained, we can use it in an obvious way for prediction of the responses Y .
Both, the estimation and the prediction problem are addressed in this paper. In existing literature,
these problems have been discussed from several angles. For example, there is the distinction between
the ‘functional regressors and responses’ model (e.g., Cuevas et al. [11]) or the perhaps more widely
studied ‘functional regressor and scalar response model’ (e.g., Cardot et al. [5]). Other papers deal
with the effect when random functions are not fully observed but are obtained from sparse, irregular
data measured with error (e.g., Yao et al. [28]). More recently, the focus was on establishing rates
of consistency (e.g., Cai and Hall [3], Cardot and Johannes [7]). The two most popular methods
∗Manuscript has been accepted for publication in Scandinavian Journal of Statistics
16
of estimation are based on principal component analysis (e.g., Bosq [1], Cardot et al. [5], Hall and
Horowitz [15]) or spline smoothing estimators (e.g., Hastie and Mallows [16], Marx and Eiler [12],
Crambes et al. [10]).
In this paper we address the estimation and prediction problem for this model when the data
are fully observed, using the principal component (PC) approach. Let us explain what is the new
contribution and what distinguishes our paper from previous work.
(i) The crucial difficulty for this type of problems is that the infinite dimensional operator Ψ needs
to be approximated by a sample version ΨK of finite dimension K, say. Clearly, K = Kn needs to
depend on the sample size and tend to ∞ in order to obtain an asymptotically unbiased estimator.
In existing papers determination of K and proof of consistency require, among others, unnecessary
moment assumptions and artificial restrictions concerning the spectrum of the covariance operator
of the regressor variables Xk. As our main result, we will complement the current literature by
showing that the PC estimator remains consistent without such technical constraints. We provide
a data-driven procedure for the choice of K, which may even be used as a practical alternative to
cross-validation.
(ii) We allow the regressors Xk to be dependent. This is important for two reasons. First, many
examples in FDA literature exhibit dependencies as the data stem from a continuous time process,
which is then segmented into a sequence of curves, e.g., by considering daily data. Examples of this
kind include intra-day patterns of pollution records, meteorological data, financial transaction data
or sequential fMRI recordings. See, e.g., Horvath and Kokoszka [20].
Second, our framework detailed below will include the important special case of a functional
autoregressive model which has been intensively investigated in the functional literature and is often
used to model autoregressive dynamics of a functional time series. This model is analyzed in detail
in Bosq [2]. We can not only greatly simplify the assumptions needed for consistent estimation,
but also allow for a more general setup. E.g., in our Theorem 2 we show that it is not necessary
to assume that Ψ is a Hilbert-Schmidt operator if our intention is prediction. This quite restrictive
assumption is standard in existing literature, though it even excludes the identity operator.
(iii) As we already mentioned before, the literature considers different forms of functional linear
models. Arguably the most common are the scalar response and functional regressor and the func-
tional response and functional regressor case. We will not distinguish between these cases, but work
with a linear model between two general Hilbert spaces.
In the next section we will introduce notation, assumptions, the estimator and our main results.
In Section 3 we provide a small simulation study which compares our data driven choice of K with
cross-validation (CV). As we will see, this procedure is quite competitive with CV in terms of mean
squared prediction error, while it is clearly favorable to the latter in terms of computational costs.
Finally, in Section 6, we give the proofs.
17
2 Estimation of Ψ
2.1 Notation
Let H1, H2 be two (not necessarily distinct) separable Hilbert spaces. We denote by L(Hi, Hj),
(i, j ∈ 1, 2), the space of bounded linear operators from Hi to Hj . Further we write 〈·, ·〉H for
the inner product on Hilbert space H and ‖x‖2H = 〈x, x〉H for the corresponding norm. For Φ ∈L(Hi, Hj) we denote by ‖Φ‖L(Hi,Hj) = sup‖x‖Hi≤1 ‖Φ(x)‖Hj the operator norm and by ‖Φ‖2S(Hi,Hj)
=∑∞k=1 ‖Φ(ek)‖2Hj , where e1, e2, ... ∈ Hi is any orthonormal basis (ONB) of Hi, the Hilbert-Schmidt
norm of Φ. It is well known that this norm is independent of the choice of the basis. Furthermore,
with the inner product 〈Φ,Θ〉S(H1,H2) =∑
k≥1〈Φ(ek),Θ(ek)〉H2 the space S(H1, H2) is again a
separable Hilbert space. For simplifying the notation we use Lij instead of L(Hi, Hj) and in the
same spirit Sij , ‖ · ‖Lij , ‖ · ‖Sij and 〈·, ·〉Sij .
All random variables appearing in this paper will be assumed to be defined on some common
probability space (Ω,A, P ). A random element X with values in H is said to be in LpH if νp,H(X) :=
(E‖X‖pH)1/p < ∞. More conveniently we shall say that X has p moments. If X possesses a first
moment, then X possesses a mean µ, determined as the unique element for which E〈X,x〉H =
〈µ, x〉H , ∀x ∈ H. For x ∈ Hi and y ∈ Hj let x⊗ y : Hi → Hj be an operator defined as x⊗ y(v) =
〈x, v〉y. If X ∈ L2H , then it possesses a covariance operator C, given by C = E[(X − µ)⊗ (X − µ)].
It can be easily seen that C is a Hilbert-Schmidt operator. Assume X,Y ∈ L2H . Following Bosq [2],
we say that X and Y are orthogonal (X ⊥ Y ) if EX ⊗ Y = 0. A sequence of orthogonal elements
in H with a constant mean and constant covariance operator is called H–white noise.
2.2 Setup
We consider the general regression problem (I.1) for fully observed data. Let us collect our main
assumptions.
(A): We have Ψ ∈ L12. Further εk and Xk are zero mean variables which are assumed to be
L4–m–approximable in the sense of Hormann and Kokoszka [18] (see below). In addition εk is
H2–white noise. For any k ≥ 1 we have Xk ⊥ εk.
Here is the weak dependence concept that we impose.
Definition 5 (Hormann and Kokoszka [18]). A random sequence Xnn≥1 with values in H is called
Lp–m–approximable, if it can be represented as
Xn = f(δn, δn−1, δn−2, ...),
where the δi are iid elements taking values in a measurable space S and f is a measurable function
f : S∞ → H. Moreover, if δ′i are independent copies of δi defined on the same probability space,
then for
X(m)n = f(δn, δn−1, δn−2, ..., δn−m+1, δ
′n−m, δ
′n−m−1, ...)
18
we have
∞∑m=1
νp,H(Xm −X(m)m ) <∞.
Evidently, i.i.d. sequences with finite p-th moments are Lp–m–approximable. This leads to the
classical functional linear model. But it is also easily checked that functional linear processes fit in
this framework. More precisely, if Xn is of the form
Xn =∑k≥0
bk(δn−k),
where bk : H0 → H1 are bounded linear operators such that∑
m≥1
∑k≥m ‖bk‖L01 <∞, and (δn) is
i.i.d. noise with νp,H0(δ0) <∞, then Xn is Lp–m–approximable. Other (also non-linear) examples
of functional time series covered by Lp–m–approximability can be found in [18].
A very important example included in our framework is the autoregressive Hilbertian model of
order 1 (ARH(1)) given by the recursion Xk+1 = Ψ(Xk) + εk+1. It will be treated in more detail in
Section 2.4.
The notion of L4–m–approximability implies that the process is stationary and ergodic and
that it has finite forth moments. The latter is in line with existing literature. We are not aware
of any article that works with less than 4 moments. In contrast, for several consistency results
finite moments of all orders (or even bounded random variables) are assumed. Since our estimator
below is a moment estimator, based on second order moments, one could be tempted to believe that
some of our results may be deduced directly from the ergodic theorem under finite second moment
assumptions. We will explain in the next section, after introducing the estimator, why this line of
argumentation is not working.
Our weak dependence assumption implies that a possible non-zero mean of Xk can be estimated
consistently by the sample mean. Moreover we have (see [19])
√n‖X − µ‖H1 = OP (1).
We conclude that the mean can be accurately removed in a preprocessing step and that EXk = 0 is
not a stringent assumption. Since by Lemma 2.1 in [18] Yk will also be L4–m–approximable, the
same argument justifies that we study a linear model without intercept.
2.3 The estimator
The PC based estimator for Ψ described below was first studied by Bosq [1] and is based on a
finite basis approximation. To achieve optimal approximation in finite dimension, one chooses
eigenfunctions of the covariance operator C = E[X1 ⊗ X1] as a basis. Let ∆ = E[X1 ⊗ Y1]. By
Assumption (A) both, ∆ and C, are Hilbert-Schmidt operators. Let (λi, vi)i≥1 be the eigenvalues
and corresponding eigenfunctions of the operator C, such that λ1 ≥ λ2 ≥ .... The eigenfunctions
are orthonormal and those belonging to a non-zero eigenvalue form an orthonormal basis of Im(C),
19
the closure of the image of C. Note that, with probability one, we have X ∈ Im(C). Since Im(C) is
again a Hilbert-space, we can assume that H1 = Im(C), i.e. that the operator is of full rank. In this
case all eigenvalues are strictly positive. Using linearity of Ψ and the requirement Xk ⊥ εk from
(A) we obtain
∆(vj) = E〈X1, vj〉H1Y1 = E〈X1, vj〉H1Ψ(X1) + E〈X1, vj〉H1ε1
= Ψ(E〈X1, vj〉H1X1) = Ψ(C(vj)) = λjΨ(vj).
Then, for any x ∈ H1, the derived equation leads to the representation
Ψ(x) = Ψ
( ∞∑j=1
〈vj , x〉vj
)=
∞∑j=1
∆(vj)
λj〈vj , x〉. (I.2)
Here we assume implicitly that dim(H1) =∞. If dim(H1) = M <∞, then (I.2) still holds with ∞replaced by M . This case is well understood and will therefore be excluded.
Equation (I.2) gives a core idea for estimation of Ψ. We will estimate ∆, vj and λj from
our sample X1, . . . , Xn, Y1, . . . , Yn and substitute the estimators into formula (I.2). The estimated
eigenelements (λj,n, vj,n; 1 ≤ j ≤ n) will be obtained from the empirical covariance operator
Cn =1
n
n∑k=1
Xk ⊗Xk.
In a similar straightforward manner we set
∆n =1
n
n∑k=1
Xk ⊗ Yk.
For ease of notation, we will suppress in the sequel the dependence on the sample size n of these
estimators.
Apparently, from the finite sample we cannot estimate the entire sequence (λj , vj), rather we
have to work with a truncated version. This leads to
ΨK(x) =
K∑j=1
∆(vj)
λj〈vj , x〉, (I.3)
where the choice of K = Kn is crucial. Since we want our estimator to be consistent, Kn has to
grow with the sample size to infinity. On the other hand, we know that λj → 0. Hence, it will be
a delicate issue to control the behavior of 1λj
. A small error in the estimation of λj can have an
enormous impact on (I.3).
Define ΨK(x) =∑K
j=1∆(vj)λj〈vj , x〉. Via the ergodic theorem one can show that the individual
terms λj , vj and ∆ in (I.3) converge to their population counterparts. It follows that ‖ΨK −ΨK‖L12 → 0 a.s., as long as K is fixed. In fact, this holds true under finite second moments.
However, as it is well known, the ergodic theorem doesn’t assure rates of convergence. Even if the
20
underlying random variables were bounded, convergence can be arbitrarily slow. Consequently, we
cannot let K grow with the sample size in this approach. We need to impose further structure
on the dynamics of the process and existence of higher order moments. Both are combined in the
concept of L4–m–approximability.
In most existing papers determination of Kn is related to the decay-rate of λj. For example,
Cardot et al. [5] assume that nλ4Kn→∞ and nλ2
Kn/(∑Kn
j=11αj
)2 →∞, when
α1 = λ1 − λ2 and αj = minλj−1 − λj , λj − λj+1, j > 1. (I.4)
Similar requirements are used in Bosq [2] (Theorem 8.7) or Yao et al. [28] (Assumption (B.5)). Hall
and Horowitz [15] assume in the scalar response model that αj ≥ C−1j−α−1, |∆(vj)λ−1j | ≤ Cj−β
for some α > 1 and 12α + 1 < β. Here C is a constant arising from the additional assumption
E〈X1, vj〉4 ≤ Cλ2j . They emphasize the importance of a sufficient separation of the eigenvalues for
their result. Then, within this setup, optimal minimax bounds are proven to hold for K = n1/(α+2β).
Of course, in practice this choice of K is only possible under the unrealistic assumption that we
know α and β. Cai and Zhou [4] modify the approach by Hall and Horowitz [15] by proposing an
adaptive choice of K which is based on a block thresholding technique. They recover the optimal
rates of Hall and Horowitz [15], but need to impose further technical assumptions. Among others,
the assumptions in [15] are strengthened to E‖Xk‖p < ∞ for all p > 0, j−α λj j−α, and
αj j−α−1. Here an bn means that lim supn |an/bn| < ∞. Rates of convergence are also
obtained in Cardot and Johannes [7]. They propose a new class of estimators which are based on
projecting on some fixed orthonormal basis instead on empirical eigenfunctions. Again, the accuracy
of the estimator relies on a thresholding technique, and similar as to the afore cited papers, the very
strong results are at the price of several technical constraints.
2.4 Consistency results
The papers cited in the previous paragraph are focus on rates of consistency for the estimator ψK .
These important and interesting need to impose technical assumptions on the operator Ψ and the
spectrum of C. In practice, such technical conditions cannot be checked and may be violated.
Furthermore, since we have no knowledge of αj and λj , j ≥ 1, determination of K has to be done
heuristically. It then remains open if the widely used PC based estimation methods stay consistent
in the case where some of these conditions are violated. Our theorems below show that the answer
to this question is affirmative, even if data are dependent. We propose a selection of Kn which is
data driven and can thus be practically implemented. The Kn we use in first result, Theorem 1
below, is given as follows:
(K): Let mn → ∞ such that m6n = o(n). Then we define Kn = min(Bn, En,mn) where Bn =
arg maxj ≥ 1|λ−1j ≤ mn and En = arg maxk ≥ 1|max1≤j≤k α
−1j ≤ mn. Here λj and αj are the
estimates for λj and αj (given in (I.4)), respectively, obtained from C.
A discussion on the tuning parameter mn is given at the end of this section. The choice of Kn
is motivated by a ‘bias variance trade-off’ argument. If an eigenvalue is very small (in our case
21
1/mn) it means that the direction it explains has only small influence on the representation of
Xk. Therefore, excluding it from the representation of Ψ will not cause a big bias, whereas it will
considerably reduce the variance. It will be only included if the sample size is big enough, in which
case we can hope for a reasonable accuracy of λj . In practice it is recommended to replace 1λj
in the
definition of Bn by λ1λj
and 1αj
in the definition of En by λ1αj
to adapt for scaling. For the asymptotics
such a modification has no influence.
Theorem 1. Consider the linear Hilbertian model (I.1) and assume that Assumption (A) and (K)
hold. Suppose further that the eigenvalues λj are mutually distinct and Ψ is a Hilbert-Schmidt
operator. Then the estimator described in Section 2.3 is weakly consistent, i.e. ‖ΨKn−Ψ‖L12P−→ 0,
if n→∞.
It is not hard to see that consistent estimation of Ψ via the PCA approach requires compactness
of the operator. As a simple example suppose that Ψ is the identity operator, which is not Hilbert-
Schmidt anymore. Then for any ONB vi we have Ψ =∑
i≥1 vi⊗vi. Even if from the finite sample
our estimators for v1, . . . , vK would be perfect (vi = vi) we have ‖Ψ − ΨK‖L12 = 1 for any K ≥ 1.
This is easily seen by evaluating Ψ and ΨK at vK+1.
In our next theorem we show that if our target is prediction, then we can further simplify the
assumptions. In this case we will be satisfied if ‖Ψ(Xn) − Ψ(Xn)‖H2 is small. E.g., if 〈Xn, v〉 = 0
with probability one, then the direction v plays no role for describing Xn and a larger value of
‖Ψ(v)− Ψ(v)‖H2 is not relevant.
Theorem 2. Let Assumption (A) hold and define the estimator ΨKn as in Section 2.3 with Kn =
arg maxj ≥ 1| λ1/λj ≤ mn, where mn →∞ and mn = o(√n). Then ‖Ψ(Xn)−ΨKn(Xn)‖H2
P−→ 0.
Remark 1. For our proof it will not be important to evaluate Ψ and Ψ at Xn. We could equally
well use X1, or Xn+1, or some arbitrary variable Xd= X1.
Theorem 2 should be compared to Theorem 3 in Crambes and Mas [9] where an asymptotic
expansion of E‖Ψ(Xn+1)− Ψk(Xn+1)‖2H2is obtained (for fixed k). Their result implies consistency,
but requires again assumptions on the decay rate of λi, an operator Ψ that is Hilbert-Schmidt,
and E‖Xk‖p <∞ for all p > 0. In our theorem we need no assumptions on the eigenvalues anymore,
not even that they are distinct.
In the last theorem we saw that whenever mn = o(√n) and mn → ∞ convergence holds. This
leaves open what is a good choice of the tuning parameter mn. From a practical perspective we
believe that the importance of this question should not be overrated. Most applied researchers will
use CV or some comparable method, which usually will give a Kaltn that is presumably close to
optimal. Hence, if we suppose that
E‖Ψ(Xn)− ΨKaltn
(Xn)‖H2 E‖Ψ(Xn)− ΨKn(Xn)‖H2 (n→∞),
the practitioner can be sure that his approach leads to a consistent estimator under very general
assumptions. In Section 3 we use for the simulations mn =√n/ log n. The performance of this
estimator is in all tested setups comparable to CV.
22
To address the optimality issue from a theoretical point of view seems to be very difficult and
depends on our final objective: is it prediction or estimation. In both cases we believe that results
in this direction can only be realistically obtained under regularity assumptions similar to those in
the above cited articles.
2.5 Applications to functional time series
Functional time series analysis has seen an upsurge in FDA literature, in particular the forecasting
in a functional setup (see e.g. Hyndman and Shang [21] or Sen and Kluppelberg [27]). We sketch
here two possible applications in this context.
FAR(1)
Of particular importance in functional time series is the ARH(1) model of Bosq [2]. We show now
that our framework covers this model. With i.i.d. innovations δk ∈ L4H the process Xk defined via
Xk+1 = Ψ(Xk) + δk+1 is L4H–approximable if Ψ ∈ L(H,H) such that ‖Ψ‖L(H,H) < 1, see [18]. The
stationary solution for Xk has the form
Xk =∑j≥0
Ψj(δk−j).
Setting εk = δk+1 and Yk = Xk+1 we obtain the linear model (I.1). Independence of δk implies
that Xk ⊥ εk and hence Assumption (A) holds. Bosq [2] has obtained a (strongly) consistent
estimator of Ψ, if Ψ is Hilbert-Schmidt and again by imposing assumptions on the spectrum of C.
In our approach we don’t even need that the innovations δk are i.i.d. As long as we can assure
that δk and Xk are L4–m–approximable we only need that δk is H-white noise. Indeed,
denoting A∗ the conjugate of operator A, we have for any x ∈ H1 and y ∈ H2 that
E〈Xk, x〉H1〈εk, y〉H2 =∑j≥0
E〈Ψj(δk−j), x〉H1〈δk+1, y〉H2
=∑j≥0
E〈δk−j , (Ψj)∗(x)〉H1〈δk+1, y〉H2 = 0.
This shows Xk ⊥ εk and Assumption (A) follows.
We obtain the following
Corollary 1. Let Xnn≥1 be an ARH(1) process given by the recurrence equation Xn+1 = Ψ(Xn)+
εn+1. Assume ‖Ψ‖L12 < 1. If εi is H-white noise and Assumption (A) holds, then for the
estimator ΨK given in Theorem 2 we have ‖Ψ(Xn) − ΨK(Xn)‖H2
P−→ 0. In particular if εi is
i.i.d. in L4H , Assumption (A) will hold.
Corollary 2. Let Xnn≥1 be an ARH(1) process given by the recurrence equation Xn+1 = Ψ(Xn)+
εn+1. Assume ‖Ψ‖S12 < 1 and that the covariance operator related to X1 has distinct eigenvalues.
If εi is H-white noise and (A) and (K) hold, then the estimator ΨK is consistent.
We remark that employing the usual state-space representation for FAR(p) processes these results
23
are easily generalized to higher order FAR models.
FARCH(1)
Another possible application of our result refers to a recently introduced functional version of the
celebrated ARCH model (Hormann et al. [17]), which plays a fundamental role in financial econo-
metrics. It is given by the two equations
yk(t) = εk(t)σk(t), t ∈ [0, 1], k ∈ Z
and
σ2k(t) = δ(t) +
∫ 1
0β(t, s)y2
k−1(s)ds, t ∈ [0, 1], k ∈ Z.
Without going into details, let us just mention that one can write the squared observations of a
functional ARCH model as an autoregressive process with innovations νk(t) = y2k(t) − σ2
k(t). The
new noise νk is no longer independent and hence the results of [2] are not applicable to prove
consistency of the involved estimator for the operator β. But it is shown in [17] that the innovations
of this new process form Hilbertian white noise and that the new process is L4–m–approximable.
This allows us to obtain a consistent estimator for β.
3 Simulation study
We consider a linear model of the form Yn = Ψ(Xn) + εn, where X1, ε1, X2, ε2, . . . are mutually
independent. We are testing the performance of the estimator in context of prediction, i.e. we work
under the setting of Theorem 2. For the simulation study we obviously have to work with finite
dimensional spaces H1 and H2. However, because of the asymptotic nature of our results, we set
the dimension relatively high and define H1 = H2 = spanfj : 0 ≤ j ≤ 34, where f0(t) = 1,
f2k−1(t) = sin(2πkt) and f2k(t) = cos(2πkt) are the first 35 elements of a Fourier basis on [0, 1]. We
work with Gaussian curves Xi(t) by setting
Xi(t) =34∑j=0
A(j)i fj−1(t), (I.5)
where (A(0)i , A
(1)i , . . . , A
(34)i )′ are independent Gaussian random vectors with mean zero and covari-
ance Σ. This setup allows us to easily manipulate the eigenvalues λk of a covariance operator
CX = EX ⊗X. Indeed, if we define Σ = diag(a1, . . . , a35), where a1 ≥ a2 ≥ · · · ≥ ak, then λk = ak
and vk = fk−1 is the corresponding eigenfunction. We test three sets of eigenvalues λk1≤k≤35:
• Λ1 : λk = c1ρk−1 with ρ = 1/2; [geometric decay],
• Λ2 : λk = c2/k2 [fast polynomial decay],
• Λ3 : λk = c3/k1.1 [slow polynomial decay].
To bring our data on the same scale and make results under different settings comparable we set
c1, c2 and c3 such that∑35
k=1 λk = 1. This implies E‖Xi‖2 = 1 in all settings. The noise εk is also
24
assumed to be of the form (I.5), but now with E‖εi‖2 = σ2 ∈ 0.25, 1, 2.25, 4.
We test three operators, all of the form Ψ(x) =∑35
i=1
∑35j=1 ψij〈x, vi〉vj .
• Ψ1 : for 1 ≤ i, j ≤ 35 we set ψii = 1 and ψij = 0 when i 6= j,
• Ψ2 : the coefficients ψij are generated as i.i.d. standard normal random variables,
• Ψ3 : for 1 ≤ i, j ≤ 35 we set ψij = 1ij
We standardize the operators such that the operator norm equals one. The operators Ψ2 are
generated once and then fixed for the entire simulation. We generate samples of size n+1 = 80×4`+1,
` = 0, . . . , 4. Estimation is based on the first n observations. We run 200 simulations for each setup
(Λ,Ψ, σ, n). As a performance measure for our procedure the mean squared error on the (n+ 1)-st
observation
MSE =1
200
200∑k=1
‖Ψ(X(k)n+1)− Ψ(X
(k)n+1)‖2H2
, (I.6)
is used. Here X(k)i is the i-th observation of the k-th simulation run.
Now we compute the median truncation level K obtained from our data-driven procedure de-
scribed in Theorem 2 with mn = n1/2
logn . We compare it to the median truncation level obtained by
cross-validation (KCV ) on the same data. To this end, we divide the sample into training and test
sets in proportion (n− ntest) : ntest, where ntest = maxn/10, 100. The estimator is obtained from
the training set for different truncation levels k = 1, 2, . . . , 35. Then, from the test set we determine
KCV = argmink∈1,...,35∑n
`=n−ntest‖Y`+1 − Ψk(X`)‖2H2
.
The MSE and the size of K and KCV are shown for different constellations in Table 1. We display
the results only for σ = 1. Not surprisingly, the bigger the variance of the noise, the bigger MSE,
but otherwise our findings were the same across all constellations of σ. The table shows that the
choice of K proposed by our method results in an MSE which is competitive with CV. We also see
that an optimal choice of K cannot be solely based on the decay of the eigenvalues as it is the case
in our approach. It clearly also depends on the unknown operator itself. Not surprisingly, the best
results are obtained under settings Λ1 (exponentially fast decay of eigenvalues) and Ψ3 (which is
the smoothest among the three operators).
4 Conclusion
Estimation of the regression operator in functional linear models has obtained much interest over
the last years. Our objective in this paper was to show that one of most widely applied estimators in
this context remains consistent, even if several of the synthetic assumptions used in previous papers
are removed. If our intention is prediction, we can further simplify the technical requirements. Our
approach comes with a data driven choice of the parameter which determines the dimension of the
estimator. While our main intention is to show that this choice leads to a consistent estimator,
we have seen in simulations that our method is performing remarkably well when compared to
cross-validation.
25
Table 1: Truncation levels obtained by Theorem 2 (K) and by cross-validation (KCV ) and correspondingMSE. For each constellation we present med(K) of 200 runs.
Ψ1 Ψ2 Ψ3
n KCV MSE K MSE KCV MSE K MSE KCV MSE K MSE
Λ1
80 1 1.10 2 0.96 1 0.68 2 0.69 1 0.64 2 0.66320 3 0.48 2 0.43 1 0.32 2 0.28 1 0.21 2 0.24
1280 4 0.21 3 0.21 3 0.14 3 0.12 2 0.09 3 0.095120 7 0.08 4 0.10 5 0.07 4 0.05 3 0.05 4 0.03
20480 9 0.03 4 0.06 8 0.03 4 0.02 5 0.02 4 0.01
Λ2
80 1 1.00 1 0.85 1 0.82 1 0.58 1 0.56 1 0.4320 2 0.56 1 0.54 1 0.26 1 0.22 1 0.20 1 0.15
1280 5 0.26 2 0.28 2 0.14 2 0.12 1 0.07 2 0.065120 9 0.13 2 0.24 5 0.08 2 0.08 3 0.04 2 0.02
20480 17 0.06 3 0.16 10 0.04 3 0.04 5.5 0.02 3 0.01
Λ3
80 1 1.60 2 1.30 1 0.78 1 0.73 1 0.71 1 0.57320 2 0.85 2 0.78 1 0.35 2 0.40 1 0.22 2 0.28
1280 8 0.55 4 0.55 2 0.22 4 0.22 2 0.08 4 0.125120 24 0.25 6 0.38 9 0.16 6 0.14 3 0.04 6 0.04
20480 33 0.08 11 0.25 23 0.07 11 0.08 5 0.02 11 0.02
5 Proofs
Throughout this entire section we assume the setup and notation of Section 2.2.
5.1 Proof of Theorem 1
We work under Assumptions (A) and (K) and assume distinct eigenvalues of the covariance operator
C and that Ψ is Hilbert-Schmidt. The first important lemma which we use in the proof of Theorem 1
is an error bound for the estimators of the operators ∆ and C. Below we extend results in [18].
Lemma 1. There is a constant U depending only on the law of (Xk, Yk) such that
nmaxE‖∆− ∆n‖2S12 , E‖C − Cn‖2S11 < U.
Proof of Lemma 1. We only prove the bound for ∆, the one for C is similar. First note that by
Lemma 2.1 in [18] and Assumption (A) Yk is also L4–m–approximable. Next we observe that
nE∥∥∆− ∆n
∥∥2
S12 = nE
∥∥∥∥∥ 1
n
n∑k=1
Zk
∥∥∥∥∥2
S12
,
where Zk = Xk ⊗ Yk −∆. Set Z(r)k = X
(r)k ⊗ Y
(r)k −∆. Using the stationarity of the sequence Zk
26
we obtain
nE
∥∥∥∥∥ 1
n
n∑k=1
Zk
∥∥∥∥∥2
S12
=∑|r|<n
(1− |r|
n
)E〈Z0, Zr〉S12
≤ E‖Z0‖2S12 + 2∞∑r=1
|E〈Z0, Zr〉S12 |. (I.7)
By the Cauchy-Schwarz inequality and the independence of Z(r−1)r and Z0 we derive:
|E〈Z0, Zr〉S12 | = |E〈Z0, Zr − Z(r−1)r 〉S12 | ≤ (E‖Z0‖2S12)
12 (E‖Zr − Z(r−1)
r ‖2S12)12 .
Using ‖X0 ⊗ Y0‖S12 = ‖X0‖H1‖Y0‖H2 and again the Cauchy-Schwarz inequality we get
E‖Z0‖2S12 = E‖X0‖2H1‖Y0‖2H2
≤ ν24,H1
(X0)ν24,H2
(Y0) <∞.
To finish the proof we show that∞∑r=1
(E‖Zr − Z(r−1)r ‖2S12)
12 < ∞. By using an inequality of the
type |ab− cd|2 ≤ 2|a|2|b− d|2 + 2|d|2|a− c|2 we obtain
E‖Zr − Z(r−1)r ‖2S12 = ‖Xr ⊗ Yr −X(r−1)
r ⊗ Y (r−1)r ‖2S12
≤ 2E‖Xr‖2H1‖Yr − Y (r−1)
r ‖2H2+ 2E‖Y (r−1)
r ‖2H2‖Xr −X(r−1)
r ‖2H1
≤ 2ν24,H1
(Xr)ν24,H2
(Yr − Y (r−1)r ) + 2ν2
4,H2(Y (r−1)r )ν2
4,H1(Xr −X(r−1)
r ).
Convergence of (C.5) follows now directly from L4-m–approximability.
Application of this lemma leads also to bounds for estimators of eigenvalues and eigenfunctions
of C via the following two lemmas (see [18]).
Lemma 2. Suppose λi, λi are the eigenvalues of C and C, respectively, listed in decreasing order.
Let vi, vi be the corresponding eigenvectors and let ci = 〈vi, vi〉. Then for each j ≥ 1,
αj‖vj − cj vj‖H1 ≤ 2√
2‖C − C‖L11 ,
where αj = minλj−1 − λj , λj − λj+1 and α1 = λ2 − λ1.
Lemma 3. Let λj , λj be defined as in Lemma 2. Then for each j ≥ 1,
|λj − λj | ≤ ‖C − C‖L11 .
In the following calculations we work with finite sums of the representation in (I.2):
ΨK(x) =K∑j=1
∆(vj)
λj〈vj , x〉. (I.8)
In order to prove the main result we consider the term ‖Ψ − ΨK‖L12 and decompose it using the
27
triangle inequality into four terms
‖Ψ− ΨK‖L12 ≤4∑i=1
‖Si(K)‖L12 ,
where
S1(K) =K∑j=1
(cj vj ⊗
∆(cj vj)
λj− cj vj ⊗
∆(cj vj)
λj
), (I.9)
S2(K) =
K∑j=1
(cj vj ⊗
∆(cj vj)
λj− cj vj ⊗
∆(cj vj)
λj
), (I.10)
S3(K) =K∑j=1
(cj vj ⊗
∆(cj vj)
λj− vj ⊗
∆(vj)
λj
), (I.11)
S4(K) = Ψ−ΨK . (I.12)
The following simple lemma gives convergence of S4(Kn), provided KnP−→∞.
Lemma 4. Let Kn, n ≥ 1 be a random sequence taking values in N, such that KnP−→ ∞ as
n→∞. Then ΨKn defined by the equation (I.8) converges to Ψ in probability.
Proof. Notice that since ‖Ψ‖2S12 =∞∑j=1‖Ψ(vj)‖2H2
<∞ for some orthonormal base vj, we can find
mε ∈ N such that ‖Ψ−Ψm‖2S12 =∑j>m‖Ψ(vj)‖2H2
≤ ε, whenever m > mε. Hence
P (‖Ψ−ΨKn‖2S12 > ε) =∞∑m=1
P (‖Ψ−Ψm‖2S12 > ε ∩Kn = m)
= P (Kn ≤ mε).
The next three lemmas deal with terms (I.9)–(I.11).
Lemma 5. Let S1(K) be defined by the equation (I.9) and U the constant derived in Lemma 1.
Then
P (‖S1(Kn)‖L12 > ε) ≤ Um2n
ε2n.
Proof. Note that for an orthonormal system ei ∈ H1 | i ≥ 1 and any sequence xi ∈ H2 | i ≥ 1the following identity holds:∥∥∥∥∥
K∑i=1
ei ⊗ xi
∥∥∥∥∥2
S12
=
∞∑j=1
∥∥∥∥∥K∑i=1
〈ei, ej〉xi
∥∥∥∥∥2
H2
=
K∑j=1
‖xj‖2H2. (I.13)
28
Using this and the fact that the Hilbert-Schmidt norm bounds the operator norm we derive
P (‖S1(Kn)‖2L12 > ε) ≤ P
(∥∥∥∥∥Kn∑j=1
cj vj ⊗1
λj(∆−∆)(cj vj)
∥∥∥∥∥2
S12
> ε
)
≤ P
(1
λ2Kn
Kn∑j=1
‖(∆−∆)(cj vj)‖2H2> ε
)≤ P (m2
n‖∆−∆‖2S12 > ε).
By the Markov inequality
P (‖S1(Kn)‖2L12 > ε) ≤ E‖∆−∆‖2S12m2n
ε≤ Um
2n
εn,
where the last inequality is obtained from Lemma 1.
Lemma 6. Let S2(K) be defined by the equation (I.10) and U the constant from Lemma 5. Then
P (‖S2(Kn)‖L12 > ε) ≤ 4U‖∆‖2S12m4n
ε2n.
Proof. Assumption Kn ≤ Bn and identity (I.13) imply
P (‖S2(Kn)‖2L12 > ε) = P
(∥∥∥∥∥Kn∑j=1
(1
λj− 1
λj
)cj vj ⊗∆(cj vj)
∥∥∥∥∥2
L12
> ε
)
≤ P
(max
1≤j≤Kn
(λj − λjλjλj
)2 Kn∑j=1
‖∆(cj vj)‖2H2> ε
)
≤ P
(max
1≤j≤Kn
(λj − λjλj
)2
>ε
m2n‖∆‖2S12
).
For simplifying the notation let b2 = εm2n‖∆‖2S12
, then
P (‖S2(Kn)‖2L12 > ε) ≤ P
(max
1≤j≤Kn
∣∣∣∣∣ λj − λjλj
∣∣∣∣∣ > b
)
≤ P
(1
λKnmax
1≤j≤Kn|λj − λj | > b ∩ max
1≤j≤Kn|λj − λj | ≤
b
2mn
)+ P
(max
1≤j≤Kn|λj − λj | >
b
2mn
).
The first summand vanishes because
P
(1
λKnmax
1≤j≤Kn|λj − λj | > b ∩ max
1≤j≤Kn|λj − λj | ≤
b
2mn
)
≤ P
(b
2λKnmn> b ∩ |λKn − λKn | ≤
b
2mn
)
≤ P
(1
2mn> λKn ∩ |λKn − λKn | ≤
√ε
m2n2‖∆‖S212
),
29
which is equal to 0 for n large enough, since λKn ≥ 1mn
and the distance between λKn and λKnshrinks faster than 1
2mn. For the second term we use Lemma 3 and the Markov inequality:
P (‖S2(Kn)‖2L12 > ε) ≤ P(
max1≤j≤Kn
|λj − λj | >b
2mn
)≤ P
(‖C − C‖L11 >
b
2mn
)≤ 4m2
n
b2E‖C − C‖2L11
≤ 4U‖∆‖2S12m4n
εn.
Lemma 7. Let S3(K) be defined by (I.11) and U be the constant defined in Lemma 5, then
P (‖S3(Kn)‖L12 < ε) ≤ U(128‖∆‖2L12 + 4ε2)m6n
ε2n.
Proof. By adding and subtracting the term cj vj∆(vj) and using the triangle inequality we derive
P (‖S3(Kn)‖L12 > ε) = P
(∥∥∥∥∥Kn∑j=1
1
λj(cj vj ⊗∆(cj vj)− vj ⊗∆(vj))
∥∥∥∥∥L12
> ε
)
≤ P
(Kn∑j=1
1
λj‖cj vj ⊗∆(cj vj − vj) + (cj vj − vj)⊗∆(vj)‖L12 > ε
)
≤ P
(Kn∑j=1
1
λj(‖∆‖L12‖cj vj − vj‖H1 + ‖cj vj − vj‖H1‖∆‖L12) > ε
).
Now we split Ω = A ∪Ac where A = 1λKn
> 2mn and get
P (‖S3(Kn)‖L12 > ε) ≤ P
(1
λKn
Kn∑j=1
‖cj vj − vj‖H1 >ε
2‖∆‖L12
)
≤ P
(Kn∑j=1
‖cj vj − vj‖H1 >ε
4mn‖∆‖L12
)+ P
(1
λKn> 2mn
). (I.14)
For the first term in the inequality (I.14), by Lemma 2, definition of En and the Markov inequality
30
we get
P
(Kn∑j=1
‖cj vj − vj‖H1 >ε
4mn‖∆‖L12
)≤ P
(mn max
1≤j≤En‖cj vj − vj‖H1 >
ε
4mn‖∆‖L12
)
≤ P
(max
1≤j≤En
2√
2
αj‖C − C‖L12 >
ε
4m2n‖∆‖L12
)
≤ P
(‖C − C‖L12 >
ε
8√
2m3n‖∆‖L12
)
≤ 128‖∆‖2L12m6n
E‖C − C‖2L12ε2
≤ 128U‖∆‖2L12m6n
ε2n.
Since λKn ≥ 1mn
, the second term in the inequality (I.14) is bounded by
P
(λKn <
1
2mn
)≤ P
(λKn <
1
2mn∩ |λKn − λKn | ≤
1
2mn
)+ P
(|λKn − λKn | >
1
2mn
)
≤ P
(‖C − C‖L12 >
1
2mn
)
≤ 4m2nE‖C − C‖2L12 ≤ 4U
m2n
n.
Thus we derive
P (‖S3(Kn)‖L12 > ε) ≤ 128U‖∆‖2L12m6n
ε2n+ 4U
m2n
n≤ U(128‖∆‖2L12 + 4ε2)
m6n
ε2n.
Finally we need a lemma which assures that Kn tends to infinity.
Lemma 8. Let Kn be defined as in (K), then KnP−→∞.
Proof. We have to show that P (minBn, En < p) → 0 for any p ∈ N. Since 1mn 0, for n large
enough we have, by combining Lemma 1 and 3, that
P (Bn < p) = P
(λp <
1
mn
)= P
(λp − λp > λp −
1
mn
)≤ P
(|λp − λp| > λp −
1
mn
)→ 0.
Now we are ready to prove the main result
31
Proof of Theorem 1. First, by the triangle inequality we get
‖Ψ− ΨKn‖L12 ≤ ‖Ψ− ΨKn‖L12 + ‖Ψ−ΨKn‖L12≤ ‖S1(Kn)‖L12 + ‖S2(Kn)‖L12 + ‖S3(Kn)‖L12 + ‖Ψ−ΨKn‖L12 .
By Lemmas 4, 5, 6, 7 and assumption m6n = o(n) we finally obtain for large enough n that
P (‖Ψ− ΨKn‖L12 > ε)
≤ U44m2n
ε4n+ 43U‖∆‖2S12
m4n
ε2n+ 42U(128‖∆‖2L12 + ε2/4)
m6n
ε2n+ P (‖Ψ−ΨKn‖L12 > ε/4)
n→∞−−−→ 0.
5.2 Proof of Theorem 2
In order to simplify the notation we will denote K = Kn. This time as a starting point we take a
representation of Ψ in the basis v1, v2, .... Let Mm = spv1, v2, ..., vm, Mm = spv1, v2, ..., vmwhere spxi, i ∈ I denotes the closed span of the elements xi, i ∈ I. If rank(C) = `, then
vi, i > ` can be any ONB of M⊥` . We write PA for the projection operator which maps on a
closed linear space A. As usual A⊥ denotes the orthogonal complement of A. Since for any m ≥ 1
we can write x = PMm(x) + PM⊥m
(x), the linearity of Ψ and the projection operator gives
Ψ(x) = Ψ(PMm(x)) + Ψ(PM⊥m
(x))
=m∑j=1
〈vj , x〉H1Ψ(vj) + Ψ(PM⊥m(x)).
Now we evaluate Ψ in some vj which is not in the kernel of C. By definitions of Ψ, C and again by
linearity of the involved operators
Ψ(vj) =1
λjΨ(C(vj))
=1
λj
1
n
n∑i=1
〈Xi, vj〉H1Ψ(Xi)
=1
λj
1
n
n∑i=1
〈Xi, vj〉H1(Yi − εi)
=1
λj(∆(vj) + Λ(vj)),
where Λ = − 1n
∑ni=1Xi⊗εi. Hence if m is such that λm > 0 (which will now be implicitely assumed
in the sequel), Ψ can be expressed as
Ψ(x) =
m∑j=1
〈vj , x〉H1
1
λj∆(vj) +
m∑j=1
〈vj , x〉H1
1
λjΛ(vj) + Ψ(PM⊥m
(x)).
32
Note that the first term on the right-hand side is just Ψm(x). Therefore for any x, the distance
between Ψ(x) and Ψm(x) takes the following form
‖Ψ(x)− Ψm(x)‖H2 =
∥∥∥∥∥m∑j=1
〈vj , x〉H1
1
λjΛ(vj) + Ψ(PM⊥m
(x))
∥∥∥∥∥H2
. (I.15)
To assess (I.15) we need the following four lemmas.
Lemma 9. Let (λi, vi)i≥1 and (λi, vi)i≥1 be eigenvalues and eigenfunctions of C and C respectively.
Set j,m ∈ N such that j ≤ m ≤ n, then
‖vj − PMm(vj)‖2H1
≤ 4‖C − C‖2L11
(λm+1 − λj)2.
Proof. Note that by using Parseval’s identity we get
‖vj − PMm(vj)‖2H1
=∞∑k=1
〈vj − PMm(vj), vk〉2H1
=∑k>m
〈vj , vk〉2H1.
Now
(λm+1 − λj)2∑k>m
〈vj , vk〉2H1≤∑k>m
(λk〈vj , vk〉H1 − λj〈vj , vk〉H1)2
=∑k>m
(〈vj , C(vk)〉H1 − λj〈vj , vk〉H1)2.
Since C is a self-adjoint operator, simple algebraic transformations yield
(λm+1 − λj)2∑k>m
〈vj , vk〉2H1≤∑k>m
(〈C(vj), vk〉H1 − λj〈vj , vk〉H1)2
=∑k>m
(〈(C − C)(vj), vk〉H1 − (λj − λj)〈vj , vk〉H1)2
≤ 2∑k>m
|〈(C − C)(vj), vk〉H1 |2 + 2∑k>m
((λj − λj)〈vj , vk〉H1)2.
By Parseval’s inequality and Lemma 3
(λm+1 − λj)2∑k>m
〈vj , vk〉2H1≤ 2‖(C − C)(vj)‖2H1
+ 2|λj − λj |2 ≤ 4‖C − C‖2L11 .
Lemma 10. Let Ψ be defined as in Lemma 2 and K = KnP−→∞. Then ‖PM⊥K (Xn)‖H2
P−→ 0.
Proof. We write here and in the sequel X = Xn. We first remark that for any ε > 0
P (‖PM⊥K (X)‖2H2> ε) = P
( ∞∑i=K+1
|〈vi, X〉H1 |2 > ε
).
33
Since∑∞
i=1 |〈vi, X〉H1 |2 = ‖X‖2H1, there exists a random variable Jε ∈ R such that
∑∞i=Jε|〈vi, X〉H1 |2 <
ε. Since by assumption E‖X‖2H1< ∞, we conclude that Jε is bounded in probability. Hence we
obtain
P (‖PM⊥K (X)‖2H2> ε) ≤ P
( ∞∑i=K+1
|〈vi, X〉H1 |2 > ε ∩ K > Jε
)+ P (K ≤ Jε)
= P (K ≤ Jε),
where the last term converges to zero as n→∞.
Lemma 11. Let Ln = arg maxr ≤ K :∑r
i=1(λK+1 − λi)−2 ≤ ξn, where K = Kn is given as in
Theorem 2 and ξn →∞. Then LnP−→∞.
Proof. Let r ∈ N such that for all 1 ≤ i ≤ r we have λr+1 6= λi. Note that E‖X‖2H1< ∞ implies
λi → 0 and since λi > 0 we can find infinitely many r satisfying this condition. We choose such r
and obtain
P (Ln < r) ≤ P
(r∑i=1
1
(λK+1 − λi)2> ξn ∩ K ≥ r
)+ P (K < r).
Lemma 8 implies that P (K < r) → 0. The first term is bounded by P(∑r
i=11
(λr+1−λi)2> ξn
).
Since λiP−→ λi and r is fixed while ξn →∞, it follows that P (Ln < r)→ 0 if n→∞. Since r can
be chosen arbitrarily large, the proof is finished.
Lemma 12. Let Ψ be defined as in Lemma 2, then ‖PMK(X)− PMK
(X)‖H1
P−→ 0.
Proof. Let us define two variables X(1) =∑L
i=1〈X, vi〉H1vi, X(2) =
∑∞i=L+1〈X, vi〉H1vi and L as in
Lemma 11. Again for simplifying the notation we will write L instead of Ln. Since X = X(1) +X(2)
we derive
‖PMK(X)− PMK
(X)‖H1 ≤ ‖PMK(X(1))− PMK
(X(1))‖H1 + ‖PMK(X(2))‖H1 + ‖PMK
(X(2))‖H1 .
(I.16)
The last two terms are bounded by 2‖X(2)‖H1 . For the first summand in (I.16) we get
‖PMK(X(1))− PMK
(X(1))‖H1 =
∥∥∥∥∥L∑i=1
〈X, vi〉H1(vi − PMK(vi))
∥∥∥∥∥H1
.
Let us choose ξn = o(n) in Lemma 11. The triangle inequality, the Cauchy-Schwarz inequality,
34
Lemma 9 and the definition of L entail
‖PMK(X(1))− PMK
(X(1))‖H1 ≤L∑i=1
|〈X, vi〉H1 |‖vi − PMK(vi)‖H1
≤
(L∑i=1
|〈X, vi〉H1 |2)1/2( L∑
i=1
‖vi − PMK(vi)‖2H1
)1/2
≤ ‖X‖H1
(L∑i=1
‖vi − PMK(vi)‖2H1
)1/2
≤ 2‖X‖H1‖C − C‖L11
(L∑i=1
1
(λK+1 − λi)2
)1/2
≤ 2‖X‖H1‖C − C‖L11√ξn.
This implies the inequality
‖PMK(X)− PMK
(X)‖H1 ≤ 2‖X‖H1‖C − C‖L11√ξn + 2‖X(2)‖H1 . (I.17)
Hence by Lemma 1 we have 2‖X‖H1‖C − C‖L11√ξn = oP (1). Furthermore we have that ‖X(2)‖ =(∑
j>L |〈X, vj〉|2)1/2 P−→ 0. This follows from the proof of Lemma 10.
Lemma 13. Let Ψ be defined as in Lemma 2, then ‖Ψ(PM⊥K(X))‖H2
P−→ 0.
Proof. Some simple manipulations show
‖Ψ(PM⊥K(X))‖H2 = ‖Ψ(X − PMK
(X))‖H2
= ‖Ψ(PMK(X) + PM⊥K
(X)− PMK(X))‖H2
≤ ‖Ψ(PMK(X))−Ψ(PMK
(X))‖H2 + ‖Ψ(PM⊥K(X))‖H2
≤ ‖Ψ‖L12(‖PMK
(X)− PMK(X)‖H1 + ‖PM⊥K (X)‖H1
).
Direct applications of Lemma 10 and Lemma 12 finish the proof.
Proof of Theorem 2. Set
Θn(x) =
Kn∑j=1
Λ(vj)
λj〈vj , x〉H1 .
By the representation (I.15) and the triangle inequality
‖Ψ(X)− Ψ(X)‖H2 ≤ ‖Θn(X)‖H2 + ‖Ψ(PM⊥Kn(X))‖H2 .
Lemma 13 shows that the second term tends to zero in probability.
If in Lemma 1 we define Ψ ≡ 0, then Λ = ∆ and by independence of εk and Xk we get
Λ = 0. By the arguments of Lemma 5 we infer P (‖Θn‖L12 > ε) ≤ Um2n/ε
2n, which implies that
35
‖Θn(X)‖H2
P−→ 0.
6 Acknowledgement
This research was supported by the Communaute francaise de Belgique—Actions de Recherche Con-
certees (2010–2015) and Interuniversity Attraction Poles Programme (IAP-network P7/06), Belgian
Science Policy Office.
Bibliography
[1] Bosq, D. (1991). Modelization, nonparametric estimation and prediction for continuous time
processes. In Nonparametric functional estimation and related topics. NATO Adv. Sci. Inst.
Ser. C Math. Phys. Sci., 335, 509–529, Kluwer Acad. Publ.
[2] Bosq, D. (2000). Linear Processes in Function Spaces., Springer, New York.
[3] Cai, T. & Hall, P. (2006). Prediction in functional linear regression. Ann. Statist. 34, 2159–2179.
[4] Cai, T. & Zhou, H. (2008). Adaptive functional linear regression. technical report.
[5] Cardot, H., Ferraty, F. & Sarda, P. (1999). Functional linear model. Statist. Probab. Lett. 45,
11–22.
[6] Cardot, H., Ferraty, F. & Sarda, P. (2003). Spline estimators for the functional linear model.
Statist. Sinica 13, 571–591.
[7] Cardot, H. & Johannes, J. (2010). Thresholding projection estimators in functional linear mod-
els. J. Multivariate Anal. 101, 395–408.
[8] Chiou, J.-M., Muller, H.-G. & Wang, J.-L. (2004). Functional response models. Statist. Sinica
14, 675–693.
[9] Crambes, C. & Mas, A. (2013). Asymptotics of prediction in functional linear regression with
functional outputs. Bernoulli 19, 2153–2779.
[10] Crambes, C., Kneip, A. & Sarda, P. (2009). Smoothing splines estimators for functional linear
regression. Ann. Statist. 37, 35–72.
[11] Cuevas, A., Febrero, M. & Fraiman, R. (2002). Linear functional regression: the case of fixed
design and functional response. Canadian J. Statist. 30, 285–300.
[12] Eilers, P. & Marx, B. (1996). Flexible Smoothing with B-splines and Penalties. Statist. Sciences
11, 89–121.
[13] Febrero-Bande, M., Galeano, P. & Gonzlez-Manteiga, W. (2010). Measures of influence for the
functional linear model with scalar response. J. Multivariate Anal. 101, 327–339.
[14] Ferraty, F., Laksaci, A., Tadj, A. & Vieu, P. (2011). Kernel regression with functional response.
Electron. J. Statist. 5, 159–171.
36
[15] Hall, P. & Horowitz, J. (2007). Methodology and convergence rates for functional linear regres-
sion. Ann. Statisti. 35, 70–91.
[16] Hastie, T. & Mallows, C. (1993). A discussion of ”a statistical view of some chemometrics
regression tools” by I.E. Frank and J.H. Friedman, Technometrics 35, 140–143.
[17] Hormann, S., Horvath, L. & Reeder, R. (2012). A Functional Version of the ARCH Model.
Econometric Theory 29, 267–288.
[18] Hormann, S. & Kokoszka, P. (2010). Weakly dependent functional data. Ann. Statist. 38,
1845–1884.
[19] Hormann, S. & Kokoszka, P. (2012). Functional Time Series. Handbook of Statistics 30, 157–
185.
[20] Horvath, L. & Kokoszka, P. (2012). Inference for Functional Data with Applications, Springer.
[21] Hyndman, R. J. & H.Shang, L. (2009). Forecasting functional time series. J. Korean Statist.
Soc., 38, 199–211.
[22] Li, Y. & Hsing, T. (2007). On rates of convergence in functional linear regression. J. Multivariate
Anal. 98, 1782–1804.
[23] Malfait, N. & Ramsay J. O. (2003). The historical functional linear model. Canad. J. Statist.
31, 115–128.
[24] Muller, H.-G. & Stadtmuller, U. (2005). Generalized functional linear models. Ann. Statist. 33,
774–805.
[25] Ramsay, J. O. & Silverman, B. (2005). Functional Data Analysis (2nd ed.), Springer, New York.
[26] Reiss, T. P. & Ogden, T. R. (2007). Functional principal component regression and functional
partial least squares. J. Amer. Statist. Assoc. 102, 984–996.
[27] Sen, R. & Kluppelberg, S. (2010). Time series of functional data. technical report.
[28] Yao, F., Muller, H.-G. & Wang, J.-L. (2005). Functional linear regression analysis for longitu-
dinal data. Ann. Statisti. 33, 2873–2903.
[29] Yuan, M. & Cai, T. (2011). A reproducing kernel Hilbert space approach to functional linear
regression. Ann. Statist. 38, 3412–3444.
37
Chapter II
Estimation in functional lagged regression
Estimation in functional lagged regression∗
Siegfried Hormann1, Lukasz Kidzinski1, Piotr Kokoszka2
1 Department de Mathematique, Universite libre de Bruxelles (ULB), Boulevard du Triomphe, B-1050
Bruxelles, Belgium
2 Department of Statistics, Colorado State University, Fort Collins, CO 80523, USA
Abstract
The paper introduces a functional time series (lagged) regression model. The impulse response coefficients
in such a model are operators acting on a separable Hilbert space, which is the function space L2 in appli-
cations. A spectral approach to the estimation of these coefficients is proposed and asymptotically justified
under a general nonparametric condition on the temporal dependence of the input series. Since the data are
infinite dimensional, the estimation involves a spectral domain dimension reduction technique. Consistency
of the estimators is established under general data dependent assumptions on the rate of the dimension re-
duction parameter. Their finite sample performance is evaluated by a simulation study which compares two
ad hoc approaches to dimension reduction with a new asymptotically justified method. The new method is
superior when the MSE of the in sample prediction error is used as a criterion.
1 Introduction
This paper is concerned with the estimation of impulse response operators in functional lagged
regression. Time series (or lagged) regression goes back to the origins of modern time series anal-
ysis, Kolmogorov [20], Wiener [30]. Accounts are given in many monographs and textbooks, e.g.
Brillinger [4], Priestley [26], Shumway and Stoffer [29]. It forms the most important input–output
paradigm in modeling engineering, geophysical and economic systems. In an abstract form, the
lagged regression model is
Y` = a+∑k∈Z
bk(X`−k) + ε`. (1.1)
The regressors Xk are elements of a Hilbert space H, the responses Y` and the errors ε` belong to
a possibly different Hilbert space H ′, and bk : H → H ′ are linear operators. In the most common
setting, all quantities are scalars, applications with several scalar input series are not uncommon.
While model (1.1) can be formulated in abstract Hilbert spaces, the existing statistical theory relies
on the assumption that all spaces are finite dimensional. This is because solutions to problems of
estimation, prediction and interpolation require inverting various matrices, and these inverses do
not exist (as bounded operators) in infinite dimensional spaces. A dimension reduction methodology
with a requisite theory must be developed.
Such issues have been extensively investigated in the field of Functional Data Analysis, with
the most relevant research relating to the functional linear model, e.g. Ramsay and Silverman [27],
Horvath and Kokoszka [17]. There are many types of functional linear models; those most relevant
∗Manuscript submitted for publication
39
to this paper are known as the scalar response model and the fully functional model. The scalar
response model has the form Yk = a +∫T b(u)Xk(u)du + εk, where the Xk are functions and the
responses Yk are scalars. This model has been investigated from many angles, to give a selection
of references, we cite Cardot et al. [7], Muller and Stadtmuller [23], Cai and Hall [5], Li and Hsing
[21], Crambes et al. [10], James et al. [18], McKeague and Sen [22] and Comte and Johannes [9].
The fully functional model is defined by
Yk(t) = a(t) +
∫Tb(t, u)Xk(u)du+ εk(t),
where now the responses Yk and the errors εk are also functions. This model is more complex, and
has not been so thoroughly investigated as the scalar response model, but it is safe to say that it
is presently well understood: Yao et al. [32], Chiou and Muller [8], Hormann and Kokoszka [13]
Gabrys et al. [11] and Hormann and Kidzinski [12] are just a few examples of recent work.
As in the usual linear regression, the assumption imposed on the above models is that the
pairs (Xk, Yk) are independent and identically distributed, and these models do not involve lagged
values of the input series. However, many problems in science, economics and engineering can be
formulated in terms of statistical inference for functional time series (FTS) which are defined as
Xn(t), t ∈ T , n = 1, 2, . . ., where n is the index referring to day, week, year or a similar unit of
time, and plays the role of the time index in time series analysis. The random elements Xn are
functions defined on a common domain T , typically an interval. This concept has been applied over
the last two decades in many settings, and a fairly complete theory for estimation, prediction and
testing for a single FTS exists, both in time and spectral domains: Bosq [2], Horvath and Kokoszka
[17], Hormann and Kokoszka [14], to name a few accounts.
The objective of this paper is to advance the existing framework by considering the input–output
paradigm for two FTS in the context of model (1.1). There are two broad approaches to inference
and prediction in the lagged regression model: 1) time domain approach based on ARMA modeling
of the series (X`) and response function modeling of the coefficients bk, Box et al. [3]; 2) spectral
domain approach based on coherency analysis, Brillinger [4]. While the Box–Jenkins approach has
an appealing heuristic justification, the coherency approach is viewed as a more principled one, and
has been extensively used in geosciences and engineering. Recent advances in the spectral theory
for functional data, Panaretos and Tavakoli (Panaretos and Tavakoli [25], Panaretos and Tavakoli
[24]), Hormann et al. [16], have opened up a prospect of developing a usable and asymptotically
supported methodology for model (1.1). As with most functional procedures, the main challenge is a
suitable dimension reduction technique and the need to deal with unbounded operators, difficulties
not encountered in the scalar and vector theory; details are explained in Section 3.
The remainder of this paper is organized as follows. Section 2 introduces model (1.1) in greater
detail by specifying the assumptions on its parameters and dependence structure. Estimation
methodology is explained in Section 3 and asymptotically justified in Section 4. Its finite sam-
ple performance is evaluated in Section 5. All proofs are collected in Section 6. In the Appendix,
we describe a new method for selecting an important dimension reduction parameter.
40
2 Model specification
We consider model (1.1) with a strictly stationary sequence (Xk) and a thereof independent i.i.d.
sequence (εk), with realizations in separable Hilbert spaces H and H ′, respectively. These spaces
are equipped with the norms ‖f‖ =√〈f, f〉, where 〈·, ·〉 is the inner product. The inner product
in H and H ′ is denoted in the same way. Even though we consider only real–valued observations,
it is convenient to assume that H and H ′ are Hilbert spaces over the complex field C, so that
〈f, g〉 = 〈g, f〉, where z is the complex conjugate of z.
Throughout we suppose that E‖Xk‖2 < ∞ and E‖εk‖2 < ∞ and that Eεk = 0. A sufficient
condition for the convergence of (1.1) is∑
k∈Z ‖bk‖L <∞, where ‖b‖L = supf : ‖f‖=1 ‖b(f)‖ denotes
the usual operator norm. A slightly stronger, but more convenient assumption is∑k∈Z‖bk‖S <∞, (2.1)
where ‖ · ‖S is the Hilbert–Schmidt norm. Recall that ‖Ψ‖2S =∑
j≥1 ‖Ψ(ej)‖2 =∑
j≥1 λ2j , where
(ej) is any orthonormal basis in H and the λj are the eigenvalues of Ψ. Recall that ‖Ψ‖L ≤ ‖Ψ‖S .
Our assumptions imply that (Y`) is also strictly stationary and E‖Y`‖2 <∞.
The means µX = EX` and µY = EY` are estimated by sample averages which, under quite
general dependence assumptions, are√n-consistent, see Section 2.4 of Bosq [2] for general results in
Banach spaces, and Theorem 16.3. of Horvath and Kokoszka [17] for the form of dependence used
in this paper. Since µY = a+∑
k∈Z bk(µX), once the bk have been estimated, an estimator for the
intercept operator a can be easily obtained. We therefore consider from now on the model
Y` =∑k∈Z
bk(X`−k) + ε`, (EY` = 0, EX` = 0). (2.2)
Since the process (Y`, X`) is strictly stationary and has second order moments, the operators
CXh := Cov(Xh, X0) and CY Xh := Cov(Yh, X0)
defined by the relation
Cov(X,Y )(f) = E[(X − EX)〈f, Y − EY 〉]
exist as elements of the space of Hilbert–Schmidt operators. The autocovariances of the input series
are assumed to be summable: ∑h∈Z‖CXh ‖S <∞. (2.3)
For ease of reference, we collect the time domain conditions imposed so far in the following assump-
tion.
Assumption 1. All random elements are square integrable. Model (2.2) and conditions (2.1) and
(2.3) hold. The sequences (Xk) and (εk) are strictly stationary and independent of each other. The
errors εk are independent.
41
Setting i =√−1, we introduce the spectral density operator
FXθ :=1
2π
∑h∈Z
CXh e−ihθ, θ ∈ [−π, π],
and the cross–spectral density operator
FY Xθ :=1
2π
∑h∈Z
CY Xh e−ihθ, θ ∈ [−π, π].
By (2.3) and Lemma 3 these two series are absolutely convergent in the Hilbert–Schmidt norm.
We will use the following assumption.
Assumption 2. For any θ ∈ [−π, π] the operators FXθ : H → H are full rank, that is ker(FXθ)
= 0.
For a scalar time series, Assumption 2 is equivalent to requiring that the spectral density of (X`)
be positive over [−π, π]. An analogous nonsingularity condition must be imposed for vector–valued
time series, Theorem 8.3.1 of Brillinger [4].
Next, we introduce the frequency response operator
Bθ :=∑h∈Z
bhe−ihθ, θ ∈ [−π, π].
The mapping θ 7→ 12πBθ is the Fourier transform (FT) of the sequence (bk) from which we may
recover bh by
bh =1
2π
∫ π
−πBθe
ihθdθ.
In case of the general model (1.1), bh are operators. We refer to Hormann et al. [16] for details on
how this type of FT is rigorously defined. If the Yk are scalars and the bk functions on an interval,
then Bθ = Bθ(u) =∑
h∈Z bh(u)e−ihθ reduces to a pointwise FT.
We conclude this section by specifying the assumptions on the dependence structure of the
process (Xk). We use the concept of Lp-m-approximability introduced by Hormann and Kokoszka
[13]. This moment based notion of dependence is convenient to apply and has been verified to
hold for several popular functional time series models, including functional linear processes. We
conjecture that our results could also be established under the cumulant type assumptions used by
Panaretos and Tavakoli [24], but the latter framework seems to be more restrictive than ours. We
write that X ∈ LpH if X takes values in Hilbert space H and
νp(X) :=(E‖X‖p
)1/p<∞.
Definition 1. A sequence (Xn) ∈ LpH is called Lp-m-approximable if each Xn admits the represen-
tation
Xn = f(un, un−1, . . .), (2.4)
where the ui are i.i.d. elements taking values in a measurable space S, and f is a measurable function
42
f : S∞ → H. Moreover we assume that if u′i is an independent copy of ui defined on the same
probability space, then letting
X(m)n = f(un, un−1, . . . , un−m+1, u
′n−m, u
′n−m−1, . . .) (2.5)
we have∞∑m=1
νp(X0 −X(m)
0
)<∞. (2.6)
Notice that by construction X(m)n
d= X0 (equality in law), and that X
(m)n is independent of
(Xn−k : k ≥ m). Representation (2.4) implies that the Xk form a stationary and ergodic sequence
in L2. Similar assumptions have been used extensively in recent theoretical work, as all stationary
time series models in practical use can be represented as Bernoulli shifts (2.4), see Wu [31], Shao
and Wu [28], Aue et al. [1], Hormann and Kokoszka [13], Hormann et al. [15] and Kokoszka and
Reimherr [19].
Assumption 3. The input sequence (Xk) is L4-m-approximable.
3 Estimation of the impulse response operators
In a scalar lagged regression model y` =∑
k bkx`−k+εk, the frequency response function is estimated
by Bθ = fyx(θ)/fxx(θ), where fyx(θ) and fxx(θ) are, respectively, estimates of the cross–spectrum of
(yk) and (xk) and the spectral density of the (xk). The response coefficients bh are then estimated
by the inverse FT of Bθ. To develop a similar procedure for functional data, we begin with the
relation
FY Xθ = Bθ FXθ (3.1)
which follows by changing the order of summation (cf. Lemma 3):
FY Xθ (f) =∑k∈Z
bk
(1
2π
∑h∈Z
CXh−k(f)e−i(h−k)θ
)e−ikθ = Bθ FXθ (f).
Heuristically, (3.1) yields the relation Bθ = FY Xθ (FXθ)−1
. This relation is heuristic only because(FXθ)−1
is not a bounded operator, see Lemma 4. We now explain how to overcome this problem
and construct consistent estimators of the impulse response operators bk.
For any θ ∈ [−π, π], the operator FXθ is Hilbert-Schmidt, symmetric and non-negative definite.
The verification is not difficult, see Hormann et al. [16]. Assumption 2 implies that its eigenvalues
λm(θ) are positive and the eigenfunctions (ϕm(θ) : m ≥ 1) form an orthonormal basis of H. Hence,
by the spectral theorem, it can be decomposed as
FXθ (f) =∑m≥1
λm(θ)〈f, ϕm(θ)〉ϕm(θ), f =∑m≥1
〈f, ϕm(θ)〉, ϕm(θ), (3.2)
where the λm(θ) are arranged in descending order and the corresponding eigenfunctions are normal-
43
ized to unit length. Relations (3.1) and (3.2) imply
FY Xθ (ϕm(θ)) = λm(θ)Bθ (ϕm(θ)) , m ≥ 1.
Therefore
Bθ(f) =∑m≥1
〈f, ϕm(θ)〉λm(θ)
FY Xθ (ϕm(θ)). (3.3)
The latter sum is convergent for any f ∈ H, and relation (3.3) forms the starting point of our
estimations approach.
Consider a sample (Y1, X1), . . . , (Yn, Xn) and the sample cross–covariance operators:
CY Xh (f) =
1n
∑n−hk=1 Yk+h〈f,Xk〉, h = 0, . . . , n− 1;
1n
∑nk=1−h Yk+h〈f,Xk〉, h = −n+ 1, . . . ,−1;
0, |h| ≥ n.
The estimators CXh for the autocovariances of (X`) are defined analogously. Now we set
FY Xθ|q =1
2π
∑|h|≤q
ωq(h)CY Xh e−ihθ.
This is the functional version of the smoothed periodogram. A popular choice is to use the Bartlett
weights ωq(h) = 1− |h|/(q + 1), but any weights satisfying the following assumption can be used.
Assumption 4. The weights ωq(h) satisfy ωq(−h) = ωq(h), |ωq(h)| ≤ ω?, for some ω? independent
of q and h, and limq→∞ ωq(h) = 1, for every fixed h.
All kernels used in practice lead to weights which satisfy Assumption 4.
In an analogous way define FXθ|q. The latter operator is non-negative definite, symmetric and
Hilbert-Schmidt for any frequency θ ∈ [−π, π]. Thus, the spectral theorem applies and we can
use its eigenfunctions ϕm(θ) = ϕm|q(θ) and eigenvalues λm(θ) = λm|q(θ) as estimators for the
population spectrum of FXθ . Clearly, ϕm(θ) and ϕm(θ) can only be close if they have the same
direction. In other words, the best we can hope is that ‖ϕm(θ) − cm(θ)ϕm(θ)‖ is small, when
cm(θ) = 〈ϕm(θ), ϕm(θ)〉/|〈ϕm(θ), ϕm(θ)〉|. This implies that all formulas defining estimators and
test statistics must be invariant with respect to cm(θ).
We are now ready to define
bh =1
2π
∫ π
−πBθe
ihθdθ,
where
Bθ(f) = Bθ|p,q,K(f) =K∑m=1
〈f, ϕm|q(θ)〉λm|q(θ)
FY Xθ|p (ϕm|q(θ)). (3.4)
The estimator Bθ involves three tuning parameters: p, q and K, which in principle may each
depend on θ. For the sake of readability, we shall in the sequel often suppress the dependence on
these parameters in the notation. The selection of these parameters is discussed in the following
44
sections.
Notice that (3.4) is invariant with respect to rotations of ϕm(θ); if c is a number on the complex
unit circle, then we can replace ϕm(θ) by cϕm(θ) without changing the estimator. This follows from
〈f, cϕm(θ)〉FY Xθ (cϕm(θ)) = cc〈f, ϕm(θ)〉FY Xθ (ϕm(θ)),
and cc = 1. Hence, in theoretical arguments, we can replace ϕm(θ) in (3.4) by cm(θ)ϕm(θ).
4 Consistency of the estimators
Asymptotic assumptions commonly used in the context of functional regression models are for-
mulated in terms of decay rates of eigenvalues and conditions on the gaps between eigenvalues of
the covariance operator CX0 . Under such assumptions, convergence rates for the estimators can be
obtained. In our spectral context, taking this route would necessitate translating such conditions
about the eigenvalues of CX0 to the eigenvalues of FXθ , θ ∈ [−π, π], and this does not yield clean
conditions. It is much more natural to base the rates of convergence of the bh directly on the rates
of convergence of the spectral density operator stated in the following lemma which is proven in
Section 6.
Lemma 1. Suppose Assumptions 1, 2, 3 and 4 hold. If q = qn → ∞, such that q2 = o(n), then
there exist null sequences (ψXn ) and (ψY Xn ), such that
supθ∈[−π,π]
‖FXθ − FXθ ‖L = oP (ψXn ) and supθ∈[−π,π]
‖FY Xθ − FY Xθ ‖L = oP (ψY Xn ).
Such an approach will allow us to specify the value of K (the truncation level in (3.4)) which
implies the consistency of the bh directly in terms of the sequences (ψXn ) and (ψY Xn ).
To establish the consistency of the bh, we need technical assumptions which ensure the identifia-
bility of the eigenfunctions of the spectral density estimators. These assumptions do not enter into
the convergence rates or the selection of K. Introduce the following function, which measures the
size of spectral gaps:
α1(θ) := λ1(θ)− λ2(θ);
αm(θ) := λm(θ)− λm+1(θ) ∧ λm−1(θ)− λm(θ), m ≥ 2.
In case of αm(θ) 6= 0, the eigenspace corresponding to λm(θ) is one-dimensional, and ϕm(θ) is
unique up to multiplication with a number on the complex unit circle. If αm(θ) = 0 for some θ,
the eigenspace corresponding to λm(θ) has dimension greater than one. Then, ϕm(θ) cannot be
identified. We shall thus impose the following assumptions.
Assumption 5. It holds that infθ αk(θ) > 0, for all k ≥ 1.
To formulate our consistency result, we need the following random variables. We define K =
45
minK(i), 1 ≤ i ≤ 4 with
K(1) = maxk ≥ 1: infθλk(θ) ≥ 2ψXn ,
K(2) = maxk ≥ 1: ψY Xn
∫ π
−πW kλ (θ)dθ ≤ 1,
K(3) = maxk ≥ 1:
∫ π
−π
(W kλ (θ)
)2dθ ≤
(ψXn)−1/2,
K(4) = maxk ≥ 1:
∫ π
−π
(W kα(θ)
)2dθ ≤
(ψXn)−1/2,
and
W kλ (θ) =
(k∑
m=1
1
λ2m(θ)
)1/2
and W kα(θ) =
(k∑
m=1
1
α2m(θ)
)1/2
.
By convention, the maximum over the empty set is equal to zero.
Theorem 1. Suppose that Assumptions 1, 2, 3, 4, 5 hold. For any null sequences (ψXn ) and (ψY Xn )
in Lemma 1 define K = minK(i) : 1 ≤ i ≤ 4. If q, p→∞ such that q + p = o(n1/2), then
maxh∈Z‖bh − bh‖S
P−→ 0.
Theorem 1 is proven in Section 6
Theorem 1 provides general conditions on the dimension parameter K to ensure that the esti-
mator bh is consistent. We now propose three specific rules whose finite sample performance will be
compared in the next section.
Cross–validation (CV). We divide 1, . . . , n into a training set Str = 1, . . . ,m and a test set
Stest = m+1, . . . , n, with m = bαnc and α ∈ (0, 1). With the variables Xj : j ∈ Str we estimate
the operators bh|k for h ∈ −H, . . . ,H, H ≥ 0, with a fixed dimension Kθ = k for all θ. Then, we
compute
V 2k :=
∑j∈Stest
∥∥∥Yj − ∑|h|≤H
bh|k(Xj−hIj − h ∈ Stest)∥∥∥2, k ≥ 1.
We set K = argmink≥1Vk. In Section 5, we use α = 0.8 and H = 3.
A potential disadvantage of this method is that K is fixed for all frequencies θ. In principle,
one could vary K over a partition of [−π, π]. However, such a method is numerically unstable and
increases computational costs.
Eigenvalue thresholding (ET). A major source of variability of estimator (3.4) is small eigen-
values λm|q(θ) in the denominator. Hence, another natural tuning approach consists in truncating
the sum in (3.4) as soon as λm|q(θ) is below a certain threshold ε = εn. Hence, we choose
Kθ = argmaxm≥1λm|q(θ) > εn.
46
In Section 5, we use εn = n−1/2.
Final prediction error (FPE). This method is more complex and is of independent interest since
it is applicable to the static functional linear model as well. The new data driven approach is
explained in Appendix 1, which also contains its theoretical justification.
5 Assessment of the performance in finite samples
5.1 Data generating processes and numerical implementation of the estimators
We work with a scalar response model of the form
Yt =∑|h|≤H
〈Xt−h, bh〉+ et, H ≥ 0.
In order to generate it, we assume a finite dimensional specification bh =∑d
k=1 bh;kfk, where the
functions fk form an orthonormal system in H. If we expand the curves Xt along the basis (fk) we
have Xt =∑
k≥1〈Xt, fk〉fk, and thus
Yt =∑|h|≤H
d∑k=1
bh;k〈Xt−h, fk〉+ et.
Hence, we can rearrange this functional lagged regression model in vector form as
Yt =∑|h|≤H
X′t−hbh + et,
where bh = (bh;1, . . . , bh;d)′ and Xt := (〈Xt, f1〉, . . . , 〈Xt, fd〉)′.
A similar discretization can be done for the operator Bθ and the covariance operators CXh and
CY Xh . More specifically, we can write
Bθ(x) =
∑|h|≤H
b′h exp(−ihθ)
x =: Bθx,
and
CXh (x) = f ′d(EXt+hX
′t
)x := f ′dC
Xh x and CY Xh (x) =
(EYt+hX
′t
)x := CY Xh x,
where x = (〈x, f1〉, . . . , 〈x, fd〉)′ and f ′d = (f1, . . . , fd). In other words, all involved operators have
their corresponding matrices, which act on the coefficients of x projected onto span(f1, . . . , fd),
instead of acting on x itself. Following this line of argumentation it follows that
Bθ(x) = FY Xθ(FXθ)−1
x,
where FXθ and FY Xθ are spectral densities related to the matrices (CXh ) and (CY Xh ), respectively.
47
Furthermore, it can be readily shown that
Bθ|p,q,k(x) = Bθ|p,q,k x := FY Xθ|p
(k∑
m=1
1
λm|q(θ)ϕm|q(θ)ϕ
∗m|q(θ)
)x,
where λm|q(θ) and ϕm|q(θ) are the eigenvalues and eigenvectors of
FXθ|q =1
2π
∑|h|≤q
wq(h)CXh e−ihθ,
and where
FY Xθ|p =1
2π
∑|h|≤p
wp(h)CY Xh e−ihθ.
Here CXh and CY Xh are the usual empirical covariance matrices related the sequence ((Yt,Xt) : 1 ≤t ≤ n) and wq(h) are the Bartlett weights.
Finally, b` = 12π
∫ π−π Bθ|p,q,ke
ihθdθ. (Note that we do allow p, q and k to depend on θ.) Since this
term cannot be computed explicitly, we use the numerical approximation
b`(x) = f ′d
(1
|Θ|∑θ∈Θ
Bθ|p,q,kei`θ
)x =: f ′d b` x, (5.1)
where Θ is a fine mesh on [−π, π].
5.2 Simulation settings and results
For the simulation study we have chosen the following settings.
• We set bh = 0 if h /∈ 0, ` where ` ∈ 1, 3. Furthermore, b0;k = α0(d − k + 1) and b`;k =
α1(d − k + 1)2, with α0 and α1 such that ‖b0‖ = β0 and ‖b`‖ = β` and d = 15. We choose
β0 ∈ 0.5, 1 and β` ∈ 0.5, 1. The curves fd are the first d Fourier basis functions.
• We assume that (et)i.i.d.∼ N(0, σ2), with σ2 ∈ 0.1, 0.5, and suppose that Xt = ΨXt−1 + εt,
where (εt)i.i.d.∼ Nd(0,Σd), Ψ = (ψij : 1 ≤ i, j ≤ d) satisfies ψij = c/(ij), with ‖Ψ‖ ∈ 0, 0.7.
Obviously, ‖Ψ‖ = 0 amounts to the i.i.d. setting. Since
E‖Xt‖2 =∑k≥1
Var(〈Xt, fk〉) <∞,
we assume that the elements of diag(Σd) are decaying. More precisely we set diag(Σd) =
(1, 1/2i, 1/3i, . . . , 1/di), with i ∈ 2, 4.• The sample size is n ∈ 250, 500. The parameters p and q are set equal to 10. Variation of
these parameters did invoke much changes for the output. Under each settings specified above,
we make 100 simulation runs and use the three methods described in Section 4 for tuning the
dimension parameter K. For the ET criterion we chose εn = 1/√n.
• We compare two measures of fit. The first is the relative absolute error of the estimators of the
48
‖Ψ‖ n ` CV FPE THmean sd mean sd mean sd
0 250 1 0.687 0.154 0.621 0.090 0.629 0.0453 0.669 0.107 0.660 0.103 0.653 0.047
500 1 0.564 0.117 0.514 0.080 0.565 0.0343 0.608 0.113 0.593 0.070 0.596 0.038
0.7 250 1 0.526 0.242 0.512 0.144 0.323 0.0523 0.674 0.219 0.631 0.091 0.513 0.045
500 1 0.531 0.292 0.571 0.190 0.339 0.0633 0.592 0.144 0.552 0.099 0.483 0.039
Table 1: Mean and standard deviation of δerr for three methods under β0 = β1 = 1, σ2 = 0.5 and i = 2.
‖Ψ‖ n ` CV FPE THmean sd mean sd mean sd
0 250 1 0.479 0.081 0.474 0.066 0.481 0.0453 0.507 0.065 0.514 0.069 0.512 0.045
500 1 0.480 0.045 0.486 0.039 0.490 0.0313 0.519 0.046 0.528 0.042 0.524 0.029
0.7 250 1 0.485 0.064 0.453 0.050 0.476 0.0473 0.541 0.096 0.518 0.065 0.531 0.050
500 1 0.505 0.042 0.479 0.036 0.497 0.0323 0.549 0.044 0.533 0.039 0.547 0.032
Table 2: Mean and standard deviation of δMSE for three methods under β0 = β1 = 1, σ2 = 0.5 and i = 2.
two non-zero lags:
δerr =1
2
(‖b0 − b0‖
β0+‖b` − b`‖
β1
).
The second is the mean square criterion:
δMSE =1
n
n∑t=1
Yt − ∑|h|≤3
bh(Xt−hI1 ≤ t− h ≤ n)
2
.
Each simulation run gives with each setting a sample δ1, . . . , δ100. We compute the mean and
the standard deviation.
Discussion of results. Due to a large number of settings we do not show all our results. Rather,
we display in Tables 1 and 2 a few selected and representative settings. Overall we found that
the relative absolute error δerr is typically smallest when we use method TH. In particular, for the
dependent setting and fast decay of eigenvalues, this method clearly outperforms CV and FPE. Still,
the method FPE is performing best when the (Xt) are independent and when i = 2 (slower decaying
eigenvalues).
The situation is quite different if we look at the model fit using δMSE. Then, overall the method
FPE performs best, especially under dependence. Here TH is throughout giving the largest errors
among the three approaches. CV can slightly outperform FPE when the (Xt) are independent and
when i = 2.
49
In conclusion we recommend to use the method TH if the target is estimation. Possibly this
method could be further improved by tuning the selection of the threshold εn. If the target is to use
the model for prediction, then FPE is preferable, though it comes with larger numerical costs than
TH. Method CV cannot be recommended, because it is numerically expensive and was not a clear
winner in any of the settings tested.
6 Proofs
It is assumed that all random elements in the sequel are defined on a common probability space
(Ω,A, P ). Recall that the vector space of all Hilbert–Schmidt operators acting on a Hilbert space
H is itself a Hilbert space with the inner product 〈K,L〉S =∑
m≥1〈K(em), L(em)〉, where (em) is
any orthonormal basis of H. The tensor product x ⊗ y of x, y ∈ H is a Hilbert–Schmidt operator
defined by x⊗ y(z) = x〈z, y〉 whose norm is ‖x⊗ y‖S = ‖x‖‖y‖.
6.1 Auxiliary lemmas
We collect in this section several simple lemmas referred to in Sections 2 and 3, and used in the
arguments that follow.
Lemma 2. If X,Z ∈ H are square integrable, and Ψ is a Hilbert–Schmidt operator, then
‖Cov(Ψ(X), Z)‖S ≤ ‖Ψ‖S ‖Cov(X,Z)‖S .
Proof. To lighten the notation, assume EX = 0, EZ = 0. Then, for any orthonormal basis (ej , j ≥1),
‖Cov(Ψ(X), Z)‖2S =∞∑j=1
‖E[Ψ(X)〈ej , Z〉]‖2 =∞∑j=1
‖Ψ(E[〈ej , Z〉X])‖2,
where we used the fact that expectation commutes with any bounded operator. It follows that
‖Cov(Ψ(X), Z)‖2S ≤∞∑j=1
‖Ψ‖2L‖Cov(X,Z)(ej)‖2 ≤ ‖Ψ‖2L ‖Cov(X,Z)‖2S .
The claim then follows because ‖Ψ‖L ≤ ‖Ψ‖S .
Lemma 3. Under Assumption 1,∑
h∈Z ‖CY Xh ‖S <∞.
Proof. Since Cov(ε`, Xk) = 0,
‖CY Xh ‖S = ‖∑k
Cov(bk(Xh−k), X0)‖S ≤∑k
‖Cov(bk(Xh−k), X0)‖S .
50
Therefore, by Lemma 2, ∑h
‖CY Xh ‖S ≤∑h
∑k
‖bk‖S ‖Cov(Xh−k, X0)‖S
=∑k
‖bk‖S∑h
‖CXh ‖S ,
so the claim follows from (2.3).
Lemma 4. Suppose Assumptions 1 and 2 hold. Then, for any θ ∈ [−π, π], the operator FXθ is
unbounded. It is invertible on
Dθ =
f ∈ H :∑m≥1
〈f, ϕm(θ)〉2λ−2m (θ) <∞
.
Proof. To show that the inverse of FXθ does not exist as a bounded operator, we must find a sequence
fn → 0 such that lim infn→∞ ‖(FXθ)−1
(fn)‖ > 0. As noted in the discussion leading to (3.2), for
any θ ∈ [−π, π], the operator FXθ is Hilbert-Schmidt, symmetric and non-negative definite. Since
Fθ is Hilbert-Schmidt,∑
m≥1 λ2m(θ) < ∞, and thus λm(θ) → 0, as m → ∞. Since all eigenvalues
λm(θ) are positive and (ϕm(θ) : m ≥ 1) is an orthonormal basis of H,
(FXθ)−1
(f) =∑m≥1
1
λm(θ)〈f, ϕm(θ)〉ϕm(θ),
if the series converges, i.e. if f ∈ Dθ. Setting fn = λ(θ)ϕn(θ), we see that fn → 0 and ‖(FXθ)−1
(fn)‖ =
‖ϕn‖ = 1.
6.2 Proofs of Lemma 1 and Theorem 1
We begin with a lemma which allows us to bound the difference between sample and population
auto– and cross–covariance operators. It is an extension of a fundamental result that the difference
between the sample and population covariance operators (h = 0, X = Y ) is of the order n−1/2, see
Bosq [2] and Horvath and Kokoszka [17]. It is a result likely to find applications in many asymptotic
arguments in the context of functional time series.
Lemma 5. Suppose Assumption 3 holds. Then there is a constant κ, independent of n and h, such
that E‖CXh −CXh ‖S ≤ κn−1/2. If, in addition, Assumption 1 holds, then E‖CY Xh −CY Xh ‖S ≤ κn−1/2.
Proof. We present the argument for CXh and h ≥ 0, which contains the key points. The result
for the cross–covariance operators is established in a similar way using the lemmas of Section 6.1.
We will repeatedly use the following simple relation: |〈x1 ⊗ y1, x2 ⊗ y2〉S | = |〈x1, y1〉〈x2, y2〉| ≤‖x1‖‖x2‖‖y1‖‖y2‖.
51
Using stationarity and diagonal summation, we obtain
nE‖CXh − CXh ‖2S =∑|r|<n
(1− |r|
n
)E〈Xr+h ⊗Xr − CXh , Xh ⊗X0 − CXh 〉S .
By Definition 1, if r ∈ 0, . . . , h− 1, then X(r)r+h is independent of Xr, Xh and X0. It follows easily
that E〈X(r)r+h ⊗Xr, Xh ⊗X0〉S = 0. Hence the summands above are bounded by
|E〈Xr+h ⊗Xr, Xh ⊗X0〉S | =∣∣∣E〈(Xr+h −X
(r)r+h)⊗Xr, Xh ⊗X0〉S
∣∣∣≤ ν3
4(X0)ν4(X0 −X(r)0 ).
When r ≥ h we get∣∣E〈Xr+h ⊗Xr − CXh , Xh ⊗X0 − CXh 〉S∣∣
=∣∣∣E〈Xr+h ⊗Xr − [Xr+h ⊗Xr]
(r), Xh ⊗X0 − CXh 〉S∣∣∣
≤ E[‖Xr+h ⊗Xr −X(r)
r+h ⊗X(r−h)r ‖S
(‖Xh ⊗X0‖S + ‖CXh ‖S
)]≤ 2
[E‖Xr+h ⊗Xr −X(r)
r+h ⊗X(r−h)r ‖2S
]1/2ν4(X0),
where we used
E(‖Xh ⊗X0‖S + ‖CXh ‖S
)2 ≤ 2E(‖Xh ⊗X0‖2S + 2‖CXh ‖2S≤ 2E‖Xh‖2‖X0‖2 + 2E‖Xh ⊗Xh‖2S
≤ 2(E‖Xh‖4
)1/2E(‖X0‖4
)1/2+ 2E‖Xh‖2‖Xh‖2
≤ 4ν24(X0).
Some further basic estimates show that[E‖Xr+h ⊗Xr −X(r)
r+h ⊗X(r−h)r ‖2S
]1/2
≤√
2ν4(X0)[ν4(X0 −X(r−h)
0 ) + ν4(X0 −X(r)0 )].
Similar estimates can be obtained when r < 0, and the result follows from (2.6).
It is convenient to introduce the following remainder terms:
τX(h) =∑|k|≥h
‖CXk ‖S ; τY X(h) =∑|k|≥h
‖CY Xk ‖S ; τ b(h) =∑|k|≥h
‖bk‖S .
52
Proof of Lemma 1. By repeated application of the triangle inequality, we obtain
supθ∈[−π,π]
‖FXθ − FXθ ‖L
≤ 1
2π
∑|h|≤q
‖CXh − CXh ‖L +∑|h|≤q
|1− ωq(h)|‖CXh ‖L + τX(q)
.
Since by Lemma 5,∑|h|≤q E‖CXh − CXh ‖L = O(qn−1/2), the first term tends to zero. The term∑
|h|≤q |1− ωq(h)|‖CXh ‖L tends to zero by (2.3), Assumption 4 and by the dominated convergence.
Again by (2.3), it follows that τX(q)→ 0. For example, one may then chose
ψXn =qn−1/2 +
∑|h|≤q
|1− ωq(h)|‖CXh ‖L + τX(q)1−γ
, γ ∈ (0, 1).
The same arguments apply to the spectral cross–density operators.
Proof of Theorem 1. Since
maxh∈Z‖bh − bh‖S ≤
1
2π
∫ π
−π‖Bθ −Bθ‖Sdθ,
we focus on the estimation of the frequency response operator Bθ. Define
Bθ = Bθ(K) =∑m≤K
FY Xθ(
1
λm(θ)ϕm(θ)⊗ ϕm(θ)
).
Then1
2‖Bθ −Bθ‖2S ≤ ‖Bθ − Bθ‖2S + ‖Bθ −Bθ‖2S .
Since, by (3.3), ∑`≥1
1
λ2` (θ)‖FY Xθ (ϕ`(θ))‖2 = ‖Bθ‖2S ≤
(∑k∈Z‖bk‖S
)2
<∞,
we see that
‖Bθ −Bθ‖2S =∑`>K
1
λ2` (θ)‖FY Xθ (ϕ`(θ))‖2 → 0, K →∞.
Thus, it remains to prove that1
2π
∫ π
−π‖Bθ − Bθ‖Sdθ
P−→ 0, (6.1)
and that KP−→∞. Condition (6.1) can be replaced by∫ π
−π‖Bθ − Bθ‖Sdθ × IAn
P−→ 0, (6.2)
53
where An ⊂ A is defined as
An := supθ‖FXθ − FXθ ‖ ≤ ψXn ∩ sup
θ‖FY Xθ − FY Xθ ‖ ≤ ψY Xn .
This is because by Lemma 1 we have that P (An)→ 1.
We have
Bθ − Bθ =K∑m=1
[FY Xθ
(1
λm(θ)ϕm(θ)⊗ ϕm(θ)
)− FY Xθ
(1
λm(θ)ϕm(θ)⊗ ϕm(θ)
)]
=K∑m=1
FY Xθ
(1
λm(θ)ϕm(θ)⊗ ϕm(θ)− 1
λm(θ)ϕm(θ)⊗ ϕm(θ)
)
+K∑m=1
(FY Xθ − FY Xθ
)( 1
λm(θ)ϕm(θ)⊗ ϕm(θ)
).
Thus, using ‖F G‖S ≤ ‖F‖L‖G‖S , we get
‖Bθ − Bθ‖S = ‖FY Xθ ‖L
∥∥∥∥∥K∑m=1
(1
λm(θ)ϕm(θ)⊗ ϕm(θ)− 1
λm(θ)ϕm(θ)⊗ ϕm(θ)
)∥∥∥∥∥S
+ ‖FY Xθ − FY Xθ ‖L
(K∑m=1
1
λ2m(θ)
)1/2
.
Since we have supθ∈[−π,π] ‖FY Xθ ‖L ≤ 1π τ
Y X(0), (6.2) follows from
∫ π
−π
∥∥∥∥∥K∑m=1
(1
λm(θ)ϕm(θ)⊗ ϕm(θ)− 1
λm(θ)ϕm(θ)⊗ ϕm(θ)
)∥∥∥∥∥S
dθ × IAn = oP (1) (6.3)
and
ψY Xn
∫ π
−πWKλ (θ)dθ = OP (1). (6.4)
Relation (6.4) is already immediate from the condition K ≤ K(2).
Some routine estimates show that the integrand in (6.3) is bounded by2
K∑m=1
1
λm(θ)‖ϕm(θ)− cm(θ)ϕm(θ)‖+
K∑m=1
|λm(θ)− λm(θ)|λm(θ)λm(θ)
× IAn , (6.5)
where cm(θ) is given as in Section 3. By Lemma 3.2 in Hormann and Kokoszka [13] we have that
‖ϕm(θ)− cm(θ)ϕm(θ)‖ ≤ 2√
2
αm(θ)sup
θ∈[−π,π]‖FXθ − FXθ ‖L,
and
supθ∈[−π,π]
supm≥1|λm(θ)− λm(θ)| ≤ sup
θ∈[−π,π]‖FXθ − FXθ ‖L. (6.6)
54
Thus we obtain the bound
4√
2
K∑m=1
ψXn
λm(θ)
[1
αm(θ)+
1
λm(θ)
]× IAn (6.7)
for (6.5). We further remark that on An we have that λm(θ) ≥ λm(θ)−|λm(θ)−λm(θ)| ≥ λm(θ)−ψXn .
Therefore, since K ≤ K(1), we have that (6.7) is bounded by
4√
2K∑m=1
ψXn
λm(θ)
[1
αm(θ)+
2
λm(θ)
]≤ 4√
2ψXn
(WKλ (θ)WK
α (θ) + 2(WKλ (θ)
)2), (6.8)
where we have made use of the Cauchy-Schwarz inequality in the last step. Using K ≤ K(3) and
K ≤ K(4) it is now easy to infer that (6.2) holds.
It remains to show that KP−→∞, i.e. that K(i) →∞ for 1 ≤ i ≤ 4.
Fix a large k and observe that P (K(1) ≥ k) = P (infθ λk(θ) ≥ 2ψXn ). Now define Bk;n :=
supθ |λk(θ) − λk(θ)| ≤ δk/2 where δk := infθ λk(θ). From Assumption 2 it follows that δk > 0.
Furthermore, it follows from Lemma 1 and (6.6) that P (Bk;n)→ 1 for n→∞. On the other hand
infθ λk(θ) ≥ infθ λk(θ)− supθ |λk(θ)−λk(θ)|, so that on Bk;n we have infθ λk(θ) ≥ δk/2. And hence,
for n large enough, we have infθ λk(θ) ≥ 2ψXn on Bk;n. Consequently P (K(1) ≥ k)→ 1 for n→∞,
irrespective of how large k was chosen.
Now we proveK(4) →∞. Fix again a big k and notice that is suffices to show that P (∫ π−π min1≤m≤k α
−2m (θ)dθ >
xn) → 0, for any xn → ∞. Define B′k;n := supθ |αk(θ) − αk(θ)| ≤ δ′k/2 where δ′k := infθ αk(θ)
and set Ak;n = ∩km=1B′k;n. Then for any fixed k we have P (Ak;n) → 1 and on Ak;n it holds that
min1≤m≤k αm(θ) ≤ min1≤m≤k δ′m/2 = rk. By Assumption 5 rk > 0 for any k. Hence, on Ak;n the
integral∫ π−π min1≤m≤k α
−2m (θ)dθ is bounded by 4π/rk and this is smaller than xn when n is big
enough. This proves K(4) →∞.
The proof of K(2) →∞ and K(3) →∞ is similar and therefore omitted.
1 Appendix
In this appendix, we derive the FPE method of selecting the dimension parameter K used in
Sections 3 and 4. In section 1.1, we discuss the relation of our spectral approach to the time domain
estimation in functional regression. This motivates the derivation of the FPE method in Section 1.2.
Section 1.3 contains the proofs of two results stated in Sections 1.1 and 1.2.
1.1 Relation to ordinary functional regression
As before we consider complex Hilbert spaces H and H ′ and define for elements (a, f), (b, g) ∈ H ′×Hdefine [(a, f), (b, g)] = 〈a, b〉+ 〈f, g〉. This defines an inner product on H ′×H, and with it the space
becomes a Hilbert space. Let us fix a frequency θ ∈ [−π, π], and define a zero mean complex random
55
element ∆ = (Υ,Ξ) ∈ L2H′×H such that
C∆ = E∆⊗∆ =
(CΥ CΥΞ
CΞΥ CΞ
)=
(FYθ FY XθFXYθ FXθ
). (1.1)
Now we regress Υ on Ξ, i.e. we seek h0 ∈ L(H,H ′) (the space of bounded linear operators from H
to H ′) which satisfies
h0 = argminh∈L(H,H′)E‖Υ− h(Ξ)‖2.
Then by the usual projection arguments, h0 solves the equation CΥΞ = h0 CΞ. By the definition
of CΥΞ and CΞ, it follows that h0 is also the solution to (3.1) and hence, by Assumption 2, is
equal to Bθ. Consequently, h0, or equivalently Bθ, can also be estimated from a random sample
((Υk,Ξk) : 1 ≤ k ≤ L) by standard methods known from functional linear models. A typical
estimator (see e.g. Cardot et al. [6]) is
h0;d(f) =
d∑`=1
CΥΞ(v`)
γ`〈f, v`〉 :=
d∑`=1
b`〈f, v`〉, (1.2)
where CΥΞ(f) := 1L
∑Lk=1 Υk〈f,Ξk〉 and γ` and v` are the eigenvalues and eigenvectors of CΞ(f) :=
1L
∑Lk=1 Ξk〈f,Ξk〉.
In practice we do not know C∆, but, as we will see in Lemma 1, below, it can be consistently
estimated from the data, which then in turn allows to generate a random sample ((Υi,Ξi) : 1 ≤ i ≤ L)
with a covariance which is asymptotically equal to C∆. A more direct approach is to define the
functional discrete Fourier transforms
Υk|p =1√2πp
pk∑t=p(k−1)+1
Yte−i(t−p(k−1))θ and Ξk|p =
1√2πp
pk∑t=p(k−1)+1
Xte−i(t−p(k−1))θ.
If we denote CΥΞp and CΞ
p covariance and cross-covariance operators related to the sequence ((Υk|p,Ξk|p) : 1 ≤k ≤ L), the following lemma holds:
Lemma 6. Consider the estimator FXθ|p with the Bartlett weights wp(h) = 1−|h|/p. Under Assump-
tion 3 we have ‖FXθ|p − CΞp ‖2S = OP (p3/n). Under the same conditions we have ‖FY Xθ|p − C
ΥΞp ‖2S =
OP (p3/n).
The lemma, which we prove in Section 1.3, confirms that computing (1.2) from the variables
(Υk|p,Ξk|p), which serve as an approximation to a random sample (Υk,Ξk), yields an estimator
which resembles closely Bθ|p,p,d in (3.4).
1.2 Description of the FPE approach
In order to keep this discussion short, we only consider the scalar response case. This is in line
with our simulation study. Our starting point is the alternative interpretation of Bθ discussed in
Section 1.1. Suppose we have an estimator h0;d for Bθ from a sample ((Υk,Ξk) : 1 ≤ k ≤ L). Now
we pick (Υ,Ξ) independent of this sample and set K = Kθ = argmind≥0E|Υ − h0;d(Ξ)|2. Note
56
that here, by the Riesz representation theorem, h0;d(Ξ) is of the from 〈Ξ, h0;d〉. With d = K in
(1.2) we minimize the mean squared prediction error in this functional regression. The related
model selection criterion is commonly known as final prediction error (FPE) criterion. Of course, to
compute K explicitly is mathematically infeasible, and therefore we go for an approximation. For
this purpose, we first note that that the coefficients bk in (1.2) satisfy
bd := (b1, . . . , bd)′ = argmin(b1,...,bd)∈Cd
L∑i=1
|Υi −d∑`=1
b`〈Ξi, v`〉|2.
Our problem is greatly simplified if we replace the empirical principal component scores by the
population ones and set
bd := (b1, . . . , bd)′ = argmin(b1,...,bd)∈Cd
L∑i=1
|Υi −d∑`=1
b`〈Ξ, v`〉|2,
and then define h0;d(Ξ) =∑d
`=1 b`〈Ξ, v`〉 and K = argmind≥0E|Υ− h0;d(Ξ)|2.
Proposition 1. Suppose that the ((Υi,Ξi) : 1 ≤ i ≤ L) constitute a Gaussian random sample, with
circularly-symmetric observation, i.e. E∆[∆, (a, f)] = 0 for any (a, f) ∈ H ′ ×H. Then for L > d
we have
E|Υ− h0;d(Ξ)|2 = σ2d ×
L
L− d,
where σ2d = 1
L−dE(Υ−Xbd)∗(Υ−Xbd) and X = (〈Ξi, v`〉 : 1 ≤ i ≤ L; 1 ≤ ` ≤ d), Υ = (Υ1, . . . ,ΥL)′.
The proof of this proposition is given in Section 1.3. Assuming Gaussianity is not a restriction,
since our estimator only relies on the second order structure of the data. Furthermore, by Panare-
tos and Tavakoli [24] we know that under general dependence assumptions the discrete Fourier
transforms Υi|p and Ξi|p are asymptotically (p→∞) complex normal random elements.
The proposition then suggests to choose d such that σ2d×
LL−d is minimized. An unbiased estimate
for the unknown σ2d is
1
L− d(Υ− Xbd)
∗(Υ− Xbd).
Finally, replacing the theoretical scores leads to the following dimension selection:
K = argmin0≤d<LL
(L− d)2(Υ− Xbd)
∗(Υ− Xbd), (1.3)
where X = (〈Ξi|p, v`〉 : 1 ≤ i ≤ L; 1 ≤ ` ≤ d) and Υ = (Υ1|p, . . . ,ΥL|p)′.
1.3 Proofs of Lemma 6 and Proposition 1
Proof Lemma 6. We define
CXh =1
Lp
L−1∑k=0
( p−h∑t=1
Xt+h+kp ⊗Xt+kp
), for 0 ≤ h < p,
57
and
CXh =1
Lp
L−1∑k=0
( p∑t=|h|+1
Xt−|h|+kp ⊗Xt+kp
), for −p < h < 0.
Direct verification shows that
CΞθ =
1
2π
∑|h|<p
CXh e−ihθ.
For two random operators An and Bn we write An = Bn + Op(mn) if ‖An − Bn‖S = OP (mn).
Then, for p > h ≥ 0, we deduce with the help of Lemma 5 that
nCXh − LpCXh =L−1∑k=0
( p∑t=p−h+1
Xt+h+kp ⊗Xt+kp
)+
n−h∑t=Lp+1
Xt+h ⊗Xt
= Lh(CXh +Op(L
−1/2))
= Lh(CXh +Op(L
−1/2)).
The same bound can be derived for h < 0. Thus,
CXh =
(1− |h|
p
)CXh +
(n
Lp− 1
)CXh +OP (L−1/2),
and since nLp − 1 ≤ p
n−p we have that
CXh =
(1− |h|
p
)CXh +OP
((p/n)1/2
).
We conclude that ‖CΞθ − FXθ ‖2S = OP
(p3/n
). A similar bound can be obtained for CΥΞ
θ − FY Xθ .
This proves Lemma 6.
Proof of Proposition 1. We have
E|Υ− h0;d(Ξ)|2 = E|Υ−d∑`=1
b`〈Ξ, v`〉|2 = E|d∑`=1
(b` − b`)〈Ξ, v`〉+ Z|2, (1.4)
where Z = (Υ − 〈Ξ, h0〉) +∑
`>d b`〈Ξ, v`〉. We set ε = Υ − 〈Ξ, h0〉. By the projection theorem it
follows that Cov(ε,Ξ) = 0. Furthermore, since we assume that b` are independent of Ξ, and since
principal components scores are orthogonal it follows that (1.4) equals
E|d∑`=1
(b` − b`)〈Ξ, v`〉|2 + E|Z|2 =d∑`=1
E|b` − b`|2γ` + E|Z|2.
58
With Γ = diag(γ1, . . . , γd) and Z = (Z1, . . . , ZL)′ and Zi = εi +∑
`>d b`〈Ξi, v`〉 we get
d∑`=1
E|b` − b`|2γ` = E[(bd − bd)∗Γ(bd − bd)
]= E
[Z∗X(X∗X)−1Γ(X∗X)−1X∗Z
]= tr
(ΓE[(X∗X)−1X∗ZZ∗X(X∗X)−1
]). (1.5)
We have E[ZZ∗] = E|Z|2IL. The imposed circular-symmetry implies that
EΥ〈Ξ, f〉 = 0 and E〈Ξ, f〉〈Ξ, g〉 = 0 ∀f, g ∈ H. (1.6)
Consequently, by Gaussianity it follows that Z and X are independent. (Note that two complex Gaus-
sian random variables U1 and U2, say, are independent if and only if Cov(U1, U2) = Cov(U1, U2) = 0.)
We can therefore conclude by a simple conditioning argument that (1.5) simplifies to
E|Z|2 tr(E[(
(XΓ−1/2)∗(XΓ−1/2))−1]
) =: E|Z|2 tr(EW−1
).
The matrix W−1 is an inverse complex Wishart matrix with expectation EW−1 = IdL−d . Thus
E|Υ− h0;d(Ξ)|2 = E|Z|2 × LL−d .
Bibliography
[1] A. Aue, S. Hormann, L. Horvath, and M. Reimherr. Break detection in the covariance structure
of multivariate time series models. The Annals of Statistics, 37:4046–4087, 2009.
[2] D. Bosq. Linear Processes in Function Spaces. Springer, 2000.
[3] G. E. P. Box, G. M. Jenkins, and G. C. Reinsel. Time Series Analysis: Forecasting and Control.
Prentice Hall, Englewood Cliffs, third edition, 1994.
[4] D. R. Brillinger. Time Series: Data Analysis and Theory. Holt, New York, 1975.
[5] T. Cai and P. Hall. Prediction in functional linear regression. The Annals of Statistics, 34:
2159–2179, 2006.
[6] H. Cardot, F. Ferraty, and P. Sarda. Functional linear model. Statistics and Probability Letters,
45:11–22, 1999.
[7] H. Cardot, F. Ferraty, A. Mas, and P. Sarda. Testing hypothesis in the functional linear model.
Scandinavian Journal of Statistics, 30:241–255, 2003.
[8] J-M. Chiou and H-G. Muller. Diagnostics for functional regression via residual processes.
Computational Statistics and Data Analysis, 15:4849–4863, 2007.
[9] F. Comte and J. Johannes. Adaptive functional linear regression. The Annals of Statistics, 40:
2765–2797, 2012.
59
[10] C. Crambes, A. Kneip, and P. Sarda. Smoothing splines estimators for functional linear regres-
sion. The Annals of Statistics, 37:35–72, 2009.
[11] R. Gabrys, L. Horvath, and P. Kokoszka. Tests for error correlation in the functional linear
model. Journal of the American Statistical Association, 105:1113–1125, 2010.
[12] S. Hormann and L. Kidzinski. A note on estimation in Hilbertian linear models. Scandinavian
Journal of Statistics, 2014. Forthcoming.
[13] S. Hormann and P. Kokoszka. Weakly dependent functional data. The Annals of Statistics, 38:
1845–1884, 2010.
[14] S. Hormann and P. Kokoszka. Functional time series. In C. R. Rao and T. Subba Rao, editors,
Time Series, volume 30 of Handbook of Statistics. Elsevier, 2012.
[15] S. Hormann, L. Horvath, and R. Reeder. A functional version of the ARCH model. Econometric
Theory, 29:267–288, 2013.
[16] S. Hormann, L. Kidzinski, and M. Hallin. Dynamic functional principal components. Journal
of the Royal Statistical Society: Series B, 2014. Forthcoming.
[17] L. Horvath and P. Kokoszka. Inference for Functional Data with Applications. Springer, 2012.
[18] G. M. James, J. Wang, and J. Zhu. Functional linear regression that’s interpretable. The
Annals of Statistics, 37:2083–2108, 2009.
[19] P. Kokoszka and M. Reimherr. Predictability of shapes of intraday price curves. The Econo-
metrics Journal, 16:285–308, 2013.
[20] A. N. Kolmogorov. Interpolation und Extrapolation von stationaren zufalligen Folgen. Bull.
Acad. Sci. U.S.S.R., 5:3–14, 1941.
[21] Y. Li and T. Hsing. On rates of convergence in functional linear regression. Journal of Multi-
variate Analysis, 98:1782–1804, 2007.
[22] I. McKeague and B. Sen. Fractals with point impacts in functional linear regression. The
Annals of Statistics, 38:2559–2586, 2010.
[23] H-G. Muller and U. Stadtmuller. Generalized functional linear models. The Annals of Statistics,
33:774–805, 2005.
[24] V. M. Panaretos and S. Tavakoli. Fourier analysis of stationary time series in function space.
The Annals of Statistics, 41:568–603, 2013.
[25] V. M. Panaretos and S. Tavakoli. Cramer–Karhunen–Loeve representation and harmonic prin-
cipal component analysis of functional time series. Stochastic Processes and their Applications,
123:2779–2807, 2013.
[26] M. B. Priestley. Spectral Analysis and Time Series. Academic Press, 1981.
[27] J. O. Ramsay and B. W. Silverman. Functional Data Analysis. Springer, 2005.
60
[28] X. Shao and W. B. Wu. Asymptotic spectral theory for nonlinear time series. The Annals of
Statistics, 35:1773–1801, 2007.
[29] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications with R Examples.
Springer, 2011.
[30] N. Wiener. The Extrapolation, Interpolation and Smoothing of Stationary Time Series with
Engineering Applications. Wiley, 1949.
[31] W. Wu. Nonlinear System Theory: Another Look at Dependence, volume 102. The National
Academy of Sciences of the United States, 2005.
[32] F. Yao, H-G. Muller, and J-L. Wang. Functional linear regression analysis for longitudinal data.
The Annals of Statistics, 33:2873–2903, 2005.
61
Appendix A
Dynamic Functional Principal Components
Dynamic Functional Principal Components∗
Siegfried Hormann1, Lukasz Kidzinski1, Marc Hallin2,3
1 Department of Mathematics, Universite libre de Bruxelles (ULB), CP210, Bd. du Triomphe, B-1050
Brussels, Belgium.
2 ECARES, Universite libre de Bruxelles (ULB), CP 114/04 50, avenue F.D. Roosevelt B-1050 Brussels,
Belgium.
3 ORFE, Princeton University, Sherrerd Hall, Princeton, NJ 08540, USA.
Abstract
In this paper, we address the problem of dimension reduction for time series of functional data (Xt : t ∈Z). Such functional time series frequently arise, e.g., when a continuous-time process is segmented into
some smaller natural units, such as days. Then each Xt represents one intraday curve. We argue that
functional principal component analysis (FPCA), though a key technique in the field and a benchmark for
any competitor, does not provide an adequate dimension reduction in a time-series setting. FPCA indeed
is a static procedure which ignores the essential information provided by the serial dependence structure of
the functional data under study. Therefore, inspired by Brillinger’s theory of dynamic principal components,
we propose a dynamic version of FPCA, which is based on a frequency-domain approach. By means of a
simulation study and an empirical illustration, we show the considerable improvement the dynamic approach
entails when compared to the usual static procedure.
Keywords. Dimension reduction, frequency domain analysis, functional data analysis, functional
time series, functional spectral analysis, principal components, Karhunen-Loeve expansion.
1 Introduction
The tremendous technical improvements in data collection and storage allow to get an increasingly
complete picture of many common phenomena. In principle, most processes in real life are continuous
in time and, with improved data acquisition techniques, they can be recorded at arbitrarily high
frequency. To benefit from increasing information, we need appropriate statistical tools that can
help extracting the most important characteristics of some possibly high-dimensional specifications.
Functional data analysis (FDA), in recent years, has proven to be an appropriate tool in many
such cases and has consequently evolved into a very important field of research in the statistical
community.
Typically, functional data are considered as realizations of (smooth) random curves. Then every
observation X is a curve (X(u) : u ∈ U). One generally assumes, for simplicity, that U = [0, 1], but
U could be a more complex domain like a cube or the surface of a sphere. Since observations are
functions, we are dealing with high-dimensional – in fact intrinsically infinite-dimensional – objects.
So, not surprisingly, there is a demand for efficient data-reduction techniques. As such, functional
∗Manuscript has been accepted for publication in Journal of the Royal Statistical Sociaty: Series B
63
principal component analysis (FPCA) has taken a leading role in FDA, and functional principal
components (FPC) arguably can be seen as the key technique in the field.
In analogy to classical multivariate PCA (see Jolliffe [22]), functional PCA relies on an eigen-
decomposition of the underlying covariance function. The mathematical foundations for this have
been laid several decades ago in the pioneering papers by Karhunen [23] and Loeve [26], but it took
a while until the method was popularized in the statistical community. Some earlier contributions
are Besse and Ramsay [5], Ramsay and Dalzell [30] and, later, the influential books by Ramsay and
Silverman [31], [32] and Ferraty and Vieu [11]. Statisticians have been working on problems related
to estimation and inference (Kneip and Utikal [24], Benko et al. [3]), asymptotics (Dauxois et al. [10]
and Hall and Hosseini-Nasab [15]), smoothing techniques (Silverman [34]), sparse data (James et
al. [21], Hall et al. [16]), and robustness issues (Locantore et al. [25], Gervini [12]), to name just a
few. Important applications include FPC-based estimation of functional linear models (Cardot et
al. [9], Reiss and Ogden [33]) or forecasting (Hyndman and Ullah [20], Aue et al. [1]). The usefulness
of functional PCA has also been recognized in other scientific disciplines, like chemical engineering
(Gokulakrishnan et al. [14]) or functional magnetic resonance imaging (Aston and Kirch [2], Viviani
et al. [37]). Many more references can be found in the above cited papers and in Sections 8–10 of
Ramsay and Silverman [32], where we refer to for background reading.
Most existing concepts and methods in FDA, even though they may tolerate some amount of
serial dependence, have been developed for independent observations. This is a serious weakness, as
in numerous applications the functional data under study are obviously dependent, either in time or
in space. Examples include daily curves of financial transactions, daily patterns of geophysical and
environmental data, annual temperatures measured on the surface of the earth, etc. In such cases,
we should view the data as the realization of a functional time series (Xt(u) : t ∈ Z), where the time
parameter t is discrete and the parameter u is continuous. For example, in case of daily observations,
the curve Xt(u) may be viewed as the observation on day t with intraday time parameter u. A key
reference on functional time series techniques is Bosq [8], who studied functional versions of AR
processes. We also refer to Hormann and Kokoszka [19] for a survey.
Ignoring serial dependence in this time-series context may result in misleading conclusions and
inefficient procedures. Hormann and Kokoszka [18] investigate the robustness properties of some
classical FDA methods in the presence of serial dependence. Among others, they show that usual
FPCs still can be consistently estimated within a quite general dependence framework. Then the
basic problem, however, is not about consistently estimating traditional FPCs: the problem is that,
in a time-series context, traditional FPCs are not the adequate concept of dimension reduction
anymore – a fact which, since the seminal work of Brillinger [6], is well recognized in the usual
vector time-series setting. FPCA indeed operates in a static way: when applied to serially dependent
curves, it fails to take into account the potentially very valuable information carried by the past
values of the functional observations under study. In particular, a static FPC with small eigenvalue,
hence negligible instantaneous impact on Xt, may have a major impact on Xt+1, and high predictive
value.
Besides their failure to produce optimal dimension reduction, static FPCs, while cross-sectionally
64
uncorrelated at fixed time t, typically still exhibit lagged cross-correlations. Therefore the resulting
FPC scores cannot be analyzed componentwise as in the i.i.d. case, but need to be considered as
vector time series which are less easy to handle and interpret.
These major shortcomings are motivating the present development of dynamic functional prin-
cipal components (dynamic FPCs). The idea is to transform the functional time series into a vector
time series (of low dimension, ≤ 4, say), where the individual component processes are mutually
uncorrelated (at all leads and lags; autocorrelation is allowed, though), and account for most of
the dynamics and variability of the original process. The analysis of the functional time series can
then be performed on those dynamic FPCs; thanks to their mutual orthogonality, dynamic FPCs
moreover can be analyzed componentwise. In analogy to static FPCA, the curves can be optimally
reconstructed/approximated from the low-dimensional dynamic FPCs via a dynamic version of the
celebrated Karhunen-Loeve expansion.
Dynamic principal components first have been suggested by Brillinger [6] for vector time series.
The purpose of this article is to develop and study a similar approach in a functional setup. The
methodology relies on a frequency-domain analysis for functional data, a topic which is still in its
infancy (see, for instance, Panaretos and Tavakoli [27]).
The rest of the paper is organized as follows. In Section 2 we give a first illustration of the
procedure and sketch two typical applications. In Section 3, we describe our approach and state
a number of relevant propositions. We also provide some asymptotic features. In Section 4, we
discuss its computational implementation. After an illustration of the methodology by a real data
example on pollution curves in Section 5, we evaluate our approach in a simulation study (Section 6).
Appendices A and B detail the mathematical framework and contain the proofs. Some of the more
technical results and proofs are provided in a supplementary document.
After the present paper (which has been available on Arxiv since October 2012) was submitted,
another paper by Panaretos and Tavakoli [28] was published, where similar ideas are proposed. While
both papers aim at the same objective of a functional extension of Brillinger’s concept, there are
essential differences between the solutions developed. The main result in Panaretos and Tavakoli [28]
is the existence of a functional process (X∗t ) of rank q which serves as an “optimal approximation”
to the process (Xt) under study. The construction of (X∗t ), which is mathematically quite elegant,
is based on stochastic integration with respect to some orthogonal-increment (functional) stochas-
tic process (Zω). The disadvantage, from a statistical perspective, is that this construction is not
explicit, and that no finite-sample version of the concept is provided – only the limiting behavior
of the empirical spectral density operator and its eigenfunctions is obtained. Quite on the contrary,
our Theorem 4 establishes the consistency of an empirical, explicitly constructed and easily imple-
mentable version of the dynamic scores – which is what a statistician will be interested in. We also
remark that we are working under milder technical conditions.
65
2 Illustration of the method
An impression of how well the proposed method works can be obtained from Figure 1. Its left panel
shows ten consecutive intraday curves of some pollutant level (a detailed description of the underlying
data is given in Section 5). The two panels to the right show one-dimensional reconstructions of
these curves. We used static FPCA in the central panel and dynamic FPCA in the right panel. The
0.0 0.2 0.4 0.6 0.8 1.0
−6
−4
−2
02
Intraday time
Sq
rt(P
M1
0)
0.0 0.2 0.4 0.6 0.8 1.0
−6
−4
−2
02
Intraday time
Sq
rt(P
M1
0)
0.0 0.2 0.4 0.6 0.8 1.0
−6
−4
−2
02
Intraday time
Sq
rt(P
M1
0)
Figure 1: Ten successive daily observations (left panel), the corresponding static Karhunen-Loeve expansionbased on one (static) principal component (middle panel), and the dynamic Karhunen-Loeve expansion withone dynamic component (right panel). Colors provide the matching between the actual observations and theirKarhunen-Loeve approximations.
difference is notable. The static method merely provides an average level, exhibiting a completely
spurious and highly misleading intraday symmetry. In addition to daily average levels, the dynamic
approximation, to a large extent, also catches the intraday evolution of the curves. In particular,
it retrieves the intraday trend of pollution levels, and the location of their daily spikes and troughs
(which varies considerably from one curve to the other). For this illustrative example we chose
one-dimensional reconstructions, based on one single FPC; needless to say, increasing the number
of FPCs (several principal components), we obtain much better approximations – see Section 4 for
details.
Applications of dynamic PCA in a time series analysis are the same as those of static PCA in the
context of independent (or uncorrelated) observations. This is why obtaining mutually orthogonal
principal components – in the sense of mutually orthogonal processes – is a major issue here. This
orthogonality, at all leads and lags, of dynamic principal components, indeed, implies that any
second-order based method (which is the most common approach in time series) can be carried out
componentwise, i.e. via scalar methods. In contrast, static principal components still have to be
treated as a multivariate time series.
Let us illustrate this superiority of mutually orthogonal dynamic components over the auto- and
cross-correlated static ones by means of two examples.
Change point analysis: Suppose that we wish to find a structural break (change point) in a sequence
of functional observations X1, . . . , Xn. For example, Berkes et al. [4] consider the problem of detect-
66
ing a change in the mean function of a sequence of independent functional data. They propose to
first project data on the p leading principal components and argue that a change in the mean will
show in the score vectors, provided hat the proportion of variance they are accounting for is large
enough. Then a CUSUM procedure is utilized. The test statistic is based on the functional
Tn(x) =1
n
p∑m=1
λ−1m
∑1≤k≤nx
Y statmk − x
∑1≤k≤n
Y statmk
2
, 0 ≤ x ≤ 1.
Here Y statmk is the m-th empirical PC score of Xk and λm is the m-th largest eigenvalue of the empirical
covariance operator related to the functional sample. The assumption of independence implies that
Tn(x) converges, under the no-change hypothesis, to the sum of p squared independent Brownian
bridges. Roughly speaking, this is due to the fact that the partial sums of score vectors (used in the
CUSUM statistic) converge in distribution to a multivariate normal with diagonal covariance. That
is, the partial sums of the individual scores become asymptotically independent, and we just obtain
p independent CUSUM test statistics – a separate one for each score sequence. The independent
test statistics are then aggregated.
This simple structure is lost when data are serially dependent. Then, if a CLT holds,(∑
1≤k≤n Ystatmk : m =
1, . . . , p)′
converges to a normal vector where the covariance (which is still diagonal) needs to be
replaced by the long-run covariance of the score vectors, which is typically non-diagonal.
In contrast, using dynamic principal components, the long-run covariance of the score vectors
remains diagonal; see Proposition 4. Let diag(λ1(0), . . . , λp(0)) be a consistent estimator of this
long-run variance and Y dynmk be the dynamic scores. Then replacing the test functionals Tn(x) by
T dynn (x) =
2π
n
p∑m=1
λ−1m (0)
∑1≤k≤nx
Y dynmk − x
∑1≤k≤n
Y dynmk
2
, 0 ≤ x ≤ 1,
we get that (under appropriate technical assumptions ensuring a functional CLT) the same asymp-
totic behavior holds as for Tn(x), so that again p independent CUSUM test statistics can be aggre-
gated.
Dynamic principal components, thus, and not the static ones, provide a feasible extension of the
Berkes et al. [4] method to the time series context.
Lagged regression: A lagged regression model is a linear model in which the response Wt ∈ Rq, say,
is allowed to depend on an unspecified number of lagged values of a series of regressor variables
(Xt) ∈ Rp. More specifically, the model equation is
Wt = a+∑k∈Z
bkXt−k + εt, (2.1)
with some i.i.d. noise (εt) which is independent of the regressor series. The intercept a ∈ Rq and
the matrices bk ∈ Rq×p are unknown. In time series analysis, the lagged regression is the natural
extension of the traditional linear model for independent data.
67
The main problem in this context, which can be tackled by a frequency domain approach, is
estimation of the parameters. See, for example, Shumway and Stoffer [35] for an introduction. Once
the parameters are known, the model can, e.g., be used for prediction.
Suppose now that Wt is a scalar response and that (Xk) constitutes a functional time series.
The corresponding lagged regression model can be formulated in analogy, but involves estimation
of an unspecified number of operators, which is quite delicate. A pragmatic way to proceed is
to have Xk in (2.1) replaced by the vector of the first p dynamic functional principal component
scores Yk = (Y1k, . . . , Ypk)′, say. The general theory implies that, under mild assumptions (basically
guaranteeing convergence of the involved series),
bk =1
2π
∫ π
−πBθe
ikθdθ, where Bθ = FWYθ
(FYθ)−1
,
and
FYθ =1
2π
∑h∈Z
cov(Yt+h, Yt)e−ihθ and FWY
θ =1
2π
∑h∈Z
cov(Wt+h, Yt)e−ihθ
are the spectral density matrix of the score sequence and the cross-spectrum between (Wt) and
(Yt), respectively. In the present setting the structure greatly simplifies. Our theory will reveal (see
Proposition 9) that FYθ is diagonal at all frequencies and that
Bθ =
(fWY1θ
λ1(θ), . . . ,
fWYpθ
λp(θ)
),
with fWYmθ being the co-spectrum between (Wt) and (Ymt) and λm(θ) is the m-th dynamic eigenvalue
of the spectral density operator of the series (Xk) (see Section 3.2). As a consequence, the influence
of each score sequence on the regressors can be assessed individually.
Of course, in applications, these population quantities are replaced by their empirical versions
and one may use some testing procedure for the null-hypothesis H0 : fWYpθ = 0 for all θ, in order to
justify the choice of the dimension of the dynamic score vectors and to retain only those components
which have a significant impact on Wt.
3 Methodology for L2 curves
In this section, we introduce some necessary notation and tools. Most of the discussion on technical
details is postponed to the Appendices A, B and the supplementary document C. For simplicity, we
are focusing here on L2([0, 1])-valued processes, i.e. on square-integrable functions defined on the
unit interval; in the appendices, however, the theory is developed within a more general framework.
3.1 Notation and setup
Throughout this section, we consider a functional time series (Xt : t ∈ Z), where Xt takes values in
the space H := L2([0, 1]) of complex-valued square-integrable functions on [0, 1]. This means that
68
Xt = (Xt(u) : u ∈ [0, 1]), with ∫ 1
0|Xt(u)|2du <∞
(|z| :=√zz, where z the complex conjugate of z, stands for the modulus of z ∈ C). In most
applications, observations are real, but, since we will use spectral methods, a complex vector space
definition will serve useful.
The space H then is a Hilbert space, equipped with the inner product 〈x, y〉 :=∫ 1
0 x(u)y(u)du,
so that ‖x‖ := 〈x, x〉1/2 defines a norm. The notation X ∈ LpH is used to indicate that, for some
p > 0, E[‖X‖p] <∞. Any X ∈ L1H then possesses a mean curve µ = (E[X(u)] : u ∈ [0, 1]), and any
X ∈ L2H a covariance operator C, defined by C(x) := E[(X − µ)〈x,X − µ〉]. The operator C is a
kernel operator given by
C(x)(u) =
∫ 1
0c(u, v)x(v)dv, with c(u, v) := cov(X(u), X(v)), u, v ∈ [0, 1],
with cov(X,Y ) := E(X −EX)(Y − EY ). The process (Xt : t ∈ Z) is called weakly stationary if, for
all t, (i) Xt ∈ L2H , (ii) EXt = EX0, and (iii) for all h ∈ Z and u, v ∈ [0, 1],
cov(Xt+h(u), Xt(v)) = cov(Xh(u), X0(v)) =: ch(u, v).
Denote by Ch, h ∈ Z, the operator corresponding to the autocovariance kernels ch. Clearly, C0 = C.
It is well known that, under quite general dependence assumptions, the mean of a stationary func-
tional sequence can be consistently estimated by the sample mean, with the usual√n-convergence
rate. Since, for our problem, the mean is not really relevant, we throughout suppose that the
data have been centered in some preprocessing step. For the rest of the paper, it is tacitly as-
sumed that (Xt : t ∈ Z) is a weakly stationary, zero mean process defined on some probability space
(Ω,A, P ).
As in the multivariate case, the covariance operator C of a random element X ∈ L2H admits an
eigendecomposition (see, e.g., p. 178, Theorem 5.1 in [13])
C(x) =∞∑`=1
λ`〈x, v`〉v`, (3.1)
where (λ` : ` ≥ 1) are C’s eigenvalues (in descending order) and (v` : ` ≥ 1) the corresponding
normalized eigenfunctions, so that C(v`) = λ`v` and ‖v`‖ = 1. If C has full rank, then the sequence
(v` : ` ≥ 1) forms an orthonormal basis of L2([0, 1]). Hence, X admits the representation
X =
∞∑`=1
〈X, v`〉v`, (3.2)
which is called the static Karhunen-Loeve expansion of X. The eigenfunctions v` are called the
(static) functional principal components (FPCs) and the coefficients 〈X, v`〉 are called the (static)
FPC scores or loadings. It is well known that the basis (v` : ` ≥ 1) is optimal in representing X in
69
the following sense: if (w` : ` ≥ 1) is any other orthonormal basis of H, then
E‖X −p∑`=1
〈X, v`〉v`‖2 ≤ E‖X −p∑`=1
〈X,w`〉w`‖2, ∀p ≥ 1. (3.3)
Property (3.3) shows that a finite number of FPCs can be used to approximate the function X
by a vector of given dimension p with a minimum loss of “instantaneous” information. It should
be stressed, though, that this approximation is of a static nature, meaning that it is performed
observation by observation, and does not take into account the possible serial dependence of the
Xt’s, which is likely to exist in a time-series context. Globally speaking, we should be looking for an
approximation which also involves lagged observations, and is based on the whole family (Ch : h ∈ Z)
rather than on C0 only. To achieve this goal, we introduce below the spectral density operator, which
contains the full information on the family of operators (Ch : h ∈ Z).
3.2 The spectral density operator
In analogy to the classical concept of a spectral density matrix, we define the spectral density
operator.
Definition 2. Let (Xt) be a stationary process. The operator FXθ whose kernel is
fXθ (u, v) :=1
2π
∑h∈Z
ch(u, v)e−ihθ, θ ∈ [−π, π],
where i denotes the imaginary unit, is called the spectral density operator of (Xt) at frequency θ.
To ensure convergence (in an appropriate sense) of the series defining fXθ (u, v) (see Appendix A.2),
we impose the following summability condition on the autocovariances
∑h∈Z
(∫ 1
0
∫ 1
0|ch(u, v)|2dudv
)1/2
<∞. (3.4)
The same condition is more conveniently expressed as∑h∈Z‖Ch‖S <∞, (3.5)
where ‖ · ‖S denotes the Hilbert-Schmidt norm (see Section C.1 in the supplementary document).
A simple sufficient condition for (3.5) to hold will be provided in Proposition 7.
This concept of a spectral density operator has been introduced by Panaretos and Tavakoli [27].
In our context, this operator is used to create particular functional filters (see Sections 3.3 and A.3),
which are the building blocks for the construction of dynamic FPCs. A functional filter is defined via
a sequence Φ = (Φ` : ` ∈ Z) of linear operators between the spaces H = L2([0, 1]) and H ′ = Rp. The
filtered variables Yt have the form Yt =∑
`∈Z Φ`(Xt−`), and by the Riesz representation theorem,
the linear operators Φ` are given as
x 7→ Φ`(x) = (〈x, φ1`〉, . . . , 〈x, φp`〉)′, with φ1`, . . . , φp` ∈ H.
70
We shall considerer filters Φ for which the sequences (∑N
`=−N φm`(u)ei`θ : N ≥ 1), 1 ≤ m ≤ p,
converge in L2([0, 1]× [−π, π]). Hence, we assume existence of a square integrable function φ?m(u|θ)such that
limN→∞
∫ π
−π
∫ 1
0
(N∑
`=−Nφm`(u)ei`θ − φ?m(u|θ)
)2
dudθ = 0. (3.6)
In addition we suppose that
supθ∈[−π,π]
∫ 1
0[φ?m(u|θ)]2 du <∞. (3.7)
Then, we write φ?m(θ) :=∑
`∈Z φm`ei`θ or, in order to emphasize its functional nature, φ?m(u|θ) :=∑
`∈Z φm`(u)ei`θ. We denote by C the family of filters Φ which satisfy (3.6) and (3.7). For example,
if Φ is such that∑
` ‖φm`‖ <∞, then Φ ∈ C.
The following proposition relates the spectral density operator of (Xt) to the spectral density
matrix of the filtered sequence (Yt =∑
`∈Z Φ`(Xt−`)). This simple result plays a crucial role in our
construction.
Proposition 2. Assume that Φ ∈ C and let φ?m(θ) be given as above. Then the series∑
`∈Z Φ`(Xt−`)
converges in mean square to a limit Yt. The p-dimensional vector process (Yt) is stationary, with
spectral density matrix
FYθ =
〈FXθ (φ?1(θ)), φ?1(θ)
⟩· · · 〈FXθ (φ?p(θ)), φ
?1(θ)
⟩...
. . ....
〈FXθ (φ?1(θ)), φ?p(θ)⟩· · · 〈FXθ (φ?p(θ)), φ
?p(θ)
⟩ .
Since we do not want to assume a priori absolute summability of the filter coefficients Φ`, the
series FYθ = (2π)−1∑
h∈ZCYh e
ihθ, where CYh = cov(Yh, Y0), may not converge absolutely, and hence
not pointwise in θ. As our general theory will show, the operator FYθ can be considered as an
element of the space L2Cp×p([−π, π]), i.e. the collection of measurable mappings f : [−π, π] → Cp×p
for which∫ π−π ‖f(θ)‖2Fdθ < ∞, where ‖ · ‖F denotes the Frobenius norm. Equality of f and g is
thus understood as∫ π−π ‖f(θ) − g(θ)‖2Fdθ = 0. In particular it implies that f(θ) = g(θ) for almost
all θ.
To explain the important consequences of Proposition 2, first observe that under (3.5), for
every frequency θ, the operator FXθ is a non-negative, self-adjoint Hilbert-Schmidt operator (see
Section C.1 of the supplementary file). Hence, in analogy to (3.1), FXθ admits, for all θ, the spectral
representation
FXθ (x) =∑m≥1
λm(θ)〈x, ϕm(θ)〉ϕm(θ),
where λm(θ) and ϕm(θ) denote the dynamic eigenvalues and eigenfunctions. We impose the order
λ1(θ) ≥ λ2(θ) ≥ . . . ≥ 0 for all θ ∈ [−π, π], and require that the eigenfunctions be standardized so
that ‖ϕm(θ)‖ = 1 for all m ≥ 1 and θ ∈ [−π, π].
71
Assume now that we could choose the functional filters (φm` : ` ∈ Z) in such a way that
limN→∞
∫ π
−π
∫ 1
0
(N∑
`=−Nφm`(u)ei`θ − ϕm(u|θ)
)2
dudθ = 0. (3.8)
We then have FYθ = diag(λ1(θ), . . . , λp(θ)) for almost all θ, implying that the coordinate processes
of (Yt) are uncorrelated at any lag: cov(Ymt, Ym′s) = 0 for all s, t and m 6= m′. As discussed in the
Introduction, this is a desirable property which the static FPCs do not possess.
3.3 Dynamic FPCs
Motivated by the discussion above, we wish to define φm` in such a way that φ?m = ϕm (in L2([0, 1]×[−π, π])). To this end, we suppose that the function ϕm(u|θ) is jointly measurable in u and θ (this
assumption is discussed in Appendix A.1). The fact that eigenfunctions are standardized to unit
length implies∫ π−π∫ 1
0 ϕ2m(u|θ)dudθ = 2π. We conclude from Tonelli’s theorem that
∫ π−π ϕ
2m(u|θ)dθ <
∞ for almost all u ∈ [0, 1], i.e. that ϕm(u|θ) ∈ L2([−π, π]) for all u ∈ Am ⊂ [0, 1], where Am has
Lebesgue measure one. We now define, for u ∈ Am,
φm`(u) :=1
2π
∫ π
−πϕm(u|s)e−i`sds; (3.9)
for u /∈ Am, φm`(u) is set to zero. Then, it follows from the results in Appendix A.1 that (3.8) holds.
We conclude that the functional filters defined via (φm` : ` ∈ Z, 1 ≤ m ≤ p) belong to the class Cand that the resulting filtered process has diagonal autocovariances at all lags.
Definition 3 (Dynamic functional principal components). Assume that (Xt : t ∈ Z) is a mean-zero
stationary process with values in L2H satisfying assumption (3.5). Let φm` be defined as in (3.9).
Then the m-th dynamic functional principal component score of (Xt) is
Ymt :=∑`∈Z〈Xt−`, φm`〉, t ∈ Z, m ≥ 1. (3.10)
Call Φm := (φm` : ` ∈ Z) the m-th dynamic FPC filter coefficients.
Remark 1. If EXt = µ, then the dynamic FPC scores Ymt are defined as in (3.10), with Xs replaced
by Xs − µ.
Remark 2. Note that the dynamic scores (Ymt) in (3.10) are not unique. The filter coefficients φm`
are computed from the eigenfunctions ϕm(θ), which are defined up to some multiplicative factor z
on the complex unit circle. Hence, to be precise, we should speak of a version of (Ymt) and a version
of (φm`). We further discuss this issue after Theorem 2 and in Section 3.4.
The rest of this section is devoted to some important properties of dynamic FPCs.
Proposition 3 (Elementary properties). Let (Xt : t ∈ Z) be a real-valued stationary process satis-
fying (3.5), with dynamic FPC scores Ymt. Then,
(a) the eigenfunctions ϕm(θ) are Hermitian, and hence Ymt is real;
(b) if Ch = 0 for h 6= 0, the dynamic FPC scores coincide with the static ones.
72
Proposition 4 (Second-order properties). Let (Xt : t ∈ Z) be a stationary process satisfying (3.5),
with dynamic FPC scores Ymt. Then,
(a) the series defining Ymt is mean-square convergent, with
EYmt = 0 and EY 2mt =
∑`∈Z
∑k∈Z〈C`−k(φm`), φmk〉;
(b) the dynamic FPC scores Ymt and Ym′s are uncorrelated for all s, t and m 6= m′. In other words,
if Yt = (Y1t, . . . , Ypt)′ denotes some p-dimensional score vector and CYh its lag-h covariance matrix,
then CYh is diagonal for all h;
(c) the long-run covariance matrix of the dynamic FPC score vector process (Yt) is
limn→∞
1
nVar(Y1 + · · ·+ Yn) = 2π diag(λ1(0), . . . , λp(0)).
The next theorem, which tells us how the original process (Xt(u) : t ∈ Z, u ∈ [0, 1]) can be
recovered from (Ymt : t ∈ Z, m ≥ 1), is the dynamic analogue of the static Karhunen-Loeve expansion
(3.2) associated with static principal components.
Theorem 2 (Inversion formula). Let Ymt be the dynamic FPC scores related to the process (Xt(u) : t ∈Z, u ∈ [0, 1]). Then,
Xt(u) =∑m≥1
Xmt(u) with Xmt(u) :=∑`∈Z
Ym,t+`φm`(u) (3.11)
(where convergence is in mean square). Call (3.11) the dynamic Karhunen-Loeve expansion of Xt.
We have mentioned in Remark 2 that dynamic FPC scores are not unique. In contrast, our
proofs show that the curves Xmt(u) are unique. To get some intuition, let us draw a simple analogy
to the static case. There, each v` in the Karhunen-Loeve expansion (3.2) can be replaced by −v`,i.e., the FPCs are defined up to their signs. The `-th score is 〈X, v`〉 or 〈X,−v`〉, and thus is not
unique either. However, the curves 〈X, v`〉v` and 〈X,−v`〉(−v`) are identical.
The sums∑p
m=1Xmt(u), p ≥ 1, can be seen as p-dimensional reconstructions of Xt(u), which
only involve the p time series (Ymt : t ∈ Z), 1 ≤ m ≤ p. Competitors to this reconstruction
are obtained by replacing φm` in (3.10) and (3.11) with alternative sequences ψm` and υm`. The
next theorem shows that, among all filters in C, the dynamic Karhunen-Loeve expansion (3.11)
approximates Xt(u) in an optimal way.
Theorem 3 (Optimality of Karhunen-Loeve expansions). Let Ymt be the dynamic FPC scores
related to the process (Xt : t ∈ Z), and define Xmt as in Theorem 2. Let Xmt =∑
`∈Z Ym,t+` υm`,
with Ymt =∑
`∈Z〈Xt−`, ψm`〉, where (ψmk : k ∈ Z) and (υmk : k ∈ Z) are sequences in H belonging
to C. Then,
E‖Xt −p∑
m=1
Xmt‖2 =∑m>p
∫ π
−πλm(θ)dθ ≤ E‖Xt −
p∑m=1
Xmt‖2 ∀p ≥ 1. (3.12)
Inequality (3.12) can be interpreted as the dynamic version of (3.3). Theorem 3 also suggests
73
the proportion ∑m≤p
∫ π
−πλm(θ)dθ
/E‖X1‖2 (3.13)
of variance explained by the first p dynamic FPCs as a natural measure of how well a functional
time series can be represented in dimension p.
3.4 Estimation and asymptotics
In practice, dynamic FPC scores need to be calculated from an estimated version of FXθ . At the same
time, the infinite series defining the scores need to be replaced by finite approximations. Suppose
again that (Xt : t ∈ Z) is a weakly stationary zero-mean time series such that (3.5) holds. Then, a
natural estimator for Ymt is
Ymt :=
L∑`=−L
〈Xt−`, φm`〉, m = 1, . . . , p and t = L+ 1, . . . n− L, (3.14)
where L is some integer and φm` is computed from some estimated spectral density operator FXθ .
For the latter, we impose the following preliminary assumption.
Assumption B.1 The estimator FXθ is consistent in integrated mean square, i.e.∫ π
−πE‖FXθ − FXθ ‖2S dθ → 0 as n→∞. (3.15)
Panaretos and Tavakoli [27] propose an estimator FXθ satisfying (3.15) under certain functional
cumulant conditions. By stating (3.15) as an assumption, we intend to keep the theory more
widely applicable. For example, the following proposition shows that estimators satisfying Assump-
tion B.1 also exist under L4-m-approximability, a dependence concept for functional data introduced
in Hormann and Kokoszka [18]. Define
FXθ =∑|h|≤q
(1− |h|
q
)CXh e
−ihθ, 0 < q < n, (3.16)
where CXh is the usual empirical autocovariance operator at lag h.
Proposition 5. Let (Xt : t ∈ Z) be L4-m-approximable, and let q = q(n)→∞ such that q3 = o(n).
Then the estimator FXθ defined in (3.16) satisfies Assumption B.1. The approximation error is
O(αq,n), where
αq,n =q3/2
√n
+1
q
∑|h|≤q
|h|‖Ch‖S +∑|h|>q
‖Ch‖S .
Corollary 1. Under the assumptions of Proposition 5 and∑
h |h|‖Ch‖S <∞ the convergence rate
of the estimator (3.16) is O(n−1/5).
Since our method requires the estimation of eigenvectors of the spectral density operator, we also
need to introduce certain identifiability constraints on eigenvectors. Define α1(θ) := λ1(θ) − λ2(θ)
74
and
αm(θ) := minλm−1(θ)− λm(θ), λm(θ)− λm+1(θ) for m > 1,
where λi(θ) is the i-th largest eigenvalue of the spectral density operator evaluated in θ.
Assumption B.2 For all m, αm(θ) has finitely many zeros.
Assumption B.2 essentially guarantees disjoint eigenvalues for all θ. It is a very common assumption
in functional PCA, as it ensures that eigenspaces are one-dimensional, and thus eigenfunctions are
unique up to their signs. To guarantee identifiability, it only remains to provide a rule for choosing
the signs. In our context, the situation is slightly more complicated, since we are working in a
complex setup. The eigenfunction ϕm(θ) is unique up to multiplication by a number on the complex
unit circle. A possible way to fix the direction of the eigenfunctions is to impose a constraint of the
form 〈ϕm(θ), v〉 ∈ (0,∞) for some given function v. In other words, we choose the orientation of
the eigenfunction such that its inner product with some reference curve v is a positive real number.
This rule identifies ϕm(θ), as long as it is not orthogonal to v. The following assumption ensures
that such identification is possible on a large enough set of frequencies θ ∈ [−π, π].
Assumption B.3 Denoting by ϕm(θ) be the m-th dynamic eigenvector of FXθ , there exists v such
that 〈ϕm(θ), v〉 6= 0 for almost all θ ∈ [−π, π].
From now on, we tacitly assume that the orientations of ϕm(θ) and ϕm(θ) are chosen so that
〈ϕm(θ), v〉 and 〈ϕm(θ), v〉 are in [0,∞) for almost all θ. Then, we have the following result.
Theorem 4 (Consistency). Let Ymt be the random variable defined by (3.14) and suppose that
Assumptions B.1–B.3 hold. Then, for some sequence L = L(n) → ∞, we have YmtP−→ Ymt as
n→∞.
Practical guidelines for the choice of L are given in the next section.
4 Practical implementation
In applications, data can only be recorded discretely. A curve x(u) is observed on grid points
0 ≤ u1 < u2 < · · · < ur ≤ 1. Often, though not necessarily so, r is very large (high frequency data).
The sampling frequency r and the sampling points ui may change from observation to observation.
Also, data may be recorded with or without measurement error, and time warping (registration)
may be required. For deriving limiting results, a common assumption is that r → ∞, while a
possible measurement error tends to zero. All these specifications have been extensively studied
in the literature, and we omit here the technical exercise to cast our theorems and propositions in
one of these setups. Rather, we show how to implement the proposed method, after the necessary
preprocessing steps have been carried out. Typically, data are then represented in terms of a finite
(but possibly large) number of basis functions (vk : 1 ≤ k ≤ d), i.e., x(u) =∑d
k=1 xkvk(u). Usually
Fourier bases, b-splines or wavelets are used. For an excellent survey on preprocessing the raw data,
we refer to Ramsey and Silverman [32, Chapters 3–5].
75
In the sequel, we write (aij : 1 ≤ i, j ≤ d) for a d×d matrix with entry aij in row i and column j.
Let x belong to the span Hd := sp(vk : 1 ≤ k ≤ d) of v1, . . . , vd. Then x is of the form v′x, where
v = (v1, . . . , vd)′ and x = (x1, . . . , xd)
′. We assume that the basis functions v1, . . . , vd are linearly
independent, but they need not be orthogonal. Any statement about x can be expressed as an
equivalent statement about x. In particular, if A : Hd → Hd is a linear operator, then, for x ∈ Hd,
A(x) =d∑
k=1
xkA(vk) =d∑
k=1
d∑k′=1
xk〈A(vk), vk′〉vk′ = v′Ax,
where A′ = (〈A(vi), vj〉 : 1 ≤ i, j ≤ d). Call A the corresponding matrix of A and x the corresponding
vector of x.
The following simple results are stated without proof.
Lemma 1. Let A,B be linear operators on Hd, with corresponding matrices A and B, respectively.
Then,
(i) for any α, β ∈ C, the corresponding matrix of αA+ βB is αA + βB;
(ii) A(e) = λe iff Ae = λe, where e = v′e;
(iii) letting A :=p∑i=1
p∑j=1
gijvi ⊗ vj, G := (gij : 1 ≤ i, j ≤ d), where gij ∈ C, and V := (〈vi, vj〉 : 1 ≤
i, j ≤ d), the corresponding matrix of A is A = GV ′.
To obtain the corresponding matrix of the spectal density operators FXθ , first observe that, if
Xk =∑d
i=1Xkivi =: v′Xk, then
CXh = EXh ⊗X0 =
d∑i=1
d∑j=1
EXhiX0jvi ⊗ vj .
It follows from Lemma 1 (iii) that CXh = CXh V
′ is the corresponding matrix of CXh := EXhX
′0; the
linearity property (i) then implies that
FXθ =1
2π
(∑h∈Z
CXh e−ihθ
)V ′ (4.1)
is the corresponding matrix of FXθ . Assume that λm(θ) is the m-th largest eigenvalue of FXθ , with
eigenvector ϕm(θ). Then λm(θ) is also an eigenvalue of FXθ and v′ϕm(θ) is the corresponding eigen-
function, from which we can compute, via its Fourier expansion, the dynamic FPCs. In particular,
we have
φmk =v′
2π
∫ π
−πϕm(s)e−iksds =: v′φmk,
and hence
Ymt =∑k∈Z
∫ 1
0X′t−kv(u)v′(u)φmkdu =
∑k∈Z
X′t−kV φmk. (4.2)
In view of (4.1), our task is now to replace the spectral density matrix
76
FXθ =
1
2π
∑h∈Z
CXh e−ihθ
of the coefficient sequence (Xk) by some estimate. For this purpose, we can use existing multivariate
techniques. Classically, we would put, for |h| < n,
CXh :=
1
n
n∑k=h+1
XkX′k−h, h ≥ 0, and CX
h := CX−h, h < 0
(recall that we throughout assume that the data are centered) and use, for example, some lag window
estimator
FXθ :=
1
2π
∑|h|≤q
w(h/q)CXh e−ihθ, (4.3)
where w is some appropriate weight function, q = qn → ∞ and qn/n → 0. For more details con-
cerning common choices of w and the tuning parameter qn, we refer to Chapters 10–11 in Brockwell
and Davis [7] and to Politis [29]. We then set FXθ := FXθ V
′ and compute the eigenvalues and eigen-
functions λm(θ) and ϕm(θ) thereof, which serve as estimators of λm(θ) and ϕm(θ), respectively. We
estimate the filter coefficients by φmk = v′
2π
∫ π−π ϕm(s)eiksds. Usually, no analytic form of ϕm(s) is
available, and one has to perform numerical integration. We take the simplest approach, which is
to set
φmk =v′
2π(2Nθ + 1)
Nθ∑j=−Nθ
ϕm(πj/Nθ)eiks =: v′φmk, (Nθ 1).
The larger Nθ the better. This clearly depends on the available computing power.
Now, we substitute φmk into (4.2), replacing the infinite sum with a rolling window
Ymt =L∑
k=−LX′t−kV φmk. (4.4)
This expression only can be computed for t ∈ L+ 1, . . . , n−L; for 1 ≤ t ≤ L or n−L+ 1 ≤ t ≤ n,
set X−L+1 = · · · = X0 = Xn+1 = · · · = Xn+L = EX1 = 0. This, of course, creates a certain bias on
the boundary of the observation period. As for the choice of L, we observe that∑
`∈Z ‖φm`‖2 = 1.
It is then natural to choose L such that∑−L≤`≤L ‖φm`‖2 ≥ 1− ε, for some small threshold ε, e.g.,
ε = 0.01.
Based on this definition of φmk, we obtain an empirical p-term dynamic Karhunen-Loeve expan-
sion
Xt =
p∑m=1
L∑k=−L
Ym,t+kφmk, with Ymt = 0, t ∈ −L+ 1, . . . , 0 ∪ n+ 1, . . . , n+ L. (4.5)
Parallel to (3.13), the proportion of variance explained by the first p dynamic FPCs can be
estimated through
PVdyn(p) :=π
Nθ
∑m≤p
Nθ∑j=−Nθ
λm(πj/Nθ)/ 1
n
n∑k=1
‖Xk‖2.
77
We will use (1 − PVdyn(p)) as a measure of the loss of information incurred when considering a
dimension reduction to dimension p. Alternatively, one also can use the normalized mean squared
errorNMSE(p) :=
n∑k=1
‖Xk − Xk‖2/ n∑k=1
‖Xk‖2. (4.6)
Both quantities converge to the same limit.
5 A real-life illustration
In this section, we draw a comparison between dynamic and static FPCA on basis of a real data
set. The observations are half-hourly measurements of the concentration (measured in µgm−3) of
particulate matter with an aerodynamic diameter of less than 10µm, abbreviated as PM10, in ambient
air taken in Graz, Austria from October 1, 2010 through March 31, 2011. Following Stadlober
et al. [36] and Aue et al. [1], a square-root transformation was performed in order to stabilize
the variance and avoid heavy-tailed observations. Also, we removed some outliers and a seasonal
(weekly) pattern induced from different traffic intensities on business days and weekends. Then we
use the software R to transform the raw data, which is discrete, to functional data, as explained
in Section 4, using 15 Fourier basis functions. The resulting curves for 175 daily observations,
X1, . . . , X175, say, roughly representing one winter season, for which pollution levels are known to
be high, are displayed in Figure 2.
0.0 0.2 0.4 0.6 0.8 1.0
02
46
81
01
2
time
sq
rt(P
M1
0)
0.0 0.2 0.4 0.6 0.8 1.0
02
46
81
01
2
time
Figure 2: A plot of 175 daily curves xt(u), 1 ≤ t ≤ 175, where xt(u) are the square-root transformed anddetrended functional observations of PM10, based on 15 Fourier basis functions. The solid black line representsthe sample mean curve µ(u).
From those data, we computed the (estimated) first dynamic FPC score sequence (Y dyn1t : 1 ≤ t ≤ 175).
To this end, we centered the data at their empirical mean µ(u), then implemented the procedure
described in Section 4. We used the traditional Bartlett kernel w(x) = 1− |x| in (4.3) to obtain an
estimator for the spectral density operator, with bandwidth q = bn1/2c = 13. More sophisticated
78
estimation methods, as those proposed, for example, by Politis [29], of course can be considered;
but they also depend on additional tuning parameters, still leaving much of the selection to the
practitioner’s choice. From FXθ we obtain the estimated filter elements φ1`. It turns out that they
fade away quite rapidly. In particular∑10
`=−10 ‖φ1`‖2 ≈ 0.998. Hence, for calculation of the scores
in (4.4) it is justified to choose L = 10. The five central filter elements φ1`(u), ` = −2, . . . , 2, are
plotted in Figure 3.
0.0
0.5
1.0
Figure 3: The five central filter elements φ1,−2(u), . . . , φ1,2(u) (from left to right).
Further components could be computed similarly, but for the purpose of demonstration we focus
on one component only. In fact, the first dynamic FPC already explains about 80% of the total
variance, compared to the 73% explained by the first static FPC. The latter was also computed,
resulting in the static FPC score sequence (Y stat1t : 1 ≤ t ≤ 175). Both sequences are shown in
Figure 4, along with their differences.
0 50 100 150
−2
02
4
Time [days]
1st
FP
C s
co
res
0 50 100 150
−4
−2
02
4
Time [days]
1st
DF
PC
sco
res
0 50 100 150
−2
02
4
Time [days]
Diffe
ren
ce
s
Figure 4: First static (left panel) and first dynamic (middle panel) FPC score sequences, and their differences(right panel).
Although based on entirely different ideas, the static and dynamic scores in Figure 4 (which,
of course, are not loading the same functions) appear to be remarkably close to one another. The
reason why the dynamic Karhunen-Loeve expansion accounts for a significantly larger amount of
the total variation is that, contrary to its static counterpart, it does not just involve the present
observation.
To get more statistical insight into those results, let us consider the first static sample FPC,
v1(u), say, displayed in Figure 5. We see that v1(u) ≈ 1 for all u ∈ [0, 1], so that the static FPC
score Y stat1t =
∫ 10 (Xt(u)− µ(u))v1(u)du roughly coincides with the average deviation of Xt(u) from
79
−1
0
1
0.0 0.4 0.8 0.0 0.4 0.8
34
56
78
0.0 0.4 0.8
34
56
78
0.0 0.4 0.8
34
56
78
0.0 0.4 0.8
34
56
78
0.0 0.4 0.8
34
56
78
0.0 0.4 0.8
34
56
78
Figure 5: First static FPC v1(u) (solid line), and second static FPC v2(u) (dashed line) [left panel]. µ(u)±v1(u) [middle panel] and µ(u) ± v2(u) [right panel] describe the effect of the first and second static FPC onthe mean curve.
the sample mean µ(u): the effect of a large (small) first score corresponds to a large (small) daily
average of√PM10. In view of the similarity between Y dyn
1t and Y stat1t , it is possible to attribute
the same interpretation to the dynamic FPC scores. However, regarding the dynamic Karhunen-
Loeve expansion, dynamic FPC scores should be interpreted sequentially. To this end, let us take
advantage of the fact that∑1
`=−1 ‖φ1`‖2 ≈ 0.92. In the approximation by a single-term dynamic
Karhunen-Loeve expansion, we thus roughly have
Xt(u) ≈ µ(u) +1∑
`=−1
Y dyn1,t+`φ1`(u).
This suggests studying the impact of triples (Y dyn1,t−1, Y
dyn1t , Y dyn
1,t+1) of consecutive scores on the pollu-
tion level of day t. We do this by adding the functions
eff(δ−1, δ0, δ1) :=1∑
`=−1
δ`φ1`(u), with δi = const×±1,
to the overall mean curve µ(u). In Figure 6, we do this with δi = ±1. For instance, the upper left
panel shows µ(u)+eff(−1,−1,−1), corresponding to the impact of three consecutive small dynamic
FPC scores. The result is a negative shift of the mean curve. If two small scores are followed by
a large one (second panel from the left in top row), then the PM10 level increases as u approaches
1. Since a large value of Y dyn1,t+1 implies a large average concentration of
√PM10 on day t + 1, and
since the pollution curves are highly correlated at the transition from day t to day t+ 1, this should
indeed be reflected by a higher value of√PM10 towards the end of day t. Similar interpretations can
be given for the other panels in Figure 6.
It is interesting to observe that, in this example, the first dynamic FPC seems to take over the
roles of the first two static FPCs. The second static FPC (see Figure 5) indeed can be interpreted as
an intraday trend effect; if the second static score of day t is large (small), then Xt(u) is increasing
(decreasing) over u ∈ [0, 1]. Since we are working with sequentially dependent data, we can get
information about such a trend from future and past observations, too. Hence, roughly speaking,
80
we have1∑
`=−1
Y dyn1,t+`φ1`(u) ≈
2∑m=1
Y statmt vm(u).
This is exemplified in Figure 1 of Section 1, which shows the ten consecutive curves x71(u) −µ(u), . . . , x80(u)− µ(u) (left panel) and compares them to the single-term static (middle panel) and
the single-term dynamic Karhunen-Loeve expansions (right panel).
0.0 0.4 0.8
45
67
8
Intraday time
Sqrt
(PM
10)
(δ−1,δ0,δ1) = (−1,−1,−1)
0.0 0.4 0.8
45
67
8
Intraday time
Sqrt
(PM
10)
(δ−1,δ0,δ1) = (−1,−1,+1)
0.0 0.4 0.8
45
67
8
Intraday timeS
qrt
(PM
10)
(δ−1,δ0,δ1) = (−1,+1,−1)
0.0 0.4 0.8
45
67
8
Intraday time
Sqrt
(PM
10)
(δ−1,δ0,δ1) = (−1,+1,+1)
0.0 0.4 0.8
45
67
8
Intraday time
Sqrt
(PM
10)
(δ−1,δ0,δ1) = (+1,−1,−1)
0.0 0.4 0.8
45
67
8
Intraday time
Sqrt
(PM
10)
(δ−1,δ0,δ1) = (+1,−1,+1)
0.0 0.4 0.8
45
67
8
Intraday time
Sqrt
(PM
10)
(δ−1,δ0,δ1) = (+1,+1,−1)
0.0 0.4 0.84
56
78
Intraday time
Sqrt
(PM
10)
(δ−1,δ0,δ1) = (+1,+1,+1)
Figure 6: Mean curves µ(u) (solid line) and µ(u) + eff(δ−1, δ0, δ1), with δi = ±1 (dashed).
6 Simulation study
In this simulation study, we compare the performance of dynamic FPCA with that of static FPCA
for a variety of data-generating processes. For each simulated functional time series (Xt), where
Xt = Xt(u), u ∈ [0, 1], we compute the static and dynamic scores, and recover the approximating
series (Xstatt (p)) and (Xdyn
t (p)) that result from the static and dynamic Karhunen-Loeve expansions,
respectively, of order p. The performances of these approximations are measured in terms of the
corresponding normalized mean squared errors (NMSE)
n∑t=1
‖Xt − Xstatt (p)‖2
/ n∑t=1
‖Xt‖2 and
n∑t=1
‖Xt − Xdynt (p)‖2
/ n∑t=1
‖Xt‖2.
The smaller these quantities, the better the approximation.
Computations were implemented in R, along with the fda package. The data were simulated
according to a functional AR(1) model Xn+1 = Ψ(Xn) + εn+1. In practice, this simulation has to
be performed in finite dimension d, say. To this end, let (vi), i ∈ N be the Fourier basis functions
81
on [0, 1]: for large d, due to the linearity of Ψ,
〈Xn+1, vj〉 = 〈Ψ(Xn), vj〉+ 〈εn+1, vj〉
= 〈Ψ( ∞∑i=1
〈Xn, vi〉vi), vj〉+ 〈εn+1, vj〉 ≈
d∑i=1
〈Xn, vi〉〈Ψ(vi), vj〉+ 〈εn+1, vj〉.
Hence, letting Xn = (〈Xn, v1〉, . . . , 〈Xn, vd〉)′ and εn = (〈εn+1, v1〉, . . . , 〈εn+1, vd〉)′, the first d Fourier
coefficients of Xn approximately satisfy the VAR(1) equation
Xn+1 = PXn + εn, where P = (〈Ψ(vi), vj〉 : 1 ≤ i, j ≤ d). Based on this observation, we used
a VAR(1) model for generating the first d Fourier coefficients of the process (Xn). To obtain P, we
generate a matrix G = (Gij : 1 ≤ i, j ≤ d), where the Gij ’s are mutually independent N(0, ψij), and
then set P := κG/‖G‖. Different choices of ψij are considered. Since Ψ is bounded, we have Pij → 0
as i, j → ∞. For the operators Ψ1, Ψ2 and Ψ3, we used ψij = (i2 + j2)−1/2, ψij = (i2/2 + j3/2)−1,
and ψij = e−(i+j), respectively. For d and κ, we considered the values d = 15, 31, 51, 101 and
κ = 0.1, 0.3, 0.6, 0.9. The noise (εt) is chosen as independent Gaussian and obtained as a lin-
ear combination of the functions (vi : 1 ≤ i ≤ d) with independent zero-mean normal coefficients
(Ci : 1 ≤ i ≤ d), such that Var(Ci) = exp((i − 1)/10). With this approach, we generate n = 400
observations. We then follow the methodology described in Section 4 and use the Barlett kernel
in (4.3) for estimation of the spectral density operator. The tuning parameter q is set equal to√n = 20. A more sophisticated calibration probably can lead to even better results, but we also
observed that moderate variations of q do not fundamentally change our findings. The numerical
integration for obtaining φmk is performed on the basis of 1000 equidistant integration points. In
(4.4) we chose L = min(L′, 60), where L′ = argminj≥0
∑−j≤`≤j ‖φm`‖2 ≥ 0.99. The limitation
L ≤ 60 is imposed to keep computation times moderate. Usually, convergence is relatively fast.
For each choice of d and κ, the experiment as described above is repeated 200 times. The mean
and standard deviation of NMSE in different settings and with values p = 1, 2, 3, 6 are reported in
Table 1. Results do not vary much among setups with d ≥ 31, and thus in Table 1 we only present
the cases d = 15 and d = 101.
We see that, in basically all settings, dynamic FPCA significantly outperforms static FPCA in
terms of NMSE. As one can expect, the difference becomes more striking with increasing dependence
coefficient κ. It is also interesting to observe that the variations of NMSE among the 200 replications
is systematically smaller for the dynamic procedure.
Finally, it should be noted that, in contrast to the static PCA, the empirical version of our
procedure is not “exact”, but is subject to small approximation errors. These approximation errors
can stem from numerical integration (which is required in the calculation of φmk) and are also due
to the truncation of the filters at some finite lag L (see Section 4). Such little deviations do not
matter in practice if a component explains a significant proportion of variance. If, however, the
additional contribution of the higher-order component is very small, then it can happen that it
doesn’t compensate a possible approximation error. This becomes visible in the setting Ψ3 with 3
or 6 components, where for some constellations the NMSE for dynamic components is slightly larger
than for the static ones.
82
1co
mp
onen
t2
com
ponen
ts3
com
ponen
ts6
com
ponen
tsd
κst
ati
cdynam
icst
ati
cdynam
icst
ati
cdynam
icst
ati
cdynam
ic
Ψ1
15
0.1
0.697
(0.1
6)
0.637
(0.1
3)
0.546
(0.1
5)
0.447
(0.1
0)
0.443
(0.1
2)
0.325
(0.0
8)
0.256
(0.0
8)
0.138
(0.0
5)
0.3
0.696
(0.1
6)
0.621
(0.1
4)
0.542
(0.1
5)
0.434
(0.1
1)
0.440
(0.1
3)
0.314
(0.0
8)
0.253
(0.0
8)
0.132
(0.0
5)
0.6
0.687
(0.3
2)
0.571
(0.2
3)
0.526
(0.2
5)
0.392
(0.1
5)
0.423
(0.2
0)
0.283
(0.1
1)
0.240
(0.1
1)
0.119
(0.0
6)
0.9
0.648
(0.7
6)
0.479
(0.4
7)
0.481
(0.5
6)
0.322
(0.2
9)
0.377
(0.4
3)
0.229
(0.2
0)
0.209
(0.2
2)
0.096
(0.0
9)
101
0.1
0.805
(0.1
2)
0.740
(0.0
8)
0.708
(0.1
1)
0.587
(0.0
8)
0.642
(0.1
2)
0.478
(0.0
7)
0.519
(0.0
8)
0.274
(0.0
5)
0.3
0.802
(0.1
3)
0.729
(0.1
1)
0.704
(0.1
2)
0.577
(0.0
9)
0.637
(0.1
1)
0.469
(0.0
8)
0.515
(0.1
0)
0.269
(0.0
5)
0.6
0.792
(0.2
2)
0.690
(0.1
8)
0.689
(0.1
9)
0.545
(0.1
2)
0.619
(0.1
6)
0.441
(0.1
0)
0.495
(0.1
3)
0.252
(0.0
7)
0.9
0.755
(0.6
6)
0.616
(0.4
5)
0.640
(0.5
0)
0.479
(0.3
1)
0.568
(0.4
0)
0.387
(0.2
3)
0.446
(0.3
4)
0.220
(0.1
5)
Ψ2
15
0.1
0.524
(0.2
0)
0.491
(0.1
7)
0.355
(0.1
4)
0.306
(0.1
1)
0.263
(0.1
0)
0.208
(0.0
8)
0.129
(0.0
5)
0.082
(0.0
3)
0.3
0.522
(0.2
1)
0.473
(0.1
8)
0.351
(0.1
6)
0.294
(0.1
2)
0.259
(0.1
2)
0.200
(0.0
8)
0.126
(0.0
6)
0.078
(0.0
4)
0.6
0.507
(0.4
9)
0.413
(0.2
9)
0.331
(0.2
9)
0.255
(0.1
5)
0.240
(0.1
9)
0.174
(0.1
0)
0.114
(0.0
8)
0.068
(0.0
5)
0.9
0.458
(1.1
5)
0.310
(0.5
9)
0.272
(0.6
4)
0.187
(0.3
2)
0.193
(0.4
1)
0.130
(0.2
1)
0.088
(0.1
7)
0.052
(0.0
9)
101
0.1
0.585
(0.1
9)
0.549
(0.1
7)
0.436
(0.1
5)
0.378
(0.1
1)
0.356
(0.1
3)
0.282
(0.1
0)
0.240
(0.0
8)
0.146
(0.0
5)
0.3
0.581
(0.2
1)
0.530
(0.1
8)
0.436
(0.1
2)
0.369
(0.1
1)
0.350
(0.1
3)
0.274
(0.0
9)
0.234
(0.1
0)
0.141
(0.0
6)
0.6
0.564
(0.4
6)
0.469
(0.2
7)
0.405
(0.3
3)
0.321
(0.1
8)
0.323
(0.2
1)
0.242
(0.1
3)
0.212
(0.1
2)
0.125
(0.0
7)
0.9
0.495
(1.0
6)
0.362
(0.5
9)
0.345
(0.6
8)
0.250
(0.3
9)
0.251
(0.5
8)
0.180
(0.3
4)
0.168
(0.2
6)
0.097
(0.1
4)
Ψ3
15
0.1
0.367
(0.2
0)
0.344
(0.1
8)
0.134
(0.0
8)
0.127
(0.0
7)
0.049
(0.0
3)
0.054
(0.0
4)
0.002
(0.0
0)
0.017
(0.0
3)
0.3
0.362
(0.2
4)
0.322
(0.1
7)
0.129
(0.0
9)
0.119
(0.0
7)
0.048
(0.0
3)
0.050
(0.0
4)
0.002
(0.0
0)
0.015
(0.0
3)
0.6
0.334
(0.5
5)
0.253
(0.2
4)
0.113
(0.1
6)
0.097
(0.0
9)
0.041
(0.0
5)
0.040
(0.0
4)
0.002
(0.0
0)
0.011
(0.0
2)
0.9
0.236
(1.1
2)
0.146
(0.4
3)
0.074
(0.2
8)
0.061
(0.1
6)
0.025
(0.0
8)
0.027
(0.0
7)
0.001
(0.0
0)
0.008
(0.0
4)
101
0.1
0.366
(0.1
9)
0.344
(0.1
7)
0.134
(0.0
8)
0.127
(0.0
7)
0.049
(0.0
3)
0.054
(0.0
4)
0.002
(0.0
0)
0.017
(0.0
3)
0.3
0.363
(0.2
5)
0.322
(0.1
8)
0.131
(0.1
0)
0.120
(0.0
7)
0.047
(0.0
3)
0.050
(0.0
4)
0.002
(0.0
0)
0.015
(0.0
3)
0.6
0.325
(0.5
2)
0.251
(0.2
4)
0.113
(0.1
6)
0.098
(0.0
9)
0.040
(0.0
5)
0.040
(0.0
4)
0.002
(0.0
0)
0.011
(0.0
2)
0.9
0.235
(1.0
5)
0.149
(0.4
3)
0.074
(0.2
8)
0.061
(0.1
6)
0.025
(0.0
9)
0.026
(0.0
7)
0.001
(0.0
0)
0.008
(0.0
4)
Tab
le1:
Res
ult
sof
the
sim
ula
tion
sof
Sec
tion
6.
Bold
nu
mbe
rsre
pre
sen
tth
em
ean
of
NM
SE
for
dyn
am
ican
dst
ati
cpro
cedu
res
resu
ltin
gfr
om
200
sim
ula
tion
run
s.T
he
nu
mbe
rsin
brack
ets
show
stan
dard
dev
iati
on
sm
ult
ipli
edby
afa
ctor
10.
The
valu
esκ
give
the
size
of‖Ψ
i‖L
,i
=1,2,3
.W
eco
nsi
der
dim
ensi
on
sof
the
un
der
lyin
gm
odel
sd
=15
an
dd
=10
1.
83
7 Conclusion
Functional principal component analysis is taking a leading role in the functional data literature. As
an extremely effective tool for dimension reduction, it is useful for empirical data analysis as well as
for many FDA-related methods, like functional linear models. A frequent situation in practice is that
functional data are observed sequentially over time and exhibit serial dependence. This happens, for
instance, when observations stem from a continuous-time process which is segmented into smaller
units, e.g., days. In such cases, classical static FPCA still may be useful, but, in contrast to the
i.i.d. setup, it does not lead to an optimal dimension-reduction technique.
In this paper, we propose a dynamic version of FPCA which takes advantage of the potential serial
dependencies in the functional observations. In the special case of uncorrelated data, the dynamic
FPC methodology reduces to the usual static one. But, in the presence of serial dependence, static
FPCA is (quite significantly, if serial dependence is strong) outperformed.
This paper also provides (i) guidelines for practical implementation, (ii) a toy example with
PM10 air pollution data, and (iii) a simulation study. Our empirical application brings empirical
evidence that dynamic FPCs have a clear edge over static FPCs in terms of their ability to represent
dependent functional data in small dimension. In the appendices, our results are cast into a rigorous
mathematical framework, and we show that the proposed estimators of dynamic FPC scores are
consistent.
84
Appendices of Chapter 3
85
A General methodology and proofs
In this subsection, we give a mathematically rigorous description of the methodology introduced in
Section 3.1. We adopt a more general framework which can be specialized to the functional setup of
Section 3.1. Throughout, H denotes some (complex) separable Hilbert space equipped with norm
‖ · ‖ and inner product 〈·, ·〉. We work in complex spaces, since our theory is based on a frequency
domain analysis. Nevertheless, all our functional time series observations Xt are assumed to be
real-valued functions.
A.1 Fourier series in Hilbert spaces.
For p ≥ 1, consider the space LpH([−π, π]), that is, the space of measurable mappings x : [−π, π]→ H
such that∫ π−π ‖x(θ)‖pdθ < ∞. Then, ‖x‖p = ( 1
2π
∫ π−π ‖x(θ)‖pdθ)1/p defines a norm. Equipped with
this norm, LpH([−π, π]) is a Banach space, and for p = 2, a Hilbert space with inner product
(x, y) :=1
2π
∫ π
−π〈x(θ), y(θ)〉dθ.
One can show (see e.g. [8, Lemma 1.4]) that, for any x ∈ L1H([−π, π]), there exists a unique element
I(x) ∈ H which satisfies ∫ π
−π〈x(θ), v〉dθ = 〈I(x), v〉 ∀v ∈ H. (A.1)
We define∫ π−π x(θ)dθ := I(x).
For x ∈ L2H([−π, π]), define the k-th Fourier coefficient as
fk :=1
2π
∫ π
−πx(θ)e−ikθdθ, k ∈ Z. (A.2)
Below, we write ek for the function θ 7→ eikθ, θ ∈ [−π, π].
Proposition 6. Suppose x ∈ L2H([−π, π]) and define fk by equation (A.2). Then, the sequence
Sn :=∑n
k=−n fkek has a mean square limit in L2H([−π, π]). If we denote the limit by S, then
x(θ) = S(θ) for almost all θ.
Proof. See supplementary document.
Let us turn to the Fourier expansion of eigenfunctions ϕm(θ) used in the definition of the dynamic
DPFCs. Eigenvectors are scaled to unit length: ‖ϕm(θ)‖2 = 1. In order for ϕm to belong to
L2H([−π, π]), we additionally need measurability. Measurability cannot be taken for granted. This
comes from the fact that ‖zϕm(θ)‖2 = 1 for all z on the complex unit circle. In principle we could
choose the “signs” z = z(θ) in an extremely erratic way, such that ϕm(θ) is no longer measurable.
To exclude such pathological choices, we tacitly impose in the sequel that versions of ϕm(θ) have
been chosen in a “smooth enough way”, to be measurable.
86
Now we can expand the eigenfunctions ϕm(θ) in a Fourier series in the sense explained above:
ϕm =∑`∈Z
φm`e` with φm` =1
2π
∫ π
−πϕm(s)e−i`sds.
The coefficients φm` thus defined yield the definition (3.10) of dynamic FPCs. In the special case
H = L2([0, 1]), φm` = φm`(u) satisfies by (A.1)∫ 1
0φm`(u)v(u)du =
1
2π
∫ π
−π
∫ 1
0ϕm(u|s)v(u)due−i`sds
=
∫ 1
0
(1
2π
∫ π
−πϕm(u|s)e−i`sds
)v(u)du ∀v ∈ H.
This implies that φm`(u) = 12π
∫ π−π ϕm(u|s)e−i`sds for almost all u ∈ [0, 1], which is in line with the
definition given in (3.9). Furthermore, (3.8) follows directly from Proposition 6.
A.2 The spectral density operator
Assume that the H-valued process (Xt : t ∈ Z) is stationary with lag h autocovariance operator CXhand spectral density operator
FXθ :=1
2π
∑h∈Z
CXh e−ihθ. (A.3)
Let S(H,H ′) be the set of Hilbert-Schmidt operators mapping from H to H ′ (both assumed to
be separable Hilbert spaces). When H = H ′ and when it is clear which space H is meant, we
sometimes simply write S. With the Hilbert-Schmidt norm ‖ ·‖S(H,H′) this defines again a separable
Hilbert space, and so does L2S(H,H′)([−π, π]). We will impose that the series in (A.3) converges in
L2S(H,H)([−π, π]): we then say that (Xt) possesses a spectral density operator.
Remark 3. It follows that the results of the previous section can be applied. In particular we may
deduce that CXk =∫ π−π F
Xθ e
ikθdθ.
A sufficient condition for convergence of (A.3) in L2S(H,H)([−π, π]) is assumption (3.5). Then, it
can be easily shown that the operator FXθ is self-adjoint, non-negative definite and Hilbert-Schmidt.
Below, we introduce a weak dependence assumption established in [18], from which we can derive a
sufficient condition for (3.5).
Definition 4 (Lp–m–approximability). A random H–valued sequence (Xn : n ∈ Z) is called Lp–m–
approximable if it can be represented as Xn = f(δn, δn−1, δn−2, ...), where the δi’s are i.i.d. elements
taking values in some measurable space S and f is a measurable function f : S∞ → H. Moreover,
if δ′1, δ′2, ... are independent copies of δ1, δ2, ... defined on the same measurable space S, then, for
X(m)n := f(δn, δn−1, δn−2, ..., δn−m+1, δ
′n−m, δ
′n−m−1, ...),
we have
∞∑m=1
(E‖Xm −X(m)m ‖p)1/p <∞. (A.4)
87
Hormann and Kokoszka [18] show that this notion is widely applicable to linear and non-linear
functional time series. One of its main advantages is that it is a purely moment-based dependence
measure that can be easily verified in many special cases.
Proposition 7. Assume that (Xt) is L2–m–approximable. Then (3.5) holds and the operators FXθ ,
θ ∈ [−π, π], are trace-class.
Proof. See supplementary document.
Instead of Assumption (3.5), Panaretos and Tavakoli [27] impose for the definition of a spectral
density operator summability of CXh in Schatten 1-norm, that is,∑
h∈Z ‖CXh ‖T < ∞. Under such
slightly more stringent assumption, it immediately follows that the resulting spectral operator is
trace-class. The verification of convergence may, however, be a bit delicate. At least, we could not
find a simple criterion as in Proposition 7.
Proposition 8. Let FXθ be the spectral density operator of a stationary sequence (Xt) for which
the summability condition (3.5) holds. Let λ1(θ) ≥ λ2(θ) ≥ · · · denote its eigenvalues and ϕm(θ)
be the corresponding eigenfunctions. Then, (a) the functions θ 7→ λm(θ) are continuous; (b) if we
strengthen (3.5) into the more stringent condition∑
h∈Z |h|‖CXh ‖S < ∞, the λm(θ)’s are Lipschitz-
continuous functions of θ; (c) assuming that (Xt) is real-valued, for each θ ∈ [−π, π], λm(θ) =
λm(−θ) and ϕm(θ) = ϕm(−θ).
Proof. See supplementary document.
Let x be the conjugate element of x, i.e. 〈x, z〉 = 〈z, x〉 for all z ∈ H. Then x is real-valued iff
x = x.
Remark 4. Since ϕm(θ) is Hermitian, it immediately follows that φm` = φm`, implying that the
dynamic FPCs are real if the process (Xt) is.
A.3 Functional filters
Computation of dynamic FPCs requires applying time-invariant functional filters to the process (Xt).
Let Ψ = (Ψk : k ∈ Z) be a sequence of linear operators mapping the separable Hilbert space H to
the separable Hilbert space H ′. Let B be the backshift or lag operator, defined by BkXt := Xt−k,
k ∈ Z. Then the functional filter Ψ(B) :=∑
k∈Z ΨkBk, when applied to the sequence (Xt), produces
an output series (Yt) in H ′ via
Yt = Ψ(B)Xt =∑k∈Z
Ψk(Xt−k). (A.5)
Call Ψ the sequence of filter coefficients, and, in the style of the scalar or vector time series termi-
nology, call
Ψθ = Ψ(e−iθ) =∑k∈Z
Ψke−ikθ (A.6)
88
the frequency response function of the filter Ψ(B). Of course, series (A.5) and (A.6) only have a
meaning if they converge in an appropriate sense. Below we use the following technical lemma.
Proposition 9. Suppose that (Xt) is a stationary sequence in L2H and possesses a spectral den-
sity operator satisfying supθ tr(FXθ ) < ∞. Consider a filter (Ψk) such that Ψθ converges in
L2S(H,H′)([−π, π]), and suppose that supθ ‖Ψθ‖S(H,H′) <∞. Then,
(i) the series Yt :=∑
k∈Z Ψk(Xt−k) converges in L2H′;
(ii) (Yt) possesses the spectral density operator FYθ = ΨθFXθ (Ψθ)∗;
(iii) supθ tr(FYθ ) <∞.
Proof. See supplementary document.
In particular, the last proposition allows for iterative applications. If supθ tr(FXθ ) < ∞ and Ψθ
satisfies the above properties, then analogue results apply to the output Yt. This is what we are
using in the proofs of Theorems 1 and 2.
A.4 Proofs for Section 3
To start with, observe that Propositions 2 and 4 directly follow from Proposition 9. Part (a) of
Proposition 3 also has been established in the previous Section (see Remark 4), and part (b) is
immediate. Thus, we can proceed to the proof of Theorems 1 and 2.
Proof of Theorems 2 and 3. Assume we have filter coefficients Ψ = (Ψk : k ∈ Z) and Υ = (Υk : k ∈Z), where Ψk : H → Cp and Υk : Cp → H both belong to the class C. If (Xt) and (Yt) are H-valued
and Cp-valued processes, respectively, then there exist elements ψmk and υmk in H such that
Ψ(B)(Xt) =∑k∈Z
(〈Xt−k, ψ1k〉, . . . , 〈Xt−k, ψpk〉)′
and
Υ(B)(Yt) =∑`∈Z
p∑m=1
Yt+`,mυm`.
Hence, the p-dimensional reconstruction of Xt in Theorem 3 is of the form
p∑m=1
Xmt = Υ(B)[Ψ(B)Xt] =: ΥΨ(B)Xt.
Since Ψ and Υ are required to belong to C, we conclude from Proposition 9 that the processes
Yt := Ψ(B)Xt and Xt = Υ(B)Yt are mean-square convergent and possess a spectral density op-
erator. Letting ψm(θ) =∑
k∈Z ψmkeikθ and υm(θ) =
∑`∈Z υm`e
i`θ, we obtain, for x ∈ H and
y = (y1, . . . , ym)′ ∈ Cp, that the frequency response functions Ψθ and Υθ satisfy
Ψθ(x) =∑k∈Z
(〈x, ψ1k〉, . . . , 〈x, ψpk〉)′ e−ikθ = (〈x, ψ1(θ)〉, . . . , 〈x, ψp(θ)〉)′
89
and
Υθ(y) =∑`∈Z
p∑m=1
ymυm`e−i`θ =
p∑m=1
ymυm(−θ).
Consequently,
ΥθΨθ =
p∑m=1
υm(−θ)⊗ ψm(θ). (A.7)
Now, using Proposition 9, it is readily verified that, for Zt := Xt−ΥΨ(B)Xt, we obtain the spectral
density operator
FZθ =(FXθ −ΥθΨθFXθ
)(FXθ − FXθ Ψ∗θΥ
∗θ
), (A.8)
where FXθ is such that FXθ FXθ = FXθ .
Using Lemma 5,
E‖Xt −ΥΨ(B)Xt‖2 =
∫ π
−πtr(FZθ ) dθ =
∫ π
−π
∥∥∥FXθ −ΥθΨθFXθ∥∥∥2
Sdθ. (A.9)
Clearly, (A.9) is minimized if we minimize the integrand for every fixed θ under the constraint that
ΥθΨθ is of the form (A.7). Employing the eigendecomposition FXθ =∑
m≥1 λm(θ)ϕm(θ) ⊗ ϕm(θ),
we infer that
FXθ =∑m≥1
√λm(θ)ϕm(θ)⊗ ϕm(θ).
The best approximating operator of rank p to FXθ is the operator
FXθ (p) =
p∑m=1
√λm(θ)ϕm(θ)⊗ ϕm(θ),
which is obtained if we choose ΥθΨθ =∑p
m=1 ϕm(θ)⊗ ϕm(θ) and hence
ψm(θ) = ϕm(θ) and υm(θ) = ϕm(−θ).
Consequently, by Proposition 6, we get
ψmk =1
2π
∫ π
−πϕm(s)e−iksds and υmk =
1
2π
∫ π
−πϕm(−s)e−iksds = ψm,−k.
With this choice, it is clear that ΥΨ(B)Xt =∑p
m=1Xmt and
E‖Xt −p∑
m=1
Xmt‖2 =
∫ π
−π
∥∥∥FXθ − FXθ (p)∥∥∥2
Sdθ =
∫ π
−π
∑m>p
λm(θ)dθ;
the proof of Theorem 3 follows.
Turning to Theorem 2, observe that, by the monotone convergence theorem, the last integral
tends to zero if p→∞, which completes the proof of Theorem 2.
90
B Large sample properties
For the proof of Theorem 4, let us show that E|Ymt − Ymt| → 0 as n→∞.
Fixing L ≥ 1,
E|Ymt − Ymt| ≤ E∣∣∣∣∑j∈Z〈Xt−j , φmj〉 −
L∑j=−L
〈Xt−j , φmj〉∣∣∣∣
≤ E∣∣∣∣ L∑j=−L
〈Xt−j , φmj − φmj〉∣∣∣∣+ E
∣∣∣∣ ∑|j|>L
〈Xt−j , φmj〉∣∣∣∣, (B.1)
and the result follows if each summand in (B.1) converges to zero, which we prove in the two
subsequent lemmas. For notational convenience, we often suppress the dependence on the sample
size n; all limits below, however, are taken as n→∞.
Lemma 2. If L = L(n) → ∞ sufficiently slowly, then, under Assumptions B.1–B.3, we have that∣∣∣ ∑|j|≤L〈Xk−j , φmj − φmj〉
∣∣∣ = oP (1).
Proof. The triangle and Cauchy-Schwarz inequalities yield
∣∣∣ ∑|j|≤L
〈Xk−j , φmj − φmj〉∣∣∣ ≤ L∑
j=−L‖Xk−j‖‖φmj − φmj‖
≤ maxj∈Z‖φmj − φmj‖
L∑j=−L
‖Xk−j‖.
Let cm(θ) := 〈φm(θ), φm(θ)〉/|〈φm(θ), φm(θ)〉|. Jensen’s inequality and the triangular inequality
imply that, for any j ∈ Z,
2π‖φmj − φmj‖ =∥∥∥∫ π
−π(ϕm(θ)− ϕm(θ))eijθdθ
∥∥∥ ≤ ∫ π
−π‖ϕm(θ)− ϕm(θ)‖dθ
≤∫ π
−π‖ϕm(θ)− cm(θ)ϕm(θ)‖dθ +
∫ π
−π|1− cm(θ)|dθ
=: Q1 +Q2.
By Lemma 3.2 in [18], we have
Q1 ≤∫ π
−π
8
|αm(θ)|2‖FXθ − FXθ ‖S ∧ 2 dθ.
By Assumption B.2, αm(θ) has only finitely many zeros, θ1, . . . , θK , say. Let δε(θ) := [θ−ε, θ+ε]
and A(m, ε) :=⋃Ki=1 δε(θi). By definition, the Lebesgue measure of this set is |A(m, ε)| ≤ 2Kε.
Define Mε such that
M−1ε = minαm(θ) | θ ∈ [−π, π]\A(m, ε).
91
By continuity of αm(θ) (see Proposition 8), we have Mε <∞, and thus∫ π
−π
8
|αm(θ)|2‖FXθ − FXθ ‖ ∧ 2dθ ≤ 4Kε+ 8M2
ε
∫ π
−π‖FXθ − FXθ ‖dθ =: Bn,ε.
By Assumption B.1, there exists a sequence εn → 0 such that Bn,εn → 0 in probability, which entails
Q1 = oP (1). Note that this also implies∫ π
−π
∣∣〈ϕm(θ), v〉 − cm(θ)〈ϕm(θ), v〉∣∣dθ = oP (1). (B.2)
Turning to Q2, suppose that Q2 is not oP (1) Then, there exists ε > 0 and δ > 0 such that ,for
infinitely many n, P (Q2 ≥ ε) ≥ δ. Set
F = Fn :=θ ∈ [−π, π] : |cm(θ)− 1| ≥ ε
4π
.
One can easily show that, on the set Q2 ≥ ε, we have λ(F ) > ε/4. Clearly, |cm(θ) − 1| ≥ ε/4π
implies that cm(θ) = eiz(θ) with z(θ) ∈ [−π/2,−ε′] ∪ [ε′, π/2], for some small enough ε′. Then the
left-hand side in (B.2) is bounded from below by∫F
∣∣〈ϕm(θ), v〉 − cm(θ)〈ϕm(θ), v〉∣∣dθ
=
∫F
(〈(ϕm(θ), v〉 − cos(z(θ)〈ϕm(θ), v〉)2 + (sin2(z(θ))〈ϕm(θ), v〉2
)1/2dθ. (B.3)
Write F := F ′ ∪ F ′′, where
F ′ := F ∩θ : |〈ϕm(θ), v〉 − cos(z(θ))〈ϕm(θ), v〉
∣∣ ≥ 〈ϕm(θ), v〉2
and
F ′′ := F ∩θ : |〈ϕm(θ), v〉 − cos(z(θ))〈ϕm(θ), v〉
∣∣ < 〈ϕm(θ), v〉2
.
On F ′, the integrand (B.3) is greater than or equal to 〈ϕm(θ), v〉/2. On F ′′ the inequality cos(z(θ))〈ϕm(θ), v〉 >〈ϕm(θ), v〉/2 holds, and consequently
〈ϕm(θ), v〉| sin(z(θ))| > 〈ϕm(θ), v〉2
| sin(z(θ))|
>〈ϕm(θ), v〉
π|z(θ)| ≥ 〈ϕm(θ), v〉
πε′.
Altogether, this yields that the integrand in (B.3) is larger than or equal to 〈ϕm(θ), v〉ε′/π. Now, it
is easy to see that, due to Assumption B.3, (B.2) cannot hold. This leads to a contradiction.
Thus, we can conclude that maxj∈Z ‖φmj− φmj‖ = oP (1), so that, for sufficiently slowly growing
L, we also have L maxj∈Z ‖φmj − φmj‖ = oP (1). Consequently,
∣∣∣∣∣ ∑|j|≤L
〈Xk−j , φmj − φmj〉
∣∣∣∣∣ = oP (1)×
L−1L∑
j=−L‖Xk−j‖
. (B.4)
92
It remains to show that L−1L∑
j=−L‖Xk−j‖ = OP (1). By the weak stationarity assumption, we have
E‖Xk‖2 = E‖X1‖2, and hence, for any x > 0,
P
(L−1
L∑j=−L
‖Xk−j‖ > x
)≤∑L
k=−LE‖Xk‖Lx
≤3√E‖X1‖2x
.
Lemma 3. Let L = L(n)→∞. Then, under condition (3.5), we have∣∣∣∣∣ ∑|j|>L
〈Xk−j , φmj〉
∣∣∣∣∣ = oP (1).
Proof. This is immediate from Proposition 4, part (a).
Turning to the proof of Proposition 5, we first establish the following lemma, which an extension
to lag-h autocovariance operators of a consistency result from [18] on the empirical covariance
operator. Define, for |h| < n,
Ch =1
n
n−h∑k=1
Xk+h ⊗Xk, h ≥ 0, and Ch = C−h, h < 0.
Lemma 4. Assume that (Xt : t ∈ Z) is an L4-m-approximable series. Then, for all |h| < n,
E‖Ch − Ch‖S ≤ U√
(|h| ∨ 1)/n, where the constant U neither depends on n nor on h.
Proof. See supplementary document.
Proof of Proposition 5. By the triangle inequality,
2π‖FXθ − FXθ ‖S =
∥∥∥∥∥∑k∈Z
Che−ihθ −
q∑h=−q
(1− |h|
q
)Che
−ihθ
∥∥∥∥∥S
≤
∥∥∥∥∥q∑
h=−q
(1− |h|
q
)(Ch − Ch)e−ihθ
∥∥∥∥∥S
+
∥∥∥∥∥1
q
q∑h=−q
|h|Che−ihθ
∥∥∥∥∥S
+
∥∥∥∥∥ ∑|h|>q
Che−ihθ
∥∥∥∥∥S
≤q∑
h=−q
(1− |h|
q
)‖Ch − Ch‖S +
1
q
q∑h=−q
|h|‖Ch‖S +∑|h|>q
‖Ch‖S .
The last two terms tend to 0 by condition (3.5) and Kronecker’s lemma. For the first term we may
use Lemma 4. Taking expectations, we obtain that, for some U1,
q∑h=−q
(1− |h|
q
)E‖Ch − Ch‖S ≤ U1
q3/2
√n.
93
Note that the bound does not depend on θ; hence q3 = o(n) and condition (3.5) jointly imply that
supθ∈[−π,π]E‖FXθ − FXθ ‖S → 0 as n→∞.
C Technical results and background
C.1 Linear operators
Consider the class L(H,H ′) of bounded linear operators between two Hilbert spaces H and H ′.
For Ψ ∈ L(H,H ′), the operator norm is defined as ‖Ψ‖L := sup‖x‖≤1 ‖Ψ(x)‖. The simplest operators
can be defined via a tensor product v ⊗ w; then v ⊗ w(z) := v〈z, w〉. Every operator Ψ ∈ L(H,H ′)
possesses an adjoint Ψ∗ ∈ L(H ′, H), which satisfies 〈Ψ(x), y〉 = 〈x,Ψ∗(y)〉 for all x ∈ H and y ∈ H ′.It holds that ‖Ψ∗‖L = ‖Ψ‖L. If H = H ′, then Ψ is called self-adjoint if Ψ = Ψ∗. It is called
non-negative definite if 〈Ψx, x〉 ≥ 0 for all x ∈ H.
A linear operator Ψ ∈ L(H,H ′) is said to be Hilbert-Schmidt if, for some orthonormal basis
(vk : k ≥ 1) of H, we have ‖Ψ‖2S :=∑
k≥1 ‖Ψ(vk)‖2 <∞. Then, ‖Ψ‖S defines a norm, the so-called
Hilbert-Schmidt norm of Ψ, which bounds the operator norm ‖Ψ‖L ≤ ‖Ψ‖S , and can be shown to
be independent of the choice of the orthonormal basis. Every Hilbert-Schmidt operator is compact.
The class of Hilbert-Schmidt operators between H and H ′ defines again a separable Hilbert space
with inner product 〈Ψ,Θ〉S :=∑
k≥1〈Ψ(vk),Θ(vk)〉: denote this class by S(H,H ′).
If Ψ ∈ L(H,H ′) and Υ ∈ L(H ′′, H), then ΨΥ is the operator mapping x ∈ H ′′ to Ψ(Υ(x)) ∈ H ′.Assume that Ψ is a compact operator in L(H,H ′) and let (s2
j ) be the eigenvalues of (Ψ∗)Ψ. Then Ψ is
said to be trace class if ‖Ψ‖T :=∑
j≥1 sj <∞. In this case, ‖Ψ‖T defines a norm, the so-called Schat-
ten 1-norm. We have that
‖Ψ‖S ≤ ‖Ψ‖T , and hence any trace-class operator is Hilbert-Schmidt. For self-adjoint non-negative
operators, it holds that ‖Ψ‖T = tr(Ψ) :=∑
k≥1〈Ψ(vk), vk〉. If ΨΨ = Ψ, then we have tr(Ψ) = ‖Ψ‖2S .
For further background on the theory of linear operators we refer to [13].
C.2 Random sequences in Hilbert spaces
All random elements that appear in the sequel are assumed to be defined on a common probability
space (Ω,A, P ). We write X ∈ LpH(Ω,A, P ) (in short, X ∈ LpH) if X is an H-valued random variable
such that E‖X‖p <∞. Every element X ∈ L1H possesses an expectation, which is the unique µ ∈ H
satisfying E〈X, y〉 = 〈µ, y〉 for all y ∈ H. Provided that X and Y are in L2H , we can define the cross-
covariance operator as CXY := E(X−µX)⊗ (Y −µY ), where µX and µY are the expectations of X
and Y , respectively. We have that ‖CXY ‖T ≤ E‖(X−µX)⊗(Y −µY )‖T = E‖X−µX‖‖Y −µY ‖, and
so these operators are trace-class. An important specific role is played by the covariance operator
CXX . This operator is non-negative definite and self-adjoint with tr(CXX) = E‖X − µX‖2. An
H-valued process (Xt) is called (weakly) stationary if (Xt) ∈ L2H , and EXt and CXt+hXt do not
depend on t. In this case, we write CXh , or shortly Ch, for CXt+hXt if it is clear to which process it
belongs.
94
Many useful results on random processes in Hilbert spaces or more general Banach spaces are
collected in Chapters 1 and 2 of [8].
C.3 Proofs for Appendix A
Proof of Proposition 6. Letting 0 < m < n, note that
‖Sn − Sm‖22 =
( ∑m≤|k|≤n
fkek,∑
m≤|`|≤n
f`e`
)
=1
2π
∫ π
−π
∑m≤|k|≤n
∑m≤|`|≤n
〈fk, f`〉ei(k−`)θdθ =∑
m≤|k|≤n
‖fk‖2.
To prove the first statement, we need to show that (Sn) defines a Cauchy sequence in L2H([−π, π]),
which follows if we show that∑
k∈Z ‖fk‖2 < ∞. We use the fact that, for any v ∈ H, the function
〈x(θ), v〉 belongs to L2([−π, π]). Then, by Parseval’s identity and (A.1), we have, for any v ∈ H,
1
2π
∫ π
−π|〈x(θ), v〉|2dθ =
∑k∈Z
(1
2π
∫ π
−π〈x(s), v〉e−iksds
)2
=∑k∈Z|〈fk, v〉|2.
Let (vk : k ≥ 1) be an orthonormal basis of H. Then, by the last result and Parseval’s identity
again, it follows that
‖x‖22 =1
2π
∫ π
−π
∑`≥1
|〈x(θ), v`〉|2dθ =1
2π
∑`≥1
∫ π
−π|〈x(θ), v`〉|2dθ
=∑`≥1
∑k∈Z|〈fk, v`〉|2 =
∑k∈Z‖fk‖2.
As for the second statement, we conclude from classical Fourier analysis results that, for each
v ∈ H,
limn→∞
1
2π
∫ π
−π
(〈x(θ), v〉 −
n∑k=−n
(1
2π
∫ π
−π〈x(s), v〉e−iksds
)eikθ
)2
dθ = 0.
Now, by definition of Sn, this is equivalent to
limn→∞
1
2π
∫ π
−π〈x(θ)− Sn(θ), v〉2 dθ = 0, ∀v ∈ H.
Combined with the first statement of the proposition and∫ π
−π〈x(θ)− S(θ), v〉2 dθ ≤ 2
∫ π
−π〈x(θ)− Sn(θ), v〉2 dθ
+ 2‖v‖2∫ π
−π‖Sn(θ)− S(θ)‖2dθ,
this implies that1
2π
∫ π
−π〈x(θ)− S(θ), v〉2 dθ = 0, ∀v ∈ H. (C.1)
Let (vi), i ∈ N bee an orthonormal basis of H, and define
95
Ai := θ ∈ [−π, π] : 〈x(θ)− S(θ), vi〉 6= 0.
By (C.1), we have that λ(Ai) = 0 (λ denotes the Lebesgue measure), and hence λ(A) = 0 for
A = ∪i≥1Ai. Consequently, since (vi) define an orthonormal basis, for any θ ∈ [−π, π] \A, we have
〈x(θ)− S(θ), v〉 = 0 for all v ∈ H, which in turn implies that x(θ)− S(θ) = 0.
Proof of Proposition 7. Without loss of generality, we assume that EX0 = 0. Since X0 and X(h)h ,
h ≥ 1, are independent,
‖CXh ‖S = ‖EX0 ⊗ (Xh −X(h)h )‖S ≤ (E‖X0‖2)1/2(E‖Xh −X
(h)h ‖
2)1/2.
The first statement of the proposition follows.
Let θ be fixed. Since FXθ is non-negative and self-adjoint, it is trace class if and only if
tr(FXθ ) =∑m≥1
〈FXθ (vm), vm〉 <∞ (C.2)
for some orthonormal basis (vm) of H. The trace can be shown to be independent of the choice of
the basis. Define Vn,θ = (2πn)−1/2∑n
k=1Xkeikθ and note that, by stationarity,
FXn,θ := EVn,θ ⊗ Vn,θ =1
2π
∑|h|<n
(1− |h|
n
)EX0 ⊗X−he−ihθ.
It is easily verified that the operators FXn,θ again are non-negative and self-adjoint. Also note that,
by the triangular inequality,
‖FXn,θ −FXθ ‖S ≤∑|h|<n
|h|n‖CXh ‖S +
∑|h|≥n
‖CXh ‖S .
By application of (3.5) and Kronecker’s lemma, it easily follows that the latter two terms converge
to zero. This implies that FXn,θ(v) converges in norm to FXθ (v), for any v ∈ H.
Choose vm = ϕm(θ). Then, by continuity of the inner product and the monotone convergence
theorem, we have ∑m≥1
〈FXθ (ϕm(θ)), ϕm(θ)〉 =∑m≥1
limn→∞
〈FXn,θ(ϕm(θ)), ϕm(θ)〉
= limn→∞
∑m≥1
〈FXn,θ(ϕm(θ)), ϕm(θ)〉.
Using the fact that the FXn,θ’s are self-adjoint and non-negative, we get∑m≥1
〈FXn,θ(ϕm(θ)), ϕm(θ)〉 = tr(FXn,θ) = E‖Vn‖2
=1
2π
∑|h|<n
(1− |h|
n
)E〈X0, Xh〉e−ihθ.
96
Since |E〈X0, Xh〉| = |E〈X0, Xh −X(h)h 〉|, by the Cauchy-Schwarz inequality,∑
h∈Z|E〈X0, Xh〉| ≤
∑h∈Z
(E‖X0‖2)1/2(E(Xh −X(h)h )2)1/2 <∞,
and thus the dominated convergence theorem implies that
tr(FXθ ) =1
2π
∑h∈Z
E〈X0, Xh〉e−ihθ ≤∑h∈Z|E〈X0, Xh〉| <∞, (C.3)
which completes he proof.
Proof of Proposition 8. We have (see e.g. [13], p. 186) that the dynamic eigenvalues are such that
|λm(θ)− λm(θ′)| ≤ ‖FXθ −FXθ′ ‖S . Now,
‖FXθ −FXθ′ ‖S ≤∑h∈Z‖CXh ‖S |e−ihθ − e−ihθ′ |.
The summability condition (3.5) implies continuity, hence part (a) of the proposition. The fact that
|e−ihθ − e−ihθ′ | ≤ |h||θ − θ′| yields part (b). To prove (c), observe that
λm(θ)ϕm(θ) = FXθ (ϕm(θ)) =1
2π
∑h∈Z
EXh〈ϕm(θ), X0〉e−ihθ
for any θ ∈ [−π, π]. Since the eigenvalues λm(θ) are real, we obtain, by computing the complex
conjugate of the above equalities,
λm(θ)ϕm(θ) =1
2π
∑h∈Z
EXh〈ϕm(θ), X0〉eihθ = FX−θ(ϕm(θ)).
This shows that λm(θ) and ϕm(θ) are eigenvalue and eigenfunction of FX−θ and they must correspond
to a pair (λn(−θ), ϕn(−θ)); (c) follows.
Lemma 5. Let (Zt) be a stationary sequence in L2H with spectral density FZθ . Then,∫ π
−πtr(FZθ )dθ = tr
(∫ π
−πFZθ dθ
)= tr(CZ0 ) = E‖Zt‖2.
Proof. Let S = S(H,H). Note that∫ π−π F
Zθ dθ = IFZ if and only if
〈IFZ , V 〉S =
∫ π
−π〈FZθ , V 〉S dθ for all V ∈ S. (C.4)
97
For some orthonormal basis (vk) define VN =∑N
k=1 vk ⊗ vk. Then (C.4) implies that
tr(IFZ) = limN→∞
N∑k=1
〈IFZ(vk), vk〉 = limN→∞
〈IFZ , VN 〉S
= limN→∞
∫ π
−π〈FZθ , VN 〉Sdθ = lim
N→∞
∫ π
−π
N∑k=1
〈FZθ (vk), vk〉 dθ.
Since FZθ is non-negative definite for any θ, the monotone convergence theorem allows to interchange
the limit with the integral.
Proof of Proposition 9. (i) Define Y r,st :=
∑r<|k|≤s Ψk(Xt−k) and the related transfer operator
Ψr,sθ :=
∑r<|k|≤s Ψke
−ikθ. We also use Y st = Ψ0(Xt) + Y 0,s
t and Ψsθ = Ψ0 + Ψ0,s
θ . Since Y r,st is
a finite sum, it is obviously in L2H′ . Also, the finite number of filter coefficients makes it easy to
check that (Y r,st : t ∈ Z) is stationary and has spectral density operator FY r,sθ = Ψr,s
θ FXθ (Ψr,s
θ )∗. By
the previous lemma we have
E‖Y r,st ‖2 =
∫ π
−πtr(FY
r,st
θ )dθ =
∫ π
−πtr(Ψr,s
θ FXθ (Ψr,s
θ )∗)dθ
≤∫ π
−π‖Ψr,s
θ ‖2S(H,H′)tr(F
Xθ )dθ.
Now, it directly follows from the assumptions that (Y st : s ≥ 1) defines a Cauchy sequence in L2
H′ .
This proves (i).
Next, remark that by our assumptions ΨθFXθ (Ψθ)∗ ∈ L2
S(H′,H′)([−π, π]). Hence, by the results
in Appendix A.1,
∑|h|≤r
1
2π
∫ π
−πΨuFXu (Ψu)∗eihudu e−ihθ → ΨθFXθ (Ψθ)
∗ as r →∞,
where convergence is in L2S(H′,H′)([−π, π]). We prove that ΨθFXθ (Ψθ)
∗ is the spectral density oper-
ator of (Yt). This is the case if 12π
∫ π−π ΨuFXu (Ψu)∗eihudu = CYh . For the approximating sequences
(Y st : t ∈ Z) we know from (i) and Remark 3 that
1
2π
∫ π
−πΨsuFXu (Ψs
u)∗eihudu = CYs
h .
Routine arguments show that under our assumptions ‖CY sh − CYh ‖S(H′,H′) → 0 and∥∥∥∥∫ π
−π
(ΨuFXu (Ψu)∗ −Ψs
uFXu (Ψsu)∗)eihudu
∥∥∥∥S(H′,H′)
→ 0, (s→∞).
Part (ii) of the proposition follows, hence also part (iii).
Proof of Lemma 4. Let us only consider the case h ≥ 0. Define X(r)n as the r-dependent approxi-
98
mation of (Xn) provided by Definition 4. Observe that
nE∥∥Ch − Ch∥∥2
S = nE
∥∥∥∥∥ 1
n
n−h∑k=1
Zk
∥∥∥∥∥2
S
,
where Zk = Xk+h ⊗Xk − Ch. Set Z(r)k = X
(r)k+h ⊗X
(r)k − Ch. Stationarity of (Zk) implies
nE
∥∥∥∥∥ 1
n
n−h∑k=1
Zk
∥∥∥∥∥2
S
=∑|r|<n−h
(1− |r|
n
)E〈Z0, Zr〉S
≤h∑
r=−h|E〈Z0, Zr〉S |+ 2
∞∑r=h+1
|E〈Z0, Zr〉S |, (C.5)
while the Cauchy-Schwarz inequality yields
|E〈Z0, Zr〉S | ≤ E|〈Z0, Zr〉S | ≤√E‖Z0‖2SE‖Zr‖2S = E‖Z0‖2S .
Furthermore, from ‖Xh ⊗X0‖ = ‖Xh‖‖X0‖, we deduce
E‖Z0‖2S = E‖X0‖2‖Xh‖2 ≤(E‖X0‖4
)1/2<∞.
Consequently, we can bound the first sum in (C.5) by (2h + 1)(E‖X0‖4
)1/2. For the second term
in (C.5), we obtain, by independence of Z(r−h)r and Z0, that
|E〈Z0, Zr〉S | = |E〈Z0, Zr − Z(r−h)r 〉S | ≤ (E‖Z0‖2S)1/2(E‖Zr − Z(r−h)
r ‖2S)1/2.
To conclude, it suffices to show that∑∞
r=1(E‖Zr − Z(r−h)r ‖2S)1/2 ≤M <∞, where the bound M is
independent of h. Using an inequality of the type |ab− cd|2 ≤ 2|a|2|b− d|2 + 2|d|2|a− c|2, we obtain
E‖Zr − Z(r−h)r ‖2S = E‖Xr ⊗Xr+h −X(r−h)
r ⊗X(r−h)r+h ‖
2S
≤ 2E‖Xr‖2‖Xr+h −X(r−h)r+h ‖
2 + 2E‖X(r−h)r+h ‖
2‖Xr −X(r−h)r ‖2
≤ 2(E‖Xr‖4)1/2(E‖Xr+h −X(r−h)r+h ‖
4)1/2
+ 2(E‖X(r−h)r+h ‖
4)1/2(E‖Xr −X(r−h)r ‖4)1/2.
Note that E‖Xr‖4 = E‖X(r−h)r+h ‖
4 = E‖X0‖4 and
E‖Xr+h −X(r−h)r+h ‖
4 = E‖Xr −X(r−h)r ‖4 = E‖X0 −X(r−h)
0 ‖4.
Altogether we get
E‖Zr − Z(r−h)r ‖2S ≤ 4(E‖X0‖4)1/2(E‖X0 −X(r−h)
0 ‖4)1/2.
Hence, L4-m-approximability implies that∑∞
r=h+1 |E〈Z0, Zr〉S | converges and is uniformly bounded
over 0 ≤ h < n.
99
Acknowledgement
The research of Siegfried Hormann and Lukasz Kidzinski was supported by the Communaute
francaise de Belgique – Actions de Recherche Concertees (2010–2015) and the Belgian Science Policy
Office – Interuniversity attraction poles (2012–2017). The research of Marc Hallin was supported by
the Sonderforschungsbereich “Statistical modeling of nonlinear dynamic processes” (SFB823) of the
Deutsche Forschungsgemeinschaft and the Belgian Science Policy Office – Interuniversity attraction
poles (2012–2017).
Bibliography
[1] Aue, A., Dubart Norinho, D. and Hormann, S. (2014), On the prediction of functional
time series, J. Amer. Statist. Assoc. (forthcoming).
[2] Aston, J.A.D. and Kirch, C. (2011), Estimation of the distribution of change-points with
application to fMRI data, Technical Report, University of Warwick, Centre for Research in
Statistical Methodology, 2011.
[3] Benko, M., Hardle, W. and Kneip, A. (2009), Common functional principal components,
The Annals of Statistics 37, 1–34.
[4] Berkes, I., Gabrys, R., Horvath, L. and Kokoszka, P.(2009), Detecting changes in the
mean of functional observations, J. Roy. Statist. Soc. Ser. B, 71, 927–946.
[5] Besse, P. and Ramsay, J. O. (1986), Principal components analysis of sampled functions,
Psychometrika 51, 285–311.
[6] Brillinger, D. R. (1981), Time Series: Data Analysis and Theory, Holden Day, San Fran-
cisco.
[7] Brockwell, P. J. and Davis, R. A. (1981), Time Series: Theory and Methods, Springer,
New York.
[8] Bosq, D. (2000), Linear Processes in Function Spaces, Springer, New York.
[9] Cardot, H., Ferraty, F. and Sarda, P. (1999), Functional linear model, Statist.& Probab.
Lett. 45, 11–22.
[10] Dauxois, J., Pousse, A. and Romain, Y. (1982), Asymptotic theory for the principal
component analysis of a vector random function: Some applications to statistical inference, J.
Multivariate Anal. 12, 136–154.
[11] Ferraty, F. and Vieu, P. (2006), Nonparametric Functional Data Analysis, Springer, New
York.
[12] Gervini, D. (2007), Robust functional estimation using the median and spherical principal
components, Biometrika 95, 587–600.
100
[13] Gohberg, I., Goldberg, S. and Kaashoek, M. A. (2003), Basic Classes of Linear Oper-
ators, Birkhauser.
[14] Gokulakrishnan, P., Lawrence, P. D., McLellan, P. J. and Grandmaison, E. W.
(2006), A functional-PCA approach for analyzing and reducing complex chemical mechanisms,
Computers and Chemical Engineering 30, 1093–1101.
[15] Hall, P. and Hosseini-Nasab, M. (2006), On properties of functional principal components
analysis, J. Roy. Statist. Soc. Ser. B 68, 109–126.
[16] Hall, P., Muller, H.-G. and Wang, J.-L. (2006), Properties of principal component
methods for functional and longitudinal data analysis, The Annals of Statistics 34, 1493–1517.
[17] Hormann, S. and Kidzinski, L. (2012), A note on estimation in Hilbertian linear models,
Scand. J. Stat. (forthcoming).
[18] Hormann, S. and Kokoszka, P. (2010), Weakly dependent functional data, The Annals of
Statistics 38, 1845–1884.
[19] Hormann, S. and Kokoszka, P. (2012), Functional Time Series, in Handbook of Statistics:
Time Series Analysis-Methods and Applications, 157–186.
[20] Hyndman, R. J. and Ullah, M. S. (2007), Robust forecasting of mortality and fertility
rates: a functional data approach, Computational Statistics & Data Analysis 51, 4942–4956.
[21] James, G.M., Hastie T. J. and Sugar, C. A. (2000), Principal component models for
sparse functional data, Biometrika 87, 587–602.
[22] Jolliffe, J. (2005), Principal Component Analysis, Wiley & Sons.
[23] Karhunen, K. (1947), Uber lineare Methoden in der Wahrscheinlichkeitsrechnung, Ann. Acad.
Sci. Fennicae Ser. A. I. Math.-Phys. 37, 79.
[24] Kneip, A. and Utikal, K. (2001), Inference for density families using functional principal
components analysis, J. Amer. Statist. Assoc. 96, 519–531.
[25] Locantore, N., Marron, J. S., Simpson, D. G., Tripoli, N., Zhang, J. T. and Cohen,
K. L. (1999), Robust principal component analysis for functional data, Test 8, 1–73.
[26] Loeve, M. (1946), Fonctions aleatoires de second ordre, Revue Sci. 84, 195–206.
[27] Panaretos, V. M. and Tavakoli, S. (2013), Fourier analysis of stationary time series in
function space, The Annals of Statistics 41, 568-603.
[28] Panaretos, V. M. and Tavakoli, S. (2013), Cramer-Karhunen-Loeve representation and
harmonic principal component analysis of functional time series, Stoch. Proc. Appl. 123, 2779-
2807.
[29] Politis, D. N. (2011), Higher-order accurate, positive semi-definite estimation of large-sample
covariance and spectral density matrices, Econometric Theory 27, 703–744.
101
[30] Ramsay, J. O. and Dalzell, C. J. (1991), Some tools for functional data analysis (with
discussion), J. Roy. Statist. Soc. Ser. B 53, 539–572.
[31] Ramsay, J. and Silverman, B. (2002), Applied Functional Data Analysis, Springer, New
York.
[32] Ramsay, J. and Silverman, B. (2005), Functional Data Analysis (2nd ed.), Springer, New
York.
[33] Reiss, P. T. and Ogden, R. T. Functional principal component regression and functional
partial least squares, J. Amer. Statist. Assoc. 102, 984–996.
[34] Silverman, B. (1996), Smoothed functional principal components analysis by choice of norm,
The Annals of Statistics 24, 1–24.
[35] Shumway, R. and Stoffer, D. (2006), Time Series Analysis and Its Applications (2nd ed.),
Springer, New York.
[36] Stadlober, E., Hormann, S. and Pfeiler, B. (2008), Quality and performance of a PM10
daily forecasting model, Atmospheric Environment 42, 1098–1109.
[37] Viviani, R., Gron, G. and Spitzer, M. (2005), Functional principal component analysis of
fMRI data, Human Brain Mapping 24, 109–129.
102
General Bibliography
General Bibliography
Bibliography
[1] J. A. D. Aston and C. Kirch. Detecting and estimating epidemic changes in dependent functional
data. Journal of Multivariate Analysis, 109:204–220, 2012.
[2] J. A. D. Aston and C. Kirch. Evaluating stationarity via change–point alternatives with appli-
cations to fmri data. The Annals of Applied Statistics, 6:1906–1948, 2012.
[3] A. Aue, S. Hormann, L. Horvath, and M. Reimherr. Break detection in the covariance structure
of multivariate time series models. The Annals of Statistics, 37:4046–4087, 2009.
[4] M. Benko, W. Hardle, and A. Kneip. Common functional principal components. The Annals
of Statistics, 37:1–34, 2009.
[5] I. Berkes, R. Gabrys, L. Horvath, and P. Kokoszka. Detecting changes in the mean of functional
observations. Journal of the Royal Statistical Society (B), 71:927–946, 2009.
[6] D. Bosq. Linear Processes in Function Spaces. Springer, 2000.
[7] G. E. P. Box, G. M. Jenkins, and G. C. Reinsel. Time Series Analysis: Forecasting and Control.
Prentice Hall, Englewood Cliffs, third edition, 1994.
[8] R. C. Bradley. Basic properties of strong mixing conditions. In E. Eberlein and M. S. Taqqu,
editors, Dependence in Probability and Statistics, pages 165–192. Birkhauser, Boston, 1986.
[9] D. R. Brillinger. Time Series: Data Analysis and Theory. Holt, New York, 1975.
[10] T. Cai and P. Hall. Prediction in functional linear regression. The Annals of Statistics, 34:2159–
2179, 2006.
[11] H. Cardot, F. Ferraty, A. Mas, and P. Sarda. Testing hypothesis in the functional linear model.
Scandinavian Journal of Statistics, 30:241–255, 2003.
[12] H. Cardot, F. Ferraty, and P. Sarda. Spline estimators for the functional linear model. Statistica
Sinica, 13:571–591, 2003b.
[13] J-M. Chiou and H-G. Muller. Diagnostics for functional regression via residual processes.
Computational Statistics and Data Analysis, 15:4849–4863, 2007.
[14] F. Comte and J. Johannes. Adaptive functional linear regression. The Annals of Statistics,
40:2765–2797, 2012.
[15] C. Crambes, A. Kneip, and P. Sarda. Smoothing splines estimators for functional linear regres-
sion. The Annals of Statistics, 37:35–72, 2009.
[16] F. Ferraty and P. Vieu. Nonparametric Functional Data Analysis: Theory and Practice.
Springer, 2006.
[17] R. Gabrys, L. Horvath, and P. Kokoszka. Tests for error correlation in the functional linear
model. Journal of the American Statistical Association, 105:1113–1125, 2010.
104
General Bibliography
[18] Pantelis Z. Hadjipantelis, John A. D. Aston, and Jonathan P. Evans. Characterizing fundamen-
tal frequency in mandarin: A functional principal component approach utilizing mixed effect
models. The Journal of the Acoustical Society of America, 131(6), 2012.
[19] P. Hall, H.-G. Muller, and J.-L. Wang. Properties of principal component methods for functional
and longitudinal data analysis. The Annals of Statistics, 34:1493–1517, 2006.
[20] S. Hays, H. Shen, and J. Z. Huang. Functional dynamic factor models with application to yield
curve forecasting. The Annals of Applied Statistics, 6:870–894, 2012.
[21] S. Hormann, L. Horvath, and R. Reeder. A functional version of the ARCH model. Econometric
Theory, 29:267–288, 2013.
[22] S. Hormann, L. Kidzinski, and M. Hallin. Dynamic functional principal components. Technical
report, Universite libre de Bruxelles, 2013.
[23] S. Hormann and P. Kokoszka. Weakly dependent functional data. The Annals of Statistics,
38:1845–1884, 2010.
[24] S. Hormann and P. Kokoszka. Functional time series. In C. R. Rao and T. Subba Rao, editors,
Time Series, volume 30 of Handbook of Statistics. Elsevier, 2012.
[25] L. Horvath and P. Kokoszka. Inference for Functional Data with Applications. Springer, 2012.
[26] L. Horvath, P. Kokoszka, and G. Rice. Testing stationarity of functional time series. Journal
of Econometrics, 179(1):66–82, 2014.
[27] G. M. James, J. Wang, and J. Zhu. Functional linear regression that’s interpretable. The
Annals of Statistics, 37:2083–2108, 2009.
[28] I. T. Jolliffe. Principal Component Analysis. Springer Verlag, 1986.
[29] P. Kokoszka and M. Reimherr. Predictability of shapes of intraday price curves. The Econo-
metrics Journal, 16(3):285–308, 2013.
[30] Y. Li and T. Hsing. On rates of convergence in functional linear regression. Journal of Multi-
variate Analysis, 98:1782–1804, 2007.
[31] Sara Lopez-Pintado and Juan Romo. On the concept of depth for functional data. Journal of
the American Statistical Association, 104(486):718–734, 2009.
[32] N. Malfait and J. O. Ramsay. The historical functional model. Canadian Journal of Statistics,
31:115–128, 2003.
[33] I. McKeague and B. Sen. Fractals with point impacts in functional linear regression. The
Annals of Statistics, 38:2559–2586, 2010.
[34] H-G. Muller and U. Stadtmuller. Generalized functional linear models. The Annals of Statistics,
33:774–805, 2005.
105
General Bibliography
[35] Alan V Oppenheim, Ronald W Schafer, John R Buck, et al. Discrete-time signal processing,
volume 2. Prentice-hall Englewood Cliffs, 1989.
[36] V. M Panaretos, D. Kraus, and J. H. Maddocks. Second-order comparison of gaussian random
functions and the geometry of dna minicircles. Journal of the American Statistical Association,
105(490):670–682, 2010.
[37] V. M. Panaretos and S. Tavakoli. Cramer–Karhunen–Loeve representation and harmonic prin-
cipal component analysis of functional time series. Stochastic Processes and their Applications,
123:2779–2807, 2013.
[38] V. M. Panaretos and S. Tavakoli. Fourier analysis of stationary time series in function space.
The Annals of Statistics, 41:568–603, 2013.
[39] D. N. Politis. Higher-order accurate, positive semidefinite estimation of large sample covariance
and spectral density matrices. Econometric Theory, 27:1469–4360, 2011.
[40] M. B. Priestley. Spectral Analysis and Time Series. Academic Press, 1981.
[41] J. O. Ramsay and C. J. Dalzell. Some tools for functional data analysis. Journal of the Royal
Statistical Society (B), 53:539–572, 1991.
[42] J. O. Ramsay and B. W. Silverman. Applied Functional Data Analysis. Springer, 2002.
[43] J. O. Ramsay and B. W. Silverman. Functional Data Analysis. Springer, 2005.
[44] P. T. Reiss and R. T. Ogden. Functional principal component regression and functional partial
least squares. Journal of the American Statistical Association, 102:984–996, 2007.
[45] J. A. Rice and B. W. Silverman. Estimating the mean and covariance structure nonparametri-
cally when the data are curves. Journal of the Royal Statistical Society. Series B (Methodolog-
ical), pages 233–243, 1991.
[46] X. Shao and W. B. Wu. Asymptotic spectral theory for nonlinear time series. The Annals of
Statistics, 35:1773–1801, 2007.
[47] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications with R Examples.
Springer, 2011.
[48] E. Stadtlober, S. Hormann, and B. Pfeiler. Qualiy and performance of a PM10 daily forecasting
model. Athmospheric Environment, 42:1098–1109, 2008.
[49] R. D. Tuddenham and M. M. Snyder. Physical growth of california boys and girls from birth to
eighteen years. Publications in child development. University of California, Berkeley, 1(2):183–
364, 1953.
[50] W. Wu. Nonlinear System Theory: Another Look at Dependence, volume 102 of Proceedings of
The National Academy of Sciences of the United States. 2005.
[51] F. Yao, H-G. Muller, and J-L. Wang. Functional data analysis for sparse longitudinal data.
Journal of the American Statistical Association, 100:577–590, 2005.
106
General Bibliography
[52] F. Yao, H-G. Muller, and J-L. Wang. Functional linear regression analysis for longitudinal data.
The Annals of Statistics, 33:2873–2903, 2005.
107