Introduction to FDA and linear models

Nathalie Villa-Vialaneix - nathalie.villa@math.univ-toulouse.frhttp://www.nathalievilla.org

Institut de Mathématiques de Toulouse - IUT de Carcassonne, Université de PerpignanFrance

La Havane, September 15th, 2008

Nathalie Villa (IMT & UPVD) Presentation 1 La Havane, Sept. 15th, 2008 1 / 37

Table of contents

1 Motivations

2 Functional Principal Component Analysis

3 Functional linear regression models

4 References

What is Functional Data Analysis (FDA)?

FDA deals with data that are measurements of continuous phenomenaon a discrete sampling grid

Example 6 : Curves clustering

[Bensmail et al., 2005]Create a typology of sick cells from their “SELDI mass” spectra.

FDA deals with data that are measurements of continuous phenomenaon a discrete sampling gridExample 1: Regression case 1

100 wavelengths

Find the fat content of peaces of meat from their absorbence spectra.

FDA deals with data that are measurements of continuous phenomenaon a discrete sampling gridExample 2: Regression case 2

1 049 wavelengths

Find the disease content in wheat from its absorbence spectra.

FDA deals with data that are measurements of continuous phenomenaon a discrete sampling gridExample 3: Classification case 1

Recognize one of the five phonemes from its log-periodograms (256wavelengths).

FDA deals with data that are measurements of continuous phenomenaon a discrete sampling gridExample 4 : Classification case 2

Recognize one of two words from its record in frequency domain (8 192time steps).

FDA deals with data that are measurements of continuous phenomenaon a discrete sampling gridExample 5: Regression on functional data

[Azaïs et al., 2008]

Estimate a typical load curve (electricity consumption) from economicmultivariate variables.

FDA deals with data that are measurements of continuous phenomenaon a discrete sampling gridExample 6 : Curves clustering

Specific issues of learning with FDA

High dimensional data (the number of discretization point is oftenbigger or much more bigger than the number of observations);

Highly correlated data (because of the functional structureunderlined, the values at two sampling points are correlated).

Consequences: Direct use of classical statistical methods on thediscretization leads to ill-posed problems and provides inaccuratesolutions.

Specific issues of learning with FDA

High dimensional data (the number of discretization point is oftenbigger or much more bigger than the number of observations);

Highly correlated data (because of the functional structureunderlined, the values at two sampling points are correlated).

Consequences: Direct use of classical statistical methods on thediscretization leads to ill-posed problems and provides inaccuratesolutions.

Theoretical model

A functional random variable is a random variable X taking its values ina functional space X where X can be:

a (infinite dimensional) Hilbert space with inner product 〈., .〉X;

in particular, L2 is often used;

any (infinite dimensional) Banach space with norm ‖.‖X (less usual);

for example, C0.

Theoretical model

for example, C0.

Theoretical model

for example, C0.

Hilbertian context

In the hilbertian context, we are able to definethe expectation of X as (Theorem of Riesz) the unique elementE (X) of X such that

∀ u ∈ X, 〈E (X) , u〉X = E (〈X , u〉X) ,

for any u1, u2 ∈ X, u1 ⊗ u2 is the linear operator

u1 ⊗ u2 : v ∈ X → 〈u1, v〉Xu2 ∈ X,

as (X − E (X)) ⊗ (X − E (X)) is an element of the Hilbert Space ofHilbert-Schmidt operators from X to X, HS(X)1

the variance of Xas the linear operator ΓX

ΓX = E ((X − E (X)) ⊗ (X − E (X))) : u ∈ X →

E (〈X − E (X) , u〉X(X − E (X))) .

This Hilbert space is equipped with the inner product∀ g1, g2 ∈ HS(X), 〈g1, g2〉HS(X) =

∑i〈g1ei , g2ei〉X where (ei)i is any orthonormal basis of X

Hilbertian context

∀ u ∈ X, 〈E (X) , u〉X = E (〈X , u〉X) ,

u1 ⊗ u2 : v ∈ X → 〈u1, v〉Xu2 ∈ X,

ΓX = E ((X − E (X)) ⊗ (X − E (X))) : u ∈ X →

E (〈X − E (X) , u〉X(X − E (X))) .

This Hilbert space is equipped with the inner product∀ g1, g2 ∈ HS(X), 〈g1, g2〉HS(X) =

Hilbertian context

∀ u ∈ X, 〈E (X) , u〉X = E (〈X , u〉X) ,

u1 ⊗ u2 : v ∈ X → 〈u1, v〉Xu2 ∈ X,

ΓX = E ((X − E (X)) ⊗ (X − E (X))) : u ∈ X →

E (〈X − E (X) , u〉X(X − E (X))) .

1This Hilbert space is equipped with the inner product∀ g1, g2 ∈ HS(X), 〈g1, g2〉HS(X) =

Hilbertian context

∀ u ∈ X, 〈E (X) , u〉X = E (〈X , u〉X) ,

u1 ⊗ u2 : v ∈ X → 〈u1, v〉Xu2 ∈ X,

as (X − E (X)) ⊗ (X − E (X)) is an element of the Hilbert Space ofHilbert-Schmidt operators from X to X, HS(X)1 the variance of Xas the linear operator ΓX

ΓX = E ((X − E (X)) ⊗ (X − E (X))) : u ∈ X →

E (〈X − E (X) , u〉X(X − E (X))) .

1This Hilbert space is equipped with the inner product∀ g1, g2 ∈ HS(X), 〈g1, g2〉HS(X) =

Case X = L2([0, 1])

if X = L2([0, 1]), this expressions simplify in:norm: ‖X‖2 =

∫[0,1]

(X(t))2dt < +∞;

expectation: for all t ∈ [0, 1],

E (X) (t) = E (X(t)) =

∫X(t)dPX ,

variance: for all t , t ′ ∈ [0, 1],

ΓX ' γ(t , t ′) = E (X(t)X(t ′))

(if E (X) = 0 for clarity reasons)

because:1 for all t ∈ [0, 1], we can define Γt

X : u ∈ X → (ΓX u)(t) ∈ R,2 By Riesz’s Theorem, it exists ζ t ∈ X such that ∀ u ∈ X, Γt

X u = 〈ζ t , u〉X.

(ΓX u)(t)

= 〈E (X(t)X), u〉X,

we have that ζ t = E (X(t)X) .

3 We define γ : (t , t ′) ∈ [0, 1]2 → ζ t (t ′) = E (X(t)X(t ′)) �

Case X = L2([0, 1])

∫[0,1]

(X(t))2dt < +∞;expectation: for all t ∈ [0, 1],

E (X) (t) = E (X(t)) =

∫X(t)dPX ,

ΓX ' γ(t , t ′) = E (X(t)X(t ′))

X u = 〈ζ t , u〉X.

(ΓX u)(t)

= 〈E (X(t)X), u〉X,

Case X = L2([0, 1])

∫[0,1]

E (X) (t) = E (X(t)) =

∫X(t)dPX ,

ΓX ' γ(t , t ′) = E (X(t)X(t ′))

X u = 〈ζ t , u〉X.

(ΓX u)(t)

= 〈E (X(t)X), u〉X,

Case X = L2([0, 1])

∫[0,1]

E (X) (t) = E (X(t)) =

∫X(t)dPX ,

ΓX ' γ(t , t ′) = E (X(t)X(t ′))

(if E (X) = 0 for clarity reasons) because:1 for all t ∈ [0, 1], we can define Γt

X : u ∈ X → (ΓX u)(t) ∈ R,

2 By Riesz’s Theorem, it exists ζ t ∈ X such that ∀ u ∈ X, ΓtX u = 〈ζ t , u〉X.

(ΓX u)(t)

= 〈E (X(t)X), u〉X,

Case X = L2([0, 1])

∫[0,1]

E (X) (t) = E (X(t)) =

∫X(t)dPX ,

ΓX ' γ(t , t ′) = E (X(t)X(t ′))

X u = 〈ζ t , u〉X.

(ΓX u)(t)

= 〈E (X(t)X), u〉X,

Case X = L2([0, 1])

∫[0,1]

E (X) (t) = E (X(t)) =

∫X(t)dPX ,

ΓX ' γ(t , t ′) = E (X(t)X(t ′))

X u = 〈ζ t , u〉X.As,

(ΓX u)(t) = E (〈X , u〉XX(t))

= 〈E (X(t)X), u〉X,

we have that ζ t = E (X(t)X) .3 We define γ : (t , t ′) ∈ [0, 1]2 → ζ t (t ′) = E (X(t)X(t ′)) �

Case X = L2([0, 1])

∫[0,1]

E (X) (t) = E (X(t)) =

∫X(t)dPX ,

ΓX ' γ(t , t ′) = E (X(t)X(t ′))

(ΓX u)(t) = E (〈X(t)X , u〉X)

= 〈E (X(t)X), u〉X,

Case X = L2([0, 1])

∫[0,1]

E (X) (t) = E (X(t)) =

∫X(t)dPX ,

ΓX ' γ(t , t ′) = E (X(t)X(t ′))

(ΓX u)(t) = 〈E (X(t)X), u〉X,

Case X = L2([0, 1])

∫[0,1]

E (X) (t) = E (X(t)) =

∫X(t)dPX ,

ΓX ' γ(t , t ′) = E (X(t)X(t ′))

(ΓX u)(t) = 〈E (X(t)X), u〉X,

Case X = L2([0, 1])

∫[0,1]

E (X) (t) = E (X(t)) =

∫X(t)dPX ,

ΓX ' γ(t , t ′) = E (X(t)X(t ′))

(ΓX u)(t) = 〈E (X(t)X), u〉X,

Properties of ΓX

ΓX is Hilbert-Schmidt:(definition)

∑i ‖ΓX ei‖

2 < +∞;

it exists a countable eigensystem of ΓX , ((λi)i≤1, (vi)i≥1): ΓX vi = λivi(for all i ≤ 1).

This eigensystem is such that:λ1 ≥ λ2 ≥ . . . ≥ 0 and 0 is the only possible accumulation value of (λi)i ,Karhunen-Loeve decomposition of ΓX

ΓX =∑i≥1

λivi ⊗ vi .

ΓX has no inverse in the space of continuous operator from X to X, ifX has infinite dimension.

More precisely, if λi > 0 for all i ≥ 1,∑i≥1

= +∞.

Properties of ΓX

∑i ‖ΓX ei‖

2 < +∞;it exists a countable eigensystem of ΓX , ((λi)i≤1, (vi)i≥1): ΓX vi = λivi(for all i ≤ 1).

This eigensystem is such that:λ1 ≥ λ2 ≥ . . . ≥ 0 and 0 is the only possible accumulation value of (λi)i ,Karhunen-Loeve decomposition of ΓX

ΓX =∑i≥1

λivi ⊗ vi .

= +∞.

Properties of ΓX

∑i ‖ΓX ei‖

2 < +∞;it exists a countable eigensystem of ΓX , ((λi)i≤1, (vi)i≥1): ΓX vi = λivi(for all i ≤ 1). This eigensystem is such that:

λ1 ≥ λ2 ≥ . . . ≥ 0 and 0 is the only possible accumulation value of (λi)i ,

Karhunen-Loeve decomposition of ΓX

ΓX =∑i≥1

λivi ⊗ vi .

= +∞.

Properties of ΓX

∑i ‖ΓX ei‖

λ1 ≥ λ2 ≥ . . . ≥ 0 and 0 is the only possible accumulation value of (λi)i ,Karhunen-Loeve decomposition of ΓX

ΓX =∑i≥1

λivi ⊗ vi .

= +∞.

Properties of ΓX

∑i ‖ΓX ei‖

ΓX =∑i≥1

λivi ⊗ vi .

= +∞.

Properties of ΓX

∑i ‖ΓX ei‖

ΓX =∑i≥1

λivi ⊗ vi .

ΓX has no inverse in the space of continuous operator from X to X, ifX has infinite dimension. More precisely, if λi > 0 for all i ≥ 1,∑

= +∞.

Model of the observed data

We focus on:1 a regression problem: Y ∈ R has to be predicted from X ∈ X,2 OR a (binary) classification problem: Y ∈ {−1, 1} has to be

predicted from X ∈ X.

Learning set - Version 1

(x1, y1), . . . , (xn, yn) are i.i.d. realizations of the random pair (X ,Y).

Remark: if E(‖X‖2

)< +∞ (i.e., if ΓX exists), a functional version of the

Central Limit Theorem exists.

Model of the uncertainty on X

In fact, realizations of X are never observed. Only a (possibly noisy)discretization of them are given.

Learning set - Version II

(x1, y1), . . . , (xn, yn) are i.i.d. realizations of the random pair (X ,Y) and forall i = 1, . . . , n, xτ

i = (xi(t))t∈τi is observed, where τi is a finite set.

Questions:1 How to obtain (xi)i from (xτ

2 What are the consequences of this uncertainty on the accuracy ofthe solution of the regression/classification problem? Can we obtain asolution that is as good as those obtained from the direct observationof (xi)i?

i )i?2 What are the consequences of this uncertainty on the accuracy of

the solution of the regression/classification problem? Can we obtain asolution that is as good as those obtained from the direct observationof (xi)i?

Noisy data model

Learning set - Version III

i = (xi(t) + εit )t∈τi is observed, where τi is a finite setand εit is a centered random variable independant of X .

Again,1 How to obtain (xi)i from (xτ

i )i? (works have been done: functionestimation)

2 What are the consequences of this uncertainty on the accuracy ofthe solution of the regression/classification problem? Can we obtain asolution that is as good as those obtained from the direct observationof (xi)i? (relating to “errors-in-variables” problems ; almost no work inFDA)

In these presentations, works coming from the three points of view.

Noisy data model

Table of contents

1 Motivations

4 References

Multidimensional PCA: context and notations

A real matrixX = (x j

i )i=1,...,n, j=1,...,p

which is the observation of p variables on n individuals.

n rows, each corresponding to the value of p variables for anindividual:

xi = (x1i , . . . , x

pi )T .

p columns, each corresponding to n observations of a variable:

x j = (x j1, . . . , x

jn)T .

Aim: Find linearly independant variables, that are linear combinations ofthe original ones, by order of “importance” in X.

i )i=1,...,n, j=1,...,p

xi = (x1i , . . . , x

pi )T .

x j = (x j1, . . . , x

jn)T .

i )i=1,...,n, j=1,...,p

xi = (x1i , . . . , x

pi )T .

x j = (x j1, . . . , x

jn)T .

i )i=1,...,n, j=1,...,p

xi = (x1i , . . . , x

pi )T .

x j = (x j1, . . . , x

jn)T .

First principal component

Suppose (to simplify)1 Data are centered: for all j = 1, . . . , p, x j = 1

i=1 x ji = 0;

2 The empirical variance of X is: for all j = 1, . . . , p, Var (X) = 1n XT X.

Problem: Find a∗ ∈ Rp such that:

a∗ := arg maxa:‖a‖Rp =1

Var([∥∥∥Pa(xi)

∥∥∥Rp

)︸︷︷︸inertia

.Solution:

∥∥∥Rp

aT Var (X) a

⇒ a∗ is the eigenvector of Var (X) associated with the biggest (positive)eigenvalue.

i=1 x ji = 0;

∥∥∥Rp

Solution:Var

([∥∥∥Pa(xi)∥∥∥Rp

aT Var (X) a

i=1 x ji = 0;

∥∥∥Rp

.Solution:

∥∥∥Rp

n∑i=1

∥∥∥aT xia∥∥∥2Rp

aT Var (X) a

i=1 x ji = 0;

∥∥∥Rp

.Solution:

∥∥∥Rp

n∑i=1

(aT xi)2 ‖a‖2Rp︸︷︷︸

aT Var (X) a

i=1 x ji = 0;

∥∥∥Rp

.Solution:

∥∥∥Rp

n∑i=1

(aT xi)(xTi a)

aT Var (X) a

i=1 x ji = 0;

∥∥∥Rp

.Solution:

∥∥∥Rp

n∑i=1

aT Var (X) a

i=1 x ji = 0;

∥∥∥Rp

.Solution:

∥∥∥Rp

)= aT Var (X) a

i=1 x ji = 0;

∥∥∥Rp

.Solution:

∥∥∥Rp

)= aT Var (X) a

An eigenvalue decomposition

GeneralizationIf ((λi)i=1,...,p , (ai)i=1,...,p) is the eigenvalue decomposition of Var (X) (bydecreasing order of the positive λi) then (ai) are the factorial axis of X.The principal components of X are the coordinates of the projections ofthe data onto these axis.

Then, we have:

Var (X) =

p∑j=1

λjajaTj

p∑j=1

(xTi aj)︸︷︷︸

principal component c ji

An eigenvalue decomposition

GeneralizationIf ((λi)i=1,...,p , (ai)i=1,...,p) is the eigenvalue decomposition of Var (X) (bydecreasing order of the positive λi) then (ai) are the factorial axis of X.The principal components of X are the coordinates of the projections ofthe data onto these axis.

Then, we have:

Var (X) =

p∑j=1

λjajaTj

p∑j=1

(xTi aj)︸︷︷︸

principal component c ji

Ordinary generalization of PCA in FDA

Data: x1, . . . , xn are n centered observations of a random functionalvariable X taking its values in X.

Aim: Find a∗ ∈ X such that:

a∗ := arg maxa:‖a‖X=1

([∥∥∥Pa(xi)∥∥∥X

Solution:Var

([∥∥∥Pa(xi)∥∥∥Rp

〈ΓnX a, a〉X

where ΓnX = 1

i=1 xi ⊗ xi is the empirical estimate of ΓX and is of rankalmost equal to n (Hilbert-Schmidt operator).⇒ a∗ is the eigenvector of Γn

X associated with the biggest (positive)eigenvalue: Γn

X a∗ = λ∗a∗.

Data: x1, . . . , xn are n centered observations of a random functionalvariable X taking its values in X.Aim: Find a∗ ∈ X such that:

([∥∥∥Pa(xi)∥∥∥X

Solution:Var

([∥∥∥Pa(xi)∥∥∥Rp

〈ΓnX a, a〉X

where ΓnX = 1

X a∗ = λ∗a∗.

([∥∥∥Pa(xi)∥∥∥X

Solution:

∥∥∥Rp

n∑i=1

‖〈a, xi〉Xa‖2X

〈ΓnX a, a〉X

where ΓnX = 1

X a∗ = λ∗a∗.

([∥∥∥Pa(xi)∥∥∥X

Solution:

∥∥∥Rp

n∑i=1

〈a, xi〉2X‖a‖2X︸︷︷︸

〈ΓnX a, a〉X

where ΓnX = 1

X a∗ = λ∗a∗.

([∥∥∥Pa(xi)∥∥∥X

Solution:

∥∥∥Rp

n∑i=1

〈a, 〈a, xi〉Xxi〉X

〈ΓnX a, a〉X

where ΓnX = 1

X a∗ = λ∗a∗.

([∥∥∥Pa(xi)∥∥∥X

Solution:

∥∥∥Rp

)= 〈

n∑i=1

〈a, xi〉Xxi , a〉X

〈ΓnX a, a〉X

where ΓnX = 1

X a∗ = λ∗a∗.

([∥∥∥Pa(xi)∥∥∥X

Solution:Var

([∥∥∥Pa(xi)∥∥∥Rp

)= 〈Γn

X a, a〉X

where ΓnX = 1

i=1 xi ⊗ xi is the empirical estimate of ΓX and is of rankalmost equal to n (Hilbert-Schmidt operator).

⇒ a∗ is the eigenvector of ΓnX associated with the biggest (positive)

eigenvalue: ΓnX a∗ = λ∗a∗.

([∥∥∥Pa(xi)∥∥∥X

Solution:Var

([∥∥∥Pa(xi)∥∥∥Rp

)= 〈Γn

X a, a〉X

where ΓnX = 1

X a∗ = λ∗a∗.

Eigenvalue decomposition of ΓnX

Factorial axes and principal components

If ((λi)i≥1, (ai)i≥1) is the eigenvalue decomposition of ΓnX (by decreasing

order of the positive λi) then (ai) are the factorial axis of x1, . . . , xn (notethat at most n λi are nonzero).

The principal components of x1, . . . , xn are the coordinates of theprojections of the data onto these axis.

Then, we have:

ΓnX =

n∑j=1

λjaj ⊗ aj

xi =n∑

〈xi , aj〉X︸︷︷︸principal component c j

order of the positive λi) then (ai) are the factorial axis of x1, . . . , xn (notethat at most n λi are nonzero).The principal components of x1, . . . , xn are the coordinates of theprojections of the data onto these axis.

Then, we have:

ΓnX =

n∑j=1

λjaj ⊗ aj

xi =n∑

order of the positive λi) then (ai) are the factorial axis of x1, . . . , xn (notethat at most n λi are nonzero).The principal components of x1, . . . , xn are the coordinates of theprojections of the data onto these axis.

Then, we have:

ΓnX =

n∑j=1

λjaj ⊗ aj

xi =n∑

Example on Tecator dataset

3rd and 4th factorial axes

Two first factorial axes

Link with the regression problem

Smoothness of the factorial axes

On a practical point of view, functional PCA in its original version iscomputing as the multivariate PCA (on the discretization or on thedecomposition of the curves on a Hilbert basis).

Hence, if the original data is irregular, the factorial axes won’t be smooth:

Smoothness of the factorial axes

On a practical point of view, functional PCA in its original version iscomputing as the multivariate PCA (on the discretization or on thedecomposition of the curves on a Hilbert basis).Hence, if the original data is irregular, the factorial axes won’t be smooth:

Smooth functional PCA

Aims: Introduce a penalty in the optimization problem so as to obtainsmooth (regular) factorial axes.

Penalized functional PCA: if D2mX ∈ X = L2

([∥∥∥Pa(xi)∥∥∥X

∥∥∥Dma∥∥∥2X

}(µ > 0) and hence a∗ is the eigenvector of Γn

X + µD2m associated with thebiggest eigenvalue.

Aims: Introduce a penalty in the optimization problem so as to obtainsmooth (regular) factorial axes.Ordinary functional PCA:

([∥∥∥Pa(xi)∥∥∥X

)}and hence a∗ is the eigenvector of Γn

X associated with the biggesteigenvalue.

Penalized functional PCA: if D2mX ∈ X = L2

([∥∥∥Pa(xi)∥∥∥X

∥∥∥Dma∥∥∥2X

Aims: Introduce a penalty in the optimization problem so as to obtainsmooth (regular) factorial axes.Penalized functional PCA: if D2mX ∈ X = L2

([∥∥∥Pa(xi)∥∥∥X

∥∥∥Dma∥∥∥2X

Practical implementation of smooth PCA

Let (ek )k≥1 be any functional basis. Then,1 Approximate the observations by xi =

∑Kk=1 ξ

ik ek .

2 Approximate the derivatives of the (ek )k by D2mek ′ =∑K

k=1 βk ′k ek .

3 which implicates

ΓnX + µD2m : a ∈ RK →

K∑k=1

n∑i=1

((ξi)T Ea)ξi + µ(βT a).

4 Smooth PCA is performed by an eigendecomposition of1n∑n

i=1 ξi(ξi)T E + µβT .

Remark: The decomposition D2mek ′ =∑K

k=1 βk ′k ek is easy to obtain

when using a spline basis⇒ splines are well designed to represent datawith smooth properties (see Presentation 4 for futher details).

∑Kk=1 ξ

ik ek .

k=1 βk ′k ek .

3 which implicates

K∑k=1

n∑i=1

∑Kk=1 ξ

ik ek .

k=1 βk ′k ek .

3 Then, we can demonstrate that:

ΓnX a + µD2ma =

K∑k=1

n∑i=1

((ξi)T Ea)ξik ek + µ(βT

k a)ek ,

where E is the matrix containing (〈ek , ek ′〉X)k ,k ′=1,...,K .

whichimplicates

K∑k=1

n∑i=1

∑Kk=1 ξ

ik ek .

k=1 βk ′k ek .

3 which implicates

K∑k=1

n∑i=1

∑Kk=1 ξ

ik ek .

k=1 βk ′k ek .

3 which implicates

K∑k=1

n∑i=1

∑Kk=1 ξ

ik ek .

k=1 βk ′k ek .

3 which implicates

K∑k=1

n∑i=1

To conclude, several references. . .

Theoretical background for functional PCA:[Deville, 1974][Dauxois and Pousse, 1976]

Smooth PCA:[Besse and Ramsay, 1986][Pezzulli and Silverman, 1993][Silverman, 1996]

Several examples and discussion:[Ramsay and Silverman, 2002][Ramsay and Silverman, 1997]

Table of contents

1 Motivations

4 References

The model

We are interested here in the functional linear regression model:Y is a random variable taking its values in R,X is a random variable taking its values in X,X and Y satisfy the following model:

Y = 〈X , α〉X + ε

where ε is a real random variable centered and independant of X andα is the parameter to be estimated.

Or, alternatively, we are given a training set of size n witherrors-in-variables: (wi , yi)i=1,...,n with:

wi = xi + ηi (ηi is the realization of a centered random variableindependant of Y )yi = 〈xi , α〉X + εi

This problem will be investigated in Presentation 4

The model

Y = 〈X , α〉X + ε

We are given a training set of size n, (xi , yi)i=1,...,n of independantrealizations of the random pair (X ,Y).

The model

Y = 〈X , α〉X + ε

Basics about the functional linear regression model

To avoid useless difficulties, we will then suppose that E (X) = 0.Let us first define the covariance between X and Y as:

∆(X ,Y) = E (XY) ∈ X.

Then, we have:ΓXα = ∆(X ,Y).

But, as ΓX is Hilbert-Schmidt, it is not invertible and thus, the empiricalestimate of α (using a generalized inverse of Γn

X ) does not converge to α

when n tends to infinity. It is a ill-posed inverse problem.⇒ Penalization or regularization are needed to access a relevantestimate.

∆(X ,Y) = E (XY) ∈ X.

when n tends to infinity. It is a ill-posed inverse problem.

⇒ Penalization or regularization are needed to access a relevantestimate.

∆(X ,Y) = E (XY) ∈ X.

PCA approach

References: [Cardot et al., 1999] from the works of [Bosq, 1991] onhilbertian AR models.

PCA decomposition of X : Note

((λni , v

ni ))i≥1 the eigenvalue decomposition of Γn

X ((λi)i are ordered indecreasing order and almost n eigenvalues are not null; (vn

i )i areorthonormal)

kn an integer such that: kn ≤ n and limn→+∞ kn = +∞

Pkn the projector Pkn (u) =∑kn

i=1〈vni , .〉Xvn

Γn,knX = Pkn ◦ Γn,kn

X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

∆n,kn(X ,Y)

= Pkn ◦1n∑n

i=1 yixi = 1n∑

i=1,...,n, i′=1,...,kn yi〈xi , vni′ 〉Xvn

PCA approach

References: [Cardot et al., 1999] from the works of [Bosq, 1991] onhilbertian AR models.PCA decomposition of X : Note

((λni , v

i=1〈vni , .〉Xvn

X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

∆n,kn(X ,Y)

= Pkn ◦1n∑n

i=1 yixi = 1n∑

PCA approach

((λni , v

i=1〈vni , .〉Xvn

X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

∆n,kn(X ,Y)

= Pkn ◦1n∑n

i=1 yixi = 1n∑

PCA approach

((λni , v

i=1〈vni , .〉Xvn

X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

∆n,kn(X ,Y)

= Pkn ◦1n∑n

i=1 yixi = 1n∑

PCA approach

((λni , v

i=1〈vni , .〉Xvn

X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

∆n,kn(X ,Y)

= Pkn ◦1n∑n

i=1 yixi = 1n∑

PCA approach

((λni , v

i=1〈vni , .〉Xvn

X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

∆n,kn(X ,Y)

= Pkn ◦1n∑n

i=1 yixi = 1n∑

Definition of a consistent estimate for α

Define:αn =

(Γn,kn

)+∆n,kn

(X ,Y)

where(Γn,kn

)+denotes the generalized inverse (Moore-Penrose):

(Γn,kn

kn∑i=1

(λni )−1〈vn

i , .〉Xvni .

Assumptions for consistency result

(A1) (λni )i=1,...,kn are all distinct and not null a.s.

(A2) (λi)i≥1 are all distinct and not null

(A3) X is a.s. bounded in X (‖X‖X ≤ C1 a.s.)

(A4) ε is a.s. bounded (‖ε‖ ≤ C2 a.s.)

(A5) limn→+∞nλ4

knlog n = +∞

kn(∑knj=1 aj

)2log n

= +∞ where a1 = 2√

2λ1−λ2

aj = 2√

2min(λj−1−λj ;λj−λj+1)

for j > 1

Example of ΓX satisfying those assumptions: if the eigenvalues of ΓX

are geometrically or exponentially decreasing, these assumptions arefullfilled as long as the sequence (kn)n tends slowly enough to +∞.For example, X is a Brownian motion on [0, 1] and kn = o(log n).

(A1) (λni )i=1,...,kn are all distinct and not null a.s.

(A2) (λi)i≥1 are all distinct and not null

(A3) X is a.s. bounded in X (‖X‖X ≤ C1 a.s.)

(A4) ε is a.s. bounded (‖ε‖ ≤ C2 a.s.)

knlog n = +∞

kn(∑knj=1 aj

)2log n

= +∞ where a1 = 2√

2λ1−λ2

aj = 2√

2min(λj−1−λj ;λj−λj+1)

for j > 1

Example of ΓX satisfying those assumptions: if the eigenvalues of ΓX

are geometrically or exponentially decreasing, these assumptions arefullfilled as long as the sequence (kn)n tends slowly enough to +∞.For example, X is a Brownian motion on [0, 1] and kn = o(log n).

Consistency result

Theorem [Cardot et al., 1999]Under assumptions (A1)-(A6), we have:∥∥∥αn − α

∥∥∥X

n→+∞−−−−−−→ 0

Smoothing approach based on B-splines

References: [Cardot et al., 2003]

Suppose that X takes values in L2([0, 1]).Basics on B-splines: Let q and k be to integers and denotes by Sqk thespace of functions satisfying:

s ∈ Sqk are polynomials of degree q on each interval[

l−1k , l

]for all

l = 1, . . . , k ,

s ∈ Sqk are q − 1 times differentiable on [0, 1].

The space Sqk has dimension q + k and a normalized basis of Sqk isdenoted by {Bqk

j , j = 1, . . . , q + k } (normalized B-Splines, see[de Boor, 1978]).These functions are easy to manipulate and have interesting smoothnessproperties. They can be used to express either X and the parameter α aswell as to impose smoothness constrains on α.

References: [Cardot et al., 2003]Suppose that X takes values in L2([0, 1]).Basics on B-splines: Let q and k be to integers and denotes by Sqk thespace of functions satisfying:

l−1k , l

]for all

l = 1, . . . , k ,

l−1k , l

]for all

l = 1, . . . , k ,

j , j = 1, . . . , q + k } (normalized B-Splines, see[de Boor, 1978]).

These functions are easy to manipulate and have interesting smoothnessproperties. They can be used to express either X and the parameter α aswell as to impose smoothness constrains on α.

l−1k , l

]for all

l = 1, . . . , k ,

Note Bqk :=(Bqk

1 , . . . ,Bqkq+k

)Tand B(m)

qk the m derivatives of Bqk for am < q + k .

A penalized mean square estimate: Providing the fact that αn is smoothenough, we aim at finding αn =

∑q+kj=1 an

j Bqkj = (an)T Bqk solution of the

optimization problem:

arg mina∈Rq+k

n∑i=1

(yi − 〈aT Bqk , xi〉X

︸︷︷︸mean square criterion

+µ∥∥∥∥aT B(m)

∥∥∥∥2

X︸︷︷︸smoothness penalization

The solution of the previous problem is given by

an =(Cn + µGn

)−1bn

where Cn is the matrix with components 〈ΓnX Bqk

j ,Bqkj′ 〉X

(j, j′ = 1, . . . , q + k ), bn is the vector with components 〈∆n(X ,Y)

,Bqkj 〉X

(j = 1, . . . , q + k ) and Gnqk is the matrix with components 〈Bqk(m)

j ,Bqk(m)j′ 〉X

(j, j′ = 1, . . . , q + k ).

Note Bqk :=(Bqk

1 , . . . ,Bqkq+k

)Tand B(m)

qk the m derivatives of Bqk for am < q + k .A penalized mean square estimate: Providing the fact that αn is smoothenough, we aim at finding αn =

∑q+kj=1 an

arg mina∈Rq+k

n∑i=1

)2+ µ

∥∥∥∥aT B(m)qk

∥∥∥∥2

arg mina∈Rq+k

n∑i=1

+µ∥∥∥∥aT B(m)

∥∥∥∥2

an =(Cn + µGn

)−1bn

j ,Bqkj′ 〉X

,Bqkj 〉X

j ,Bqk(m)j′ 〉X

(j, j′ = 1, . . . , q + k ).

Note Bqk :=(Bqk

1 , . . . ,Bqkq+k

)Tand B(m)

∑q+kj=1 an

arg mina∈Rq+k

n∑i=1

+µ∥∥∥∥aT B(m)

∥∥∥∥2

an =(Cn + µGn

)−1bn

j ,Bqkj′ 〉X

,Bqkj 〉X

j ,Bqk(m)j′ 〉X

(j, j′ = 1, . . . , q + k ).

Note Bqk :=(Bqk

1 , . . . ,Bqkq+k

)Tand B(m)

∑q+kj=1 an

arg mina∈Rq+k

n∑i=1

+µ∥∥∥∥aT B(m)

∥∥∥∥2

an =(Cn + µGn

)−1bn

j ,Bqkj′ 〉X

,Bqkj 〉X

j ,Bqk(m)j′ 〉X

(j, j′ = 1, . . . , q + k ).Nathalie Villa (IMT & UPVD) Presentation 1 La Havane, Sept. 15th, 2008 32 / 37

(A1) X is a.s. bounded in X

(A2) Var (Y |X = x) ≤ C1 for all x ∈ X

(A3) E (Y |X = x) ≤ C2 for all x ∈ X

(A4) it exists an integer p′ and a real ν ∈ [0, 1] such that p′ + ν ≤ qand

∣∣∣α(p′)(t1) − α(p′)(t2)∣∣∣ ≤ |t1 − t2|ν

(A5) µ = O(n−(1−δ)/2

)for a 0 < δ < 1

(A6) limn→+∞ µk 2(m−p) = 0 for p = p′ + ν

(A7) k = O(n1/(4p+1)

Consistency result

Theorem [Cardot et al., 2003]Under assumptions (A1)-(A7),

limn→+∞ P (it exists a unique solution to the minimization problem) = 1

E(‖αn − α‖2X |x1, . . . , xn

(n−2p/(4p+1)

Other functional linear methods

Canonical correlation: [Leurgans et al., 1993]

Factorial Discriminant Analysis: [Hastie et al., 1995]

Partial Least Squares: [Preda and Saporta, 2005]

Table of contents

1 Motivations

4 References

References

Further details for the references are given in the joint document.

Azaïs, J., Bercu, S., Chaoui, O., Fort, J., Lagnoux-Renaudie, A., andLé, P. (2008).Load curves estimation and simultaneous confidence bands.preprint.available (in french) athttp://www.lsp.ups-tlse.fr/Fp/Lagnoux/rapport_final.pdf.

Bensmail, H., Aruna, B., Semmes, O., and Haoudi, A. (2005).Functional clustering algorithm for high-dimensional proteomics data.Journal of Biomedicine and Biotechnology, 2:80–86.

Besse, P. and Ramsay, J. (1986).Principal component analysis of sampled curves.Psychometrika, 51:285–311.

Bosq, D. (1991).Modelization, non-parametric estimation and prediction for continuoustime processes, volume 335 of ASI Series, pages 509–529.NATO.

Cardot, H., Ferraty, F., and Sarda, P. (1999).Functional linear model.Statistics and Probability Letters, 45:11–22.

Cardot, H., Ferraty, F., and Sarda, P. (2003).Spline estimators for the functional linear model.Statistica Sinica, 13:571–591.

Dauxois, J. and Pousse, A. (1976).Les analyses factorielles en calcul des probabilités et en statistique :essai d’étude synthétique.Thèse d’État, Université Toulouse III.

de Boor, C. (1978).A Practical Guide to Splines.Springer, New York.

Deville, J. (1974).Méthodes statistiques et numériques de l’analyse harmonique.Annales de l’INSEE, 15(Janvier–Avril):3–97.

Hastie, T., Buja, A., and Tibshirani, R. (1995).Penalized discriminant analysis.Annals of Statistics, 23:73–102.

Leurgans, S., Moyeed, R., and Silverman, B. (1993).Canonical correlation analysis when the data are curves.Journal of the Royal Statistical Society, Series B, 55:725–740.

Pezzulli, S. and Silverman, B. (1993).Some properties of smoothed principal components analysis forfunctional data.Computational Statistics, 8:1–16.

Preda, C. and Saporta, G. (2005).Clusterwise pls regression on a stochastic process.Computational Statistics and Data Analysis, 49(1):99–108.

Ramsay, J. and Silverman, B. (1997).Functional Data Analysis.Springer Verlag, New York.

Ramsay, J. and Silverman, B. (2002).Applied Functional Data Analysis.Springer Verlag.

Silverman, B. (1996).Smoothed functional principal components analysis by choice ofnorm.Annals of Statistics, 24:1–24.

Introduction to FDA and linear models

Science

Transcript of Introduction to FDA and linear models

Linear Models

Linear regression models

Introduction to General and Generalized Linear Models ...hmad/GLM/Slides_2012/week08/lect08b.pdfIntroduction to General and Generalized Linear Models Generalized Linear Models ...

Classification: Linear Models

Using Hierarchical Linear Models to Measure Growth · Using Hierarchical Linear Models to Measure Growth Measurement Incorporated Hierarchical Linear Models Workshop

Introduction to Generalized Linear Mixed Models, Matrix …€¦ · Introduction to Generalized Linear Mixed Models and Mathematical Statistics Generalized Linear Mixed Models Lecture

Linear Hierarchical Models

Generalized Linear Models

Applying Generalized Linear Models - LEG-UFPRinternas:biblioteca:lindsey... · Applying Generalized Linear Models James K. Lindsey Springer. Preface Generalized linear models provide

LINEAR PREDICTION MODELS

Generalized Linear Models and Extensions · 2010. 12. 16. · Session 1 - Generalized linear models † Introduction † Motivating examples † History † Generalized linear models

Dynamic Linear Models (DLMs) or state space models de ne a ...ghuerta/tseries/dlmch2.pdf · Introduction to Dynamic Linear Models Dynamic Linear Models (DLMs) or state space models

Linear Models & Clustering

Prediction and calibration in generalized linear models · PREDICTION AND CALIBRATION IN GENERALIZED LINEAR MODELS* ... Prediction and calibration in generalized linear models ...

Linear and Generalized Linear Models - Lecture 10njc23/Lecture10.pdf · Fit Linear Models Inference Model Diagnostics Model Selection Descriptive Plots Generalized Linear Models Example

Lecture 10: Linear Mixed Models (Linear Models with …sayan/Sta613/2017/lec/LMM.pdf · c (Claudia Czado, TU Munich) – 0 – Lecture 10: Linear Mixed Models (Linear Models with

GENERALISED LINEAR MODELS IN ACTUARIAL WORK · models. The class of generalised linear models includes, as special cases, linear regression, analysis-of-variance models, log-linear

Linear Models I

Linear Models - WordPress.com

Linear Models 1 - iut.ac.irrikhtehgaran.iut.ac.ir/sites/rikhtehgaran.iut.ac.ir/files//files... · Linear Models 1 Isfahan University ... Bapat (2000), Linear Algebra and Linear Models,