Understanding multivariate pattern analysis for...

36
Understanding multivariate pattern analysis for neuroimaging applications Dimitri Van De Ville Medical Image Processing Lab, Institute of Bioengineering, Ecole Polytechnique Fédérale de Lausanne Department of Radiology & Medical Informatics, Université de Genève http://miplab.epfl.ch/ April 23, 2015

Transcript of Understanding multivariate pattern analysis for...

Page 1: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Understanding multivariate pattern analysisfor neuroimaging applications

Dimitri Van De VilleMedical Image Processing Lab, Institute of Bioengineering, Ecole Polytechnique Fédérale de LausanneDepartment of Radiology & Medical Informatics, Université de Genève

http://miplab.epfl.ch/

April 23, 2015

Page 2: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

OverviewBrain decoding: motivation and applications

Limitations of “regular SPM” (GLM)“Hall of fame” of brain decoding

Pattern recognition principles for brain decodingBasics

Linear classifierDiscriminative versus generative model

Cross-validationTraining phaseEvaluation phaseTesting phase

Support vector machine (SVM)Primal and dual formKernel trick

2

Page 3: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

3

GLM

Stats

whole-brain scan2x2x2mm3, 20-30 slicesevery 2-4 sec

100’000 intracranial voxels1’000 timepoints

t-val

ue

time

Page 4: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Limitations of the GLM approachMassively univariate approach

Hypothesis testing is applied voxel-wiseAll voxels are treated independently,but are obviously not independent

Textbook multivariate approaches are not practicalE.g., estimating spatial covariance matrix (100’000x100’000)by 1’000 timepoints will be rank-deficient fMRI data are complicated➔ Data-driven: let the data speak

Group statistics are great...... but the ultimate goal is oftento know how well the resultsgeneralizes (prediction)Learn from training data, apply to the (unseen) test data

4

condition 1 condition 2

voxel 1voxel 2

voxel 1voxel 2

Page 5: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Forward versus backwards inference

5

stimuliconditionsdiseasedisorder...

generative model

stimuliconditionsdiseasedisorder...

discriminative model

encoding

decoding

data

data

phrenology

blind reading

Page 6: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Exploiting “subtle” spatial patterns

[Cox et al., 2003; Haynes and Rees, 2006]

example in in 2D9D feature space

Page 7: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Emotions in voice-sensitive corticesHow is emotional information treated in aud. cortex?

22 healthy right-handed subjects10 actors pronounced “ne kalibam sout molem”(anger, sadness, neutrality, relief, joy)

normalization for mean sound intensity; rated by 24 (other) subjects

Classifier approach on aud. cortexlocalizer: human voices > animal sounds20 stimuli for each category (5); 5 classifiers (one-versus-others)

7[Ethofer, VDV, Scherer, Vuilleumier, Current Biology, 2009; collaboration with LABNIC/CISA, UniGE]

p<0.05 (FWE)

Page 8: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

8

Anger Sadness Neutral Relief Joy

z=0

z=+3

z=+6

[Ethofer, VDV, Scherrer, Vuilleumier, Current Biology, 2009]“Ne kalibam sout molem”

Page 9: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Decoding from the emotional brain

9[Ethofer, VDV, Scherer, Vuilleumier, Current Biology, 2009]

Results for bilateralvoice-sensitive cortices

solid: smoothed (10mm FWHM)dashed: non-smootheddotted: spatial average

Chance level: 20%Classifier: linear SVM

Evidence that auditory cues for emotional processing are spatially encoded; bilaterial representation

Comprehension of emotional prosody is crucial for social functioning psychiatric disorders (schizophrenia, bipolar affective disorder,...)

Page 10: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Brain decoding of visual processing

10[Haxby et al., 2001]

distributed patterns subtle patterns in V1

[Kamitani and Tong, 2005, 2006][Haynes, Rees., 2006]

advanced models forreceptive fields

[Miyawaki et al., 2008]

Page 11: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

11

Decoding video sequences

reverse statistical inference(“decode” stimulus from data)

[Nishimoto et al, 2011]

Page 12: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Basics of pattern recognition

Machine learning aims at getting the model right based on a series of examples (supervised learning)

learning/training phase:from examples {(x1,y1),...,(xN,yN)} learn model f such that f(xi)=yi

testing phase: for unseen data xs, predict label f(xs)

12

featurevector xi

D featuresmodel

functionmachineclassifier

...

label yi(e.g., -1/1,

control/patient)

Page 13: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Classification as a regression problem

Linear model:

y = wTX + b+ noise

where the vector (w, b) characterizes the full model

Learning the model by least-squares fitting: (w, b) = argmin(w,b)

��y �wTX � b��2

Underdetermined (D > N ): (w, b) = argmin(w,b)

��y �wTX � b��2+� kwk2

solution: w = pinv(XT )yT

Basics of pattern recognition

13

featurematrix X

D featuresmodel

labels y

N instances

Page 14: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Basics of pattern recognition

14

featurematrix X

D featuresmodel

labels y

N instances

Evaluate classifier on unseen data

Avoid overfitting:

increasingly complex models will succeed better to fit (any) training data!

Classify new data x

0by comparing w

Tx

0 + b against 0?

Page 15: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Interpretation of the weight vector

15

Voxel  1 Voxel  2Voxel  1 Voxel  2

examples  of  class  1

Training

weight  vector  w

x1  =  0.5 x2  =  0.8Tes6ngnew  example

predicted  label:(w1.x1+w2.x2)+b            =  (+5.0.5-­‐3.0.8)+0            =  0.1  posi?ve  value  -­‐>  class  1

Voxel  1 Voxel  2Voxel  1 Voxel  2

examples  of  class  2

w1  =  +5 w2  =  -­‐3

• spatial representation of decision function

• multivariate pattern:no local inference!

Page 16: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Basics of pattern recognition

16

Linear model projects instance on line parallel to wDecision boundary is a hyperplane in the feature spaceIf feature dimensions correspond to voxels, then w corresponds to a mapBias b allows to shift decision boundary along w

[adapted from Jaakkola]

-1

+10

separating hyperplane

Page 17: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Classification and decision theory

Assume class-conditional densities p(x|y) and class probabilities P (y) known

Minimizing the overall probability of error corresponds to

y0 = argmax

yP (y|x0

)

= argmax

y

p(x0|y)P (y)

p(x0)

(Bayes’ rule)

= argmax

yp(x0|y)P (y)

Basics of pattern recognition

17

featurematrix X

D featuresmodel

labels y

N instances

posterior

likelihood

P (y|x)p(x|y)

Page 18: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Wonderful link between discriminative model (posterior) and gener-

ative model (likelihood)

P (y|x) / p(x|y)P (y)

Discriminative model; e.g., linear classifier, SVM

Generative model; e.g., na¨ıve Bayes

p(x|y) =DY

i=1

p(xi|y)

where the features are modeled as independent random variables

Basics of pattern recognition

18

posterior

likelihood

Page 19: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Basics of pattern recognition

19

featurematrix X

D featuresmodel

labels y

N instances

TR EV TE

Train Evaluate TestErrorOK?

TREV

TE

Final Performance

MeasureSplitFull

Dataset

Tune

Leave-one-out cross-validation

Repeatedly train on all instances except one, evaluate on left-out instance

Accuracy statistics: class 1 =

AA+B ; class 2=

DC+D ; average =

A+DA+B+C+D

Sensitivity =

DC+D = TP

FN+TP

; specificity =

AA+B = TN

FP+TN

class 1 class 2

class 1 A Bclass 2 C Dtru

th

estimated

Page 20: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Testing classifiers: the ROC curveReceiver Operating Characteristic (ROC) curve

Allow to see compromise over the operating range of a classifierCan be used forclassifier comparisone.g., compute the Area Under Curve (AUC) as a summary measure of performance

20

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1−specificityse

nsiti

vity

classifier 1 − AUC 0.79classifier 2 − AUC 0.74

Page 21: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Learning in high dimensionsDimensionality of feature vector (e.g., #voxels) often exceeds number of instances

model is underdetermined and requires additional assumptions (e.g., regularization)

Other solutionsfeature selection & extraction

gray-matter mask; region-of-interest (ROI); atlas regionsonly most-active voxels (inside training phase!)

searchlight approach [Kriegeskorte et al., 2006]ROI is ‘sliding’ around brainmultivariate capabilities (long-range) are limited

dimensionality reductionPCA, ICA, ...

kernel methods

21

Page 22: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

dot product: (4.-2) +(1.3) = -5

Kernel matrixKernel function K

similarity measure (scalar) between two feature vectorssimplest example: dot product (linear kernel)

N brain scans (feature vectors)

22

-­‐2     3

4 1

5 10 15 20 25 30 35 40 45

5

10

15

20

25

30

35

40

45-3

-2

-1

0

1

2

3

4

5

6

7

x 106instances

instances

Page 23: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Maximum-margin classifier

Assume data are linearly separable:

9w 2 RD, 9b 2 R :

yi(wTxi + b) > 0, i = 1, . . . , N

w and b can be rescaled such that

closest points to hyperplane satisfy

��w

Txi + b

�� = 1

Distance ⇢x

of a point x to hyper-

plane is

⇢x

=

��w

Txi + b

��kwk

Margin of hyperplane = 1/ kwk

Support vector machine

23[Vapnik, 1963, 1992]

HARD MARGIN

wT x +

b = 1

wT x +

b = 0

wT x +

b = -

1

Page 24: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Quadratic programming optimization problem

arg min

(w,b)

1

2

w

Tw

subject to yi(wTxi + b) � 1, i = 1, . . . , N

Constrained problem can be formulated as an unconstrained one using La-

grange multipliers:

arg min

(w,b)max

↵>0

1

2

kwk2 �NX

i=1

↵i

�yi(w

Txi + b)� 1

| {z }L

Points for which yi(wTxi + b) > 1 do not matter and should have ↵i = 0

Points on the margin have yi(wTxi + b) = 1 and are called support vectors;

these have ↵i > 0 and thus

@L

@↵i= yi(w

Txi + b)� 1 = 0 ! b = w

Txi � yi

Support vector machine (primal form)

24

convex criterion so no local minima

Page 25: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Lagrangian functional

arg min

(w,b)max

↵>0

1

2

kwk2 �NX

i=1

↵i

�yi(w

Txi + b)� 1

| {z }L

Solving L for ↵i, b, w leads to

@L

@↵i= yi(w

Txi + b)� 1 = 0

! b = w

Txi � yi (for support vector)

@L

@b= �

NX

i=1

yi↵i = 0

@L

@w= w �

NX

i=1

↵iyixi = 0

! w =

NX

i=1

↵iyixi

Support vector machine (primal form)

25

Page 26: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Lagrangian functional can be rewritten as

L =1

2

NX

i=1

NX

j=1

↵iyi x

Ti xj| {z }

K(xi,xj)

yj↵j �NX

i=1

↵i, s.t.

NX

i=1

yi↵i = 0

Minimization of this dual form requires to find ↵i � 0 in an N dimensional

space instead of D dimensional one!

Weight vector w can be found as a linear combination of the instances

w =N

SVX

i0=1

↵i0yi0xi0

where only the NSV

support vectors (with ↵i0 > 0) contribute!

Prediction of unseen data point x according to

f(x) = sign

N

SVX

i0=1

yi0↵i0xTi0x+ b

!

Support vector machine (dual form)

26

Efficient: “sparse” solutionLimited number of support vectors;

rest of data can be ignored

Page 27: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Soft-margin classifier

Introduce slack variables ⇠i to relax

separation:

⇠i > 0 ! xi has margin less than 1

New optimization problem (primal)

arg min(w,b),⇠i�0

1

2w

Tw + C

NX

i=1

⇠i

subject to yi(wTxi + b) � 1� ⇠i

Which leads to the dual form:

argmin↵�0

1

2

NX

i=1

NX

j=1

↵iyixTi xjyj↵j �

NX

i=1

↵i

subject to

NX

i=1

yi↵i = 0 and 0 ↵i C

Support vector machine

27

SOFT MARGIN

⇠1

⇠2

“box constraint”: allows for capacity control

C = 0 : very large safety marginC = Inf: close to hard-margin constraint

[Shawe-Taylor and Cristianini, 2004]

wT x +

b = 1

wT x +

b = 0

wT x +

b = -

1

Page 28: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Linear classification (by hyperplane) can be limiting

Project into (even) higher-dimensional feature space V where linear

separation corresponds to non-linear separation in original space

Kernel trick

28

Page 29: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Linear classification (by hyperplane) can be limiting

Project into (even) higher-dimensional feature space V where linearseparation corresponds to non-linear separation in original space

However, this new feature space V is implicit because in practice we use gen-eralized kernel functions:

K(x1,x2) = h'(x1),'(x2)iV

where h·, ·iV should be a proper inner product, that is, K(·, ·) should satisfyMercer’s condition:

NX

i=1

NX

j=1

K(xi,xj)cicj � 0

for all finite sequences of points (x1, . . . ,xN ) and real-valued coefficients(c1, . . . , cN ). This also means that the kernel matrix [K(xi,xj)]i,j shouldbe positive semi-definite (i.e., all eigenvalues� 0)

Kernel trick

29

Page 30: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Commonly used kernel functions K(x1,x2) = h'(x1),'(x2)i:Polynomial (homogeneous): K(x1,x2) = (x

T1 x2)

p

Polynomial (inhomogeneous): K(x1,x2) = (x

T1 x2 + 1)

p

Gaussian radial basis kernel: K(x1,x2) = exp(�� kx1 � x2k2), � > 0

Example: polynomial kernel K(x1,x2) = (xT1 x2)2

For D = 2, developing the kernel function leads to (binomial theorem)

K(x1,x2) = (x1,1x2,1 + x1,2x2,2)2

= (x1,1x2,1)2+ 2x1,1x2,1x1,2x2,2 + (x1,2x2,2)

2

=

D[x

21,1

p2x1,1x1,2 x

21,2], [x

22,1

p2x2,1 x

22,2]

E

From which we can identify ' : (x1, x2) 7! (x

21,p2x1x2, x

22)

For general D, this kernel maps to

�D+12

�=

D(D+1)2 dimensions (multinomial

theoream)

Kernel trick

30

Page 31: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

The transform ' is implicit, but the weight vector is defined as

w =X

i

↵iyi'(xi)

Dot products with w can be computed using the kernel trick:

w

T'(x) =X

i

↵iyiK(xi,x)

However, in general there exists no w

0 such that

w

T'(x) = K(w0,x)

which means there is no “equivalent weight map”

Kernel trick

31

Page 32: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Neuroimaging applications often work best with linear kernels

Very high-dimensional data with small sample sizes (D >> N ); increased

complexity of non-linear kernels (more parameters) cannot be turned into an

advantage

Linear models have less risk of overfitting

Weight vector of linear model has direct interpretation as an image;

i.e., “discrimination map”

Linear and non-linear kernels

32

Page 33: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

high reproducibilitylow predictability

low reproducibilityhigh predictability

Model complexity (no free lunch)

Ensemble learningcombine many (weak) classifiers to get better results

bagging: classifiers on random training subset, equal voteboosting: adding classifiers that deal with previously misclassified data

What’s the best classifier?

33[Hasbie et al., 2009]

Page 34: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Recommended readingPereira et al., Machine learning classifiers and fMRI: A tutorial overview, NeuroImage 45, 2009Richiardi et al., Machine learning with brain graphs, IEEE Signal Processing Magazine 30, 2013Haufe et al., On the interpretation of weight vectors of linear models in multivariate neuroimaging, NeuroImage, in press

34

Page 35: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Classifier selectionIs classifier A better than classifier B?Hypothesis testing framework

“Do the two sets of decisions of classifiers A and B represent two different populations?”

Careful: Samples (sets of decisions) are dependent (same evaluation/testing dataset), and measurement type is categorical binary (for a two-class problem)

⇒Use McNemar testH0: “there is no difference between the accuracies of the two classifiers”

35

test sample

AB

1 2 3 4 5 6 7 8 9 10✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✘ ✘

✔ ✔ ✘ ✔ ✔ ✔ ✘ ✔ ✔ ✘

Page 36: Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513/download/7-PR.pdf · Limitations of the GLM approach 䡧Massively univariate approach 䡧Hypothesis

Compute relationships between classifier decisions

If H0 is true, we should have N01 = N10 = 0.5(N01+N10)We can measure the following test-static based on the observed counts:

Then, compare x2 to critical value of χ2 (1 DOF) at a level of significance α

(requires N01+N10>25)

Selection with the McNemar test

36

B ✔ B ✘

A ✔ N11 N10

A ✘ N01 N00

After [Dietterich, 1998]

x

2 =(|N01 �N10|� 1)2

N01 +N10