Understanding multivariate pattern analysis for...

Understanding multivariate pattern analysisfor neuroimaging applications

Dimitri Van De VilleMedical Image Processing Lab, Institute of Bioengineering, Ecole Polytechnique Fédérale de LausanneDepartment of Radiology & Medical Informatics, Université de Genève

http://miplab.epfl.ch/

April 23, 2015

OverviewBrain decoding: motivation and applications

Limitations of “regular SPM” (GLM)“Hall of fame” of brain decoding

Pattern recognition principles for brain decodingBasics

Linear classifierDiscriminative versus generative model

Cross-validationTraining phaseEvaluation phaseTesting phase

Support vector machine (SVM)Primal and dual formKernel trick

2

3

GLM

Stats

whole-brain scan2x2x2mm3, 20-30 slicesevery 2-4 sec

100’000 intracranial voxels1’000 timepoints

t-val

ue

time

Limitations of the GLM approachMassively univariate approach

Hypothesis testing is applied voxel-wiseAll voxels are treated independently,but are obviously not independent

Textbook multivariate approaches are not practicalE.g., estimating spatial covariance matrix (100’000x100’000)by 1’000 timepoints will be rank-deficient fMRI data are complicated➔ Data-driven: let the data speak

Group statistics are great...... but the ultimate goal is oftento know how well the resultsgeneralizes (prediction)Learn from training data, apply to the (unseen) test data

4

condition 1 condition 2

voxel 1voxel 2

voxel 1voxel 2

Forward versus backwards inference

5

stimuliconditionsdiseasedisorder...

generative model

stimuliconditionsdiseasedisorder...

discriminative model

encoding

decoding

data

data

phrenology

blind reading

Exploiting “subtle” spatial patterns

[Cox et al., 2003; Haynes and Rees, 2006]

example in in 2D9D feature space

Emotions in voice-sensitive corticesHow is emotional information treated in aud. cortex?

22 healthy right-handed subjects10 actors pronounced “ne kalibam sout molem”(anger, sadness, neutrality, relief, joy)

normalization for mean sound intensity; rated by 24 (other) subjects

Classifier approach on aud. cortexlocalizer: human voices > animal sounds20 stimuli for each category (5); 5 classifiers (one-versus-others)

7[Ethofer, VDV, Scherer, Vuilleumier, Current Biology, 2009; collaboration with LABNIC/CISA, UniGE]

p<0.05 (FWE)

8

Anger Sadness Neutral Relief Joy

z=0

z=+3

z=+6

[Ethofer, VDV, Scherrer, Vuilleumier, Current Biology, 2009]“Ne kalibam sout molem”

Decoding from the emotional brain

9[Ethofer, VDV, Scherer, Vuilleumier, Current Biology, 2009]

Results for bilateralvoice-sensitive cortices

solid: smoothed (10mm FWHM)dashed: non-smootheddotted: spatial average

Chance level: 20%Classifier: linear SVM

Evidence that auditory cues for emotional processing are spatially encoded; bilaterial representation

Comprehension of emotional prosody is crucial for social functioning psychiatric disorders (schizophrenia, bipolar affective disorder,...)

Brain decoding of visual processing

10[Haxby et al., 2001]

distributed patterns subtle patterns in V1

[Kamitani and Tong, 2005, 2006][Haynes, Rees., 2006]

advanced models forreceptive fields

[Miyawaki et al., 2008]

11

Decoding video sequences

reverse statistical inference(“decode” stimulus from data)

[Nishimoto et al, 2011]

Basics of pattern recognition

Machine learning aims at getting the model right based on a series of examples (supervised learning)

learning/training phase:from examples {(x1,y1),...,(xN,yN)} learn model f such that f(xi)=yi

testing phase: for unseen data xs, predict label f(xs)

12

featurevector xi

D featuresmodel

functionmachineclassifier

...

label yi(e.g., -1/1,

control/patient)

Classification as a regression problem

Linear model:

y = wTX + b+ noise

where the vector (w, b) characterizes the full model

Learning the model by least-squares fitting: (w, b) = argmin(w,b)

��y �wTX � b��2

Underdetermined (D > N ): (w, b) = argmin(w,b)

��y �wTX � b��2+� kwk2

solution: w = pinv(XT )yT


13

featurematrix X

D featuresmodel

labels y

N instances


14

featurematrix X

D featuresmodel

labels y

N instances

Evaluate classifier on unseen data

Avoid overfitting:

increasingly complex models will succeed better to fit (any) training data!

Classify new data x

0by comparing w

Tx

0 + b against 0?

Interpretation of the weight vector

15

Voxel 1 Voxel 2Voxel 1 Voxel 2

…

…

examples of class 1

Training

weight vector w

x1 = 0.5 x2 = 0.8Tes6ngnew example

predicted label:(w1.x1+w2.x2)+b = (+5.0.5-‐3.0.8)+0 = 0.1 posi?ve value -‐> class 1

Voxel 1 Voxel 2Voxel 1 Voxel 2

examples of class 2

w1 = +5 w2 = -‐3

• spatial representation of decision function

• multivariate pattern:no local inference!


16

Linear model projects instance on line parallel to wDecision boundary is a hyperplane in the feature spaceIf feature dimensions correspond to voxels, then w corresponds to a mapBias b allows to shift decision boundary along w

[adapted from Jaakkola]

-1

+10

separating hyperplane

Classification and decision theory

Assume class-conditional densities p(x|y) and class probabilities P (y) known

Minimizing the overall probability of error corresponds to

y0 = argmax

yP (y|x0

)

= argmax

y

p(x0|y)P (y)

p(x0)

(Bayes’ rule)

= argmax

yp(x0|y)P (y)


17

featurematrix X

D featuresmodel

labels y

N instances

posterior

likelihood

P (y|x)p(x|y)

Wonderful link between discriminative model (posterior) and gener-

ative model (likelihood)

P (y|x) / p(x|y)P (y)

Discriminative model; e.g., linear classifier, SVM

Generative model; e.g., na¨ıve Bayes

p(x|y) =DY

i=1

p(xi|y)

where the features are modeled as independent random variables


18

posterior

likelihood


19

featurematrix X

D featuresmodel

labels y

N instances

TR EV TE

Train Evaluate TestErrorOK?

TREV

TE

Final Performance

MeasureSplitFull

Dataset

Tune

Leave-one-out cross-validation

Repeatedly train on all instances except one, evaluate on left-out instance

Accuracy statistics: class 1 =

AA+B ; class 2=

DC+D ; average =

A+DA+B+C+D

Sensitivity =

DC+D = TP

FN+TP

; specificity =

AA+B = TN

FP+TN

class 1 class 2

class 1 A Bclass 2 C Dtru

th

estimated

Testing classifiers: the ROC curveReceiver Operating Characteristic (ROC) curve

Allow to see compromise over the operating range of a classifierCan be used forclassifier comparisone.g., compute the Area Under Curve (AUC) as a summary measure of performance

20

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1−specificityse

nsiti

vity

classifier 1 − AUC 0.79classifier 2 − AUC 0.74

Learning in high dimensionsDimensionality of feature vector (e.g., #voxels) often exceeds number of instances

model is underdetermined and requires additional assumptions (e.g., regularization)

Other solutionsfeature selection & extraction

gray-matter mask; region-of-interest (ROI); atlas regionsonly most-active voxels (inside training phase!)

searchlight approach [Kriegeskorte et al., 2006]ROI is ‘sliding’ around brainmultivariate capabilities (long-range) are limited

dimensionality reductionPCA, ICA, ...

kernel methods

21

dot product: (4.-2) +(1.3) = -5

Kernel matrixKernel function K

similarity measure (scalar) between two feature vectorssimplest example: dot product (linear kernel)

N brain scans (feature vectors)

22

-‐2 3

4 1

5 10 15 20 25 30 35 40 45

5

10

15

20

25

30

35

40

45-3

-2

-1

0

1

2

3

4

5

6

7

x 106instances

instances

Maximum-margin classifier

Assume data are linearly separable:

9w 2 RD, 9b 2 R :

yi(wTxi + b) > 0, i = 1, . . . , N

w and b can be rescaled such that

closest points to hyperplane satisfy

��w

Txi + b

�� = 1

Distance ⇢x

of a point x to hyper-

plane is

⇢x

=

��w

Txi + b

��kwk

Margin of hyperplane = 1/ kwk

Support vector machine

23[Vapnik, 1963, 1992]

HARD MARGIN

wT x +

b = 1

wT x +

b = 0

wT x +

b = -

1

Quadratic programming optimization problem

arg min

(w,b)

1

2

w

Tw

subject to yi(wTxi + b) � 1, i = 1, . . . , N

Constrained problem can be formulated as an unconstrained one using La-

grange multipliers:

arg min

(w,b)max

↵>0

1

2

kwk2 �NX

i=1

↵i

�yi(w

Txi + b)� 1

�

| {z }L

Points for which yi(wTxi + b) > 1 do not matter and should have ↵i = 0

Points on the margin have yi(wTxi + b) = 1 and are called support vectors;

these have ↵i > 0 and thus

@L

@↵i= yi(w

Txi + b)� 1 = 0 ! b = w

Txi � yi

Support vector machine (primal form)

24

convex criterion so no local minima

Lagrangian functional

arg min

(w,b)max

↵>0

1

2

kwk2 �NX

i=1

↵i

�yi(w

Txi + b)� 1

�

| {z }L

Solving L for ↵i, b, w leads to

@L

@↵i= yi(w

Txi + b)� 1 = 0

! b = w

Txi � yi (for support vector)

@L

@b= �

NX

i=1

yi↵i = 0

@L

@w= w �

NX

i=1

↵iyixi = 0

! w =

NX

i=1

↵iyixi

Support vector machine (primal form)

25

Lagrangian functional can be rewritten as

L =1

2

NX

i=1

NX

j=1

↵iyi x

Ti xj| {z }

K(xi,xj)

yj↵j �NX

i=1

↵i, s.t.

NX

i=1

yi↵i = 0

Minimization of this dual form requires to find ↵i � 0 in an N dimensional

space instead of D dimensional one!

Weight vector w can be found as a linear combination of the instances

w =N

SVX

i0=1

↵i0yi0xi0

where only the NSV

support vectors (with ↵i0 > 0) contribute!

Prediction of unseen data point x according to

f(x) = sign

N

SVX

i0=1

yi0↵i0xTi0x+ b

!

Support vector machine (dual form)

26

Efficient: “sparse” solutionLimited number of support vectors;

rest of data can be ignored

Soft-margin classifier

Introduce slack variables ⇠i to relax

separation:

⇠i > 0 ! xi has margin less than 1

New optimization problem (primal)

arg min(w,b),⇠i�0

1

2w

Tw + C

NX

i=1

⇠i

subject to yi(wTxi + b) � 1� ⇠i

Which leads to the dual form:

argmin↵�0

1

2

NX

i=1

NX

j=1

↵iyixTi xjyj↵j �

NX

i=1

↵i

subject to

NX

i=1

yi↵i = 0 and 0 ↵i C

Support vector machine

27

SOFT MARGIN

⇠1

⇠2

“box constraint”: allows for capacity control

C = 0 : very large safety marginC = Inf: close to hard-margin constraint

[Shawe-Taylor and Cristianini, 2004]

wT x +

b = 1

wT x +

b = 0

wT x +

b = -

1

Linear classification (by hyperplane) can be limiting

Project into (even) higher-dimensional feature space V where linear

separation corresponds to non-linear separation in original space

Kernel trick

28

Linear classification (by hyperplane) can be limiting

Project into (even) higher-dimensional feature space V where linearseparation corresponds to non-linear separation in original space

However, this new feature space V is implicit because in practice we use gen-eralized kernel functions:

K(x1,x2) = h'(x1),'(x2)iV

where h·, ·iV should be a proper inner product, that is, K(·, ·) should satisfyMercer’s condition:

NX

i=1

NX

j=1

K(xi,xj)cicj � 0

for all finite sequences of points (x1, . . . ,xN ) and real-valued coefficients(c1, . . . , cN ). This also means that the kernel matrix [K(xi,xj)]i,j shouldbe positive semi-definite (i.e., all eigenvalues� 0)

Kernel trick

29

Commonly used kernel functions K(x1,x2) = h'(x1),'(x2)i:Polynomial (homogeneous): K(x1,x2) = (x

T1 x2)

p

Polynomial (inhomogeneous): K(x1,x2) = (x

T1 x2 + 1)

p

Gaussian radial basis kernel: K(x1,x2) = exp(�� kx1 � x2k2), � > 0

Example: polynomial kernel K(x1,x2) = (xT1 x2)2

For D = 2, developing the kernel function leads to (binomial theorem)

K(x1,x2) = (x1,1x2,1 + x1,2x2,2)2

= (x1,1x2,1)2+ 2x1,1x2,1x1,2x2,2 + (x1,2x2,2)

2

=

D[x

21,1

p2x1,1x1,2 x

21,2], [x

22,1

p2x2,1 x

22,2]

E

From which we can identify ' : (x1, x2) 7! (x

21,p2x1x2, x

22)

For general D, this kernel maps to

�D+12

�=

D(D+1)2 dimensions (multinomial

theoream)

Kernel trick

30

The transform ' is implicit, but the weight vector is defined as

w =X

i

↵iyi'(xi)

Dot products with w can be computed using the kernel trick:

w

T'(x) =X

i

↵iyiK(xi,x)

However, in general there exists no w

0 such that

w

T'(x) = K(w0,x)

which means there is no “equivalent weight map”

Kernel trick

31

Neuroimaging applications often work best with linear kernels

Very high-dimensional data with small sample sizes (D >> N ); increased

complexity of non-linear kernels (more parameters) cannot be turned into an

advantage

Linear models have less risk of overfitting

Weight vector of linear model has direct interpretation as an image;

i.e., “discrimination map”

Linear and non-linear kernels

32

high reproducibilitylow predictability

low reproducibilityhigh predictability

Model complexity (no free lunch)

Ensemble learningcombine many (weak) classifiers to get better results

bagging: classifiers on random training subset, equal voteboosting: adding classifiers that deal with previously misclassified data

What’s the best classifier?

33[Hasbie et al., 2009]

Recommended readingPereira et al., Machine learning classifiers and fMRI: A tutorial overview, NeuroImage 45, 2009Richiardi et al., Machine learning with brain graphs, IEEE Signal Processing Magazine 30, 2013Haufe et al., On the interpretation of weight vectors of linear models in multivariate neuroimaging, NeuroImage, in press

34

Classifier selectionIs classifier A better than classifier B?Hypothesis testing framework

“Do the two sets of decisions of classifiers A and B represent two different populations?”

Careful: Samples (sets of decisions) are dependent (same evaluation/testing dataset), and measurement type is categorical binary (for a two-class problem)

⇒Use McNemar testH0: “there is no difference between the accuracies of the two classifiers”

35

test sample

AB

1 2 3 4 5 6 7 8 9 10✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✘ ✘

✔ ✔ ✘ ✔ ✔ ✔ ✘ ✔ ✔ ✘

Compute relationships between classifier decisions

If H0 is true, we should have N01 = N10 = 0.5(N01+N10)We can measure the following test-static based on the observed counts:

Then, compare x2 to critical value of χ2 (1 DOF) at a level of significance α

(requires N01+N10>25)

Selection with the McNemar test

36

B ✔ B ✘

A ✔ N11 N10

A ✘ N01 N00

After [Dietterich, 1998]

x

2 =(|N01 �N10|� 1)2

N01 +N10

Understanding multivariate pattern analysis for...

Documents

Transcript of Understanding multivariate pattern analysis for...