Understanding multivariate pattern analysis for...

Understanding multivariate pattern analysisfor neuroimaging applications

Dimitri Van De VilleMedical Image Processing Lab, Institute of Bioengineering, Ecole Polytechnique Fédérale de LausanneDepartment of Radiology & Medical Informatics, Université de Genève

http://miplab.epfl.ch/

April 23, 2015

OverviewBrain decoding: motivation and applications

Limitations of “regular SPM” (GLM)“Hall of fame” of brain decoding

Pattern recognition principles for brain decodingBasics

Linear classifierDiscriminative versus generative model

Cross-validationTraining phaseEvaluation phaseTesting phase

Support vector machine (SVM)Primal and dual formKernel trick

whole-brain scan2x2x2mm3, 20-30 slicesevery 2-4 sec

100’000 intracranial voxels1’000 timepoints

Limitations of the GLM approachMassively univariate approach

Hypothesis testing is applied voxel-wiseAll voxels are treated independently,but are obviously not independent

Textbook multivariate approaches are not practicalE.g., estimating spatial covariance matrix (100’000x100’000)by 1’000 timepoints will be rank-deficient fMRI data are complicated➔ Data-driven: let the data speak

Group statistics are great...... but the ultimate goal is oftento know how well the resultsgeneralizes (prediction)Learn from training data, apply to the (unseen) test data

condition 1 condition 2

voxel 1voxel 2

Forward versus backwards inference

stimuliconditionsdiseasedisorder...

generative model

stimuliconditionsdiseasedisorder...

discriminative model

encoding

decoding

phrenology

blind reading

Exploiting “subtle” spatial patterns

[Cox et al., 2003; Haynes and Rees, 2006]

example in in 2D9D feature space

Emotions in voice-sensitive corticesHow is emotional information treated in aud. cortex?

22 healthy right-handed subjects10 actors pronounced “ne kalibam sout molem”(anger, sadness, neutrality, relief, joy)

normalization for mean sound intensity; rated by 24 (other) subjects

Classifier approach on aud. cortexlocalizer: human voices > animal sounds20 stimuli for each category (5); 5 classifiers (one-versus-others)

7[Ethofer, VDV, Scherer, Vuilleumier, Current Biology, 2009; collaboration with LABNIC/CISA, UniGE]

p<0.05 (FWE)

Anger Sadness Neutral Relief Joy

[Ethofer, VDV, Scherrer, Vuilleumier, Current Biology, 2009]“Ne kalibam sout molem”

Decoding from the emotional brain

9[Ethofer, VDV, Scherer, Vuilleumier, Current Biology, 2009]

Results for bilateralvoice-sensitive cortices

solid: smoothed (10mm FWHM)dashed: non-smootheddotted: spatial average

Chance level: 20%Classifier: linear SVM

Evidence that auditory cues for emotional processing are spatially encoded; bilaterial representation

Comprehension of emotional prosody is crucial for social functioning psychiatric disorders (schizophrenia, bipolar affective disorder,...)

Brain decoding of visual processing

10[Haxby et al., 2001]

distributed patterns subtle patterns in V1

[Kamitani and Tong, 2005, 2006][Haynes, Rees., 2006]

advanced models forreceptive fields

[Miyawaki et al., 2008]

Decoding video sequences

reverse statistical inference(“decode” stimulus from data)

[Nishimoto et al, 2011]

Basics of pattern recognition

Machine learning aims at getting the model right based on a series of examples (supervised learning)

learning/training phase:from examples {(x1,y1),...,(xN,yN)} learn model f such that f(xi)=yi

testing phase: for unseen data xs, predict label f(xs)

featurevector xi

D featuresmodel

functionmachineclassifier

label yi(e.g., -1/1,

control/patient)

Classification as a regression problem

Linear model:

y = wTX + b+ noise

where the vector (w, b) characterizes the full model

Learning the model by least-squares fitting: (w, b) = argmin(w,b)

��y �wTX � b��2

Underdetermined (D > N ): (w, b) = argmin(w,b)

��y �wTX � b��2+� kwk2

solution: w = pinv(XT )yT

featurematrix X

D featuresmodel

labels y

N instances

featurematrix X

D featuresmodel

labels y

N instances

Evaluate classifier on unseen data

Avoid overfitting:

increasingly complex models will succeed better to fit (any) training data!

Classify new data x

0by comparing w

0 + b against 0?

Interpretation of the weight vector

Voxel 1 Voxel 2Voxel 1 Voxel 2

examples of class 1

Training

weight vector w

x1 = 0.5 x2 = 0.8Tes6ngnew example

predicted label:(w1.x1+w2.x2)+b = (+5.0.5-‐3.0.8)+0 = 0.1 posi?ve value -‐> class 1

Voxel 1 Voxel 2Voxel 1 Voxel 2

examples of class 2

w1 = +5 w2 = -‐3

• spatial representation of decision function

• multivariate pattern:no local inference!

Linear model projects instance on line parallel to wDecision boundary is a hyperplane in the feature spaceIf feature dimensions correspond to voxels, then w corresponds to a mapBias b allows to shift decision boundary along w

[adapted from Jaakkola]

separating hyperplane

Classification and decision theory

Assume class-conditional densities p(x|y) and class probabilities P (y) known

Minimizing the overall probability of error corresponds to

y0 = argmax

yP (y|x0

= argmax

p(x0|y)P (y)

(Bayes’ rule)

= argmax

yp(x0|y)P (y)

featurematrix X

D featuresmodel

labels y

N instances

posterior

likelihood

P (y|x)p(x|y)

Wonderful link between discriminative model (posterior) and gener-

ative model (likelihood)

P (y|x) / p(x|y)P (y)

Discriminative model; e.g., linear classifier, SVM

Generative model; e.g., na¨ıve Bayes

p(x|y) =DY

p(xi|y)

where the features are modeled as independent random variables

posterior

likelihood

featurematrix X

D featuresmodel

labels y

N instances

TR EV TE

Train Evaluate TestErrorOK?

Final Performance

MeasureSplitFull

Dataset

Leave-one-out cross-validation

Repeatedly train on all instances except one, evaluate on left-out instance

Accuracy statistics: class 1 =

AA+B ; class 2=

DC+D ; average =

A+DA+B+C+D

Sensitivity =

DC+D = TP

; specificity =

AA+B = TN

class 1 class 2

class 1 A Bclass 2 C Dtru

estimated

Testing classifiers: the ROC curveReceiver Operating Characteristic (ROC) curve

Allow to see compromise over the operating range of a classifierCan be used forclassifier comparisone.g., compute the Area Under Curve (AUC) as a summary measure of performance

0 0.2 0.4 0.6 0.8 10

1−specificityse

classifier 1 − AUC 0.79classifier 2 − AUC 0.74

Learning in high dimensionsDimensionality of feature vector (e.g., #voxels) often exceeds number of instances

model is underdetermined and requires additional assumptions (e.g., regularization)

Other solutionsfeature selection & extraction

gray-matter mask; region-of-interest (ROI); atlas regionsonly most-active voxels (inside training phase!)

searchlight approach [Kriegeskorte et al., 2006]ROI is ‘sliding’ around brainmultivariate capabilities (long-range) are limited

dimensionality reductionPCA, ICA, ...

kernel methods

dot product: (4.-2) +(1.3) = -5

Kernel matrixKernel function K

similarity measure (scalar) between two feature vectorssimplest example: dot product (linear kernel)

N brain scans (feature vectors)

-‐2 3

5 10 15 20 25 30 35 40 45

x 106instances

instances

Maximum-margin classifier

Assume data are linearly separable:

9w 2 RD, 9b 2 R :

yi(wTxi + b) > 0, i = 1, . . . , N

w and b can be rescaled such that

closest points to hyperplane satisfy

��w

Txi + b

�� = 1

Distance ⇢x

of a point x to hyper-

plane is

��w

Txi + b

��kwk

Margin of hyperplane = 1/ kwk

Support vector machine

23[Vapnik, 1963, 1992]

HARD MARGIN

wT x +

Quadratic programming optimization problem

arg min

subject to yi(wTxi + b) � 1, i = 1, . . . , N

Constrained problem can be formulated as an unconstrained one using La-

grange multipliers:

arg min

(w,b)max

kwk2 �NX

�yi(w

Txi + b)� 1

| {z }L

Points for which yi(wTxi + b) > 1 do not matter and should have ↵i = 0

Points on the margin have yi(wTxi + b) = 1 and are called support vectors;

these have ↵i > 0 and thus

@↵i= yi(w

Txi + b)� 1 = 0 ! b = w

Txi � yi

Support vector machine (primal form)

convex criterion so no local minima

Lagrangian functional

arg min

(w,b)max

kwk2 �NX

�yi(w

Txi + b)� 1

| {z }L

Solving L for ↵i, b, w leads to

@↵i= yi(w

Txi + b)� 1 = 0

! b = w

Txi � yi (for support vector)

@b= �

yi↵i = 0

@w= w �

↵iyixi = 0

↵iyixi

Support vector machine (primal form)

Lagrangian functional can be rewritten as

↵iyi x

Ti xj| {z }

K(xi,xj)

yj↵j �NX

↵i, s.t.

yi↵i = 0

Minimization of this dual form requires to find ↵i � 0 in an N dimensional

space instead of D dimensional one!

Weight vector w can be found as a linear combination of the instances

↵i0yi0xi0

where only the NSV

support vectors (with ↵i0 > 0) contribute!

Prediction of unseen data point x according to

f(x) = sign

yi0↵i0xTi0x+ b

Support vector machine (dual form)

Efficient: “sparse” solutionLimited number of support vectors;

rest of data can be ignored

Soft-margin classifier

Introduce slack variables ⇠i to relax

separation:

⇠i > 0 ! xi has margin less than 1

New optimization problem (primal)

arg min(w,b),⇠i�0

Tw + C

subject to yi(wTxi + b) � 1� ⇠i

Which leads to the dual form:

argmin↵�0

↵iyixTi xjyj↵j �

subject to

yi↵i = 0 and 0 ↵i C

Support vector machine

SOFT MARGIN

“box constraint”: allows for capacity control

C = 0 : very large safety marginC = Inf: close to hard-margin constraint

[Shawe-Taylor and Cristianini, 2004]

wT x +

Linear classification (by hyperplane) can be limiting

Project into (even) higher-dimensional feature space V where linear

separation corresponds to non-linear separation in original space

Kernel trick

Linear classification (by hyperplane) can be limiting

Project into (even) higher-dimensional feature space V where linearseparation corresponds to non-linear separation in original space

However, this new feature space V is implicit because in practice we use gen-eralized kernel functions:

K(x1,x2) = h'(x1),'(x2)iV

where h·, ·iV should be a proper inner product, that is, K(·, ·) should satisfyMercer’s condition:

K(xi,xj)cicj � 0

for all finite sequences of points (x1, . . . ,xN ) and real-valued coefficients(c1, . . . , cN ). This also means that the kernel matrix [K(xi,xj)]i,j shouldbe positive semi-definite (i.e., all eigenvalues� 0)

Kernel trick

Commonly used kernel functions K(x1,x2) = h'(x1),'(x2)i:Polynomial (homogeneous): K(x1,x2) = (x

T1 x2)

Polynomial (inhomogeneous): K(x1,x2) = (x

T1 x2 + 1)

Gaussian radial basis kernel: K(x1,x2) = exp(�� kx1 � x2k2), � > 0

Example: polynomial kernel K(x1,x2) = (xT1 x2)2

For D = 2, developing the kernel function leads to (binomial theorem)

K(x1,x2) = (x1,1x2,1 + x1,2x2,2)2

= (x1,1x2,1)2+ 2x1,1x2,1x1,2x2,2 + (x1,2x2,2)

p2x1,1x1,2 x

21,2], [x

p2x2,1 x

From which we can identify ' : (x1, x2) 7! (x

21,p2x1x2, x

For general D, this kernel maps to

�D+12

D(D+1)2 dimensions (multinomial

theoream)

Kernel trick

The transform ' is implicit, but the weight vector is defined as

↵iyi'(xi)

Dot products with w can be computed using the kernel trick:

T'(x) =X

↵iyiK(xi,x)

However, in general there exists no w

0 such that

T'(x) = K(w0,x)

which means there is no “equivalent weight map”

Kernel trick

Neuroimaging applications often work best with linear kernels

Very high-dimensional data with small sample sizes (D >> N ); increased

complexity of non-linear kernels (more parameters) cannot be turned into an

advantage

Linear models have less risk of overfitting

Weight vector of linear model has direct interpretation as an image;

i.e., “discrimination map”

Linear and non-linear kernels

high reproducibilitylow predictability

low reproducibilityhigh predictability

Model complexity (no free lunch)

Ensemble learningcombine many (weak) classifiers to get better results

bagging: classifiers on random training subset, equal voteboosting: adding classifiers that deal with previously misclassified data

What’s the best classifier?

33[Hasbie et al., 2009]

Recommended readingPereira et al., Machine learning classifiers and fMRI: A tutorial overview, NeuroImage 45, 2009Richiardi et al., Machine learning with brain graphs, IEEE Signal Processing Magazine 30, 2013Haufe et al., On the interpretation of weight vectors of linear models in multivariate neuroimaging, NeuroImage, in press

Classifier selectionIs classifier A better than classifier B?Hypothesis testing framework

“Do the two sets of decisions of classifiers A and B represent two different populations?”

Careful: Samples (sets of decisions) are dependent (same evaluation/testing dataset), and measurement type is categorical binary (for a two-class problem)

⇒Use McNemar testH0: “there is no difference between the accuracies of the two classifiers”

test sample

1 2 3 4 5 6 7 8 9 10✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✘ ✘

✔ ✔ ✘ ✔ ✔ ✔ ✘ ✔ ✔ ✘

Compute relationships between classifier decisions

If H0 is true, we should have N01 = N10 = 0.5(N01+N10)We can measure the following test-static based on the observed counts:

Then, compare x2 to critical value of χ2 (1 DOF) at a level of significance α

(requires N01+N10>25)

Selection with the McNemar test

B ✔ B ✘

A ✔ N11 N10

A ✘ N01 N00

After [Dietterich, 1998]

2 =(|N01 �N10|� 1)2

N01 +N10

Understanding multivariate pattern analysis for...

Documents

Transcript of Understanding multivariate pattern analysis for...

Understanding multivariate pattern analysis for ...miplab.epfl.ch/teaching/micro-513_2014/download/7-PR.pdf · Understanding multivariate pattern analysis for neuroimaging applications

bioinformatics secrets The Bioinformatics Skill System · 䡧Differentiate yourself . Make your own luck! ... 䡧The majority of these need to be correct!!!

Lab - Dynamic BOLD functional connectivity in humans and its …miplab.epfl.ch/BrainHack/Theory/Articles/Tagliazucchi... · 2017. 3. 3. · monkey studies, Local Field Potentials

Lesson #16 Rocks and the Rock Cycleimages.pcmac.org/SiSFiles/Schools/TN/DyerCounty/... · The Rock Cycle 䡧Once rock forms, it may stay in the same form for millions of years.! 䡧However,

cardiovascular diseases in pregnancy - WordPress.com · 䡧heart silhouette normally is larger in pregnancy. This is accentuated further with a portable AP chest radiograph. 䡧2D

̃ ̃ Basel PR.pdf · 535 West 22nd St, 3rd Floor New York, NY 10011 212 647 1044 p.p.O'

DEGREE REQUIREMENTS GRADUATE COURSES OFFEREDgpsdocs.rice.edu/dept_brochures/STAT-pr.pdf · namely: 1) nonparametric function estimation, 2) stochastic processes, 3) biostatistics

Signal Processing for Functional Brain Imaging: A few topics revisited …miplab.epfl.ch/teaching/micro-513/download/10... · 2015-05-21 · Signal Processing for Functional Brain

SPAM - Tamsayfun.me/wp-content/uploads/2014/04/SPAM-Tam.pdf · 䡧Trackback Spam! " 䡧Trackback facilitate communication between blogs.! " 䡧Ex: As a blogger writes a new entry

ACL-03-04_EXP-01-AB PR.pdf

PROFESSIONAL SCIENCE MASTER’S OF ... - Rice Universitygpsdocs.rice.edu/dept_brochures/PSMP-pr.pdf · Rice University Bioscience and ... and receive communication training. ... area

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. …miplab.epfl.ch/pub/verhaeghe0801.pdfVERHAEGHE et al.: DYNAMIC PET RECONSTRUCTION USING WAVELET REGULARIZATION WITH ADAPTED BASIS FUNCTIONS

Chapter 1 Ethical Hacking Overview Last modified 8-21-14 Ethical Hacking and... · Hands-On Ethical Hacking and Network Defense 2 䡧Describe the role of an ethical hacker 䡧Describe

PEER MENTORING Two Directional Benefits䡧100 undergraduate and graduate students are peer mentors annually for first and second year college students. 䡧 Mentoring directly benefits

Declaration of Barmen · Barmen’s Uniqueness 䡧It is a declaration about a specific issue that stands to unite the church against a moral heresy. 䡧It purposefully avoided being

Neural Correlates of Clinical Scores in Patients with Anterior ...miplab.epfl.ch/pub/zanchi1501p.pdfshoulder instability, which may lead to patient morbidity and impede shoulder function.

Improved statistical evaluation of group differences in ...miplab.epfl.ch/pub/meskaldji1501.pdf · Improved statistical evaluation of group differences in connectomes by screening–ﬁltering

Chapter 1 Ethical Hacking Overview Last modified 1-11-17 Ethical Hacking and Network Defense 2 䡧Describe the role of an ethical hacker 䡧Describe what you can do legally as an ethical

Anatomically-adapted graph wavelets for improved group-level fMRI activation …miplab.epfl.ch › pub › behjat1501.pdf · 2015-12-08 · Anatomically-adapted graph wavelets for

21 CENTURY LEARNING SPACES · 䡧Learning spaces need continued change to be relevant! 䡧“Design , Functionality and People” 䡧The strong push towards self discovery learning