Post on 15-Jul-2020
Understanding multivariate pattern analysisfor neuroimaging applications
Dimitri Van De VilleMedical Image Processing Lab, Institute of Bioengineering, Ecole Polytechnique Fédérale de LausanneDepartment of Radiology & Medical Informatics, Université de Genève
http://miplab.epfl.ch/
April 23, 2015
OverviewBrain decoding: motivation and applications
Limitations of “regular SPM” (GLM)“Hall of fame” of brain decoding
Pattern recognition principles for brain decodingBasics
Linear classifierDiscriminative versus generative model
Cross-validationTraining phaseEvaluation phaseTesting phase
Support vector machine (SVM)Primal and dual formKernel trick
2
3
GLM
Stats
whole-brain scan2x2x2mm3, 20-30 slicesevery 2-4 sec
100’000 intracranial voxels1’000 timepoints
t-val
ue
time
Limitations of the GLM approachMassively univariate approach
Hypothesis testing is applied voxel-wiseAll voxels are treated independently,but are obviously not independent
Textbook multivariate approaches are not practicalE.g., estimating spatial covariance matrix (100’000x100’000)by 1’000 timepoints will be rank-deficient fMRI data are complicated➔ Data-driven: let the data speak
Group statistics are great...... but the ultimate goal is oftento know how well the resultsgeneralizes (prediction)Learn from training data, apply to the (unseen) test data
4
condition 1 condition 2
voxel 1voxel 2
voxel 1voxel 2
Forward versus backwards inference
5
stimuliconditionsdiseasedisorder...
generative model
stimuliconditionsdiseasedisorder...
discriminative model
encoding
decoding
data
data
phrenology
blind reading
Exploiting “subtle” spatial patterns
[Cox et al., 2003; Haynes and Rees, 2006]
example in in 2D9D feature space
Emotions in voice-sensitive corticesHow is emotional information treated in aud. cortex?
22 healthy right-handed subjects10 actors pronounced “ne kalibam sout molem”(anger, sadness, neutrality, relief, joy)
normalization for mean sound intensity; rated by 24 (other) subjects
Classifier approach on aud. cortexlocalizer: human voices > animal sounds20 stimuli for each category (5); 5 classifiers (one-versus-others)
7[Ethofer, VDV, Scherer, Vuilleumier, Current Biology, 2009; collaboration with LABNIC/CISA, UniGE]
p<0.05 (FWE)
8
Anger Sadness Neutral Relief Joy
z=0
z=+3
z=+6
[Ethofer, VDV, Scherrer, Vuilleumier, Current Biology, 2009]“Ne kalibam sout molem”
Decoding from the emotional brain
9[Ethofer, VDV, Scherer, Vuilleumier, Current Biology, 2009]
Results for bilateralvoice-sensitive cortices
solid: smoothed (10mm FWHM)dashed: non-smootheddotted: spatial average
Chance level: 20%Classifier: linear SVM
Evidence that auditory cues for emotional processing are spatially encoded; bilaterial representation
Comprehension of emotional prosody is crucial for social functioning psychiatric disorders (schizophrenia, bipolar affective disorder,...)
Brain decoding of visual processing
10[Haxby et al., 2001]
distributed patterns subtle patterns in V1
[Kamitani and Tong, 2005, 2006][Haynes, Rees., 2006]
advanced models forreceptive fields
[Miyawaki et al., 2008]
11
Decoding video sequences
reverse statistical inference(“decode” stimulus from data)
[Nishimoto et al, 2011]
Basics of pattern recognition
Machine learning aims at getting the model right based on a series of examples (supervised learning)
learning/training phase:from examples {(x1,y1),...,(xN,yN)} learn model f such that f(xi)=yi
testing phase: for unseen data xs, predict label f(xs)
12
featurevector xi
D featuresmodel
functionmachineclassifier
...
label yi(e.g., -1/1,
control/patient)
Classification as a regression problem
Linear model:
y = wTX + b+ noise
where the vector (w, b) characterizes the full model
Learning the model by least-squares fitting: (w, b) = argmin(w,b)
��y �wTX � b��2
Underdetermined (D > N ): (w, b) = argmin(w,b)
��y �wTX � b��2+� kwk2
solution: w = pinv(XT )yT
Basics of pattern recognition
13
featurematrix X
D featuresmodel
labels y
N instances
Basics of pattern recognition
14
featurematrix X
D featuresmodel
labels y
N instances
Evaluate classifier on unseen data
Avoid overfitting:
increasingly complex models will succeed better to fit (any) training data!
Classify new data x
0by comparing w
Tx
0 + b against 0?
Interpretation of the weight vector
15
Voxel 1 Voxel 2Voxel 1 Voxel 2
…
…
examples of class 1
Training
weight vector w
x1 = 0.5 x2 = 0.8Tes6ngnew example
predicted label:(w1.x1+w2.x2)+b = (+5.0.5-‐3.0.8)+0 = 0.1 posi?ve value -‐> class 1
Voxel 1 Voxel 2Voxel 1 Voxel 2
examples of class 2
w1 = +5 w2 = -‐3
• spatial representation of decision function
• multivariate pattern:no local inference!
Basics of pattern recognition
16
Linear model projects instance on line parallel to wDecision boundary is a hyperplane in the feature spaceIf feature dimensions correspond to voxels, then w corresponds to a mapBias b allows to shift decision boundary along w
[adapted from Jaakkola]
-1
+10
separating hyperplane
Classification and decision theory
Assume class-conditional densities p(x|y) and class probabilities P (y) known
Minimizing the overall probability of error corresponds to
y0 = argmax
yP (y|x0
)
= argmax
y
p(x0|y)P (y)
p(x0)
(Bayes’ rule)
= argmax
yp(x0|y)P (y)
Basics of pattern recognition
17
featurematrix X
D featuresmodel
labels y
N instances
posterior
likelihood
P (y|x)p(x|y)
Wonderful link between discriminative model (posterior) and gener-
ative model (likelihood)
P (y|x) / p(x|y)P (y)
Discriminative model; e.g., linear classifier, SVM
Generative model; e.g., na¨ıve Bayes
p(x|y) =DY
i=1
p(xi|y)
where the features are modeled as independent random variables
Basics of pattern recognition
18
posterior
likelihood
Basics of pattern recognition
19
featurematrix X
D featuresmodel
labels y
N instances
TR EV TE
Train Evaluate TestErrorOK?
TREV
TE
Final Performance
MeasureSplitFull
Dataset
Tune
Leave-one-out cross-validation
Repeatedly train on all instances except one, evaluate on left-out instance
Accuracy statistics: class 1 =
AA+B ; class 2=
DC+D ; average =
A+DA+B+C+D
Sensitivity =
DC+D = TP
FN+TP
; specificity =
AA+B = TN
FP+TN
class 1 class 2
class 1 A Bclass 2 C Dtru
th
estimated
Testing classifiers: the ROC curveReceiver Operating Characteristic (ROC) curve
Allow to see compromise over the operating range of a classifierCan be used forclassifier comparisone.g., compute the Area Under Curve (AUC) as a summary measure of performance
20
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1−specificityse
nsiti
vity
classifier 1 − AUC 0.79classifier 2 − AUC 0.74
Learning in high dimensionsDimensionality of feature vector (e.g., #voxels) often exceeds number of instances
model is underdetermined and requires additional assumptions (e.g., regularization)
Other solutionsfeature selection & extraction
gray-matter mask; region-of-interest (ROI); atlas regionsonly most-active voxels (inside training phase!)
searchlight approach [Kriegeskorte et al., 2006]ROI is ‘sliding’ around brainmultivariate capabilities (long-range) are limited
dimensionality reductionPCA, ICA, ...
kernel methods
21
dot product: (4.-2) +(1.3) = -5
Kernel matrixKernel function K
similarity measure (scalar) between two feature vectorssimplest example: dot product (linear kernel)
N brain scans (feature vectors)
22
-‐2 3
4 1
5 10 15 20 25 30 35 40 45
5
10
15
20
25
30
35
40
45-3
-2
-1
0
1
2
3
4
5
6
7
x 106instances
instances
Maximum-margin classifier
Assume data are linearly separable:
9w 2 RD, 9b 2 R :
yi(wTxi + b) > 0, i = 1, . . . , N
w and b can be rescaled such that
closest points to hyperplane satisfy
��w
Txi + b
�� = 1
Distance ⇢x
of a point x to hyper-
plane is
⇢x
=
��w
Txi + b
��kwk
Margin of hyperplane = 1/ kwk
Support vector machine
23[Vapnik, 1963, 1992]
HARD MARGIN
wT x +
b = 1
wT x +
b = 0
wT x +
b = -
1
Quadratic programming optimization problem
arg min
(w,b)
1
2
w
Tw
subject to yi(wTxi + b) � 1, i = 1, . . . , N
Constrained problem can be formulated as an unconstrained one using La-
grange multipliers:
arg min
(w,b)max
↵>0
1
2
kwk2 �NX
i=1
↵i
�yi(w
Txi + b)� 1
�
| {z }L
Points for which yi(wTxi + b) > 1 do not matter and should have ↵i = 0
Points on the margin have yi(wTxi + b) = 1 and are called support vectors;
these have ↵i > 0 and thus
@L
@↵i= yi(w
Txi + b)� 1 = 0 ! b = w
Txi � yi
Support vector machine (primal form)
24
convex criterion so no local minima
Lagrangian functional
arg min
(w,b)max
↵>0
1
2
kwk2 �NX
i=1
↵i
�yi(w
Txi + b)� 1
�
| {z }L
Solving L for ↵i, b, w leads to
@L
@↵i= yi(w
Txi + b)� 1 = 0
! b = w
Txi � yi (for support vector)
@L
@b= �
NX
i=1
yi↵i = 0
@L
@w= w �
NX
i=1
↵iyixi = 0
! w =
NX
i=1
↵iyixi
Support vector machine (primal form)
25
Lagrangian functional can be rewritten as
L =1
2
NX
i=1
NX
j=1
↵iyi x
Ti xj| {z }
K(xi,xj)
yj↵j �NX
i=1
↵i, s.t.
NX
i=1
yi↵i = 0
Minimization of this dual form requires to find ↵i � 0 in an N dimensional
space instead of D dimensional one!
Weight vector w can be found as a linear combination of the instances
w =N
SVX
i0=1
↵i0yi0xi0
where only the NSV
support vectors (with ↵i0 > 0) contribute!
Prediction of unseen data point x according to
f(x) = sign
N
SVX
i0=1
yi0↵i0xTi0x+ b
!
Support vector machine (dual form)
26
Efficient: “sparse” solutionLimited number of support vectors;
rest of data can be ignored
Soft-margin classifier
Introduce slack variables ⇠i to relax
separation:
⇠i > 0 ! xi has margin less than 1
New optimization problem (primal)
arg min(w,b),⇠i�0
1
2w
Tw + C
NX
i=1
⇠i
subject to yi(wTxi + b) � 1� ⇠i
Which leads to the dual form:
argmin↵�0
1
2
NX
i=1
NX
j=1
↵iyixTi xjyj↵j �
NX
i=1
↵i
subject to
NX
i=1
yi↵i = 0 and 0 ↵i C
Support vector machine
27
SOFT MARGIN
⇠1
⇠2
“box constraint”: allows for capacity control
C = 0 : very large safety marginC = Inf: close to hard-margin constraint
[Shawe-Taylor and Cristianini, 2004]
wT x +
b = 1
wT x +
b = 0
wT x +
b = -
1
Linear classification (by hyperplane) can be limiting
Project into (even) higher-dimensional feature space V where linear
separation corresponds to non-linear separation in original space
Kernel trick
28
Linear classification (by hyperplane) can be limiting
Project into (even) higher-dimensional feature space V where linearseparation corresponds to non-linear separation in original space
However, this new feature space V is implicit because in practice we use gen-eralized kernel functions:
K(x1,x2) = h'(x1),'(x2)iV
where h·, ·iV should be a proper inner product, that is, K(·, ·) should satisfyMercer’s condition:
NX
i=1
NX
j=1
K(xi,xj)cicj � 0
for all finite sequences of points (x1, . . . ,xN ) and real-valued coefficients(c1, . . . , cN ). This also means that the kernel matrix [K(xi,xj)]i,j shouldbe positive semi-definite (i.e., all eigenvalues� 0)
Kernel trick
29
Commonly used kernel functions K(x1,x2) = h'(x1),'(x2)i:Polynomial (homogeneous): K(x1,x2) = (x
T1 x2)
p
Polynomial (inhomogeneous): K(x1,x2) = (x
T1 x2 + 1)
p
Gaussian radial basis kernel: K(x1,x2) = exp(�� kx1 � x2k2), � > 0
Example: polynomial kernel K(x1,x2) = (xT1 x2)2
For D = 2, developing the kernel function leads to (binomial theorem)
K(x1,x2) = (x1,1x2,1 + x1,2x2,2)2
= (x1,1x2,1)2+ 2x1,1x2,1x1,2x2,2 + (x1,2x2,2)
2
=
D[x
21,1
p2x1,1x1,2 x
21,2], [x
22,1
p2x2,1 x
22,2]
E
From which we can identify ' : (x1, x2) 7! (x
21,p2x1x2, x
22)
For general D, this kernel maps to
�D+12
�=
D(D+1)2 dimensions (multinomial
theoream)
Kernel trick
30
The transform ' is implicit, but the weight vector is defined as
w =X
i
↵iyi'(xi)
Dot products with w can be computed using the kernel trick:
w
T'(x) =X
i
↵iyiK(xi,x)
However, in general there exists no w
0 such that
w
T'(x) = K(w0,x)
which means there is no “equivalent weight map”
Kernel trick
31
Neuroimaging applications often work best with linear kernels
Very high-dimensional data with small sample sizes (D >> N ); increased
complexity of non-linear kernels (more parameters) cannot be turned into an
advantage
Linear models have less risk of overfitting
Weight vector of linear model has direct interpretation as an image;
i.e., “discrimination map”
Linear and non-linear kernels
32
high reproducibilitylow predictability
low reproducibilityhigh predictability
Model complexity (no free lunch)
Ensemble learningcombine many (weak) classifiers to get better results
bagging: classifiers on random training subset, equal voteboosting: adding classifiers that deal with previously misclassified data
What’s the best classifier?
33[Hasbie et al., 2009]
Recommended readingPereira et al., Machine learning classifiers and fMRI: A tutorial overview, NeuroImage 45, 2009Richiardi et al., Machine learning with brain graphs, IEEE Signal Processing Magazine 30, 2013Haufe et al., On the interpretation of weight vectors of linear models in multivariate neuroimaging, NeuroImage, in press
34
Classifier selectionIs classifier A better than classifier B?Hypothesis testing framework
“Do the two sets of decisions of classifiers A and B represent two different populations?”
Careful: Samples (sets of decisions) are dependent (same evaluation/testing dataset), and measurement type is categorical binary (for a two-class problem)
⇒Use McNemar testH0: “there is no difference between the accuracies of the two classifiers”
35
test sample
AB
1 2 3 4 5 6 7 8 9 10✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✘ ✘
✔ ✔ ✘ ✔ ✔ ✔ ✘ ✔ ✔ ✘
Compute relationships between classifier decisions
If H0 is true, we should have N01 = N10 = 0.5(N01+N10)We can measure the following test-static based on the observed counts:
Then, compare x2 to critical value of χ2 (1 DOF) at a level of significance α
(requires N01+N10>25)
Selection with the McNemar test
36
B ✔ B ✘
A ✔ N11 N10
A ✘ N01 N00
After [Dietterich, 1998]
x
2 =(|N01 �N10|� 1)2
N01 +N10