Understanding multivariate pattern analysisfor neuroimaging applications
Maria Giulia PretiDimitri Van De VilleMedical Image Processing Lab, Institute of Bioengineering, Ecole Polytechnique Fédérale de LausanneDepartment of Radiology & Medical Informatics, Université de Genève
http://miplab.epfl.ch/
April 17, 2014
Overviewn Brain decoding: motivation and applications
n Limitations of “regular SPM” (GLM)n “Hall of fame” of brain decoding
n Pattern recognition principles for brain decodingn Basicsn Data processing pipeline for fMRI decoding
n Preprocessingn Feature extractionn Feature selection
n Cross-validationn Training phasen Evaluation phasen Testing phase
n Classifier selection
2
3
GLM
Stats
whole-brain scan2x2x2mm3, 20-30 slicesevery 2-4 sec
100’000 intracranial voxels1’000 timepoints
t-val
ue
time
Limitations of the GLM approachn Massively univariate approach
n Hypothesis testing is applied voxel-wisen All voxels are treated independently,
but are obviously not independentn Textbook multivariate approaches are not practical
n E.g., estimating spatial covariance matrix (100’000x100’000)by 1’000 timepoints will be rank-deficient
n fMRI data are complicated➔ Data-driven: let the data speak
n Group statistics are great...n ... but the ultimate goal is often
to know how well the resultsgeneralizes (prediction)
n Learn from the group, apply to the (unseen) individual4
condition 1 condition 2
voxel 1voxel 2
voxel 1voxel 2
Forward versus backwards inference
5
stimuliconditionsdiseasedisorder...
generative model
stimuliconditionsdiseasedisorder...
discriminative model
encoding
decoding
data
data
phrenology
blind reading
Exploiting “subtle” spatial patterns
[Cox et al., 2003; Haynes and Rees, 2006]
Brain decoding of visual processing
7[Haxby et al., 2001]
distributed patterns subtle patterns in V1
[Kamitani and Tong, 2005, 2006][Haynes, Rees., 2006]
advanced models forreceptive fields
[Miyawaki et al., 2008]
8
Decoding video sequences
reverse statistical inference(“decode” stimulus from data)
[Nishimoto et al, 2011]
Emotions in voice-sensitive corticesn How is emotional information treated in aud. cortex?
n 22 healthy right-handed subjectsn 10 actors pronounced “ne kalibam sout molem”
(anger, sadness, neutrality, relief, joy)n normalization for mean sound intensity;
rated by 24 (other) subjectsn Classifier approach on aud. cortex
n localizer: human voices > animal soundsn 20 stimuli for each category (5); 5 classifiers (one-versus-others)
9[Ethofer, VDV, Scherer, Vuilleumier, Current Biology, 2009; collaboration with LABNIC/CISA, UniGE]
p<0.05 (FWE)
10
Anger Sadness Neutral Relief Joy
z=0
z=+3
z=+6
[Ethofer, VDV, Scherrer, Vuilleumier, Current Biology, 2009]“Ne kalibam sout molem”
Decoding from the emotional brain
11[Ethofer, VDV, Scherer, Vuilleumier, Current Biology, 2009]
Results for bilateralvoice-sensitive cortices
solid: smoothed (10mm FWHM)dashed: non-smootheddotted: spatial average
Chance level: 20%Classifier: linear SVM
Evidence that auditory cues for emotional processing are spatially encoded; bilaterial representation
Comprehension of emotional prosody is crucial for social functioning psychiatric disorders (schizophrenia, bipolar affective disorder,...)
Brain decodingn Automatic classification of structural images
n SVM applied to similarity measures (inner product)
12[Klöppel et al., 2008]
Basics of pattern recognition
n Feature vector is typically embedded in a vector space, where each feature is a dimension
13
featurevector xi
D featuresmodel
functionmachineclassifier
...
label yi(e.g., 0/1)
Basics of pattern recognition
14
featurematrix X
D featuresmodel
labels y
N instances
Classification as a regression problem
Linear model:
y
T = X
Tw1 + w0 + noise
where the weight vector w = (w1, w0) characterizes the model
Learning the model by least-squares fitting: w = argminw���yT �X
Tw1 � w0
���2
(notice problem when D > N )
Classify new data by comparing x
0Tw against 0.5?
Basics of pattern recognition
15
n Linear model projects instance on line parallel to wn Decision boundary is a hyperplane in the feature space for a linear
modeln If feature dimensions correspond to voxels, then w corresponds to a map
[adapted from Jaakkola]
Classification and decision theory
Assume class-conditional densities p(x|y) and class probabilities P (y) known
Minimizing the overall probability of error corresponds to
y0 = argmax
yP (y|x0
)
= argmax
y
p(x0|y)P (y)
p(x0)
(Bayes’ rule)
= argmax
yp(x0|y)P (y)
Basics of pattern recognition
16
featurematrix X
D featuresmodel
labels y
N instances
posterior
likelihood
P (y|x)p(x|y)
Wonderful link between discriminative model (posterior) and gener-
ative model (likelihood)
P (y|x) / p(x|y)P (y)
Discriminative model; e.g., logistic regression
log
✓P (y = 1|x)P (y = 0|x)
◆= w
T1 x+ w0 leads to P (y = 1|x) = g(w
T1 x+ w0)
Generative model; e.g., na¨ıve Bayes
p(x|y) =DY
i=1
p(xi|y)
where the features are modeled as independent random variables
Basics of pattern recognition
17
posterior
likelihood
Model order selection
�2 log p(x|✓K , y)| {z }log-likelihood
+penalty(✓K)
Model ✓K with K parameters
Occam’s razor: make model as simple as possible
Akaike information criterion: penalty = 2K
Bayesian information criterion: penalty = K logD
Basics of pattern recognition
18
FMRI processing pipelinen From fMRI raw data to decisions
19
acquire signal
pre-process signal
extract features
selectfeatures
trainclassifier
[trai
n]
[test]classify
decision �
classifier �
Motion correctionCoregistration
NormalisationParcellation
activity
connectivity
mean per voxel
mean per region
Univariate
Multivariate
D'
D < D'
Preprocessing n Due to large inter-subject variability (anatomical and
functional) and artifacts, (f)MRI data must be preprocessed before feature extraction
n Possible preprocessing stepsn Slice-timing correctionn Susceptibility correctionn Motion correctionn Detrending (scanner drift)n Physiological artifacts removaln Normalisation to common template (e.g., MNI)
n Exploit structural MRI data (T1)
20
Feature extractionn Three key principles
n Use engineering know-how (e.g., try to increase SNR)n Use domain knowledge (e.g., look at the right regions)n Aim at interpretability, depending on the task
n Intra- vs intersubject decoding
n Spatial dimensionsn Gray-matter segmentation (10’000 voxels)n Atlasing (100-1’000 regions) n Interhemispherical differencesn Searchlight approach [Kriegeskorte et al., 2006]n Transform domain (e.g., PCA/ICA, wavelets)
n Time dimensionn Exploit task paradigm n Incorporate knowledge about hemodynamic responsen Temporal compression
n Within and between timecourses21
Feature extraction (temporal compression)
n local features: one vector per scan (e.g., intensity or derivative)n segmental features: computed over a number of time points (e.g.,
sliding-window measures)n global features: computed over whole timecourse (e.g., mean,
standard deviation)n statistical dependencies: between timecourses (e.g., functional
connectivity based on Pearson correlation)22
Feature selectionn Establish a subset of most discriminative features
n Reduce dimensionality of the feature space to D’n Feature selection is a combinatorial search
problem for the best feature subset based on a search strategy and an objective functionn Search strategies can be univariate (select one voxel at
a time) or multivariaten optimal (exhaustive), suboptimal; best-N, genetic algorithms,
backward/forward recursive selection...n Objective functions are of three kinds:
n filter: the objective function is not a discriminant functionn wrapper: the objective function is the error rate of the final
classifiern embedded: the objective function is directly based on the
objective function of the final classifier
23
Simple method: point-biserial correlationn For each of the D features, measure point-biserial
correlation:
n Then generate a ranking according to |rpb| and keep only top D’ features
2483 27 49 16 42 39 80 22 46 48 64 58 8 26 45 2 53 17 85 21 24 79 70 69 63 61 19 6 9 54 76 10 71 15 55 32 38 14 3 74 25 12 28 90 770
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Feature number (AAL brain region)
|r pb|
D’
rpb =X!1 �X!0
�X
rN0N1
N2
Model learningn Training set with known labels (supervised learning)
n Model and its parameters depends on type of classifiern Discriminative models
n Logistic regressionn Fisher discriminant analysisn Support vector machinesn Random forestn Neural networks
n Generative modelsn Naïve Bayes, can assume different types of densitiesn Gaussian mixture modeln Hidden Markov model
25
n Classifier prediction can be correct or wrongn Accuracy statistics can be shown in a confusion
matrix (summary table):n Class 0 accuracy=A/(A+B)n Class 1 accuracy=D/(C+D)n Accuracy= (A+D)/(A+B+C+D)n Biostats (ω0 =healthy, ω1 =disease):
n sensitivity (“Disease is PRESENT and I report disease”): D/(C+D) = TP/(FN+TP)
n specificity: (“Disease is ABSENT and I report no disease”: A/(A+B) = TN/(FP+TN)
n Perfect: B=C=0. Be suspicious if this happens!n Random: A=B=C=D. Same as flipping a coin
Testing classifiers: error analysis
26
ω0 ω1
ω0 A Bω1 C Dtru
th
estimated
Testing classifiers: the ROC curven Increasing sensitivity can
never increase specificity at the same time, and vice-versa
n Receiver Operating Characteristic (ROC) curven Allow to see compromise over
the operating range of a classifier
n Can be used forclassifier comparisone.g., compute the Area Under Curve (AUC) as a summary measure of performance
27
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1−specificityse
nsiti
vity
classifier 1 − AUC 0.79classifier 2 − AUC 0.74
Proper tuning and error estimation
28
n Maintain strict separation throughoutn Remember our goal: classify unseen datan This applies at all stages of the processing chainn Always ask yourself: what would I do if I acquired data
for one more subject after my classifier is trained? n If not, main cause of over-optimistic results (and even
useless in some cases)
training evaluation testdatalabels
TR EV TE
Train Evaluate TestErrorOK?
TREV
TE
Final Performance
MeasureSplitFull
Dataset
Tune
Classifier selectionn How to choose a classifier?
n No Free Lunch Theorem n You cannot get learning for free, every classifier has its biases
n Problem characteristics (data structure, dimensions, ...)n Prior experience and personal preferencen Cross-validation + statistical testing
n Model complexity and optimal parameters ?n Occam’s razor
n As simple as possiblen Information-theoretic
criterian AIC, BIC/MDL, ...
n Empirical approachesn Cross-validationn Surrogate data 29[Hasbie et al., 2009]
high reproducibilitylow predictability
low reproducibilityhigh predictability
Challenges and opportunitiesn High-dimensional learning problem
n Large number of features compared to instancesn Kernel trick
n Basically, look at XTX instead of XXT
n “Distance” between instances depends on definition of kernel functionn Interpretation
n How to get “brain maps” (e.g., showing where information is processed)n Integration versus segregation
n Estimate where you are in the bias-variance trade-offn Multimodal approaches
n Combine different classifiers (“ensemble learning”)n Intra-subject decoding
n Fine-grained organization of brain activation patterns (“hyperacuity”)n Inter-subject decoding
n Multicentric datasetsn Diagnosis and prognosis
n Prediction at the individual patient leveln Beyond patient-or-not, predict clinical measures ~ surrogate functional markers
Recommended readingn Pereira et al., Machine learning classifiers and fMRI:
A tutorial overview, NeuroImage 45, 2009n Richiardi et al., Machine learning with brain graphs,
IEEE Signal Processing Magazine 30, 2013n Haufe et al., On the interpretation of weight vectors
of linear models in multivariate neuroimaging, NeuroImage, in press
31
Classifier selectionn Is classifier A better than classifier B?n Hypothesis testing framework
n “Do the two sets of decisions of classifiers A and B represent two different populations?”
n Careful: Samples (sets of decisions) are dependent (same evaluation/testing dataset), and measurement type is categorical binary (for a two-class problem)
⇒Use McNemar testH0: “there is no difference between the accuracies of the two classifiers”
32
test sample
AB
1 2 3 4 5 6 7 8 9 10✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✘ ✘
✔ ✔ ✘ ✔ ✔ ✔ ✘ ✔ ✔ ✘
n Compute relationships between classifier decisions
n If H0 is true, we should have N01 = N10 = 0.5(N01+N10)n We can measure the following test-static based on
the observed counts:
n Then, compare x2 to critical value of χ2 (1 DOF) at a level of significance α
n (requires N01+N10>25)
Selection with the McNemar test
33
B ✔ B ✘
A ✔ N11 N10
A ✘ N01 N00
After [Dietterich, 1998]
x
2 =(|N01 �N10|� 1)2
N01 +N10
Top Related