Supervised Learning applied to Neuroimaging2 › staff › J.Shawe-Taylor › courses › J2.pdf ·...
Transcript of Supervised Learning applied to Neuroimaging2 › staff › J.Shawe-Taylor › courses › J2.pdf ·...
Application of Supervised
Learning to NeuroimagingJanaina Mourao-Miranda
Supervised Learning
Input:
X1
X2
X3
Output
y1
y2
y3
Learning/Training
Generate a function or hypothesis fsuch that
Training Examples:
(X1, y1), (X2, y2), . . .,(Xn, yn)
Test
Prediction
Test Example
Xi
f(xi) -> yi
f(Xi) = yi
f
Learning
Methodology
Automatic procedures that learn a task from a series of examples
No mathematical
model available
Machine Leaning Methods
• Artificial Neural Networks
• Decision Trees
• Bayesian Networks
• Gaussian Process
• Support Vector Machines
• ..
• SVM is a classifier derived from statistical learning
theory by Vapnik and Chervonenkis
• SVMs introduced by Boser, Guyon, Vapnik in
COLT-92
• Powerful tool for statistical pattern recognition
Advantages of pattern recognition analysis in Neuroimaging
Explore the multivariate nature of neuroimaging data
•MRI/fMRI data are multivariate by nature since each scan contains
information about brain activity at thousands of measured locations (voxels).
•Considering that most of the brain function are distributed process involving a
network of brain regions it seems advantageous to use the spatially distributed
information contained in the data to give a better understanding of brain
functions.
•Can yield greater sensitivity than conventional analysis.
Can be used to make predictions for new examples
•Enable clinical applications: previously acquired data can be used to make
diagnostic or prognostic for a new subjects.
e.g. GLM
Input Output
Map: Activated regionstask 1 vs. task 2
Classical approach: Mass-univariate Analysis
SVM - training
Input Output
Volumes from task 1
Volumes from task 2
…
… Map: Discriminating regions between task 1 and task 2
Pattern recognition approach: Multivariate Analysis
SVM - test Prediction: task 1 or task 2
Time
Intensity
BOLD signal
1. Voxel time series2. Experimental Design
New example
fMRI Data Analysis
Each fMRI volume is treated as a vector in a extremely high dimensional space
(~200,000 voxels or dimensions after the mask)
fMRI data as input to a classifier
Vector representing the pattern of brain activation
[2 8 4 2 5 4 8 4 8]
Using pattern recognition to distinguish between object categories
vo
xe
ls
Time (trails or scans)
Train Test
input classification decision
?
Classification in Neuroimaging: 2D toy example
voxel 1
voxel 2
w
volume in t1
volume in t2
volume in t4
volume from a
new subject
volume in t32
4
L R
4 2
task 2
volume in t1 volume in t3volume in t2 volume in t4
task 2task 1 task 1task ?
voxel 1
voxel 2
Classification in High Dimensions
w
Data: <xi,yi>, i=1,..,N
Observations: xi ∈ Rd
Labels: yi ∈ {-1,+1}
All hyperplanes in Rd are parameterized by a vector w and a constant b. They can be expressed as:
(x1,+1)
(x2,-1)
In high dimensions there are many possible hyperplanes
0)(, =+>< bxw φ
Our aim is to find such a hyperplane/decision function:
that correctly classify our data: f(x1)=+1 and f(x2)=-1
))(,sgn()f( b+><= xwx φ
Nd ℜ→ℜ:φ
voxel 1
voxel 2
( )11 tX
( )31 tX ( )22 tX
( )42 tX
1m 2mw
thr
w
Projections onto the learning weight vector
FLD
with correction
w
FLD
without correction
w
Simplest Approach: Fisher Linear Discriminant
Fisher Linear Discriminant
voxel 1
voxel 2
−
2x
−
1x
+1x
+2x
1µ 2µw
thr
Fisher Discriminant is a classification function:
))(,sgn()( bf +><= xwx φ
Where the weight w is chosen to maximize the quotient
22
2
21
)()(
)()(
−+
−+
+
−=
ww
Jσσ
µµw
−+11 , µµ : mean of the projection of the +/- examples
−+ww σσ , : corresponding standard deviations
Find the direction w that maximizes the separation of the means scaled according to the
variances in that direction.
Regularized version:
ww
λσσ
µµ
++
−=
−+
−+
22
2
21
)()(
)()(
ww
J
If the optimal hyperplane has margin γγγγ>r it will correctly separate the test points.
As r is unknown the best we can do is maximize the margin γγγγ
r
Among all hyperplanes separating the data there is a unique optimal hyperplane, the one which
presents the largest margin (the distance of the closest points to the hyperplane).
Let us consider that all test points are generated by adding bounded noise (r) to the training
examples (test and training data are assumed to have been generate by the same distribution).
Optimal Hyperplane: largest margin classifier
γγγγ
γγγγ
Support Vector Machine: the maximal margin classifier
Data: <xi,yi>, i=1,..,N
Observations: xi ∈ R2
Labels: yi ∈ {-1,+1}
ξ
w
1,...,1
0,))(,(..
min
2
1
,,,
==
≥−≥+><
+− ∑=
w
xw
w
andNi
byts
C
iiii
N
i
ib
ξξγφ
ξγξγ
For details on SVM formulation see Kernel Methods for Patter Analysis, J. Shaw-Taylor & N. Christianini
Optimization Problem (convex quadratic program):
marginslack variables
weight vector
C: controls the trade-off between the margin and the size of the slack variables
In practice C is chosen by cross-validation.
As the parameter C varies, the margin varies smoothly through a corresponding range.
SVM decision function:
SVM weights:
+= ∑
=
N
j
jii bKy1
),(sgn)f( xxx α
)(1
i
N
i
i y xw ∑=
= φα
In the linear case:
>=<
=
jiji
ii
K xxxx
xx
,),(
)(φ
αi≠0 only for inputs that lie on the margin (i.e. support vectors)
The trade-off parameter C between accuracy and
regularization directly controls the size of αi
How to interpret the SVM weight vector?
Weight vector (Discriminating Volume)W = [0.45 0.89]
1 4 2 3 2.5 4.50.5 0.3 1 1.5 2 1
task1 task2 task1 task2 task1 task2
0
0.5
1
1.5
2
2.5
3
0 1 2 3 4 5
voxel 2
vo
xel 1
H: Hyperplane
w
• The value of each voxel in the discriminating volume indicates the importance of such voxel in differentiating between the two classes or brain states.
0.45 0.89
Patter Recognition Method: General Procedure
Split data: training and test
ML training and test
Dimensionality Reduction and/or
Feature Selection
Standard fMRI pre-processing:
•Realignment
•Normalization
•Smooth
Output:
-Accuracy
-Discriminating Maps
(weight vector)
Compute Kernel Matrix
Kernel is a function that, for given two pattern X and X*, returns a real number
characterizing their similarity.
Κ: χ x χ →ℝ
A simple type of similarity measure between two vectors is a dot product.
<X,X*> → Κ(X,X*)
Kernel
5 10 15 20 25 30 35 40 45
5
10
15
20
25
30
35
40
45-3
-2
-1
0
1
2
3
4
5
6
7
x 106
Kernel Matrix
X1
X2
<X1,X2>
>=<
=
jiji
ii
K xxxx
xx
,),(
)(φ
• The original input space can be mapped to some higher-dimensional feature space
where the training set is separable:
Φ: x→→→→ φ(x)
Kernel Approaches and Feature Space
Instead of using two steps:
1. Mapping to high dimensional space xi-> φ(xi)
2. Computing dot product in the high dimensional space <φ(xi), φ(xi)>
One can use the kernel trick and compute these two steps together
A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space
K(xi,xj) := <φ(xi), φ(xi)>
Kernel trick
– Linear kernel:
2
2( , ) exp( )
2
i j
i jK
σ
−= −
x xx x
( , ) T
i j i jK =x x x x
( , ) (1 )T p
i j i jK = +x x x x
0 1( , ) tanh( )T
i j i jK β β= +x x x x
• Examples of commonly-used kernel functions:
– Polynomial kernel:
– Gaussian (Radial-Basis Function (RBF) ) kernel:
– Sigmoid:
• In general, functions that satisfy Mercer’s condition can be kernel functions.
How to give the data as input to the
classifier?
First Approach: Training with the whole brain
- Additional pre-processing: removal of the base line and low frequency components of each
voxel
- Advantages: Predict single events
- Disadvantages: Low signal to noise rate (SNR), stationarity assumptions
Data Matrix =
C1 C1 C1 BL BL BL C2 C2 C2 BL BL BL
vo
xe
ls
Single volumes
Second Approach: Training temporal compressed data
- Additional pre-processing: removal of the base line and low frequency components of each
voxel
- Advantages: High SNR
- Disadvantages: Stationarity assumptions
Data Matrix =
C1 C1 C1 BL BL BL C2 C2 C2 BL BL BL
vo
xe
ls
Mean of volumes or betas
Average volumes(over blocks or over the experiment) or
use parameter estimates (betas) of
the GLM model
Third Approach: Training with regions of interest (ROIs)
- Additional pre-processing: removal of the base line and low frequency components of each
voxel
- Advantages: lower dimensionality
- Disadvantages: stationarity assumptions, need a priori hypothesis to define the ROI, does
not use the whole brain information
Data Matrix =
C1 C1 C1 BL BL BL C2 C2 C2 BL BL BL
vo
xe
ls
Single volumes
Feature selection
Method
Se
lecte
d v
oxe
ls
Single volumes
Fourth Approach: Spatiotemporal information
- Additional pre-processing: removal of the base line and low frequency components of each
voxel
- Advantages: uses temporal and spatial information, no stationarity assumptions
- Disadvantages: Low signal to noise rate (SNR)
Data Matrix =
C1 C1 C1 BL BL BL C2 C2 C2 BL BL BL
vo
xe
ls
Spatiotemporal observations
T1 T2 T3
T1
T2
T3
Examples of Applications
Can we classify brain states
using the whole brain information
from different subjects?
Application I: Classifying cognitive states
We applied SVM classifier to predict from the fMRI scans if a subject was looking
at an unpleasant or pleasant image
Number of subjects: 16
Tasks: Viewing unpleasant and pleasant pictures (6 blocks of 7 scans per block)
Pre-Processing Procedures
• Realignment, normalization to standard space, spatial filter.
• Mask to select voxels inside the brain.
Training Examples
• Mean volume per block
Leave one-out-test
• Training: 15 subjects
• Test: 1 subject
This procedure was repeated 16 times and the results (error rate) were averaged.
Experimental Design:
fMRI scanner
fMRI scanner
?
fMR
Iscanner
Machine Learning Method:
Support Vector Machine
The subject was viewing a pleasant stimuli
Brain looking
at a pleasant stimulus
Brain looking
at an unpleasant stimulus
fMRI scanner
fMRI scanner
Brain looking
at a pleasant stimulus
Brain looking
at an unpleasant stimulus
Training Subjects
Test Subject
1.00
0.66
0.33
0.05
-0.05
-0.33
-0.66
-1.00
un
ple
asa
nt
ple
asa
nt
z=-18 z=-6 z=6 z=18 z=30 z=42
Spatial weight vector
Results
Mourao-Miranda et al. 2006
Can we make use of the
temporal dimension in decoding?
Experiment: Emotional
Images
(Pleasant vs. Unpleasant)
Fixation
Unpleasant or
Pleasant Stimuli
vt2
vt3
vt4
vt5
vt6
vt7
vt9
vt10
vt11
vt12
vt13
vt14
vt8vt1
Duty Cycle
Spatial Temporal Observation
Vi = [v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14]
Spatiotemporal SVM: Block Design
Unpleasant
Pleasant
Training example:
Whole duty cycle
T1T2T3T4T5T6T7T8T9T10T11T12T13T14
1.00
0.45
0.22
0.05
-0.05
-0.22
-0.45
-1.00
un
ple
asa
nt
ple
asa
nt
Mourao-Miranda et al. 2007
Spatial-Temporal weight vector:
Dynamic discriminating map
Spatial-Temporal weight vector:
Dynamic discriminating map
T5
1.00
0.45
0.22
0.05
-0.05
-0.22
-0.45
-1.00
un
ple
asa
nt
ple
asa
nt
z=-18
A C
B D
T5
z=-6
1.00
0.45
0.22
0.05
-0.05
-0.22
-0.45
-1.00
un
ple
asa
nt
ple
asa
nt
B
A
T5
z=-6
1.00
0.45
0.22
0.05
-0.05
-0.22
-0.45
-1.00
un
ple
asa
nt
ple
asa
nt
C
D
E
F
Can we classify groups using the
whole brain information from
different subjects?
Application II: Classifying groups of subjects
We applied SVM to classify depressed patients vs. healthy controls based on
their pattern of activation for emotional stimuli (sad faces).
•19 free medication depressed patients vs. 19 healthy controls
•Event-related fMRI paradigm consisted of affective processing of sad facial stimuli with
modulation of the intensity of the emotional expression (low, medium, and high
intensity).
Pre-Processing Procedures
• Realignment, normalization to standard space, spatial filter.
• GLM analysis.
Training Examples
• GLM coefficients, i.e. one example per subject
Leave one pair out cross-validation test
Experimental Design:
Pattern Classification of Brain Activity in Depression
(train and test with GLM coefficients)
Collaboration with Cynthia H.Y. Fu
Fu et al. 2008
SVM weight – Low intensity (Hap 0)
SVM weight – Medium intensity (Sad 2)
SVM – High intensity (Sad 4)
Can we decode subjective pain
from whole brain pattern of fMRI?(Andre Marquand)
Application IV: Decoding Pain Perception
We applied GP methods to predict subjective pain levels in an fMRI experiment
investigating subjective responses to thermal pain
• 15 subjects scanned 6 times over three visits (repeated-measures design)
• Thermal stimulation was delivered via a thermode attached to subjects’ right forearm
• Stimulation was individually calibrated to three different subjective intensity thresholds:1. Sensory detection threshold (SDT; temperature stimulation is detectable)2. Pain detection threshold (PDT; temperature at which it becomes painful)3. Pain tolerance threshold (PTT; maximum tolerable temperature)
• Subjects rated the perceived intensity of the stimulus using a visual-analogue scale
(VAS). 0 = “No sensation”, 100 = “Worst pain imaginable”
• After calibration, the actual temperature applied was invariant throughout the
experiment (within subjects and stimulus classes)
Experimental Design:
1. GPR was used to predict the subjective pain rating (VAS score)
Whole-brain fMRI volumes were used as input to the model
Predictive Model:
Results: GP Regression
For every stimulus class, GPR provided very accurate predictions of subjective
pain intensity (SMSE = 0.51*, p < 1x10-10 by permutation)
SDT ρS = 0.60 PDT ρS = 0.73 PTT ρS = 0.87
Predicted VAS Predicted VAS Predicted VAS
Tru
e V
AS
Tru
e V
AS
Tru
e V
AS
Marquand et al. 2009
Results: GP Regression
Relating brain activity to subjective pain intensity is not a novel finding and
several brain regions have been demonstrated to encode subjective pain intensity
We compared the strength of correlation between GPR predictions and VAS
scores to correlations derived from a number of intensity-coding brain regions
Primary Somatosensory cortex:
• Left: ρS = 0.26
• Right: ρS = 0.12
Secondary Somatosensory cortex:
• Left: ρS = 0.27*
• Right: ρS = 0.32*
Anterior Cingulate Cortex:
• Left: ρS = 0.42*
• Right: ρS = 0.41*
Insula:
• Left: ρS = 0.37*
• Right: ρS = 0.36*
No brain region produced a correlation as strong as GPR predictions derived from
the whole brain. ‘The whole is greater than any of the parts’
Marquand et al. 2009