London, UK 21 May 2012 Janaina Mourao -Miranda, Machine Learning and Neuroimaging Lab,
Machine Learning Methods for Population Neuroimaging
Transcript of Machine Learning Methods for Population Neuroimaging
Machine Learning Methods for Population Neuroimaging
Thomas E. Nichols
University of Warwick
Neuroimaging: More then eye-candy?
• s
Neuroimaging Frisson• Brain
– Seat of behavior, cognition, consciousness
– Essential for unraveling mental disorders• Profound cost to society
– Cost of treatment, welfare/disability & loss of earnings
» UK: £100 billion/year | US: $300 billion year/year
• Functional neuroimaging– Before, we only had neuropsychology
• Wait for cerebral accidents & observe behavior change– E.g. stroke, industrial accidents
– Now have Functional MRI
Phineas Gage
Magnetic Resonance Imaging
T1 Map T2 Map Grey Matter (GM) White Matter (WM)
Fractional Anisotropy (FA) Mean Diffusivity (MD) Susceptibility Weighting (SWI) Cerebral Blood Volume (CBV)
Different Types of MRI Acquisitions/Parameters
Different Possible Analysis Types
BOLD fMRI
Cortical Thickness
Diffusion Tensors
Track-Based Analysisof FA
• MRI - rich set of tools
– Multiple “dials” to assess brainanatomy & physiology
“Old” Neuroimaging:Observational Studies of Structure
• “Voxel Based Morphometry” Study of London Taxi cab drivers– n=16 Taxi drivers
• Mean 2yrs studying “The Knowledge”
– n=30 controls
• Hippocampal volume differences, increasing with time as driver– Hippocampus key for long-
term memory formation
• Methods– Small, observational sample– Mass Univariate Model
• Parametric RFT
Maguire et al. (2000). Navigation-related structural change in the hippocampi of taxi drivers. PNAS, 97(8), 4398–403.
Volume Differences vs. Time as Driver
Brain Volume Differences:
Taxi Drivers>
Controls
Traditional Neuroimaging• Traditionally a small sample affair
– Typical group size 10!
• Low n = low power
– Power: Prob. of true positive
– But if study has low power,even if true effect, you’ll neverreplicate it
Random sample of 300 articles using fMRI in 2007. Carp (2012) NeuroImage.Power of 730 studies in
Neuroscience. Median power = 20%.Button et al. (2013) Nature Reviews Neuroscience.
Population Neuroimaging
• Population Sampling
– Not using a sample of convenience
– A selection method that should equally sample all members of a population
• E.g. based on voter registration, doctor
– Representative of a larger population
– And a larger sample
Machine Learning forPrediction
Prediction – The basics
• Supervised Learning – Build prediction algorithm, maps inputs to an output
• Input: data/features– E.g. disease duration, blood work, age, etc.
• Output: labels/target variable– E.g. “success” (no symptoms for 6 months), symptom score
– Must know ‘truth’, true labels– Build algorithm using ‘training’ set, evaluate with ‘test’
• Unsupervised Learning– Exploring heterogeneity in data– Do groups of subjects cluster / segregate?
• Based on which variables?
– After unsupervised learning, may give you idea for supervised model
• Core goal of statistics:
– Make an inference on population sampled
– E.g. On average, do women have longer hair than men?
• Randomly sample men, women, measure hair length
• Test Ho: μMensHair = μWomensHair vs. Ha: μMensHair < μWomensHair
• If reject Ho, conclude something about population means
– Very likely with sufficient data
Statistics vs. Machine LearningInference vs. Prediction
Population of Women’s Hair Length
Men’sHair Length
Mean Hair Length over 40 Women
Mean Hair Lengthover 60 men
TruePopulation Distributions
Sampling Distributionof Mean
σM σW
σM σW
√60 √40
Statistics vs. Machine LearningInference vs. Prediction
• Core goal of (supervised) machine learning:
– Make individual predictions, often the inverse question
– Can I use Hair Length to predict gender?
• Yes, but not perfectly– Population distributions set limit on accuracy
– No increase precision from averaging to help you
Population of Women’s Hair Length
Men’sHair Length
TruePopulation Distributions
Population of Women’s Hair Length
Men’sHair Length
TruePopulation Distributions
Multivariate Machine Learning
• What if, instead of just using hair length, we could use other variables?
– Number of pairs of baseball caps owned?
– Indeed… often considering multiple variables gives better prediction
Common to all predictive models:• Training data, {labels,features}
• Classifier/predictor
• Key aspect is: • What form does have?• How do you train/estimate it?
Internal prediction
External prediction
= M or F?
# ca
ps
Hair Length
Key Prediction Concept: Overfitting
• Trying to predict continuous response (green curve)
• For this n=10 dataset, a order-9 polynomial fits perfectly!
• But it likely won’t generalize
• Thus essential to use cross-validation schemes– Must test on held-out data, to
estimate generalization accuracy
True function
Fitted function
Fitting Predictive Models:Cross Validation
• Avoiding Over-fitting
– To get good estimate of accuracy on truly new, unseen data
• Leave One Out Cross-Validation (LOOCV)
– Hold out one case/subject’s data treat as “new”
– Fit model on “held in” data
– Make prediction on “held out” data
– Repeat N times, giving ‘held-out’ estimate for each case
– Simplest approach; gives unbiased estimates of accuracy
– Most computationally expensive, gives variable estimates of accuracy
Leave One Out Cross-Validation
sub
j. 1
sub
j. 2
sub
j. 3
sub
j. 4
sub
j. 5
sub
j. 6
sub
j. 7
sub
j. 8
sub
j. 9
sub
j. 1
0su
bj.
11
sub
j. 1
2su
bj.
13
sub
j. 1
4su
bj.
15
sub
j. 1
6su
bj.
17
sub
j. 1
8su
bj.
19
sub
j. 2
0
run 1
run 2
run 3
run 4
run 20
…
Test Datum
Training Data
…
Fitting Predictive Models:Cross Validation
• K-fold cross-validation
– Divide data in to K “folds”
– Analyze fit with remainingK-1 “folds”
– Predict each of the held-out data
– Repeat K times
– More computationally efficient, butless variable
• Generally K-fold CV is recommended
– K=10 typical
Test
Test
Test
Test
Illustration of 4-fold CV
T r a i n g
a i n g
n g
T r a i n g
T r
T r a i
Prediction Methods:Ridge & Lasso Regression
GLM – No regularization Ridge: λ = 1.5×10-8 Ridge: λ = 1
Test
Illustration of 4-fold nested CV
T r a i n g
Test a i n gT r
run 1
run 1/1
run 1/2
run 1/3
run 1/4
run 2
• Use test data to optimize
• Then use optimal to finally predictrun 1’s test data
• Revisit curve-fitting… with Ridge Regression
• No automatic way to find λ!
– Must use another cross-validation!
– Nested CV’s canbe very slooow
Population NeuroimagingHCP Application
Human Connectome Project (HCP) (1)
• Missouri Twin Registry (MOTWIN) – Ascertainment from birth records
• State of Missouri Division of Vital Statistics
– In 1990’s, parents of all twins born to Missouri residents from 1975 to 1991 invite to participate
– HCP ran from 2010-2015• Twin ages at study start, 20-36 years
• Families with at least 4 offspring selected, all siblings invited
• “Extended Twin Design”– Allows heritability to be estimated, also improves
power to detect genetic associations
Human Connectome Project (HCP) (2)
• Target sample– n=1,200 (300 families of 4)
• Extensive testing with standard psychological, health-history tests
• Extensive state-of-the-art MRI– Structural MRI
• 2× T1w, 2× T2w, 0.7 mm3 isotropic
– Functional MRI, Task & Rest• TR=0.72s 2 mm3 isotropic, 4× 15min resting
– Diffusion MRI• 1.25 mm3 isotropic
HCP Mega Trawl
• Can we predict fundament subject features with resting-state fMRI connectivity?
• Would like directional “arrows”
– In practice, all we get are undirectedJoint work with Steve Smith, Oxford& HCP team
HCP Mega Trawl
• 50 nodes (defined by ICA)
– Gives 50 × 50 network matrix
HCP Mega Trawl
• 50nodes (defined by ICA)
– Gives 50 × 50 network matrix… clustered
(full) correlation
(full) correlation
HCP Mega Trawl
• 50nodes (defined by ICA)
– Partial correlation sparser than full
partial correlation
(full) correlation
HCP MegaTrawl:Predict each SM with NetMat
• Network Matrix (NetMat) for each subject
– 50×(50-1)/2 = 1,225 unique edges
• Also tried for networks with 25, 50, 100 & 200 nodes
– Partial correlation (r2z) between each node
• “Subject Measures” (SM) for each subject
– 280 behavioral and demographic measures
• Fluid IQ, life satisfaction, stress, dexterity, smoking, alcohol/drug use, sleep quality, …
HCP Mega Trawl
• For each SM
– 10-fold CV to estimate “prediction R2”
– For each fold
• Using held-in (9/10th) fold– Nuisance regression estimated & applied to SM & netmats
» Using brain/head size, motion, acquisition quarter
– Feature selection
» Each edge used to predict SM alone; best half of edges kept
– Retained edges jointly predict SM in Elastic (L1+L2) regression
» Optimized with 10-fold CV, nested within outer CV
• Held-out (1/10th)– Nuisance-adjustment, using pre-estimated model
– Prediction with final model
• s
• s
• a
Prediction Evaluation Measures:Continuous Reponses – MegaTrawl
• Coefficient of Determination (R2) is 4%– 4% of total variance explained by predicted values
– Note difference from 4% ≠ (r)2 = 0.242 = 0.058
• CoD is -4%! No useful prediction variance explained
HCP Mega Trawl Redux
• Try it out!
– https://db.humanconnectome.org/megatrawl
• Disappointing
– No measure predicted well
• Except boring things like, year of data acquisition, gender
• Ideally would like to relate all SM’s to all NetMat’s
– Not just using an entire NetMat’s to predict one SM
HCP CCA Trawl:Relate all of the two sets variables
• Network Matrix (NetMat) for each subject– 200×(200-1)/2 = 19,900 unique edges
– Partial correlation (r2z) between each node
• “Subject Measures” (SM) for each subject– 280 behavioral and demographic measures
• Fluid IQ, life satisfaction, stress, dexterity, smoking, alcohol/drug use, sleep quality, …
• Canonical Correlation Analysis (CCA)– Find
• some linear combination of NetMat edges that
• best correlates with
• some linear of SMs
Smith, et al. (2015). A positive-negative mode of population covariation links brain connectivity, demographics and behavior. Nature Neuroscience.
Exactly 1 mode found!
Conclusions
• Population Neuroimaging
– Need large sample size to overcome intrinsic power limitations of previous studies
– Need representative samples to generalise to a population other than healthy undergraduates
• Machine Learning
– Prediction is the future
– But building accurate predictive models is harder than finding population differences
• And easier to screw up 70