Deep Learning: An Overview
Bradley J Erickson, MD PhD
Mayo Clinic, Rochester
Medical Imaging Informatics and Teleradiology Conference1:30-2:05pm June 17, 2016
Disclosures
• Relationships with commercial interests:
– Board of OneMedNet
– Board of VoiceIT
What is “Machine Learning”?
• It is a part of Artificial Intelligence
• Finds patterns in data
– Patterns that reflect properties of examples (supervised)
– Patterns that separate examples (unsupervised)
• (Other types of artificial intelligence include rules systems)
Machine Learning Classes
Supervised
ANN
SVM
Random Forest
Bayes
DNN
Unsupervised
Clusters
Adaptive Resonance
Reinforced
Machine Learning History
• Artificial Neural Networks (ANN)
– Starting point of machine learning
– Early versions didn’t work well
• Other Machine Learning Methods
– Naïve Bayes
– Support Vector Machine (SVM)
– Random Forest Classifier (RFC)
Artificial Neural Network/Perceptron
f(Σ)
f(Σ)
f(Σ)
f(Σ)
f(Σ)
f(Σ)
f(Σ)
Input Layer Hidden Layer Output Layer
T1 Pre
T1 Post
T2
Tumor
Brain
Artificial Neural Network/Perceptron
45
322
128
f(Σ)
f(Σ)
f(Σ)
f(Σ)
f(Σ)
f(Σ)
f(Σ)
Input Layer Hidden Layer Output Layer
T1 Pre
T1 Post
T2
Tumor
Brain
Artificial Neural Network/Perceptron
45
322
128
f(Σ)
f(Σ)
34
57
418
-68
312
Input Layer Hidden Layer Output Layer
T1 Pre
T1 Post
T2
Tumor
Brain
Artificial Neural Network/Perceptron
45
322
128
1
0
34
57
418
-68
312
Input Layer Hidden Layer Output Layer
T1 Pre
T1 Post
T2
Tumor
Brain
How ANNs Learn
• Propagation
– Multiple prior layer node value times weight
– Activation function. E.g. threshold the sum
• Weight Update
– Compute error = actual output – expected output
– Weight gradient = error * input value
– New weight = old weight * gradient * learning rate
Learning = Optimization Problem
• Learning depends on:
– ‘Correct’ gradient directions
– ‘Correct’ gradient multiplier (learning rate)
Global Minimum
Local Minimum Small Gradient
Support Vector Machines
• Maps input data to new ‘space’
• Creates hyperplane that separates classes in that space
f(x)
Deep Learning: Why the Hype?
Performance in ImageNet Challenge
Team / Software Year Error Rate
XRCE (not Deep Learning) 2011 25.8%
SuperVision (AlexNet) 2012 16.4%
Clarifai 2013 11.7%
GoogLeNet (Inception) 2014 6.66%
Andrej Karpathy (human comparison) 2014 5.1%
BN-Inception (Arxiv) 2015 4.9%
Inception-v3 (Arxiv) 2015 3.46%
What is “Deep Learning”
• “Deep” because it uses many layers
– ANN typically had 3 or fewer layers
DNNs have 15+ layers
Types of DNNs
• Convolutional Neural Network (CNN)
– Early layers have ‘windows’ of image as input
– Multiplied by a ‘kernel’ to get output
– Known as a convolution
22
14
18
44
21
13
27
64
55
32
0
28
89
32
15
31
43
65
41
33
71
21
32
4
7
1
2
1
1
2
1
2
4
2
Types of DNNs
• Convolutional Neural Network (CNN)
– Early layers have ‘windows’ of image as input
– Multiplied by a ‘kernel’ to get output
– Known as a convolution
22
14
18
44
21
13
27
64
55
32
0
28
89
32
15
31
43
65
41
33
71
21
32
4
7
1
2
1
1
2
1
2
4
2
Types of DNNs
• Convolutional Neural Network (CNN)
– Early layers have ‘windows’ of image as input
– Multiplied by a ‘kernel’ to get output
– Known as a convolution
22
14
18
44
21
13
27
64
55
32
0
28
89
32
15
31
43
65
41
33
71
21
32
4
7
1
2
1
1
2
1
2
4
2
22
* =
Types of DNNs
• Convolutional Neural Network (CNN)
– Early layers have ‘windows’ of image as input
– Multiplied by a ‘kernel’ to get output
– Known as a convolution
22
14
18
44
21
13
27
64
55
32
0
28
89
32
15
31
43
65
41
33
71
21
32
4
7
1
2
1
1
2
1
2
4
2
22
28
18
0
56
89
26
108
128
53/ 9
Types of DNNs
• Convolutional Neural Network (CNN)
– Early layers have ‘windows’ of image as input
– Multiplied by a ‘kernel’ to get output
– Known as a convolution
22
14
18
44
21
13
27
64
55
32
0
28
89
32
15
31
43
65
41
33
71
21
32
4
7
1
2
1
1
2
1
2
4
2
13
56
64
31
86
65
0
112
178
53 67/ 9
Why the Excitement Now?Advances That Addressed Problems
• Many layers -> Overfitting
– Implement sparsity in weights: Dropout
Why the Excitement Now?Advances That Addressed Problems
• Many layers -> Vanishing Gradients
– Drop out partially addresses this
– Can use ‘pre-trained’ weights for early layers, and fix those, with weights of later layers for learning higher level features
Typical CNNs
Convolution Pooling Pooling Convolution Pooling Fully Connected
Typical CNNs
Andrei Karpathy: http://karpathy.github.io/2015/10/25/selfie/
Why the Excitement Now?Batch Normalization
• What should be the initial set of weights connecting nodes?
– All the same = no gradients
– Random. But what range of values?
• BatchNorm:
– After each Convolutional layer
– Subtract mean / divide by standard deviation
• Simple but effective
Why the Excitement Now?Residual Networks
*Targ, ICLR 2016
• Residual defines if and how to pass data through from layer to layer.
• Makes deep network construction reliable
Why the Excitement Now?
• Deep Neural Network Theory
• Exponential Compute Power Growth
Moore’s Law
Computing performance doubles approximately every 18 months
Exponentials In Real Life
• If you put 1 drop of water into a football stadium, and then double the number of drops each minute:
– At 5 minutes, you will have 32 drops
– At 45 minutes, you will cover the field 1"
– At 55 minutes, the stadium will be full
• It is not natural for humans to grasp exponential growth
Deep Learning Works Well on GPUs
• Naturally parallel
• Less precision (single precision FP) actually can be advantage
• Now building cards with no video output and optimized for Deep Learning (P-100)
GPUs are Beating Moore’s Law
FPGA
FPGA
TPU
GPU
CPU
Ice Age 2000 2005 2010 2015 2020
1,000,000
100,000
10,000
1000
100
10
Deep Learning Myths
• “You Need Millions of Exams to Train and Use Deep Learning Methods”
Deep Learning Myths
• “You Need Millions of Exams to Train and Use Deep Learning Methods”
Ways To Avoid Need For Large Data Sets
• Data Augmentation
– Essentially, creating variants of data that are different enough that they are learnable
– Similar enough that they teaching point is kept
– Mirror/Flip/Rotate/Contrast/Crop
Ways To Avoid Need For Large Data Sets
• Data Augmentation
• Transfer Learning
Imag
e
Co
nv
Max
Poo
l
Fully
Co
nn
ecte
d
Soft
Max
Fully
Co
nn
ecte
d
Fully
Co
nn
ecte
d
Co
nv
Co
nv
Max
Poo
l
Co
nv
Co
nv
Max
Poo
l
Co
nv
Train on Large Corpus like ImageNet
Ways To Avoid Need For Large Data Sets
• Data Augmentation
• Transfer Learning
Imag
e
Co
nv
Max
Poo
l
Fully
Co
nn
ecte
d
Soft
Max
Fully
Co
nn
ecte
d
Fully
Co
nn
ecte
d
Co
nv
Co
nv
Max
Poo
l
Co
nv
Co
nv
Max
Poo
l
Co
nv
Freeze These Layers
Ways To Avoid Need For Large Data Sets
• Data Augmentation
• Transfer Learning
Imag
e
Co
nv
Max
Poo
l
Fully
Co
nn
ecte
d
Soft
Max
Fully
Co
nn
ecte
d
Fully
Co
nn
ecte
d
Co
nv
Co
nv
Max
Poo
l
Co
nv
Co
nv
Max
Poo
l
Co
nv
Freeze These Layers Train this
Take Home Point
• Deep Learning Learns Features and Connections vs Just Connections
Hand-Crafted Feature Extraction
Learning Feature Extractor
Classifier
Classifier
Examples of CNN in Medical Imaging: Body Part
*Roth, Arxiv 2016
Examples of CNN in Medical Imaging: Segmentation
Mo
esko
ps,
IEEE
-TM
I, 2
01
6
Mayo: AutoEncoder for Segmentation
• Dataset– Trained on Brats 2015– Flair enhancing signal
• Preprocessing – N4 bias correction– Nuyl intensity normalization
• Autoencoders trained on 110.000 ROIs (size=12)• Time 1 hour for 155 slices (DNN would be days
or weeks)
Korfiatis, Submitted
What is an AutoEncoder?
Korfiatis, Submitted
Dice = 0.92 over BRATS dataset
Korfiatis, Submitted
Machine Learning & Radiomics
• Computers find textures reflecting genomics: 1p19q
• 85 Subjects with FISH results, computed multiple textures, SVM
# Features Sens Spec F-score Accuracy
SVMAbstract
1010
0.910.95
0.870.93
0.930.96
0.910.95
Naïve Bayes 12 0.95 0.77 0.92 0.89
Erickson, Proc ASNR, 2016
Machine Learning & Radiomics
• 155 Subjects, GBM, MGMT Methylation
• Compute textures (T2 was best) -> SVM
Korfiatis, Med Phys, 2016
Deep Learning: MGMT Methylation
• Same set of patients, use VGGNet / Xfer: Az=0.86
• Autoencoder is giving nearly as good performance and trains about 10x faster
• Now testing DeepMedic and RNN
Korfiatis, unpublished
The Pace of Change
Will Computers Replace Radiologists?
• Deep Learning will likely be able to create reports for diagnostic images in the future.
– 5 years: Mammo & CXR
– 10 years: CT Head, Chest, Abd, Pelvis, MR head, knee, shoulder, US: liver, thyroid, carotids
– 15-20 years: most diagnostic imaging
• Will likely ‘see’ more than we do today
• Will allow radiologists for focus on patient interaction and invasive procedures
How Might Medicine Best Embrace Deep Learning
How Might Medicine Best Embrace Deep Learning
• Algorithms for Machine Learning are rapidly improving. CNN are not the only game in town
• Hardware for Machine Learning is REALLY rapidly improving
• The amount of change in 20 years will be unbelievable
How Might Medicine Best Embrace Deep Learning
• Medicine needs to remain flexible about hardware and software
• The VALUE is in the data and metadata
• Physicians are OBLIGATED to make sure the data are properly handled.
– Improper interpretation of data will lead to bad implementations and poor patient care
– Non-cooperation is also counter-productive
Top Related