Download - Deep Learning: An Overviewfhs.mcmaster.ca/conted/documents/miit16/6. Deep Learning - Brad... · Deep Learning: An Overview Bradley J Erickson, ... AutoEncoder for Segmentation •Dataset

Deep Learning: An Overview

Bradley J Erickson, MD PhD

Mayo Clinic, Rochester

Medical Imaging Informatics and Teleradiology Conference1:30-2:05pm June 17, 2016

Disclosures

• Relationships with commercial interests:

– Board of OneMedNet

– Board of VoiceIT

What is “Machine Learning”?

• It is a part of Artificial Intelligence

• Finds patterns in data

– Patterns that reflect properties of examples (supervised)

– Patterns that separate examples (unsupervised)

• (Other types of artificial intelligence include rules systems)

Machine Learning Classes

Supervised

ANN

SVM

Random Forest

Bayes

DNN

Unsupervised

Clusters

Adaptive Resonance

Reinforced

Machine Learning History

• Artificial Neural Networks (ANN)

– Starting point of machine learning

– Early versions didn’t work well

• Other Machine Learning Methods

– Naïve Bayes

– Support Vector Machine (SVM)

– Random Forest Classifier (RFC)

Artificial Neural Network/Perceptron

f(Σ)

f(Σ)

f(Σ)

f(Σ)

f(Σ)

f(Σ)

f(Σ)

Input Layer Hidden Layer Output Layer

T1 Pre

T1 Post

T2

Tumor

Brain


45

322

128

f(Σ)

f(Σ)

f(Σ)

f(Σ)

f(Σ)

f(Σ)

f(Σ)


T1 Pre

T1 Post

T2

Tumor

Brain


45

322

128

f(Σ)

f(Σ)

34

57

418

-68

312


T1 Pre

T1 Post

T2

Tumor

Brain


45

322

128

1

0

34

57

418

-68

312


T1 Pre

T1 Post

T2

Tumor

Brain

How ANNs Learn

• Propagation

– Multiple prior layer node value times weight

– Activation function. E.g. threshold the sum

• Weight Update

– Compute error = actual output – expected output

– Weight gradient = error * input value

– New weight = old weight * gradient * learning rate

Learning = Optimization Problem

• Learning depends on:

– ‘Correct’ gradient directions

– ‘Correct’ gradient multiplier (learning rate)

Global Minimum

Local Minimum Small Gradient

Support Vector Machines

• Maps input data to new ‘space’

• Creates hyperplane that separates classes in that space

f(x)

Deep Learning: Why the Hype?

Performance in ImageNet Challenge

Team / Software Year Error Rate

XRCE (not Deep Learning) 2011 25.8%

SuperVision (AlexNet) 2012 16.4%

Clarifai 2013 11.7%

GoogLeNet (Inception) 2014 6.66%

Andrej Karpathy (human comparison) 2014 5.1%

BN-Inception (Arxiv) 2015 4.9%

Inception-v3 (Arxiv) 2015 3.46%

What is “Deep Learning”

• “Deep” because it uses many layers

– ANN typically had 3 or fewer layers

DNNs have 15+ layers

Types of DNNs

• Convolutional Neural Network (CNN)

– Early layers have ‘windows’ of image as input

– Multiplied by a ‘kernel’ to get output

– Known as a convolution

22

14

18

44

21

13

27

64

55

32

0

28

89

32

15

31

43

65

41

33

71

21

32

4

7

1

2

1

1

2

1

2

4

2

Types of DNNs





22

14

18

44

21

13

27

64

55

32

0

28

89

32

15

31

43

65

41

33

71

21

32

4

7

1

2

1

1

2

1

2

4

2

22

* =

Types of DNNs





22

14

18

44

21

13

27

64

55

32

0

28

89

32

15

31

43

65

41

33

71

21

32

4

7

1

2

1

1

2

1

2

4

2

22

28

18

0

56

89

26

108

128

53/ 9

Types of DNNs





22

14

18

44

21

13

27

64

55

32

0

28

89

32

15

31

43

65

41

33

71

21

32

4

7

1

2

1

1

2

1

2

4

2

13

56

64

31

86

65

0

112

178

53 67/ 9

Why the Excitement Now?Advances That Addressed Problems

• Many layers -> Overfitting

– Implement sparsity in weights: Dropout

Why the Excitement Now?Advances That Addressed Problems

• Many layers -> Vanishing Gradients

– Drop out partially addresses this

– Can use ‘pre-trained’ weights for early layers, and fix those, with weights of later layers for learning higher level features

Typical CNNs

Convolution Pooling Pooling Convolution Pooling Fully Connected

Typical CNNs

Andrei Karpathy: http://karpathy.github.io/2015/10/25/selfie/

Why the Excitement Now?Batch Normalization

• What should be the initial set of weights connecting nodes?

– All the same = no gradients

– Random. But what range of values?

• BatchNorm:

– After each Convolutional layer

– Subtract mean / divide by standard deviation

• Simple but effective

Why the Excitement Now?Residual Networks

*Targ, ICLR 2016

• Residual defines if and how to pass data through from layer to layer.

• Makes deep network construction reliable

Why the Excitement Now?

• Deep Neural Network Theory

• Exponential Compute Power Growth

Moore’s Law

Computing performance doubles approximately every 18 months

Exponentials In Real Life

• If you put 1 drop of water into a football stadium, and then double the number of drops each minute:

– At 5 minutes, you will have 32 drops

– At 45 minutes, you will cover the field 1"

– At 55 minutes, the stadium will be full

• It is not natural for humans to grasp exponential growth

Deep Learning Works Well on GPUs

• Naturally parallel

• Less precision (single precision FP) actually can be advantage

• Now building cards with no video output and optimized for Deep Learning (P-100)

GPUs are Beating Moore’s Law

FPGA

FPGA

TPU

GPU

CPU

Ice Age 2000 2005 2010 2015 2020

1,000,000

100,000

10,000

1000

100

10

Deep Learning Myths

• “You Need Millions of Exams to Train and Use Deep Learning Methods”

Ways To Avoid Need For Large Data Sets

• Data Augmentation

– Essentially, creating variants of data that are different enough that they are learnable

– Similar enough that they teaching point is kept

– Mirror/Flip/Rotate/Contrast/Crop



• Transfer Learning

Imag

e

Co

nv

Max

Poo

l

Fully

Co

nn

ecte

d

Soft

Max

Fully

Co

nn

ecte

d

Fully

Co

nn

ecte

d

Co

nv

Co

nv

Max

Poo

l

Co

nv

Co

nv

Max

Poo

l

Co

nv

Train on Large Corpus like ImageNet




Imag

e

Co

nv

Max

Poo

l

Fully

Co

nn

ecte

d

Soft

Max

Fully

Co

nn

ecte

d

Fully

Co

nn

ecte

d

Co

nv

Co

nv

Max

Poo

l

Co

nv

Co

nv

Max

Poo

l

Co

nv

Freeze These Layers




Imag

e

Co

nv

Max

Poo

l

Fully

Co

nn

ecte

d

Soft

Max

Fully

Co

nn

ecte

d

Fully

Co

nn

ecte

d

Co

nv

Co

nv

Max

Poo

l

Co

nv

Co

nv

Max

Poo

l

Co

nv

Freeze These Layers Train this

Take Home Point

• Deep Learning Learns Features and Connections vs Just Connections

Hand-Crafted Feature Extraction

Learning Feature Extractor

Classifier

Classifier

Examples of CNN in Medical Imaging: Body Part

*Roth, Arxiv 2016

Examples of CNN in Medical Imaging: Segmentation

Mo

esko

ps,

IEEE

-TM

I, 2

01

6

Mayo: AutoEncoder for Segmentation

• Dataset– Trained on Brats 2015– Flair enhancing signal

• Preprocessing – N4 bias correction– Nuyl intensity normalization

• Autoencoders trained on 110.000 ROIs (size=12)• Time 1 hour for 155 slices (DNN would be days

or weeks)

Korfiatis, Submitted

What is an AutoEncoder?


Dice = 0.92 over BRATS dataset


Machine Learning & Radiomics

• Computers find textures reflecting genomics: 1p19q

• 85 Subjects with FISH results, computed multiple textures, SVM

# Features Sens Spec F-score Accuracy

SVMAbstract

1010

0.910.95

0.870.93

0.930.96

0.910.95

Naïve Bayes 12 0.95 0.77 0.92 0.89

Erickson, Proc ASNR, 2016

Machine Learning & Radiomics

• 155 Subjects, GBM, MGMT Methylation

• Compute textures (T2 was best) -> SVM

Korfiatis, Med Phys, 2016

Deep Learning: MGMT Methylation

• Same set of patients, use VGGNet / Xfer: Az=0.86

• Autoencoder is giving nearly as good performance and trains about 10x faster

• Now testing DeepMedic and RNN

Korfiatis, unpublished

The Pace of Change

Will Computers Replace Radiologists?

• Deep Learning will likely be able to create reports for diagnostic images in the future.

– 5 years: Mammo & CXR

– 10 years: CT Head, Chest, Abd, Pelvis, MR head, knee, shoulder, US: liver, thyroid, carotids

– 15-20 years: most diagnostic imaging

• Will likely ‘see’ more than we do today

• Will allow radiologists for focus on patient interaction and invasive procedures

How Might Medicine Best Embrace Deep Learning


• Algorithms for Machine Learning are rapidly improving. CNN are not the only game in town

• Hardware for Machine Learning is REALLY rapidly improving

• The amount of change in 20 years will be unbelievable


• Medicine needs to remain flexible about hardware and software

• The VALUE is in the data and metadata

• Physicians are OBLIGATED to make sure the data are properly handled.

– Improper interpretation of data will lead to bad implementations and poor patient care

– Non-cooperation is also counter-productive