Deep Residual Networks - Kaiming He - FAIRkaiminghe.com/icml16tutorial/icml2016_tutorial_deep... ·...
Transcript of Deep Residual Networks - Kaiming He - FAIRkaiminghe.com/icml16tutorial/icml2016_tutorial_deep... ·...
DeepResidualNetworksDeepLearningGetsWayDeeper
8:30-10:30am,June19ICML2016tutorial
KaimingHeFacebookAIResearch*
*asofJuly2016.FormerlyaffiliatedwithMicrosoftResearchAsia
1x1conv,64
3x3conv,64
1x1conv,256
1x1conv,64
3x3conv,64
1x1conv,256
1x1conv,64
3x3conv,64
1x1conv,256
1x1conv,128
,/2
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,256
,/2
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,512
,/2
3x3conv,512
1x1conv,2048
1x1conv,512
3x3conv,512
1x1conv,2048
1x1conv,512
3x3conv,512
1x1conv,2048
avepool,fc1
000
7x7conv
,64,/2,pool/2
Overview
• Introduction• Background• Fromshallowtodeep
• DeepResidualNetworks• From10layersto100layers• From100layersto1000layers
• Applications• Q&A
Introduction
Introduction
DeepResidualNetworks(ResNets)• “DeepResidualLearningforImageRecognition”.CVPR2016(nextweek)
• Asimpleandcleanframeworkoftraining“very”deepnets
• State-of-the-artperformancefor• Imageclassification• Objectdetection• Semanticsegmentation• andmore…
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
ResNets@ILSVRC&COCO2015Competitions
• 1stplacesinallfivemaintracks• ImageNetClassification:“Ultra-deep”152-layer nets• ImageNetDetection: 16% betterthan2nd• ImageNetLocalization: 27% betterthan2nd• COCODetection: 11% betterthan2nd• COCOSegmentation: 12% betterthan2nd
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
*improvementsarerelativenumbers
RevolutionofDepth
3.57
6.7 7.3
11.7
16.4
25.828.2
ILSVRC'15ResNet
ILSVRC'14GoogleNet
ILSVRC'14VGG
ILSVRC'13 ILSVRC'12AlexNet
ILSVRC'11 ILSVRC'10
ImageNetClassificationtop-5error(%)
shallow8layers
19layers22layers
152layers
8layers
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
RevolutionofDepth11x11conv,96,/4,pool/2
5x5conv,256,pool/2
3x3conv,384
3x3conv,384
3x3conv,256,pool/2
fc,4096
fc,4096
fc,1000
AlexNet,8layers(ILSVRC2012)
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
RevolutionofDepth11x11conv,96,/4,pool/2
5x5conv,256,pool/2
3x3conv,384
3x3conv,384
3x3conv,256,pool/2
fc,4096
fc,4096
fc,1000
AlexNet,8layers(ILSVRC2012)
3x3conv,64
3x3conv,64,pool/2
3x3conv,128
3x3conv,128,pool/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256,pool/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512,pool/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512,pool/2
fc,4096
fc,4096
fc,1000
VGG,19layers(ILSVRC2014)
input
Conv7x7+ 2(S)
MaxPool 3x3+ 2(S)
LocalRespNorm
Conv1x1+ 1(V)
Conv3x3+ 1(S)
LocalRespNorm
MaxPool 3x3+ 2(S)
Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
MaxPool 3x3+ 2(S)
Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool 5x5+ 3(V)
Dept hConcat
Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool 5x5+ 3(V)
Dept hConcat
MaxPool 3x3+ 2(S)
Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
AveragePool 7x7+ 1(V)
FC
Conv1x1+ 1(S)
FC
FC
Soft maxAct ivat ion
soft max0
Conv1x1+ 1(S)
FC
FC
Soft maxAct ivat ion
soft max1
Soft maxAct ivat ion
soft max2
GoogleNet,22layers(ILSVRC2014)
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
1x1conv,64
3x3conv,64
1x1conv,256
1x1conv,64
3x3conv,64
1x1conv,256
1x1conv,64
3x3conv,64
1x1conv,256
1x1conv,128,/2
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,128
3x3conv,128
1x1conv,512
1x1conv,256,/2
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,256
3x3conv,256
1x1conv,1024
1x1conv,512,/2
3x3conv,512
1x1conv,2048
1x1conv,512
3x3conv,512
1x1conv,2048
1x1conv,512
3x3conv,512
1x1conv,2048
avepool,fc1000
7x7conv,64,/2,pool/2
AlexNet,8layers(ILSVRC2012)
RevolutionofDepthResNet,152layers(ILSVRC2015)
3x3conv,64
3x3conv,64,pool/2
3x3conv,128
3x3conv,128,pool/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256,pool/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512,pool/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512,pool/2
fc,4096
fc,4096
fc,1000
11x11conv,96,/4,pool/2
5x5conv,256,pool/2
3x3conv,384
3x3conv,384
3x3conv,256,pool/2
fc,4096
fc,4096
fc,1000
VGG,19layers(ILSVRC2014)
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
RevolutionofDepth
34
5866
86
HOG,DPM AlexNet(RCNN)
VGG(RCNN)
ResNet(FasterRCNN)*
PASCALVOC2007ObjectDetectionmAP (%)
shallow8layers
16layers
101layers
*w/otherimprovements&moredata
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
Enginesofvisualrecognition
ResNet’s objectdetectionresultonCOCO*theoriginalimageisfromtheCOCOdataset
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
Verysimple,easytofollow
• Manythird-partyimplementations(listinhttps://github.com/KaimingHe/deep-residual-networks)• FacebookAIResearch’sTorchResNet:• Torch,CIFAR-10,withResNet-20toResNet-110,trainingcode,andcurves:code• Lasagne,CIFAR-10,withResNet-32andResNet-56andtrainingcode:code• Neon,CIFAR-10,withpre-trainedResNet-32toResNet-110models,trainingcode,andcurves: code• Torch,MNIST,100layers:blog,code• AwinningentryinKaggle's rightwhalerecognitionchallenge:blog,code• Neon,Place2(mini),40layers:blog,code• …
• Easilyreproducedresults(e.g.TorchResNet:https://github.com/facebook/fb.resnet.torch)
• Aseriesofextensionsandfollow-ups• >200citationsin6monthsafterpostedonarXiv (Dec.2015)
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
Background
Fromshallowtodeep
Traditionalrecognition
edges classifier “bus”?
pixelsclassifier “bus”?
histogram classifier “bus”?edges
SIFT/HOG
histogram classifier “bus”?edges K-means/sparsecode
shallower
deeper
Butwhat’snext?
DeepLearning
histogram classifier “bus”?edges K-means/sparsecode
Specializedcomponents, domainknowledge required
“bus”?
Genericcomponents (“layers”),lessdomainknowledge
“bus”?
Repeatelementarylayers=>Goingdeeper
• End-to-endlearning• Richersolutionspace
SpectrumofDepth
shallower deeper
5layers:easy
>10layers:initialization,BatchNormalization
>30layers:skipconnections
>100layers:identityskipconnections
>1000layers:?
Initialization
LeCun etal1998“EfficientBackprop”Glorot &Bengio 2010“Understandingthedifficultyoftrainingdeepfeedforwardneuralnetworks”
input𝑋
output𝑌 = 𝑊𝑋
weight𝑊
1-layer:𝑉𝑎𝑟 𝑦 = (𝑛+,𝑉𝑎𝑟 𝑤 )𝑉𝑎𝑟[𝑥]
Multi-layer:
𝑉𝑎𝑟 𝑦 = (2𝑛3+,𝑉𝑎𝑟 𝑤33
)𝑉𝑎𝑟[𝑥]
If:• Linearactivation• 𝑥, 𝑦,𝑤:independentThen:
𝑛+, 𝑛567
Initialization
LeCun etal1998“EfficientBackprop”Glorot &Bengio 2010“Understandingthedifficultyoftrainingdeepfeedforwardneuralnetworks”
1 3 5 7 9 11 13 15depth
exploding
vanishing
ideal
Forward:
𝑉𝑎𝑟 𝑦 = (2𝑛3+,𝑉𝑎𝑟 𝑤33
)𝑉𝑎𝑟[𝑥]
Backward:
𝑉𝑎𝑟𝜕𝜕𝑥 = (2𝑛3567𝑉𝑎𝑟 𝑤3
3
)𝑉𝑎𝑟[𝜕𝜕𝑦]
Bothforward(response)andbackward(gradient)signalcanvanish/explode
Initialization
• Initializationunderlinear assumption
LeCun etal1998“EfficientBackprop”Glorot &Bengio 2010“Understandingthedifficultyoftrainingdeepfeedforwardneuralnetworks”
∏ 𝑛3+,𝑉𝑎𝑟 𝑤33 = 𝑐𝑜𝑛𝑠𝑡>? (healthyforward)and
∏ 𝑛3567𝑉𝑎𝑟 𝑤33 = 𝑐𝑜𝑛𝑠𝑡@?(healthybackward)
𝑛3+,𝑉𝑎𝑟 𝑤3 = 1or*
𝑛3567𝑉𝑎𝑟 𝑤3 = 1
*:𝑛3567 = 𝑛3BC+, ,soD5,E7FGD5,E7HG= ,IJKL
MNO
,HPQKLRS < ∞.
Itissufficienttouseeitherform.
“Xavier”init inCaffe
Initialization
• InitializationunderReLU activation
∏ CV𝑛3
+,𝑉𝑎𝑟 𝑤33 = 𝑐𝑜𝑛𝑠𝑡>? (healthyforward)and
∏ CV𝑛3
567𝑉𝑎𝑟 𝑤33 = 𝑐𝑜𝑛𝑠𝑡@?(healthybackward)
12𝑛3
+,𝑉𝑎𝑟 𝑤3 = 1or
12𝑛3
567𝑉𝑎𝑟 𝑤3 = 1
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DelvingDeepintoRectifiers:SurpassingHuman-LevelPerformanceonImageNetClassification”.ICCV2015.
With𝐷 layers,afactorof2 perlayerhasexponentialimpactof2Y
“MSRA”init inCaffe
Initialization
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DelvingDeepintoRectifiers:SurpassingHuman-LevelPerformanceonImageNetClassification”.ICCV2015.
ours
Xavier
22-layerReLU net:goodinit convergesfaster
𝑛𝑉𝑎𝑟 𝑤 = 1oursXavier
30-layerReLU net:goodinit isabletoconverge
12𝑛𝑉𝑎𝑟 𝑤 = 1
12𝑛𝑉𝑎𝑟 𝑤 = 1
𝑛𝑉𝑎𝑟 𝑤 = 1
*Figuresshowthebeginningoftraining
BatchNormalization(BN)
• Normalizinginput(LeCun etal1998“EfficientBackprop”)
• BN:normalizingeachlayer,foreachmini-batch
• Greatlyacceleratetraining
• Lesssensitivetoinitialization
• Improveregularization
S.Ioffe &C.Szegedy.Batchnormalization:Acceleratingdeepnetworktrainingbyreducing internalcovariateshift.ICML2015
BatchNormalization(BN)
S.Ioffe &C.Szegedy.Batchnormalization:Acceleratingdeepnetworktrainingbyreducing internalcovariateshift.ICML2015
layer 𝑥 𝑥Z =𝑥 − 𝜇𝜎 𝑦 = 𝛾𝑥Z + 𝛽
• 𝜇:meanof𝑥 inmini-batch• 𝜎:std of𝑥 inmini-batch• 𝛾:scale• 𝛽:shift
• 𝜇,𝜎:functionsof𝑥,analogoustoresponses
• 𝛾, 𝛽:parameterstobelearned,analogoustoweights
BatchNormalization(BN)
S.Ioffe &C.Szegedy.Batchnormalization:Acceleratingdeepnetworktrainingbyreducing internalcovariateshift.ICML2015
layer 𝑥 𝑥Z =𝑥 − 𝜇𝜎 𝑦 = 𝛾𝑥Z + 𝛽
2modesofBN:• Trainmode:• 𝜇,𝜎 arefunctionsof𝑥;backprop gradients
• Testmode:• 𝜇,𝜎 arepre-computed*ontrainingset
*:byrunning average,orpost-processing aftertraining
Caution:makesureyourBNisinacorrectmode
BatchNormalization(BN)
S.Ioffe &C.Szegedy.Batchnormalization:Acceleratingdeepnetworktrainingbyreducing internalcovariateshift.ICML2015
Figure takenfrom[S.Ioffe &C.Szegedy]
w/oBNbestofw/BNaccuracy
iter.
DeepResidualNetworks
From10layersto100layers
GoingDeeper
• Initializationalgorithms✓• BatchNormalization✓
• Islearningbetternetworksassimpleasstackingmorelayers?
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
Simplystackinglayers?
0 1 2 3 4 5 60
10
20
iter. (1e4)
trainerror(%)
0 1 2 3 4 5 60
10
20
iter. (1e4)
testerror(%)CIFAR-10
56-layer
20-layer
56-layer
20-layer
• Plain nets:stacking3x3convlayers…• 56-layernethashighertrainingerror andtesterrorthan20-layernet
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
Simplystackinglayers?
0 1 2 3 4 5 60
5
10
20
iter. (1e4)
erro
r (%
)
plain-20plain-32plain-44plain-56
CIFAR-10
20-layer32-layer44-layer56-layer
0 10 20 30 40 5020
30
40
50
60
iter. (1e4)
erro
r (%
)
plain-18plain-34
ImageNet-1000
34-layer
18-layer
• “Overlydeep”plainnetshavehighertrainingerror• Ageneralphenomenon,observedinmanydatasets
solid:test/valdashed:train
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
7x7conv,64,/2
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,128,/2
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,256,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,512,/2
3x3conv,512
3x3conv,512
3x3conv,512
fc1000
ashallowermodel
(18layers)
adeepercounterpart(34layers)
7x7conv,64,/2
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,128,/2
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,256,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,512,/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
fc1000
“extra”layers
• Richersolutionspace
• Adeepermodelshouldnothavehighertrainingerror
• Asolutionbyconstruction:• originallayers:copiedfroma
learnedshallowermodel• extralayers:setasidentity• atleastthesametrainingerror
• Optimizationdifficulties:solverscannotfindthesolutionwhengoingdeeper…
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
DeepResidualLearning
• Plaintnet
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
anytwostackedlayers
𝑥
𝐻(𝑥)
weightlayer
weightlayer
relu
relu
𝐻 𝑥 isanydesiredmapping,
hopethe2weightlayersfit𝐻(𝑥)
DeepResidualLearning
• Residual net
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
𝐻 𝑥 isanydesiredmapping,
hopethe2weightlayersfit𝐻(𝑥)
hope the2weightlayersfit𝐹(𝑥)
let𝐻 𝑥 = 𝐹 𝑥 + 𝑥weightlayer
weightlayer
relu
relu
𝑥
𝐻 𝑥 = 𝐹 𝑥 + 𝑥
identity𝑥
𝐹(𝑥)
DeepResidualLearning
• 𝐹 𝑥 isaresidual mappingw.r.t.identity
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
• Ifidentitywereoptimal,easytosetweightsas0
• Ifoptimalmappingisclosertoidentity,easiertofindsmallfluctuations
weightlayer
weightlayer
relu
relu
𝑥
𝐻 𝑥 = 𝐹 𝑥 + 𝑥
identity𝑥
𝐹(𝑥)
RelatedWorks– ResidualRepresentations
• VLAD&FisherVector[Jegou etal2010],[Perronnin etal2007]
• Encodingresidual vectors;powerfulshallowerrepresentations.
• ProductQuantization(IVF-ADC)[Jegou etal2011]
• Quantizingresidual vectors;efficientnearest-neighborsearch.
• MultiGrid&HierarchicalPrecondition[Briggs,etal2000],[Szeliski1990,2006]• Solvingresidual sub-problems;efficientPDEsolvers.
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
7x7conv,64,/2
pool,/2
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,128,/2
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,256,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,512,/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
avgpool
fc1000
7x7conv,64,/2
pool,/2
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,128,/2
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,256,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,512,/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
avgpool
fc1000
Network“Design”
• Keepitsimple
• Ourbasicdesign (VGG-style)• all3x3conv(almost)
• spatialsize/2=>#filtersx2(~samecomplexityperlayer)
• Simpledesign;justdeep!
• Otherremarks:• nohiddenfc• nodropout
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
plainnet ResNet
Training
• Allplain/residualnetsaretrainedfromscratch
• Allplain/residualnetsuseBatchNormalization
• Standardhyper-parameters&augmentation
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
CIFAR-10experiments
0 1 2 3 4 5 60
5
10
20
iter. (1e4)
erro
r (%
)
plain-20plain-32plain-44plain-56
20-layer32-layer44-layer56-layer
CIFAR-10plainnets
0 1 2 3 4 5 60
5
10
20
iter. (1e4)
erro
r (%
)
ResNet-20ResNet-32ResNet-44ResNet-56ResNet-110
CIFAR-10ResNets
56-layer44-layer32-layer20-layer
110-layer
• DeepResNetscanbetrainedwithoutdifficulties• DeeperResNetshavelowertrainingerror,andalsolowertesterror
solid:testdashed:train
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
ImageNetexperiments
0 10 20 30 40 5020
30
40
50
60
iter. (1e4)
erro
r (%
)
ResNet-18ResNet-34
0 10 20 30 40 5020
30
40
50
60
iter. (1e4)
erro
r (%
)
plain-18plain-34
ImageNetplainnets ImageNetResNets
solid:testdashed:train
34-layer
18-layer
18-layer
34-layer
• DeepResNetscanbetrainedwithoutdifficulties• DeeperResNetshavelowertrainingerror,andalsolowertesterror
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
ImageNetexperiments
• Apracticaldesignofgoingdeeper
3x3,64
3x3,64
relu
relu
64-d
3x3,64
1x1,64relu
1x1,256relu
relu
256-d
all-3x3
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
bottleneck(forResNet-50/101/152)
similarcomplexity
ImageNetexperiments7.4
6.7
6.15.7
4
5
6
7
8
ResNet-34ResNet-50ResNet-101ResNet-15210-crop testing,top-5val error(%)
thismodelhaslowertimecomplexity
thanVGG-16/19
• Deeper ResNetshavelower error
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
ImageNetexperiments
3.57
6.7 7.3
11.7
16.4
25.828.2
ILSVRC'15ResNet
ILSVRC'14GoogleNet
ILSVRC'14VGG
ILSVRC'13 ILSVRC'12AlexNet
ILSVRC'11 ILSVRC'10
ImageNetClassificationtop-5error(%)
shallow8layers
19layers22layers
152layers
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
8layers
DiscussionsRepresentation,Optimization,Generalization
Issuesonlearningdeepmodels
•Representation ability
•Optimization ability
•Generalization ability
• Abilityofmodeltofittrainingdata,ifoptimumcouldbefound
• IfmodelA’ssolutionspaceisasupersetofB’s,Ashouldbebetter.
• Feasibilityoffindinganoptimum
• Notallmodelsareequallyeasytooptimize
• Oncetrainingdataisfit,howgoodisthetestperformance
HowdoResNetsaddresstheseissues?
•Representation ability
•Optimization ability
•Generalization ability
• Noexplicitadvantageonrepresentation(onlyre-parameterization),but
• Allowmodelstogodeeper
• Enableverysmoothforward/backwardprop
• Greatlyeaseoptimizingdeeper models
• Notexplicitlyaddressgeneralization,but
• Deeper+thinner isgoodgeneralization
OntheImportanceofIdentityMapping
From100layersto1000layers
Onidentitymappingsforoptimization
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
𝑥cBC = 𝑓(ℎ 𝑥c + 𝐹 𝑥c )
𝑥𝑙
ℎ(𝑥c)𝐹(𝑥𝑙)layer
layer
• shortcutmapping:ℎ =identity
• after-addmapping:𝑓 =ReLU
Onidentitymappingsforoptimization
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
𝑥cBC = 𝑓(ℎ 𝑥c + 𝐹 𝑥c )
𝑥𝑙
ℎ(𝑥c)𝐹(𝑥𝑙)layer
layer
• shortcutmapping:ℎ =identity
• after-addmapping:𝑓 =ReLU
• Whatif𝑓 =identity?
Onidentitymappingsforoptimization
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
𝑥cBC = 𝑓(ℎ 𝑥c + 𝐹 𝑥c )
𝑥𝑙
ℎ(𝑥c)𝐹(𝑥𝑙)layer
layer
• shortcutmapping:ℎ =identity
• after-addmapping:𝑓 =ReLU
• Whatif𝑓 =identity?
Verysmoothforwardpropagation
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
𝑥cBC = 𝑥c + 𝐹 𝑥c
𝑥cBV = 𝑥cBC + 𝐹 𝑥cBC
Verysmoothforwardpropagation
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
𝑥cBC = 𝑥c + 𝐹 𝑥c
𝑥cBV = 𝑥c + 𝐹 𝑥c + 𝐹 𝑥cBC
𝑥cBV = 𝑥cBC + 𝐹 𝑥cBC
Verysmoothforwardpropagation
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
𝑥cBC = 𝑥c + 𝐹 𝑥c
𝑥cBV = 𝑥c + 𝐹 𝑥c + 𝐹 𝑥cBC
𝑥cBV = 𝑥cBC + 𝐹 𝑥cBC
𝑥g = 𝑥c +h𝐹 𝑥+
giC
+jc
Verysmoothforwardpropagation
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
𝑥g = 𝑥c +h𝐹 𝑥+
giC
+jc
7x7conv,64,/2
pool,/2
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,128,/2
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,256,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,512,/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
avgpool
fc1000
7x7conv,64,/2
pool,/2
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,128,/2
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,256,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,512,/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
avgpool
fc1000
𝑥g
𝑥c• Any𝑥c isdirectly forward-proptoany𝑥g,plus residual.• Any𝑥g isanadditiveoutcome.• incontrastto multiplicative:𝑥g = ∏ 𝑊+𝑥cgiC
+jc
Verysmoothbackwardpropagation
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
𝑥g = 𝑥c +h𝐹 𝑥+
giC
+jc
7x7conv,64,/2
pool,/2
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,128,/2
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,256,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,512,/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
avgpool
fc1000
7x7conv,64,/2
pool,/2
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,128,/2
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,256,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,512,/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
avgpool
fc1000
𝜕𝐸𝜕𝑥g
𝜕𝐸𝜕𝑥c
𝜕𝐸𝜕𝑥c
=𝜕𝐸𝜕𝑥g
𝜕𝑥g𝜕𝑥c
=𝜕𝐸𝜕𝑥g
(1 +𝜕𝜕𝑥c
h𝐹 𝑥+
giC
+jC
)
Verysmoothbackwardpropagation
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
7x7conv,64,/2
pool,/2
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,128,/2
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,256,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,512,/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
avgpool
fc1000
7x7conv,64,/2
pool,/2
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,128,/2
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,256,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,512,/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
avgpool
fc1000
𝜕𝐸𝜕𝑥g
𝜕𝐸𝜕𝑥c
𝜕𝐸𝜕𝑥c
=𝜕𝐸𝜕𝑥g
(1 +𝜕𝜕𝑥c
h𝐹 𝑥+
giC
+jC
)
• Any lmlno
isdirectly back-proptoanylmlnp,
plus residual.
• Anylmlnp
is additive;unlikelytovanish
• incontrastto multiplicative:lmlnp = ∏ 𝑊+lmlno
giC+jc
Residualforeverylayer
𝑥g = 𝑥c +h𝐹 𝑥+
giC
+jc
forward:
𝜕𝐸𝜕𝑥c
=𝜕𝐸𝜕𝑥g
(1 +𝜕𝜕𝑥c
h𝐹 𝑥+
giC
+jC
)backward:
Enabledby:
• shortcutmapping:ℎ =identity
• after-addmapping:𝑓 =identity
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
Experiments
• Set1:whatifshortcutmappingℎ ≠identity
• Set2:whatifafter-addmapping𝑓 isidentity
• ExperimentsonResNetswithmorethan100layers• deepermodelssuffermorefromoptimizationdifficulty
ExperimentSet1:whatifshortcutmappingℎ ≠identity?
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
(f) dropout shortcut(e) conv shortcut
3x3conv
3x3conv
additionReLU
1x1convReLU
3x3conv
3x3conv
addition
dropoutReLU
ReLU
(d) shortcut-only gating(c) exclusive gating
3x3conv
3x3conv
addition
1x1convsigmoid
1-
ReLU
ReLU
3x3conv
3x3conv
addition
1x1convsigmoid
1-
ReLU
ReLU
(a) original (b) constant scaling
3x3conv
3x3conv
additionReLU
ReLU
3x3conv
3x3conv
addition
0.5 0.5
ReLU
ReLU
ℎ 𝑥 = 𝑥error:6.6%
ℎ 𝑥 = 0.5𝑥error:12.4%
ℎ 𝑥 = gate ·𝑥error:8.7%
ℎ 𝑥 = gate ·𝑥error:12.9%
ℎ 𝑥 = conv(𝑥)error:12.2%
ℎ 𝑥 = dropout(𝑥)error:>20%
*ResNet-110onCIFAR-10
*similarto“HighwayNetwork”
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
(f) dropout shortcut(e) conv shortcut
3x3conv
3x3conv
additionReLU
1x1convReLU
3x3conv
3x3conv
addition
dropoutReLU
ReLU
(d) shortcut-only gating(c) exclusive gating
3x3conv
3x3conv
addition
1x1convsigmoid
1-
ReLU
ReLU
3x3conv
3x3conv
addition
1x1convsigmoid
1-
ReLU
ReLU
(a) original (b) constant scaling
3x3conv
3x3conv
additionReLU
ReLU
3x3conv
3x3conv
addition
0.5 0.5
ReLU
ReLU
ℎ 𝑥 = 𝑥error:6.6%
ℎ 𝑥 = 0.5𝑥error:12.4%
ℎ 𝑥 = gate ·𝑥error:8.7%
ℎ 𝑥 = gate ·𝑥error:12.9%
ℎ 𝑥 = conv(𝑥)error:12.2%
ℎ 𝑥 = dropout(𝑥)error:>20%
shortcutsblockedby
multiplications
*ResNet-110onCIFAR-10
Ifℎ ismultiplicative,e.g.ℎ 𝑥 = λ𝑥
𝑥g = λgic𝑥c +h𝐹u 𝑥+
giC
+jc
forward:
𝜕𝐸𝜕𝑥c
=𝜕𝐸𝜕𝑥g
(λgic +𝜕𝜕𝑥c
h𝐹u 𝑥+
giC
+jC
)backward:
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
• ifℎ ismultiplicative,shortcutsareblocked
• directpropagationisdecayed
*assuming𝑓 =identity
3x3conv
3x3conv
addition
1x1convsigmoid
1-
ReLU
ReLU
3x3conv
3x3conv
additionReLU
ReLU
ℎ isidentity
ℎ isgating
• gatingshouldhavebetterrepresentationability(identityisaspecialcase),but
• optimizationdifficultydominatesresultsKaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
solid:testdashed:train
ExperimentSet2:whatifafter-addmapping𝑓 isidentity
BN
ReLU
weight
BN
weight
addition
ReLU
xl
xl+1
BN
ReLU
weight
BN
weight
addition
ReLU
xl
xl+1
ReLU
weight
BN
ReLU
weight
BN
addition
xl
xl+1
𝑓 isReLU(originalResNet)
𝑓 isBN+ReLU 𝑓 isidentity(pre-activation ResNet)
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
BN
ReLU
weight
BN
weight
addition
ReLU
xl
xl+1
BN
ReLU
weight
BN
weight
addition
ReLU
xl
xl+1
𝑓 = ReLU 𝑓 = BN+ReLU
𝑓 = ReLU
𝑓 = BN+ReLU
• BNcouldblockprop• Keeptheshortestpassas
smoothaspossible
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
solid:testdashed:train
BN
ReLU
weight
BN
weight
addition
ReLU
xl
xl+1
𝑓 = ReLU 𝑓 = identity
ReLU
weight
BN
ReLU
weight
BN
addition
xl
xl+1
1001-layer ResNetsonCIFAR-10
𝑓 = ReLU𝑓 = identity
• ReLUcouldblockpropwhenthereare1000layers
• pre-activationdesigneasesoptimization(andimprovesgeneralization;seepaper)
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
solid:testdashed:train
method error (%)
NIN 8.81
DSN 8.22
FitNet 8.39
Highway 7.72
ResNet-110 (1.7M) 6.61
ResNet-1202 (19.4M) 7.93
ResNet-164, pre-activation (1.7M) 5.46
ResNet-1001, pre-activation (10.2M) 4.92 (4 .89±0 .14 )
method error (%)NIN 35.68
DSN 34.57
FitNet 35.04
Highway 32.39
ResNet-164(1.7M) 25.16
ResNet-1001(10.2M) 27.82
ResNet-164, pre-activation (1.7M) 24.33
ResNet-1001, pre-activation (10.2M) 22.71 (22 .68±0 .22 )
ComparisonsonCIFAR-10/100CIFAR-10 CIFAR-100
*allbasedonmoderateaugmentation
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
ImageNetExperiments
method data augmentation top-1error(%) top-5error(%)ResNet-152, original scale 21.3 5.5ResNet-152, pre-activation scale 21.1 5.5ResNet-200, original scale 21.8 6.0ResNet-200, pre-activation scale 20.7 5.3ResNet-200, pre-activation scale+aspectratio 20.1* 4.8*
*independentlyreproducedby:https://github.com/facebook/fb.resnet.torch/tree/master/pretrained#notes
trainingcodeandmodelsavailable.
ImageNetsingle-crop(320x320)val error
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
Summaryofobservations
• Keeptheshortestpathassmoothaspossible• bymakingℎ and𝑓 identity• forward/backwardsignalsdirectlyflowthroughthispath
• Featuresofanylayersareadditiveoutcomes
• 1000-layer ResNetscanbeeasilytrainedandhavebetteraccuracy
ReLU
weight
BN
ReLU
weight
BN
addition
xl
xl+1
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
FutureWorks
• Representation• skipping1layervs.multiplelayers?• Flatvs.Bottleneck?• Inception-ResNet [Szegedy etal2016]• ResNet inResNet [Targ etal2016]• Widthvs.Depth[Zagoruyko &Komodakis 2016]
• Generalization• DropOut,MaxOut,DropConnect,…• DropLayer(StochasticDepth)[Huangetal2016]
• Optimization• Withoutresidual/shortcut?
ReLU
weight
BN
ReLU
weight
BN
addition
xl
xl+1
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
Applications
“Featuresmatter”
“Featuresmatter.”(quote[Girshicketal.2014],theR-CNNpaper)
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
task 2nd-placewinner ResNets margin
(relative)
ImageNetLocalization(top-5error) 12.0 9.0 27%
ImageNetDetection([email protected]) 53.6 62.1 16%
COCO Detection([email protected]:.95) 33.5 37.3 11%
COCOSegmentation([email protected]:.95) 25.1 28.2 12%
• OurresultsareallbasedonResNet-101• Deeperfeaturesarewelltransferrable
absolute8.5%better!
RevolutionofDepth
34
5866
86
HOG,DPM AlexNet(RCNN)
VGG(RCNN)
ResNet(FasterRCNN)*
PASCALVOC2007ObjectDetectionmAP (%)
shallow8layers
16layers
101layers
*w/otherimprovements&moredata
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.
Enginesofvisualrecognition
DeepLearningforComputerVision
backbonestructure
ImageNetdata
classificationnetwork
pre-train features
detectionnetwork
(e.g.R-CNN)
segmentationnetwork(e.g.FCN)
…...
humanposeestimationnetwork
depthestimationnetwork
targetdata
fine-tune
Example:ObjectDetection
ü boatü person
ImageClassification(what?)
ObjectDetection(what+where?)
ObjectDetection:R-CNN
regionproposals~2,000
1CNNforeachregion
Region-based CNNpipeline
figurecredit:R.Girshicketal.
aeroplane? no.
..person? yes.
tvmonitor? no.
warped region..
CNN
inputimage classifyregions
Girshick,Donahue, Darrell,Malik.RichFeatureHierarchiesforAccurateObjectDetectionandSemanticSegmentation.CVPR2014
ObjectDetection:R-CNN
• R-CNN
Girshick,Donahue, Darrell,Malik.RichFeatureHierarchiesforAccurateObjectDetectionandSemanticSegmentation.CVPR2014
CNN
feature
image
CNN
feature
CNN
feature
CNN
feature
pre-computedRegions-of-Interest
(RoIs)
End-to-Endtraining
pre-computedRegions-of-Interest
(RoIs)
image
CNN
feature
featurefeature
ObjectDetection:FastR-CNN
• FastR-CNN
Girshick.FastR-CNN.ICCV2015
End-to-Endtraining
sharedconvlayers
RoI pooling
ObjectDetection:FasterR-CNN
• FasterR-CNN• SolelybasedonCNN• Noexternalmodules• Eachstepisend-to-end
End-to-Endtraining
image
CNN
featuremap
RegionProposalNet
proposals
features
RoI pooling
Shaoqing Ren,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.
ObjectDetection
backbonestructure
ImageNetdata
classificationnetwork
pre-train features
detectionnetwork
detectiondata
fine-tune
• AlexNet• VGG-16• GoogleNet• ResNet-101• …
• R-CNN• FastR-CNN• FasterR-CNN• MultiBox• SSD• …
“plug-in”features detectors
independentlydeveloped
ObjectDetection
• Simply“FasterR-CNN+ResNet”
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.ShaoqingRen,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.
image
CNN
featuremap
RegionProposalNet
proposals
classifier
RoI pooling
FasterR-CNNbaseline [email protected] [email protected]:.95
VGG-16 41.5 21.5ResNet-101 48.4 27.2
COCOdetection resultsResNet-101has28%relativegain
vsVGG-16
ObjectDetection
• RPNlearns proposalsbyextremelydeepnets• Weuseonly300proposals(nohand-designedproposals)
• Addcomponents:• Iterativelocalization• Contextmodeling• Multi-scaletesting
• AllarebasedonCNNfeatures;allareend-to-end
• Allbenefitmore fromdeeper features– cumulativegains!
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.ShaoqingRen,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.
ResNet’s objectdetectionresultonCOCO*theoriginalimageisfromtheCOCOdataset
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.arXiv2015.ShaoqingRen,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.arXiv2015.ShaoqingRen,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.
*theoriginalimageisfromtheCOCOdataset
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.arXiv2015.ShaoqingRen,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.
*theoriginalimageisfromtheCOCOdataset
Resultsonrealvideo.ModelstrainedonMSCOCO(80categories).(frame-by-frame;notemporalprocessing)
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.arXiv2015.ShaoqingRen,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.
thisvideoisavailableonline:https://youtu.be/WZmSMkK9VuA
MoreVisualRecognitionTasks
ResNet-basedmethodsleadonthesebenchmarks(incompletelist):• ImageNetclassification,detection,localization• MSCOCOdetection,segmentation• PASCALVOCdetection,segmentation
• Humanposeestimation[Newelletal2016]• Depthestimation[Laina etal2016]• Segmentproposal[Pinheiro etal2016]• …
PASCALdetectionleaderboard
PASCALsegmentationleaderboard
ResNet-101
ResNet-101
PotentialApplications
ResNetshaveshownoutstandingorpromisingresultson:
VisualRecognition
ImageGeneration(PixelRNN,NeuralArt,etc.)
NaturalLanguageProcessing(VerydeepCNN)
SpeechRecognition(preliminaryresults)
Advertising,userprediction(preliminaryresults)
ConclusionsoftheTutorial
• DeepResidualLearning:• Ultradeepnetworkscanbeeasytotrain• Ultradeepnetworkscangainaccuracyfromdepth• Ultradeeprepresentationsarewelltransferrable• Now200 layersonImageNetand1000layersonCIFAR!
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.
Resources
• ModelsandCode• OurImageNetmodelsinCaffe:https://github.com/KaimingHe/deep-residual-networks
• Manyavailableimplementation(listinhttps://github.com/KaimingHe/deep-residual-networks)
• FacebookAIResearch’sTorchResNet:https://github.com/facebook/fb.resnet.torch
• Torch,CIFAR-10,withResNet-20toResNet-110,trainingcode,andcurves:code• Lasagne,CIFAR-10,withResNet-32andResNet-56andtrainingcode:code• Neon,CIFAR-10,withpre-trainedResNet-32toResNet-110models,trainingcode,andcurves:code• Torch,MNIST,100layers:blog,code• AwinningentryinKaggle's rightwhalerecognitionchallenge:blog,code• Neon,Place2(mini),40layers:blog,code• …....
KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.