Deep Residual Networks - Kaiming He - FAIRkaiminghe.com/icml16tutorial/icml2016_tutorial_deep... ·...

DeepResidualNetworksDeepLearningGetsWayDeeper

8:30-10:30am,June19ICML2016tutorial

KaimingHeFacebookAIResearch*

*asofJuly2016.FormerlyaffiliatedwithMicrosoftResearchAsia

1x1conv,64

3x3conv,64

1x1conv,256

1x1conv,64

3x3conv,64

1x1conv,256

1x1conv,64

3x3conv,64

1x1conv,256

1x1conv,128

,/2

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,256

,/2

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,512

,/2

3x3conv,512

1x1conv,2048

1x1conv,512

3x3conv,512

1x1conv,2048

1x1conv,512

3x3conv,512

1x1conv,2048

avepool,fc1

000

7x7conv

,64,/2,pool/2

Overview

• Introduction• Background• Fromshallowtodeep

• DeepResidualNetworks• From10layersto100layers• From100layersto1000layers

• Applications• Q&A

Introduction

Introduction

DeepResidualNetworks(ResNets)• “DeepResidualLearningforImageRecognition”.CVPR2016(nextweek)

• Asimpleandcleanframeworkoftraining“very”deepnets

• State-of-the-artperformancefor• Imageclassification• Objectdetection• Semanticsegmentation• andmore…

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

ResNets@ILSVRC&COCO2015Competitions

• 1stplacesinallfivemaintracks• ImageNetClassification:“Ultra-deep”152-layer nets• ImageNetDetection: 16% betterthan2nd• ImageNetLocalization: 27% betterthan2nd• COCODetection: 11% betterthan2nd• COCOSegmentation: 12% betterthan2nd


*improvementsarerelativenumbers

RevolutionofDepth

3.57

6.7 7.3

11.7

16.4

25.828.2

ILSVRC'15ResNet

ILSVRC'14GoogleNet

ILSVRC'14VGG

ILSVRC'13 ILSVRC'12AlexNet

ILSVRC'11 ILSVRC'10

ImageNetClassificationtop-5error(%)

shallow8layers

19layers22layers

152layers

8layers


RevolutionofDepth11x11conv,96,/4,pool/2

5x5conv,256,pool/2

3x3conv,384

3x3conv,384

3x3conv,256,pool/2

fc,4096

fc,4096

fc,1000

AlexNet,8layers(ILSVRC2012)


RevolutionofDepth11x11conv,96,/4,pool/2

5x5conv,256,pool/2

3x3conv,384

3x3conv,384

3x3conv,256,pool/2

fc,4096

fc,4096

fc,1000


3x3conv,64

3x3conv,64,pool/2

3x3conv,128

3x3conv,128,pool/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256,pool/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512,pool/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512,pool/2

fc,4096

fc,4096

fc,1000

VGG,19layers(ILSVRC2014)

input

Conv7x7+ 2(S)

MaxPool 3x3+ 2(S)

LocalRespNorm

Conv1x1+ 1(V)

Conv3x3+ 1(S)

LocalRespNorm

MaxPool 3x3+ 2(S)

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat



Dept hConcat

MaxPool 3x3+ 2(S)



Dept hConcat



AveragePool 5x5+ 3(V)

Dept hConcat



Dept hConcat



Dept hConcat




Dept hConcat

MaxPool 3x3+ 2(S)



Dept hConcat



Dept hConcat


FC

Conv1x1+ 1(S)

FC

FC

Soft maxAct ivat ion

soft max0

Conv1x1+ 1(S)

FC

FC


soft max1


soft max2

GoogleNet,22layers(ILSVRC2014)


1x1conv,64

3x3conv,64

1x1conv,256

1x1conv,64

3x3conv,64

1x1conv,256

1x1conv,64

3x3conv,64

1x1conv,256

1x1conv,128,/2

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,256,/2

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,512,/2

3x3conv,512

1x1conv,2048

1x1conv,512

3x3conv,512

1x1conv,2048

1x1conv,512

3x3conv,512

1x1conv,2048

avepool,fc1000

7x7conv,64,/2,pool/2


RevolutionofDepthResNet,152layers(ILSVRC2015)

3x3conv,64

3x3conv,64,pool/2

3x3conv,128

3x3conv,128,pool/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256,pool/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512,pool/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512,pool/2

fc,4096

fc,4096

fc,1000

11x11conv,96,/4,pool/2

5x5conv,256,pool/2

3x3conv,384

3x3conv,384

3x3conv,256,pool/2

fc,4096

fc,4096

fc,1000

VGG,19layers(ILSVRC2014)


RevolutionofDepth

34

5866

86

HOG,DPM AlexNet(RCNN)

VGG(RCNN)

ResNet(FasterRCNN)*

PASCALVOC2007ObjectDetectionmAP (%)

shallow8layers

16layers

101layers

*w/otherimprovements&moredata


Enginesofvisualrecognition

ResNet’s objectdetectionresultonCOCO*theoriginalimageisfromtheCOCOdataset


Verysimple,easytofollow

• Manythird-partyimplementations(listinhttps://github.com/KaimingHe/deep-residual-networks)• FacebookAIResearch’sTorchResNet:• Torch,CIFAR-10,withResNet-20toResNet-110,trainingcode,andcurves:code• Lasagne,CIFAR-10,withResNet-32andResNet-56andtrainingcode:code• Neon,CIFAR-10,withpre-trainedResNet-32toResNet-110models,trainingcode,andcurves: code• Torch,MNIST,100layers:blog,code• AwinningentryinKaggle's rightwhalerecognitionchallenge:blog,code• Neon,Place2(mini),40layers:blog,code• …

• Easilyreproducedresults(e.g.TorchResNet:https://github.com/facebook/fb.resnet.torch)

• Aseriesofextensionsandfollow-ups• >200citationsin6monthsafterpostedonarXiv (Dec.2015)


Background

Fromshallowtodeep

Traditionalrecognition

edges classifier “bus”?

pixelsclassifier “bus”?

histogram classifier “bus”?edges

SIFT/HOG

histogram classifier “bus”?edges K-means/sparsecode

shallower

deeper

Butwhat’snext?

DeepLearning

histogram classifier “bus”?edges K-means/sparsecode

Specializedcomponents, domainknowledge required

“bus”?

Genericcomponents (“layers”),lessdomainknowledge

“bus”?

Repeatelementarylayers=>Goingdeeper

• End-to-endlearning• Richersolutionspace

SpectrumofDepth

shallower deeper

5layers:easy

>10layers:initialization,BatchNormalization

>30layers:skipconnections

>100layers:identityskipconnections

>1000layers:?

Initialization

LeCun etal1998“EfficientBackprop”Glorot &Bengio 2010“Understandingthedifficultyoftrainingdeepfeedforwardneuralnetworks”

input𝑋

output𝑌 = 𝑊𝑋

weight𝑊

1-layer:𝑉𝑎𝑟 𝑦 = (𝑛+,𝑉𝑎𝑟 𝑤 )𝑉𝑎𝑟[𝑥]

Multi-layer:

𝑉𝑎𝑟 𝑦 = (2𝑛3+,𝑉𝑎𝑟 𝑤33

)𝑉𝑎𝑟[𝑥]

If:• Linearactivation• 𝑥, 𝑦,𝑤:independentThen:

𝑛+, 𝑛567

Initialization


1 3 5 7 9 11 13 15depth

exploding

vanishing

ideal

Forward:

𝑉𝑎𝑟 𝑦 = (2𝑛3+,𝑉𝑎𝑟 𝑤33

)𝑉𝑎𝑟[𝑥]

Backward:

𝑉𝑎𝑟𝜕𝜕𝑥 = (2𝑛3567𝑉𝑎𝑟 𝑤3

3

)𝑉𝑎𝑟[𝜕𝜕𝑦]

Bothforward(response)andbackward(gradient)signalcanvanish/explode

Initialization

• Initializationunderlinear assumption


∏ 𝑛3+,𝑉𝑎𝑟 𝑤33 = 𝑐𝑜𝑛𝑠𝑡>? (healthyforward)and

∏ 𝑛3567𝑉𝑎𝑟 𝑤33 = 𝑐𝑜𝑛𝑠𝑡@?(healthybackward)

𝑛3+,𝑉𝑎𝑟 𝑤3 = 1or*

𝑛3567𝑉𝑎𝑟 𝑤3 = 1

*:𝑛3567 = 𝑛3BC+, ,soD5,E7FGD5,E7HG= ,IJKL

MNO

,HPQKLRS < ∞.

Itissufficienttouseeitherform.

“Xavier”init inCaffe

Initialization

• InitializationunderReLU activation

∏ CV𝑛3

+,𝑉𝑎𝑟 𝑤33 = 𝑐𝑜𝑛𝑠𝑡>? (healthyforward)and

∏ CV𝑛3

567𝑉𝑎𝑟 𝑤33 = 𝑐𝑜𝑛𝑠𝑡@?(healthybackward)

12𝑛3

+,𝑉𝑎𝑟 𝑤3 = 1or

12𝑛3

567𝑉𝑎𝑟 𝑤3 = 1

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DelvingDeepintoRectifiers:SurpassingHuman-LevelPerformanceonImageNetClassification”.ICCV2015.

With𝐷 layers,afactorof2 perlayerhasexponentialimpactof2Y

“MSRA”init inCaffe

Initialization

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DelvingDeepintoRectifiers:SurpassingHuman-LevelPerformanceonImageNetClassification”.ICCV2015.

ours

Xavier

22-layerReLU net:goodinit convergesfaster

𝑛𝑉𝑎𝑟 𝑤 = 1oursXavier

30-layerReLU net:goodinit isabletoconverge

12𝑛𝑉𝑎𝑟 𝑤 = 1

12𝑛𝑉𝑎𝑟 𝑤 = 1

𝑛𝑉𝑎𝑟 𝑤 = 1

*Figuresshowthebeginningoftraining

BatchNormalization(BN)

• Normalizinginput(LeCun etal1998“EfficientBackprop”)

• BN:normalizingeachlayer,foreachmini-batch

• Greatlyacceleratetraining

• Lesssensitivetoinitialization

• Improveregularization

S.Ioffe &C.Szegedy.Batchnormalization:Acceleratingdeepnetworktrainingbyreducing internalcovariateshift.ICML2015



layer 𝑥 𝑥Z =𝑥 − 𝜇𝜎 𝑦 = 𝛾𝑥Z + 𝛽

• 𝜇:meanof𝑥 inmini-batch• 𝜎:std of𝑥 inmini-batch• 𝛾:scale• 𝛽:shift

• 𝜇,𝜎:functionsof𝑥,analogoustoresponses

• 𝛾, 𝛽:parameterstobelearned,analogoustoweights



layer 𝑥 𝑥Z =𝑥 − 𝜇𝜎 𝑦 = 𝛾𝑥Z + 𝛽

2modesofBN:• Trainmode:• 𝜇,𝜎 arefunctionsof𝑥;backprop gradients

• Testmode:• 𝜇,𝜎 arepre-computed*ontrainingset

*:byrunning average,orpost-processing aftertraining

Caution:makesureyourBNisinacorrectmode



Figure takenfrom[S.Ioffe &C.Szegedy]

w/oBNbestofw/BNaccuracy

iter.

DeepResidualNetworks

From10layersto100layers

GoingDeeper

• Initializationalgorithms✓• BatchNormalization✓

• Islearningbetternetworksassimpleasstackingmorelayers?


Simplystackinglayers?

0 1 2 3 4 5 60

10

20

iter. (1e4)

trainerror(%)

0 1 2 3 4 5 60

10

20

iter. (1e4)

testerror(%)CIFAR-10

56-layer

20-layer

56-layer

20-layer

• Plain nets:stacking3x3convlayers…• 56-layernethashighertrainingerror andtesterrorthan20-layernet


Simplystackinglayers?

0 1 2 3 4 5 60

5

10

20

iter. (1e4)

erro

r (%

)

plain-20plain-32plain-44plain-56

CIFAR-10

20-layer32-layer44-layer56-layer

0 10 20 30 40 5020

30

40

50

60

iter. (1e4)

erro

r (%

)

plain-18plain-34

ImageNet-1000

34-layer

18-layer

• “Overlydeep”plainnetshavehighertrainingerror• Ageneralphenomenon,observedinmanydatasets

solid:test/valdashed:train


7x7conv,64,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

fc1000

ashallowermodel

(18layers)

adeepercounterpart(34layers)

7x7conv,64,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

fc1000

“extra”layers

• Richersolutionspace

• Adeepermodelshouldnothavehighertrainingerror

• Asolutionbyconstruction:• originallayers:copiedfroma

learnedshallowermodel• extralayers:setasidentity• atleastthesametrainingerror

• Optimizationdifficulties:solverscannotfindthesolutionwhengoingdeeper…


DeepResidualLearning

• Plaintnet


anytwostackedlayers

𝑥

𝐻(𝑥)

weightlayer

weightlayer

relu

relu

𝐻 𝑥 isanydesiredmapping,

hopethe2weightlayersfit𝐻(𝑥)


• Residual net


𝐻 𝑥 isanydesiredmapping,

hopethe2weightlayersfit𝐻(𝑥)

hope the2weightlayersfit𝐹(𝑥)

let𝐻 𝑥 = 𝐹 𝑥 + 𝑥weightlayer

weightlayer

relu

relu

𝑥

𝐻 𝑥 = 𝐹 𝑥 + 𝑥

identity𝑥

𝐹(𝑥)


• 𝐹 𝑥 isaresidual mappingw.r.t.identity


• Ifidentitywereoptimal,easytosetweightsas0

• Ifoptimalmappingisclosertoidentity,easiertofindsmallfluctuations

weightlayer

weightlayer

relu

relu

𝑥

𝐻 𝑥 = 𝐹 𝑥 + 𝑥

identity𝑥

𝐹(𝑥)

RelatedWorks– ResidualRepresentations

• VLAD&FisherVector[Jegou etal2010],[Perronnin etal2007]

• Encodingresidual vectors;powerfulshallowerrepresentations.

• ProductQuantization(IVF-ADC)[Jegou etal2011]

• Quantizingresidual vectors;efficientnearest-neighborsearch.

• MultiGrid&HierarchicalPrecondition[Briggs,etal2000],[Szeliski1990,2006]• Solvingresidual sub-problems;efficientPDEsolvers.


7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

avgpool

fc1000

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

avgpool

fc1000

Network“Design”

• Keepitsimple

• Ourbasicdesign (VGG-style)• all3x3conv(almost)

• spatialsize/2=>#filtersx2(~samecomplexityperlayer)

• Simpledesign;justdeep!

• Otherremarks:• nohiddenfc• nodropout


plainnet ResNet

Training

• Allplain/residualnetsaretrainedfromscratch

• Allplain/residualnetsuseBatchNormalization

• Standardhyper-parameters&augmentation


CIFAR-10experiments

0 1 2 3 4 5 60

5

10

20

iter. (1e4)

erro

r (%

)

plain-20plain-32plain-44plain-56


CIFAR-10plainnets

0 1 2 3 4 5 60

5

10

20

iter. (1e4)

erro

r (%

)

ResNet-20ResNet-32ResNet-44ResNet-56ResNet-110

CIFAR-10ResNets


110-layer

• DeepResNetscanbetrainedwithoutdifficulties• DeeperResNetshavelowertrainingerror,andalsolowertesterror

solid:testdashed:train


ImageNetexperiments

0 10 20 30 40 5020

30

40

50

60

iter. (1e4)

erro

r (%

)

ResNet-18ResNet-34

0 10 20 30 40 5020

30

40

50

60

iter. (1e4)

erro

r (%

)

plain-18plain-34

ImageNetplainnets ImageNetResNets


34-layer

18-layer

18-layer

34-layer

• DeepResNetscanbetrainedwithoutdifficulties• DeeperResNetshavelowertrainingerror,andalsolowertesterror


ImageNetexperiments

• Apracticaldesignofgoingdeeper

3x3,64

3x3,64

relu

relu

64-d

3x3,64

1x1,64relu

1x1,256relu

relu

256-d

all-3x3


bottleneck(forResNet-50/101/152)

similarcomplexity

ImageNetexperiments7.4

6.7

6.15.7

4

5

6

7

8

ResNet-34ResNet-50ResNet-101ResNet-15210-crop testing,top-5val error(%)

thismodelhaslowertimecomplexity

thanVGG-16/19

• Deeper ResNetshavelower error


ImageNetexperiments

3.57

6.7 7.3

11.7

16.4

25.828.2

ILSVRC'15ResNet

ILSVRC'14GoogleNet

ILSVRC'14VGG

ILSVRC'13 ILSVRC'12AlexNet

ILSVRC'11 ILSVRC'10

ImageNetClassificationtop-5error(%)

shallow8layers

19layers22layers

152layers


8layers

DiscussionsRepresentation,Optimization,Generalization

Issuesonlearningdeepmodels

•Representation ability

•Optimization ability

•Generalization ability

• Abilityofmodeltofittrainingdata,ifoptimumcouldbefound

• IfmodelA’ssolutionspaceisasupersetofB’s,Ashouldbebetter.

• Feasibilityoffindinganoptimum

• Notallmodelsareequallyeasytooptimize

• Oncetrainingdataisfit,howgoodisthetestperformance

HowdoResNetsaddresstheseissues?

•Representation ability

•Optimization ability

•Generalization ability

• Noexplicitadvantageonrepresentation(onlyre-parameterization),but

• Allowmodelstogodeeper

• Enableverysmoothforward/backwardprop

• Greatlyeaseoptimizingdeeper models

• Notexplicitlyaddressgeneralization,but

• Deeper+thinner isgoodgeneralization

OntheImportanceofIdentityMapping

From100layersto1000layers

Onidentitymappingsforoptimization

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

𝑥cBC = 𝑓(ℎ 𝑥c + 𝐹 𝑥c )

𝑥𝑙

ℎ(𝑥c)𝐹(𝑥𝑙)layer

layer

• shortcutmapping:ℎ =identity

• after-addmapping:𝑓 =ReLU

Onidentitymappingsforoptimization


𝑥cBC = 𝑓(ℎ 𝑥c + 𝐹 𝑥c )

𝑥𝑙

ℎ(𝑥c)𝐹(𝑥𝑙)layer

layer


• after-addmapping:𝑓 =ReLU

• Whatif𝑓 =identity?

Verysmoothforwardpropagation


𝑥cBC = 𝑥c + 𝐹 𝑥c

𝑥cBV = 𝑥cBC + 𝐹 𝑥cBC




𝑥cBV = 𝑥c + 𝐹 𝑥c + 𝐹 𝑥cBC





𝑥cBV = 𝑥c + 𝐹 𝑥c + 𝐹 𝑥cBC


𝑥g = 𝑥c +h𝐹 𝑥+

giC

+jc




giC

+jc

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

avgpool

fc1000

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

avgpool

fc1000

𝑥g

𝑥c• Any𝑥c isdirectly forward-proptoany𝑥g,plus residual.• Any𝑥g isanadditiveoutcome.• incontrastto multiplicative:𝑥g = ∏ 𝑊+𝑥cgiC

+jc

Verysmoothbackwardpropagation



giC

+jc

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

avgpool

fc1000

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

avgpool

fc1000

𝜕𝐸𝜕𝑥g

𝜕𝐸𝜕𝑥c

𝜕𝐸𝜕𝑥c

=𝜕𝐸𝜕𝑥g

𝜕𝑥g𝜕𝑥c

=𝜕𝐸𝜕𝑥g

(1 +𝜕𝜕𝑥c

h𝐹 𝑥+

giC

+jC

)

Verysmoothbackwardpropagation


7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

avgpool

fc1000

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

avgpool

fc1000

𝜕𝐸𝜕𝑥g

𝜕𝐸𝜕𝑥c

𝜕𝐸𝜕𝑥c

=𝜕𝐸𝜕𝑥g

(1 +𝜕𝜕𝑥c

h𝐹 𝑥+

giC

+jC

)

• Any lmlno

isdirectly back-proptoanylmlnp,

plus residual.

• Anylmlnp

is additive;unlikelytovanish

• incontrastto multiplicative:lmlnp = ∏ 𝑊+lmlno

giC+jc

Residualforeverylayer


giC

+jc

forward:

𝜕𝐸𝜕𝑥c

=𝜕𝐸𝜕𝑥g

(1 +𝜕𝜕𝑥c

h𝐹 𝑥+

giC

+jC

)backward:

Enabledby:


• after-addmapping:𝑓 =identity


Experiments

• Set1:whatifshortcutmappingℎ ≠identity

• Set2:whatifafter-addmapping𝑓 isidentity

• ExperimentsonResNetswithmorethan100layers• deepermodelssuffermorefromoptimizationdifficulty

ExperimentSet1:whatifshortcutmappingℎ ≠identity?


(f) dropout shortcut(e) conv shortcut

3x3conv

3x3conv

additionReLU

1x1convReLU

3x3conv

3x3conv

addition

dropoutReLU

ReLU

(d) shortcut-only gating(c) exclusive gating

3x3conv

3x3conv

addition

1x1convsigmoid

1-

ReLU

ReLU

3x3conv

3x3conv

addition

1x1convsigmoid

1-

ReLU

ReLU

(a) original (b) constant scaling

3x3conv

3x3conv

additionReLU

ReLU

3x3conv

3x3conv

addition

0.5 0.5

ReLU

ReLU

ℎ 𝑥 = 𝑥error:6.6%

ℎ 𝑥 = 0.5𝑥error:12.4%

ℎ 𝑥 = gate ·𝑥error:8.7%


ℎ 𝑥 = conv(𝑥)error:12.2%

ℎ 𝑥 = dropout(𝑥)error:>20%

*ResNet-110onCIFAR-10

*similarto“HighwayNetwork”


(f) dropout shortcut(e) conv shortcut

3x3conv

3x3conv

additionReLU

1x1convReLU

3x3conv

3x3conv

addition

dropoutReLU

ReLU

(d) shortcut-only gating(c) exclusive gating

3x3conv

3x3conv

addition

1x1convsigmoid

1-

ReLU

ReLU

3x3conv

3x3conv

addition

1x1convsigmoid

1-

ReLU

ReLU

(a) original (b) constant scaling

3x3conv

3x3conv

additionReLU

ReLU

3x3conv

3x3conv

addition

0.5 0.5

ReLU

ReLU

ℎ 𝑥 = 𝑥error:6.6%

ℎ 𝑥 = 0.5𝑥error:12.4%



ℎ 𝑥 = conv(𝑥)error:12.2%

ℎ 𝑥 = dropout(𝑥)error:>20%

shortcutsblockedby

multiplications

*ResNet-110onCIFAR-10

Ifℎ ismultiplicative,e.g.ℎ 𝑥 = λ𝑥

𝑥g = λgic𝑥c +h𝐹u 𝑥+

giC

+jc

forward:

𝜕𝐸𝜕𝑥c

=𝜕𝐸𝜕𝑥g

(λgic +𝜕𝜕𝑥c

h𝐹u 𝑥+

giC

+jC

)backward:


• ifℎ ismultiplicative,shortcutsareblocked

• directpropagationisdecayed

*assuming𝑓 =identity

3x3conv

3x3conv

addition

1x1convsigmoid

1-

ReLU

ReLU

3x3conv

3x3conv

additionReLU

ReLU

ℎ isidentity

ℎ isgating

• gatingshouldhavebetterrepresentationability(identityisaspecialcase),but

• optimizationdifficultydominatesresultsKaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.


ExperimentSet2:whatifafter-addmapping𝑓 isidentity

BN

ReLU

weight

BN

weight

addition

ReLU

xl

xl+1

BN

ReLU

weight

BN

weight

addition

ReLU

xl

xl+1

ReLU

weight

BN

ReLU

weight

BN

addition

xl

xl+1

𝑓 isReLU(originalResNet)

𝑓 isBN+ReLU 𝑓 isidentity(pre-activation ResNet)


BN

ReLU

weight

BN

weight

addition

ReLU

xl

xl+1

BN

ReLU

weight

BN

weight

addition

ReLU

xl

xl+1

𝑓 = ReLU 𝑓 = BN+ReLU

𝑓 = ReLU

𝑓 = BN+ReLU

• BNcouldblockprop• Keeptheshortestpassas

smoothaspossible



BN

ReLU

weight

BN

weight

addition

ReLU

xl

xl+1

𝑓 = ReLU 𝑓 = identity

ReLU

weight

BN

ReLU

weight

BN

addition

xl

xl+1

1001-layer ResNetsonCIFAR-10

𝑓 = ReLU𝑓 = identity

• ReLUcouldblockpropwhenthereare1000layers

• pre-activationdesigneasesoptimization(andimprovesgeneralization;seepaper)



method error (%)

NIN 8.81

DSN 8.22

FitNet 8.39

Highway 7.72

ResNet-110 (1.7M) 6.61

ResNet-1202 (19.4M) 7.93

ResNet-164, pre-activation (1.7M) 5.46

ResNet-1001, pre-activation (10.2M) 4.92 (4 .89±0 .14 )

method error (%)NIN 35.68

DSN 34.57

FitNet 35.04

Highway 32.39

ResNet-164(1.7M) 25.16

ResNet-1001(10.2M) 27.82

ResNet-164, pre-activation (1.7M) 24.33

ResNet-1001, pre-activation (10.2M) 22.71 (22 .68±0 .22 )

ComparisonsonCIFAR-10/100CIFAR-10 CIFAR-100

*allbasedonmoderateaugmentation


ImageNetExperiments

method data augmentation top-1error(%) top-5error(%)ResNet-152, original scale 21.3 5.5ResNet-152, pre-activation scale 21.1 5.5ResNet-200, original scale 21.8 6.0ResNet-200, pre-activation scale 20.7 5.3ResNet-200, pre-activation scale+aspectratio 20.1* 4.8*

*independentlyreproducedby:https://github.com/facebook/fb.resnet.torch/tree/master/pretrained#notes

trainingcodeandmodelsavailable.

ImageNetsingle-crop(320x320)val error


Summaryofobservations

• Keeptheshortestpathassmoothaspossible• bymakingℎ and𝑓 identity• forward/backwardsignalsdirectlyflowthroughthispath

• Featuresofanylayersareadditiveoutcomes

• 1000-layer ResNetscanbeeasilytrainedandhavebetteraccuracy

ReLU

weight

BN

ReLU

weight

BN

addition

xl

xl+1


FutureWorks

• Representation• skipping1layervs.multiplelayers?• Flatvs.Bottleneck?• Inception-ResNet [Szegedy etal2016]• ResNet inResNet [Targ etal2016]• Widthvs.Depth[Zagoruyko &Komodakis 2016]

• Generalization• DropOut,MaxOut,DropConnect,…• DropLayer(StochasticDepth)[Huangetal2016]

• Optimization• Withoutresidual/shortcut?

ReLU

weight

BN

ReLU

weight

BN

addition

xl

xl+1


Applications

“Featuresmatter”

“Featuresmatter.”(quote[Girshicketal.2014],theR-CNNpaper)


task 2nd-placewinner ResNets margin

(relative)

ImageNetLocalization(top-5error) 12.0 9.0 27%

ImageNetDetection([email protected]) 53.6 62.1 16%

COCO Detection([email protected]:.95) 33.5 37.3 11%

COCOSegmentation([email protected]:.95) 25.1 28.2 12%

• OurresultsareallbasedonResNet-101• Deeperfeaturesarewelltransferrable

absolute8.5%better!

RevolutionofDepth

34

5866

86

HOG,DPM AlexNet(RCNN)

VGG(RCNN)

ResNet(FasterRCNN)*

PASCALVOC2007ObjectDetectionmAP (%)

shallow8layers

16layers

101layers

*w/otherimprovements&moredata


Enginesofvisualrecognition

DeepLearningforComputerVision

backbonestructure

ImageNetdata

classificationnetwork

pre-train features

detectionnetwork

(e.g.R-CNN)

segmentationnetwork(e.g.FCN)

…...

humanposeestimationnetwork

depthestimationnetwork

targetdata

fine-tune

Example:ObjectDetection

ü boatü person

ImageClassification(what?)

ObjectDetection(what+where?)

ObjectDetection:R-CNN

regionproposals~2,000

1CNNforeachregion

Region-based CNNpipeline

figurecredit:R.Girshicketal.

aeroplane? no.

..person? yes.

tvmonitor? no.

warped region..

CNN

inputimage classifyregions

Girshick,Donahue, Darrell,Malik.RichFeatureHierarchiesforAccurateObjectDetectionandSemanticSegmentation.CVPR2014

ObjectDetection:R-CNN

• R-CNN

Girshick,Donahue, Darrell,Malik.RichFeatureHierarchiesforAccurateObjectDetectionandSemanticSegmentation.CVPR2014

CNN

feature

image

CNN

feature

CNN

feature

CNN

feature

pre-computedRegions-of-Interest

(RoIs)

End-to-Endtraining

pre-computedRegions-of-Interest

(RoIs)

image

CNN

feature

featurefeature

ObjectDetection:FastR-CNN

• FastR-CNN

Girshick.FastR-CNN.ICCV2015

End-to-Endtraining

sharedconvlayers

RoI pooling

ObjectDetection:FasterR-CNN

• FasterR-CNN• SolelybasedonCNN• Noexternalmodules• Eachstepisend-to-end

End-to-Endtraining

image

CNN

featuremap

RegionProposalNet

proposals

features

RoI pooling

Shaoqing Ren,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.

ObjectDetection

backbonestructure

ImageNetdata

classificationnetwork

pre-train features

detectionnetwork

detectiondata

fine-tune

• AlexNet• VGG-16• GoogleNet• ResNet-101• …

• R-CNN• FastR-CNN• FasterR-CNN• MultiBox• SSD• …

“plug-in”features detectors

independentlydeveloped

ObjectDetection

• Simply“FasterR-CNN+ResNet”

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.ShaoqingRen,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.

image

CNN

featuremap

RegionProposalNet

proposals

classifier

RoI pooling

FasterR-CNNbaseline [email protected] [email protected]:.95

VGG-16 41.5 21.5ResNet-101 48.4 27.2

COCOdetection resultsResNet-101has28%relativegain

vsVGG-16

ObjectDetection

• RPNlearns proposalsbyextremelydeepnets• Weuseonly300proposals(nohand-designedproposals)

• Addcomponents:• Iterativelocalization• Contextmodeling• Multi-scaletesting

• AllarebasedonCNNfeatures;allareend-to-end

• Allbenefitmore fromdeeper features– cumulativegains!

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.ShaoqingRen,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.

ResNet’s objectdetectionresultonCOCO*theoriginalimageisfromtheCOCOdataset

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.arXiv2015.ShaoqingRen,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.


*theoriginalimageisfromtheCOCOdataset

Resultsonrealvideo.ModelstrainedonMSCOCO(80categories).(frame-by-frame;notemporalprocessing)


thisvideoisavailableonline:https://youtu.be/WZmSMkK9VuA

MoreVisualRecognitionTasks

ResNet-basedmethodsleadonthesebenchmarks(incompletelist):• ImageNetclassification,detection,localization• MSCOCOdetection,segmentation• PASCALVOCdetection,segmentation

• Humanposeestimation[Newelletal2016]• Depthestimation[Laina etal2016]• Segmentproposal[Pinheiro etal2016]• …

PASCALdetectionleaderboard

PASCALsegmentationleaderboard

ResNet-101

ResNet-101

PotentialApplications

ResNetshaveshownoutstandingorpromisingresultson:

VisualRecognition

ImageGeneration(PixelRNN,NeuralArt,etc.)

NaturalLanguageProcessing(VerydeepCNN)

SpeechRecognition(preliminaryresults)

Advertising,userprediction(preliminaryresults)

ConclusionsoftheTutorial

• DeepResidualLearning:• Ultradeepnetworkscanbeeasytotrain• Ultradeepnetworkscangainaccuracyfromdepth• Ultradeeprepresentationsarewelltransferrable• Now200 layersonImageNetand1000layersonCIFAR!

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

Resources

• ModelsandCode• OurImageNetmodelsinCaffe:https://github.com/KaimingHe/deep-residual-networks

• Manyavailableimplementation(listinhttps://github.com/KaimingHe/deep-residual-networks)

• FacebookAIResearch’sTorchResNet:https://github.com/facebook/fb.resnet.torch

• Torch,CIFAR-10,withResNet-20toResNet-110,trainingcode,andcurves:code• Lasagne,CIFAR-10,withResNet-32andResNet-56andtrainingcode:code• Neon,CIFAR-10,withpre-trainedResNet-32toResNet-110models,trainingcode,andcurves:code• Torch,MNIST,100layers:blog,code• AwinningentryinKaggle's rightwhalerecognitionchallenge:blog,code• Neon,Place2(mini),40layers:blog,code• …....

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

Deep Residual Networks - Kaiming He - FAIRkaiminghe.com/icml16tutorial/icml2016_tutorial_deep... ·...

Documents

Transcript of Deep Residual Networks - Kaiming He - FAIRkaiminghe.com/icml16tutorial/icml2016_tutorial_deep... ·...