Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for...

13
11/6/18 1 Deep Learning Ronald Parr CompSci 570 With thanks to Kris Hauser for some content Late 1990’s: Neural Networks Hit the Wall Recall that a 3 layer network can approximate any function arbitrarily closely (caveat: might require many, many hidden nodes) Q: Why not use big networks for hard problems? A: It didn’t work in practice! Vanishing gradients Not enough training data Not enough training time

Transcript of Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for...

Page 1: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds

11/6/18

1

DeepLearning

RonaldParrCompSci 570

WiththankstoKrisHauserforsomecontent

Late1990’s:NeuralNetworksHittheWall

• Recall thata3layernetworkcanapproximateanyfunctionarbitrarilyclosely (caveat:mightrequiremany,manyhiddennodes)

• Q:Whynotusebignetworksforhardproblems?• A:Itdidn’tworkinpractice!– Vanishinggradients– Notenoughtrainingdata– Notenoughtrainingtime

Page 2: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds

11/6/18

2

WhyDeep?

• Deeplearningisafamilyoftechniquesforbuildingandtraininglarge neuralnetworks

• Whydeepandnotwide?– Deepsounds betterthanwideJ–Maybeeasiertostructure,think aboutlayersofcomputationvs.oneflat,widecomputation(consider CNFtoDNFconversion frommidterm)

VanishingGradients

• Recallbackprop derivation:

• Activationfunctions oftenbetween-1and+1• Thefurtheryougetfromtheoutput layer,thesmallerthegradientgets

• Hardtolearnwhengradientsarenoisyandsmall

!!

δ j =∂E∂akk

∑ ∂ak

∂a j

= h'(a j) wkjk∑ δk

Page 3: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds

11/6/18

3

RelatedProblem:Saturation

• Sigmoidgradientgoesto0attails• Extremevalues(saturation)anywherealongbackprop pathcausesgradienttovanish

-1.5

-1

-0.5

0

0.5

1

1.5

-10 -5 0 5 10

EstimatingtheGradient• Recall:Backpropagation isgradientdescent• Computingexactgradientofthelossfunctionrequiressummingoverall trainingsamples

• Whynotupdateaftereachtrainingsample?– Calledonlineorstochasticgradient– Possibilityofmoreefficient learning

• Supposeyouneedonlyasmallnumberofsamplestoestimatethegradientcorrectly?

• Whydolotsofunnecessary computation?– But,theoretically, banbeunstableunlessyouuseasmallstepsize

Page 4: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds

11/6/18

4

Batch/MinibatchMethods• Findasweetspotbyestimatingthegradientusingasubsetofthesamples

• Randomlysamplesubsetsofthetrainingdataandsumgradientcomputationsoverall samplesinthesubset

• Takeadvantageofparallelarchitectures(multicore/GPU)

• Still requirescarefulselectionofstepsizeandstepsizeadjustmentschedule– artvs.science

TricksforSpeedingThingsUp• Secondordermethods,e.g.,Newton’smethod– maybe

computationallyintensiveinhighdimensions

• Conjugategradientismorecomputationallyefficient,thoughnotyetwidelyused

• Momentum:Useacombinationofpreviousgradientstosmoothoutoscillations

• Linesearch:(Binary)searchingradientdirectiontofindbiggestworthwhilestepsize

• Somemethodstrytogetbenefitsofsecondordermethodswithoutcost(withoutcomputingfullHessian),e.g.,Adam

Page 5: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds

11/6/18

5

TricksForBreakingDownProblems

• Buildupdeepnetworksbytrainingshallownetworks,thenfeedingtheiroutput intonewlayers(mayhelpwithvanishinggradientandotherproblems)– aformof“pretraining”

• Trainthenetworktosolve“easier”problemsfirst,thentrainonharderproblems–curriculumlearning,aformof“shaping”

ConvolutionalNeuralNetworks(CNNs)

• ChampionedbyLeCun (1998)

• Originallyusedforhandwritingrecognition

• Nowusedinstateoftheartsystemsinmanycomputervisionapplications

• Well-suitedtodatawithagrid-likestructure

Page 6: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds

11/6/18

6

Convolutions

• Whatisaconvolution?• Waytocombinetwofunctions,e.g.,xandw:

• Discreteversion

𝑠 𝑡 = % 𝑥 𝑎 𝑤 𝑡 −𝑎 𝑑𝑎�

𝑠 𝑡 = ,𝑥 𝑎 𝑤(𝑡 −𝑎)�

EntireDomain

Example:Supposes(t)isadecayingaverageofvaluesofxaroundt,withwdecreasingasagetsfurther fromt

ConvolutionsonGrids

• ForimageI• Convolution“kernel”K:

𝑆 𝑖, 𝑗 = ,,𝐼 𝑚, 𝑛 𝐾(𝑖 − 𝑚,𝑗 − 𝑛)�

7

=,,𝐼 𝑖 − 𝑚, 𝑗 − 𝑛 𝐾(𝑚, 𝑛)�

7

8

8

Examples:Aconvolutioncanblur/smooth/noise-filteranimagebyaveragingneighboringpixels.Aconvolutioncanalsoserveasanedgedetector

Figure9.6fromDeepLearning,IanGoodfellowandYoshua Bengio andAaronCourville

Page 7: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds

11/6/18

7

ConvolutiononGridExample

CHAPTER 9. CONVOLUTIONAL NETWORKS

a b c d

e f g h

i j k l

w x

y z

aw + bx +

ey + fzaw + bx +

ey + fzbw + cx +

fy + gzbw + cx +

fy + gzcw + dx +

gy + hzcw + dx +

gy + hz

ew + fx +iy + jz

ew + fx +iy + jz

fw + gx +

jy + kz

fw + gx +

jy + kz

gw + hx +

ky + lz

gw + hx +

ky + lz

Input

Kernel

Output

Figure 9.1: An example of 2-D convolution without kernel-flipping. In this case we restrict

the output to only positions where the kernel lies entirely within the image, called “valid”

convolution in some contexts. We draw boxes with arrows to indicate how the upper-left

element of the output tensor is formed by applying the kernel to the corresponding

upper-left region of the input tensor.

334

Figure9.1fromDeepLearning,IanGoodfellow andYoshua Bengio andAaronCourville

ApplicationtoImages&Nets

• Imageshavehugeinputspace:1000x1000=1M• Fully connectedlayers=hugenumberofweights,slowtraining

• Convolutional layersreduceconnectivity byconnectingonlyanmxn windowaroundeachpixel

• Canuseweightsharing tolearnacommon setofweightssothatsameconvolution isappliedeverywhere(orinmultipleplaces)

Page 8: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds

11/6/18

8

AdditionalStages

• Convolutionalstages(may)feedtodetector stages

• Detectorsarenonlinear,e.g.,ReLU

• Detectorsfeedtopoolstages

• Poolingstagessummarizingupstreamnodes,e.g.,average,2-norm,max

Source:wikipedia

ReLU vs.Sigmoid

• ReLU isfastertocompute• Derivativeistrivial• Onlysaturatesononeside

• Worryaboutnon-differentiabilityat0?• Canusesub-gradient

Page 9: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds

11/6/18

9

ExampleConvolutionalNetwork

INPUT 28x28

feature maps 4@24x24

feature maps4@12x12

feature maps12@8x8

feature maps12@4x4

OUTPUT26@1x1

Subsampling

Convolution

Convolution

Subsampling

Convolution

From,ConvolutionalNetworksforImages,Speech,andTime-Series, LeCun &Bengio

N.B.:Subsampling=averaging

Weightsharingresultsin2600weightssharedover100,000connections.

WhyThisWorks• ConvNets canuseweightsharingtoreducethenumberof

parameterslearned–mitigatesproblemswithbignetworks

• Canbestructuredtolearnscaleandpositioninvariantfeaturedetectors

• Finallayersthencombinefeaturetolearnthetargetfunction

• Canbeviewedasdoingsimultaneousfeatureselectionandclassification

Page 10: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds

11/6/18

10

ConvNets inPractice

• Worksurprisinglywellinmanyexamples,eventhosethataren’timages

• Numberofconvolutional layers,formofpoolinganddetectingunitsmaybeapplicationspecific– art&sciencetopickingthis

OtherTricks

• Residualnets introduceconnectionsacrosslayers,whichtendstomitigatethevanishinggradientproblem

• Techniquessuchasimageperturbationanddropoutreduceoverfittingandproducemorerobustsolutions

Page 11: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds

11/6/18

11

PuttingItallTogether• Whyisdeeplearningsucceedingnowwhenneuralnetslostmomentuminthe90’s?

• Newarchitectures(e.g.ConvNets)arebettersuitedto(some)learningtasks,reduce#ofweights

• Smarteralgorithmsmakebetteruseofdata,handlenoisygradientsbetter

• Massiveamountsofdatamakeoverfittinglessofaconcern(butstillalwaysaconcern)Massingamountsofcomputationmakehandlingmassiveamountsofdatapossible

• Largeandgrowingbagoftrickstomitigatingoverfitting,vanishinggradientissues

Superficial(?)Limitations

• Deeplearningresultsarenoteasilyhuman-interpretable

• Computationallyintensive• Combinationofart,science,rulesofthumb

• Canbetricked:– “Intriguingpropertiesofneuralnetworks”

Page 12: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds

11/6/18

12

BeyondClassification

• Deepnetworks(andothertechniques)canbeusedforunsupervised learning

• Example:Autoencodertriestocompressinputstoalowerdimensionalrepresentation

RecurrentNetworks• Recurrentnetworksfeed(partof)theoutputofthenetworkbacktotheinput

• Why?– Canlearn(hidden)state,e.g.,stateinahiddenMarkovmodel

– Usefulforparsinglanguage– Canlearnaprogram

• LSTM:VariationonRNNthathandleslongtermmemoriesbetter

Page 13: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds

11/6/18

13

DeeperLimitations• Wegetimpressiveresultsbutwedon’talwaysunderstandwhyor

whetherwereallyneedallofthedataandcomputationused

• Hardtoexplainresultsandhardtoguardagainstadversarialspecialcases(“Intriguingpropertiesofneuralnetworks”,and“Universaladversarialperturbations”)

• Notclearhowlogic,highlevelreasoningcouldbeincorporated

• Notclearhowtoincorporatepriorknowledgeinaprincipledway