Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for...

11/6/18

1

DeepLearning

RonaldParrCompSci 570

WiththankstoKrisHauserforsomecontent

Late1990’s:NeuralNetworksHittheWall

• Recall thata3layernetworkcanapproximateanyfunctionarbitrarilyclosely (caveat:mightrequiremany,manyhiddennodes)

• Q:Whynotusebignetworksforhardproblems?• A:Itdidn’tworkinpractice!– Vanishinggradients– Notenoughtrainingdata– Notenoughtrainingtime

11/6/18

2

WhyDeep?

• Deeplearningisafamilyoftechniquesforbuildingandtraininglarge neuralnetworks

• Whydeepandnotwide?– Deepsounds betterthanwideJ–Maybeeasiertostructure,think aboutlayersofcomputationvs.oneflat,widecomputation(consider CNFtoDNFconversion frommidterm)

VanishingGradients

• Recallbackprop derivation:

• Activationfunctions oftenbetween-1and+1• Thefurtheryougetfromtheoutput layer,thesmallerthegradientgets

• Hardtolearnwhengradientsarenoisyandsmall

!!

€

δ j =∂E∂akk

∑ ∂ak

∂a j

= h'(a j) wkjk∑ δk

11/6/18

3

RelatedProblem:Saturation

• Sigmoidgradientgoesto0attails• Extremevalues(saturation)anywherealongbackprop pathcausesgradienttovanish

-1.5

-1

-0.5

0

0.5

1

1.5

-10 -5 0 5 10

EstimatingtheGradient• Recall:Backpropagation isgradientdescent• Computingexactgradientofthelossfunctionrequiressummingoverall trainingsamples

• Whynotupdateaftereachtrainingsample?– Calledonlineorstochasticgradient– Possibilityofmoreefficient learning

• Supposeyouneedonlyasmallnumberofsamplestoestimatethegradientcorrectly?

• Whydolotsofunnecessary computation?– But,theoretically, banbeunstableunlessyouuseasmallstepsize

11/6/18

4

Batch/MinibatchMethods• Findasweetspotbyestimatingthegradientusingasubsetofthesamples

• Randomlysamplesubsetsofthetrainingdataandsumgradientcomputationsoverall samplesinthesubset

• Takeadvantageofparallelarchitectures(multicore/GPU)

• Still requirescarefulselectionofstepsizeandstepsizeadjustmentschedule– artvs.science

TricksforSpeedingThingsUp• Secondordermethods,e.g.,Newton’smethod– maybe

computationallyintensiveinhighdimensions

• Conjugategradientismorecomputationallyefficient,thoughnotyetwidelyused

• Momentum:Useacombinationofpreviousgradientstosmoothoutoscillations

• Linesearch:(Binary)searchingradientdirectiontofindbiggestworthwhilestepsize

• Somemethodstrytogetbenefitsofsecondordermethodswithoutcost(withoutcomputingfullHessian),e.g.,Adam

11/6/18

5

TricksForBreakingDownProblems

• Buildupdeepnetworksbytrainingshallownetworks,thenfeedingtheiroutput intonewlayers(mayhelpwithvanishinggradientandotherproblems)– aformof“pretraining”

• Trainthenetworktosolve“easier”problemsfirst,thentrainonharderproblems–curriculumlearning,aformof“shaping”

ConvolutionalNeuralNetworks(CNNs)

• ChampionedbyLeCun (1998)

• Originallyusedforhandwritingrecognition

• Nowusedinstateoftheartsystemsinmanycomputervisionapplications

• Well-suitedtodatawithagrid-likestructure

11/6/18

6

Convolutions

• Whatisaconvolution?• Waytocombinetwofunctions,e.g.,xandw:

• Discreteversion

𝑠 𝑡 = % 𝑥 𝑎 𝑤 𝑡 −𝑎 𝑑𝑎�

�

𝑠 𝑡 = ,𝑥 𝑎 𝑤(𝑡 −𝑎)�

�

EntireDomain

Example:Supposes(t)isadecayingaverageofvaluesofxaroundt,withwdecreasingasagetsfurther fromt

ConvolutionsonGrids

• ForimageI• Convolution“kernel”K:

𝑆 𝑖, 𝑗 = ,,𝐼 𝑚, 𝑛 𝐾(𝑖 − 𝑚,𝑗 − 𝑛)�

7

=,,𝐼 𝑖 − 𝑚, 𝑗 − 𝑛 𝐾(𝑚, 𝑛)�

7

�

8

�

8

Examples:Aconvolutioncanblur/smooth/noise-filteranimagebyaveragingneighboringpixels.Aconvolutioncanalsoserveasanedgedetector

Figure9.6fromDeepLearning,IanGoodfellowandYoshua Bengio andAaronCourville

11/6/18

7

ConvolutiononGridExample

CHAPTER 9. CONVOLUTIONAL NETWORKS

a b c d

e f g h

i j k l

w x

y z

aw + bx +

ey + fzaw + bx +

ey + fzbw + cx +

fy + gzbw + cx +

fy + gzcw + dx +

gy + hzcw + dx +

gy + hz

ew + fx +iy + jz

ew + fx +iy + jz

fw + gx +

jy + kz

fw + gx +

jy + kz

gw + hx +

ky + lz

gw + hx +

ky + lz

Input

Kernel

Output

Figure 9.1: An example of 2-D convolution without kernel-flipping. In this case we restrict

the output to only positions where the kernel lies entirely within the image, called “valid”

convolution in some contexts. We draw boxes with arrows to indicate how the upper-left

element of the output tensor is formed by applying the kernel to the corresponding

upper-left region of the input tensor.

334

Figure9.1fromDeepLearning,IanGoodfellow andYoshua Bengio andAaronCourville

ApplicationtoImages&Nets

• Imageshavehugeinputspace:1000x1000=1M• Fully connectedlayers=hugenumberofweights,slowtraining

• Convolutional layersreduceconnectivity byconnectingonlyanmxn windowaroundeachpixel

• Canuseweightsharing tolearnacommon setofweightssothatsameconvolution isappliedeverywhere(orinmultipleplaces)

11/6/18

8

AdditionalStages

• Convolutionalstages(may)feedtodetector stages

• Detectorsarenonlinear,e.g.,ReLU

• Detectorsfeedtopoolstages

• Poolingstagessummarizingupstreamnodes,e.g.,average,2-norm,max

Source:wikipedia

ReLU vs.Sigmoid

• ReLU isfastertocompute• Derivativeistrivial• Onlysaturatesononeside

• Worryaboutnon-differentiabilityat0?• Canusesub-gradient

11/6/18

9

ExampleConvolutionalNetwork

INPUT 28x28

feature maps 4@24x24

feature maps4@12x12

feature maps12@8x8

feature maps12@4x4

OUTPUT26@1x1

Subsampling

Convolution

Convolution

Subsampling

Convolution

From,ConvolutionalNetworksforImages,Speech,andTime-Series, LeCun &Bengio

N.B.:Subsampling=averaging

Weightsharingresultsin2600weightssharedover100,000connections.

WhyThisWorks• ConvNets canuseweightsharingtoreducethenumberof

parameterslearned–mitigatesproblemswithbignetworks

• Canbestructuredtolearnscaleandpositioninvariantfeaturedetectors

• Finallayersthencombinefeaturetolearnthetargetfunction

• Canbeviewedasdoingsimultaneousfeatureselectionandclassification

11/6/18

10

ConvNets inPractice

• Worksurprisinglywellinmanyexamples,eventhosethataren’timages

• Numberofconvolutional layers,formofpoolinganddetectingunitsmaybeapplicationspecific– art&sciencetopickingthis

OtherTricks

• Residualnets introduceconnectionsacrosslayers,whichtendstomitigatethevanishinggradientproblem

• Techniquessuchasimageperturbationanddropoutreduceoverfittingandproducemorerobustsolutions

11/6/18

11

PuttingItallTogether• Whyisdeeplearningsucceedingnowwhenneuralnetslostmomentuminthe90’s?

• Newarchitectures(e.g.ConvNets)arebettersuitedto(some)learningtasks,reduce#ofweights

• Smarteralgorithmsmakebetteruseofdata,handlenoisygradientsbetter

• Massiveamountsofdatamakeoverfittinglessofaconcern(butstillalwaysaconcern)Massingamountsofcomputationmakehandlingmassiveamountsofdatapossible

• Largeandgrowingbagoftrickstomitigatingoverfitting,vanishinggradientissues

Superficial(?)Limitations

• Deeplearningresultsarenoteasilyhuman-interpretable

• Computationallyintensive• Combinationofart,science,rulesofthumb

• Canbetricked:– “Intriguingpropertiesofneuralnetworks”

11/6/18

12

BeyondClassification

• Deepnetworks(andothertechniques)canbeusedforunsupervised learning

• Example:Autoencodertriestocompressinputstoalowerdimensionalrepresentation

RecurrentNetworks• Recurrentnetworksfeed(partof)theoutputofthenetworkbacktotheinput

• Why?– Canlearn(hidden)state,e.g.,stateinahiddenMarkovmodel

– Usefulforparsinglanguage– Canlearnaprogram

• LSTM:VariationonRNNthathandleslongtermmemoriesbetter

11/6/18

13

DeeperLimitations• Wegetimpressiveresultsbutwedon’talwaysunderstandwhyor

whetherwereallyneedallofthedataandcomputationused

• Hardtoexplainresultsandhardtoguardagainstadversarialspecialcases(“Intriguingpropertiesofneuralnetworks”,and“Universaladversarialperturbations”)

• Notclearhowlogic,highlevelreasoningcouldbeincorporated

• Notclearhowtoincorporatepriorknowledgeinaprincipledway

Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for...

Documents

Transcript of Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for...