Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for...
Transcript of Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for...
11/6/18
1
DeepLearning
RonaldParrCompSci 570
WiththankstoKrisHauserforsomecontent
Late1990’s:NeuralNetworksHittheWall
• Recall thata3layernetworkcanapproximateanyfunctionarbitrarilyclosely (caveat:mightrequiremany,manyhiddennodes)
• Q:Whynotusebignetworksforhardproblems?• A:Itdidn’tworkinpractice!– Vanishinggradients– Notenoughtrainingdata– Notenoughtrainingtime
11/6/18
2
WhyDeep?
• Deeplearningisafamilyoftechniquesforbuildingandtraininglarge neuralnetworks
• Whydeepandnotwide?– Deepsounds betterthanwideJ–Maybeeasiertostructure,think aboutlayersofcomputationvs.oneflat,widecomputation(consider CNFtoDNFconversion frommidterm)
VanishingGradients
• Recallbackprop derivation:
• Activationfunctions oftenbetween-1and+1• Thefurtheryougetfromtheoutput layer,thesmallerthegradientgets
• Hardtolearnwhengradientsarenoisyandsmall
!!
€
δ j =∂E∂akk
∑ ∂ak
∂a j
= h'(a j) wkjk∑ δk
11/6/18
3
RelatedProblem:Saturation
• Sigmoidgradientgoesto0attails• Extremevalues(saturation)anywherealongbackprop pathcausesgradienttovanish
-1.5
-1
-0.5
0
0.5
1
1.5
-10 -5 0 5 10
EstimatingtheGradient• Recall:Backpropagation isgradientdescent• Computingexactgradientofthelossfunctionrequiressummingoverall trainingsamples
• Whynotupdateaftereachtrainingsample?– Calledonlineorstochasticgradient– Possibilityofmoreefficient learning
• Supposeyouneedonlyasmallnumberofsamplestoestimatethegradientcorrectly?
• Whydolotsofunnecessary computation?– But,theoretically, banbeunstableunlessyouuseasmallstepsize
11/6/18
4
Batch/MinibatchMethods• Findasweetspotbyestimatingthegradientusingasubsetofthesamples
• Randomlysamplesubsetsofthetrainingdataandsumgradientcomputationsoverall samplesinthesubset
• Takeadvantageofparallelarchitectures(multicore/GPU)
• Still requirescarefulselectionofstepsizeandstepsizeadjustmentschedule– artvs.science
TricksforSpeedingThingsUp• Secondordermethods,e.g.,Newton’smethod– maybe
computationallyintensiveinhighdimensions
• Conjugategradientismorecomputationallyefficient,thoughnotyetwidelyused
• Momentum:Useacombinationofpreviousgradientstosmoothoutoscillations
• Linesearch:(Binary)searchingradientdirectiontofindbiggestworthwhilestepsize
• Somemethodstrytogetbenefitsofsecondordermethodswithoutcost(withoutcomputingfullHessian),e.g.,Adam
11/6/18
5
TricksForBreakingDownProblems
• Buildupdeepnetworksbytrainingshallownetworks,thenfeedingtheiroutput intonewlayers(mayhelpwithvanishinggradientandotherproblems)– aformof“pretraining”
• Trainthenetworktosolve“easier”problemsfirst,thentrainonharderproblems–curriculumlearning,aformof“shaping”
ConvolutionalNeuralNetworks(CNNs)
• ChampionedbyLeCun (1998)
• Originallyusedforhandwritingrecognition
• Nowusedinstateoftheartsystemsinmanycomputervisionapplications
• Well-suitedtodatawithagrid-likestructure
11/6/18
6
Convolutions
• Whatisaconvolution?• Waytocombinetwofunctions,e.g.,xandw:
• Discreteversion
𝑠 𝑡 = % 𝑥 𝑎 𝑤 𝑡 −𝑎 𝑑𝑎�
�
𝑠 𝑡 = ,𝑥 𝑎 𝑤(𝑡 −𝑎)�
�
EntireDomain
Example:Supposes(t)isadecayingaverageofvaluesofxaroundt,withwdecreasingasagetsfurther fromt
ConvolutionsonGrids
• ForimageI• Convolution“kernel”K:
𝑆 𝑖, 𝑗 = ,,𝐼 𝑚, 𝑛 𝐾(𝑖 − 𝑚,𝑗 − 𝑛)�
7
=,,𝐼 𝑖 − 𝑚, 𝑗 − 𝑛 𝐾(𝑚, 𝑛)�
7
�
8
�
8
Examples:Aconvolutioncanblur/smooth/noise-filteranimagebyaveragingneighboringpixels.Aconvolutioncanalsoserveasanedgedetector
Figure9.6fromDeepLearning,IanGoodfellowandYoshua Bengio andAaronCourville
11/6/18
7
ConvolutiononGridExample
CHAPTER 9. CONVOLUTIONAL NETWORKS
a b c d
e f g h
i j k l
w x
y z
aw + bx +
ey + fzaw + bx +
ey + fzbw + cx +
fy + gzbw + cx +
fy + gzcw + dx +
gy + hzcw + dx +
gy + hz
ew + fx +iy + jz
ew + fx +iy + jz
fw + gx +
jy + kz
fw + gx +
jy + kz
gw + hx +
ky + lz
gw + hx +
ky + lz
Input
Kernel
Output
Figure 9.1: An example of 2-D convolution without kernel-flipping. In this case we restrict
the output to only positions where the kernel lies entirely within the image, called “valid”
convolution in some contexts. We draw boxes with arrows to indicate how the upper-left
element of the output tensor is formed by applying the kernel to the corresponding
upper-left region of the input tensor.
334
Figure9.1fromDeepLearning,IanGoodfellow andYoshua Bengio andAaronCourville
ApplicationtoImages&Nets
• Imageshavehugeinputspace:1000x1000=1M• Fully connectedlayers=hugenumberofweights,slowtraining
• Convolutional layersreduceconnectivity byconnectingonlyanmxn windowaroundeachpixel
• Canuseweightsharing tolearnacommon setofweightssothatsameconvolution isappliedeverywhere(orinmultipleplaces)
11/6/18
8
AdditionalStages
• Convolutionalstages(may)feedtodetector stages
• Detectorsarenonlinear,e.g.,ReLU
• Detectorsfeedtopoolstages
• Poolingstagessummarizingupstreamnodes,e.g.,average,2-norm,max
Source:wikipedia
ReLU vs.Sigmoid
• ReLU isfastertocompute• Derivativeistrivial• Onlysaturatesononeside
• Worryaboutnon-differentiabilityat0?• Canusesub-gradient
11/6/18
9
ExampleConvolutionalNetwork
INPUT 28x28
feature maps 4@24x24
feature maps4@12x12
feature maps12@8x8
feature maps12@4x4
OUTPUT26@1x1
Subsampling
Convolution
Convolution
Subsampling
Convolution
From,ConvolutionalNetworksforImages,Speech,andTime-Series, LeCun &Bengio
N.B.:Subsampling=averaging
Weightsharingresultsin2600weightssharedover100,000connections.
WhyThisWorks• ConvNets canuseweightsharingtoreducethenumberof
parameterslearned–mitigatesproblemswithbignetworks
• Canbestructuredtolearnscaleandpositioninvariantfeaturedetectors
• Finallayersthencombinefeaturetolearnthetargetfunction
• Canbeviewedasdoingsimultaneousfeatureselectionandclassification
11/6/18
10
ConvNets inPractice
• Worksurprisinglywellinmanyexamples,eventhosethataren’timages
• Numberofconvolutional layers,formofpoolinganddetectingunitsmaybeapplicationspecific– art&sciencetopickingthis
OtherTricks
• Residualnets introduceconnectionsacrosslayers,whichtendstomitigatethevanishinggradientproblem
• Techniquessuchasimageperturbationanddropoutreduceoverfittingandproducemorerobustsolutions
11/6/18
11
PuttingItallTogether• Whyisdeeplearningsucceedingnowwhenneuralnetslostmomentuminthe90’s?
• Newarchitectures(e.g.ConvNets)arebettersuitedto(some)learningtasks,reduce#ofweights
• Smarteralgorithmsmakebetteruseofdata,handlenoisygradientsbetter
• Massiveamountsofdatamakeoverfittinglessofaconcern(butstillalwaysaconcern)Massingamountsofcomputationmakehandlingmassiveamountsofdatapossible
• Largeandgrowingbagoftrickstomitigatingoverfitting,vanishinggradientissues
Superficial(?)Limitations
• Deeplearningresultsarenoteasilyhuman-interpretable
• Computationallyintensive• Combinationofart,science,rulesofthumb
• Canbetricked:– “Intriguingpropertiesofneuralnetworks”
11/6/18
12
BeyondClassification
• Deepnetworks(andothertechniques)canbeusedforunsupervised learning
• Example:Autoencodertriestocompressinputstoalowerdimensionalrepresentation
RecurrentNetworks• Recurrentnetworksfeed(partof)theoutputofthenetworkbacktotheinput
• Why?– Canlearn(hidden)state,e.g.,stateinahiddenMarkovmodel
– Usefulforparsinglanguage– Canlearnaprogram
• LSTM:VariationonRNNthathandleslongtermmemoriesbetter
11/6/18
13
DeeperLimitations• Wegetimpressiveresultsbutwedon’talwaysunderstandwhyor
whetherwereallyneedallofthedataandcomputationused
• Hardtoexplainresultsandhardtoguardagainstadversarialspecialcases(“Intriguingpropertiesofneuralnetworks”,and“Universaladversarialperturbations”)
• Notclearhowlogic,highlevelreasoningcouldbeincorporated
• Notclearhowtoincorporatepriorknowledgeinaprincipledway