Post on 30-Aug-2020
DeepLearningforNLPPart3
CS224NChristopherManning
(ManyslidesborrowedfromACL2012/NAACL2013Tutorialsbyme,RichardSocherandYoshuaBengio)
Backpropagation TrainingPart1.5:TheBasics
2
Backprop
• Computegradientofexample-wiselosswrtparameters
• Simplyapplyingthederivativechainrulewisely
• Ifcomputingtheloss(example,parameters)isO(n)computation,thensoiscomputingthegradient
3
Simple Chain Rule
4
Multiple Paths Chain Rule
5
Multiple Paths Chain Rule - General
…
6
Chain Rule in Flow Graph
…
…
…
Flowgraph:anydirectedacyclicgraphnode=computationresultarc=computationdependency
=successors of
7
Back-Prop in Multi-Layer Net
…
…
8
h = sigmoid(Vx)
Back-Prop in General Flow Graph
…
…
…
=successors of
1. Fprop:visitnodes intopo-sortorder- Computevalueofnodegivenpredecessors
2. Bprop:- initialize outputgradient=1- visitnodes inreverseorder:
Computegradientwrt eachnodeusinggradientwrt successors
Singlescalaroutput
9
Automatic Differentiation
• Thegradientcomputationcanbeautomaticallyinferredfromthesymbolicexpressionofthefprop.
• Eachnodetypeneedstoknowhowtocomputeitsoutputandhowtocomputethegradientwrt itsinputsgiventhegradientwrt itsoutput.
• Easyandfastprototyping.See:• Theano (Python),• TensorFlow (Python/C++),or• Autograd (Lua/C++forTorch)
10
Deep Learning General Strategy and Tricks
11
General Strategy1. Selectnetworkstructureappropriateforproblem
1. Structure:Singlewords,fixedwindowsvs.convolutionalvs.recurrent/recursivesentencebasedvs.bagofwords
2. Nonlinearities[coveredearlier]2. Checkforimplementationbugswithgradientchecks3. Parameterinitialization4. Optimization5. Checkifthemodelispowerfulenoughtooverfit
1. Ifnot,changemodelstructureormakemodel“larger”2. Ifyoucanoverfit:Regularize
12
Gradient Checks are Awesome!
• Allowsyoutoknowthattherearenobugsinyourneuralnetworkimplementation!(Butmakesitrunreallyslow.)
• Steps:1. Implementyourgradient2. Implementafinitedifferencecomputationbylooping
throughtheparametersofyournetwork,addingandsubtractingasmallepsilon(∼10-4)andestimatederivatives
3. Comparethetwoandmakesuretheyarealmostthesame13
,
Parameter Initialization
• ParameterInitializationcanbeveryimportantforsuccess!• Initializehiddenlayerbiasesto0andoutput(orreconstruction)
biasestooptimalvalueifweightswere0(e.g.,meantargetorinversesigmoidofmeantarget).
• Initializeweights∼ Uniform(−r,r),r inverselyproportionaltofan-in(previouslayersize)andfan-out(nextlayersize):
fortanh units,and4xbiggerforsigmoidunits [Glorot AISTATS2010]• Makeinitializationslighly positiveforReLU – toavoiddeadunits
14
• Gradientdescentusestotalgradientoverallexamplesperupdate• Shouldn’tbeused.Veryslow.
• SGDupdatesaftereachexample:
• L = lossfunction,zt=currentexample,θ = parametervector,andεt = learningrate.
• Youprocessanexampleandthenmoveeachparameterasmalldistancebysubtractingafractionofthegradient
• εt shouldbesmall…moreinfollowingslide• Important:applyallSGDupdatesatonceafterbackproppass
Stochastic Gradient Descent (SGD)
15
• RatherthandoSGDonasingleexample,peopleusuallydoitonaminibatch of32,64,or128examples.
• Yousumthegradientsinminibatch (andscaledownlearningrate)• Minoradvantage:gradientestimateismuchmorerobustwhenestimatedonabunchofexamplesratherthanjustone.
• Majoradvantage:codecanrunmuchfasteriff youcandoawholeminibatch atonceviamatrix-matrixmultiplies
• ThereisawholepanoplyoffancieronlinelearningalgorithmscommonlyusednowwithNNs.Goodonesinclude:• Adagrad• RMSprop• ADAM
Stochastic Gradient Descent (SGD)
16
Learning Rates
• Settingα correctlyistricky• Simplestrecipe:α fixedandsameforallparameters• Orstartwithlearningratejustsmallenoughtobestableinfirst
passthroughdata(epoch),thenhalveitonsubsequentepochs• Betterresultscanusuallybeobtainedbyusingacurriculumfor
decreasinglearningrates,typicallyinO(1/t)becauseoftheoreticalconvergenceguarantees,e.g.,withhyper-parametersε0 andτ
• Betteryet:Nohand-setlearningratesbyusingmethodslikeAdaGrad (Duchi,Hazan,&Singer2011)[butmayconvergetoosoon– tryresettingaccumulatedgradients]
17
Attempt to overfit training data
Assumingyoufoundtherightnetworkstructure,implementeditcorrectly,andoptimizeditproperly,youcanmakeyourmodeltotallyoverfit onyourtrainingdata(99%+accuracy)• Ifnot:
• Changearchitecture• Makemodelbigger(biggervectors,morelayers)• Fixoptimization
• Ifyes:• Now,it’stimetoregularizethenetwork
18
Prevent Overfitting: Model Size and Regularization
• Simplefirststep:Reducemodelsizebyloweringnumberofunitsandlayersandotherparameters
• StandardL1orL2regularizationonweights• EarlyStopping:Useparametersthatgavebestvalidationerror• Sparsity constraintsonhiddenactivations,e.g.,addtocost:
19
Prevent Feature Co-adaptation
Dropout (Hintonetal.2012)http://jmlr.org/papers/v15/srivastava14a.html
• Trainingtime:ateachinstanceofevaluation(inonlineSGD-training),randomlyset50%oftheinputstoeachneuronto0
• Testtime:halvethemodelweights(nowtwiceasmany)• Thispreventsfeatureco-adaptation:Afeaturecannotonlybeusefulinthepresenceofparticularotherfeatures
• Akindofmiddle-groundbetweenNaïveBayes(whereallfeatureweightsaresetindependently)andlogisticregressionmodels(whereweightsaresetinthecontextofallothers)
• Canbethoughtofasaformofmodelbagging• Itactsasastrongregularizer;see(Wageretal.2013)
http://arxiv.org/abs/1307.149320
Deep Learning Tricks of the Trade• Y.Bengio(2012),“PracticalRecommendationsforGradient-
BasedTrainingofDeepArchitectures”http://arxiv.org/abs/1206.5533• Unsupervisedpre-training• Stochasticgradientdescentandsettinglearningrates• Mainhyper-parameters• Learning rateschedule &earlystopping,Minibatches, Parameterinitialization,Numberofhiddenunits,L1orL2weightdecay,…
• Y.Bengio,I.Goodfellow,andA.Courville (inpress),“DeepLearning”.MITPress,ms. http://goodfeli.github.io/dlbook/• Manychaptersondeeplearning,includingoptimizationtricks• Somemorerecentstuffthan2012paper
21
Sharing statistical strengthPart1.7
22
Sharing Statistical Strength
• Besidesveryfastprediction,themainadvantageofdeeplearningisstatistical
• Potentialtolearnfromlesslabeledexamplesbecauseofsharingofstatisticalstrength:• Unsupervisedpre-training• Multi-tasklearning• Semi-supervisedlearning
23
Multi-Task Learning
• GeneralizingbettertonewtasksiscrucialtoapproachAI
• Deeparchitectureslearngoodintermediaterepresentationsthatcanbesharedacrosstasks
• Goodrepresentationsmakesenseformanytasks
raw input x
task 1 output y1
task 3 output y3
task 2output y2
shared intermediate representation h
24
Semi-Supervised Learning
• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)
purelysupervised
25
Semi-Supervised Learning
• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)
semi-supervised
26
Advantages of Deep LearningPart 2
Part1.1:TheBasics
27
#4 Unsupervised feature learning
Today,mostpractical,goodNLP&MLmethodsrequirelabeledtrainingdata(i.e.,supervisedlearning)
Butalmostalldataisunlabeled
MostinformationmustbeacquiredunsupervisedFortunately,agoodmodelofobserveddatacanreallyhelpyoulearnclassificationdecisions
Commentary:Thisismorethedreamthanthereality;mostoftherecentsuccessesofdeeplearninghavecomefromregularsupervisedlearningoververylargedatasets28
#5 Handling the recursivity of language
Humansentencesarecomposedfromwordsandphrases
Weneedcompositionality inourMLmodels
Recursion:thesameoperator(sameparameters)isappliedrepeatedlyondifferentcomponents
Recurrentmodels:Recursionalongatemporalsequence
Asmallcrowdquietlyentersthehistoricchurch
historicthe
quietlyenters
SVP
Det. Adj.
NPVP
Asmallcrowd
NP
NP
church
N.
SemanticRepresentations
xt−1 xt xt+1
zt−1 zt zt+1
29
#6 Why now?
Despitepriorinvestigationandunderstandingofmanyofthealgorithmictechniques…
Before2006trainingdeeparchitectureswasunsuccessfulL
Whathaschanged?• Newmethodsforunsupervisedpre-training(Restricted
BoltzmannMachines=RBMs,autoencoders,contrastiveestimation,etc.)anddeepmodeltrainingdeveloped
• Moreefficientparameterestimationmethods• Betterunderstandingofmodelregularization• More data andmorecomputationalpower
Deep Learning models have already achieved impressive results for HLT
NeuralLanguageModel[Mikolov etal.Interspeech 2011]
MSRMAVISSpeechSystem[Dahletal.2012;Seide etal. 2011;followingMohamedetal.2011]
“Thealgorithms representthefirsttimeacompanyhasreleased adeep-neural-networks(DNN)-basedspeech-recognitionalgorithminacommercialproduct.”
Model\ WSJASRtask Eval WER
KN5Baseline 17.2
Discriminative LM 16.9
Recurrent NNcombination 14.4
Acousticmodel&training
Recog\ WER
RT03SFSH
Hub5SWB
GMM40-mix,BMMI,SWB309h
1-pass−adapt
27.4 23.6
DBN-DNN7layerx2048, SWB309h
1-pass−adapt
18.5(−33%)
16.1(−32%)
GMM72-mix,BMMI,FSH2000h
k-pass+adapt
18.6 17.131
Deep Learn Models Have Interesting Performance Characteristics
Deeplearningmodelscannowbeveryfastinsomecircumstances
• SENNA[Collobert etal.2011]candoPOSorNERfasterthanotherSOTAtaggers(16xto122x),using25xlessmemory
• WSJPOS97.29%acc;CoNLL NER89.59%F1;CoNLL Chunking94.32%F1
Changesincomputingtechnologyfavordeeplearning• InNLP,speedhastraditionallycomefromexploitingsparsity• Butwithmodernmachines,branchesandwidelyspacedmemoryaccessesarecostly
• UniformparalleloperationsondensevectorsarefasterThesetrendsareevenstrongerwithmulti-coreCPUsandGPUs
32
TREE STRUCTURES WITH CONTINUOUS VECTORS
34
Compositionality
We need more than word vectors and bags! What of larger semantic units?
Howcanweknowwhenlargerunitsaresimilarinmeaning?
• Thesnowboarder isleapingoverthemogul
• Apersononasnowboardjumpsintotheair
Peopleinterpretthemeaningoflargertextunits–entities,descriptiveterms,facts,arguments,stories– bysemanticcomposition ofsmallerelements
Socher, Manning, NgSocher, Ng, Manning
RepresentingPhrasesasVectors
x2
x1012345678910
5
4
3
2
1Monday
92
Tuesday 9.51.5
Ifthevectorspacecapturessyntacticandsemanticinformation,thevectorscanbeusedasfeaturesforparsingandinterpretation
15
1.14
thecountryofmybirththeplacewhereIwasborn
Canweextendideasofwordvectorspacestophrases?
France 22.5
Germany 13
Vector for single words are useful as features but limited
Socher, Manning, NgSocher, Ng, Manning
Howshouldwemapphrasesintoavectorspace?
thecountryofmybirth
0.40.3
2.33.6
44.5
77
2.13.3
2.53.8
5.56.1
13.5
15
Use theprincipleofcompositionality!Themeaning (vector)ofasentenceisdetermined by(1) themeaningsofitswordsand(2) therules thatcombinethem.
Modeljointlylearnscompositionalvectorrepresentationsandtreestructure.
x2
x10 1 2 3 4 5 6 7 8 9 10
5
4
3
2
1
the country of my birththe place where I was born
Monday
Tuesday
FranceGermany
Socher, Manning, NgSocher, Ng, Manning
TreeRecursiveNeuralNetworks(TreeRNNs)
onthemat.
91
43
33
83
Computational unit:SimpleNeuralNetworklayer,appliedrecursively
85
33
NeuralNetwork
83
1.3
85
(Goller &Küchler 1996,Costaetal.2003,Socheretal.ICML,2011)
Socher, Manning, NgSocher, Ng, Manning
Version1:SimpleconcatenationTreeRNN
Onlyasingleweightmatrix=compositionfunction!
Noreallyinteractionbetweentheinputwords!
Notadequateforhumanlanguagecompositionfunction
W
c1 c2
pWscore s
p=tanh(W+b),
wheretanh:
score=VT p
c1c2
Socher, Manning, NgSocher, Ng, Manning
Version4:RecursiveNeuralTensorNetwork
Allowsthetwowordorphrasevectorstointeractmultiplicatively
Socher, Manning, NgSocher, Ng, Manning
Beyondthebagofwords:Sentimentdetection
Isthetoneofapieceoftextpositive,negative,orneutral?
• Sentimentisthatsentimentis“easy”• Detectionaccuracyforlongerdocuments~90%,BUT
……loved……………great………………impressed………………marvelous…………
Socher, Manning, NgSocher, Ng, Manning
StanfordSentimentTreebank
• 215,154phraseslabeledin11,855sentences• Canactuallytrainandtestcompositions
http://nlp.stanford.edu:8080/sentiment/
Socher, Manning, NgSocher, Ng, Manning
BetterDatasetHelpedAllModels
• Hardnegationcasesarestillmostlyincorrect• Wealsoneedamorepowerfulmodel!
75
76
77
78
79
80
81
82
83
84
TrainingwithSentenceLabels
TrainingwithTreebank
BiNB
RNN
MV-RNN
Socher, Manning, NgSocher, Ng, Manning
Version4:RecursiveNeuralTensorNetwork
Idea:Allowbothadditiveandmediatedmultiplicativeinteractionsofvectors
Socher, Manning, NgSocher, Ng, Manning
RecursiveNeuralTensorNetwork
Socher, Manning, NgSocher, Ng, Manning
RecursiveNeuralTensorNetwork
Socher, Manning, NgSocher, Ng, Manning
RecursiveNeuralTensorNetwork
• Useresultingvectorsintreeasinputtoaclassifierlikelogisticregression
• Trainallweightsjointlywithgradientdescent
Socher, Manning, NgSocher, Ng, Manning
Positive/NegativeResultsonTreebank
74
76
78
80
82
84
86
TrainingwithSentence Labels TrainingwithTreebank
BiNBRNNMV-RNNRNTN
ClassifyingSentences:Accuracyimprovesto85.4
Note:formorerecentwork,seeLe&Mikolov (2014),Irsoy &Cardie (2014),Taietal.(2015)
Socher, Manning, NgSocher, Ng, Manning
ExperimentalResultsonTreebank
• RNTNcancaptureconstructionslikeXbutY• RNTNaccuracyof72%,comparedtoMV-RNN(65%),biword NB(58%)andRNN(54%)
Socher, Manning, NgSocher, Ng, Manning
NegationResults
Whennegatingnegatives,positiveactivationshouldincrease!
Demo:http://nlp.stanford.edu:8080/sentiment/
Conclusion
Developingintelligentmachinesinvolvesbeingabletorecognizeandexploitcompositionalstructure
Italsoinvolvesotherthingsliketop-downprediction,ofcourse
Workisnowunderwayonhowtodomorecomplextasksthanstraightclassificationinsidedeeplearningsystems