Post on 24-Jul-2020
11/9/15
1
DeepLearningforNLPPart3
CS224NChristopherManning
(ManyslidesborrowedfromACL2012/NAACL2013Tutorialsbyme,RichardSocherandYoshuaBengio)
Backpropagation TrainingPart1.5:The Basics
2
Backprop
• Computegradientofexample-wise losswrtparameters
• Simplyapplyingthederivativechainrulewisely
• Ifcomputingtheloss(example,parameters)isO(n)computation,thensoiscomputingthegradient
3
Simple Chain Rule
4
Multiple Paths Chain Rule
5
Multiple Paths Chain Rule - General
…
6
11/9/15
2
Chain Rule in Flow Graph
…
…
…
Flow graph:anydirected acyclicgraphnode =computation resultarc=computation dependency
=successors of
7
Back-Prop in Multi-Layer Net
…
…
8
h = sigmoid(Vx)
Back-Prop in General Flow Graph
…
…
…
=successors of
1. Fprop:visitnodes intopo-sortorder- Computevalueofnodegivenpredecessors
2. Bprop:- initialize outputgradient=1- visitnodesin reverseorder:
Computegradientwrt eachnodeusinggradientwrt successors
Singlescalaroutput
9
Automatic Differentiation
• Thegradientcomputationcanbeautomaticallyinferredfromthesymbolicexpressionofthefprop.
• Eachnodetypeneedstoknowhowtocomputeitsoutputandhowtocomputethegradientwrt itsinputsgiventhegradientwrt itsoutput.
• Easyandfastprototyping.See:• Theano (Python),• TensorFlow (Python/C++),or• Autograd (Lua/C++forTorch)
10
Deep Learning General Strategy and Tricks
11
General Strategy1. Selectnetworkstructureappropriateforproblem
1. Structure:Singlewords,fixedwindowsvs.convolutionalvs.recurrent/recursivesentencebasedvs.bagofwords
2. Nonlinearities[coveredearlier]2. Checkforimplementationbugswithgradientchecks3. Parameterinitialization4. Optimization5. Checkifthemodelispowerfulenoughtooverfit
1. Ifnot,changemodelstructureormakemodel“larger”2. Ifyoucanoverfit:Regularize
12
11/9/15
3
Gradient Checks are Awesome!
• Allowsyoutoknowthattherearenobugsinyourneuralnetworkimplementation!(Butmakesitrunreallyslow.)
• Steps:1. Implementyourgradient2. Implementafinitedifferencecomputationbylooping
throughtheparametersofyournetwork,addingandsubtractingasmallepsilon(∼10-4)andestimatederivatives
3. Comparethetwoandmakesuretheyarealmostthesame13
,
Parameter Initialization
• ParameterInitializationcanbeveryimportantforsuccess!• Initializehiddenlayerbiasesto0andoutput(orreconstruction)
biasestooptimalvalueifweightswere0(e.g.,meantargetorinversesigmoidofmeantarget).
• Initializeweights∼Uniform(−r,r),r inverselyproportionaltofan-in(previouslayersize)andfan-out(nextlayersize):
fortanh units,and4xbiggerforsigmoidunits [Glorot AISTATS2010]• Makeinitializationslighly positiveforReLU – toavoiddeadunits
14
• Gradientdescentusestotalgradientoverallexamplesperupdate• Shouldn’tbeused.Veryslow.
• SGDupdatesaftereachexample:
• L = lossfunction,zt =currentexample,θ =parametervector,andεt = learningrate.
• Youprocessanexampleandthenmoveeachparameterasmalldistancebysubtractingafractionofthegradient
• εt shouldbesmall…moreinfollowingslide• Important:applyallSGDupdatesatonceafterbackproppass
Stochastic Gradient Descent (SGD)
15
• RatherthandoSGDonasingleexample,peopleusuallydoitonaminibatch of32,64,or128examples.
• Yousumthegradientsinminibatch (andscaledownlearningrate)• Minoradvantage:gradientestimateismuchmorerobustwhenestimatedonabunchofexamplesratherthanjustone.• Majoradvantage:codecanrunmuchfasteriff youcandoawholeminibatch atonceviamatrix-matrixmultiplies
• ThereisawholepanoplyoffancieronlinelearningalgorithmscommonlyusednowwithNNs.Goodonesinclude:• Adagrad• RMSprop• ADAM
Stochastic Gradient Descent (SGD)
16
Learning Rates
• Settingα correctlyistricky• Simplestrecipe:α fixedandsameforallparameters• Orstartwithlearningratejustsmallenoughtobestableinfirst
passthroughdata(epoch),thenhalveitonsubsequentepochs• Betterresultscanusuallybeobtainedbyusingacurriculumfor
decreasinglearningrates,typicallyinO(1/t)becauseoftheoreticalconvergenceguarantees,e.g.,withhyper-parametersε0 andτ
• Betteryet:Nohand-setlearningratesbyusingmethodslikeAdaGrad (Duchi,Hazan,&Singer2011)[butmayconvergetoosoon– tryresettingaccumulatedgradients]
17
Attempt to overfit training data
Assumingyoufoundtherightnetworkstructure,implementeditcorrectly,andoptimizeditproperly,youcanmakeyourmodeltotallyoverfit onyourtrainingdata(99%+accuracy)• Ifnot:
• Changearchitecture• Makemodelbigger(biggervectors,morelayers)• Fixoptimization
• Ifyes:• Now,it’stimetoregularizethenetwork
18
11/9/15
4
Prevent Overfitting: Model Size and Regularization
• Simplefirststep:Reducemodelsizebyloweringnumberofunitsandlayersandotherparameters
• StandardL1orL2regularizationonweights• EarlyStopping:Useparametersthatgavebestvalidationerror• Sparsity constraintsonhiddenactivations,e.g.,addtocost:
19
Prevent Feature Co-adaptation
Dropout (Hintonetal.2012)http://jmlr.org/papers/v15/srivastava14a.html
• Trainingtime:ateachinstanceofevaluation(inonlineSGD-training),randomlyset50%oftheinputstoeachneuronto0• Testtime:halvethemodelweights(nowtwiceasmany)• Thispreventsfeatureco-adaptation:Afeaturecannotonlybeusefulinthepresenceofparticularotherfeatures• Akindofmiddle-groundbetweenNaïveBayes(whereallfeatureweightsaresetindependently)andlogisticregressionmodels(whereweightsaresetinthecontextofallothers)• Canbethoughtofasaformofmodelbagging• Itactsasastrongregularizer;see(Wageretal.2013)
http://arxiv.org/abs/1307.149 320
Deep Learning Tricks of the Trade• Y.Bengio(2012),“PracticalRecommendationsforGradient-
BasedTrainingofDeepArchitectures”http://arxiv.org/abs/1206.5533• Unsupervisedpre-training• Stochasticgradientdescentandsettinglearningrates• Mainhyper-parameters• Learning rateschedule &early stopping ,Minibatches, Parameterinitialization , Number ofhidden units, L1orL2weight decay,…
• Y.Bengio,I.Goodfellow,andA.Courville (inpress),“DeepLearning”.MITPress,ms.http://goodfeli.github.io/dlbook/• Manychaptersondeeplearning,includingoptimizationtricks• Somemorerecentstuffthan2012paper
21
Sharing statistical strengthPart1.7
22
Sharing Statistical Strength
• Besidesveryfastprediction,themainadvantageofdeeplearning isstatistical
• Potentialtolearnfromlesslabeledexamplesbecauseofsharingofstatisticalstrength:• Unsupervisedpre-training•Multi-tasklearning• Semi-supervisedlearning
23
Multi-Task Learning• Generalizingbettertonew
tasksiscrucialtoapproachAI
• Deeparchitectureslearngoodintermediaterepresentationsthatcanbesharedacrosstasks
• Goodrepresentationsmakesenseformanytasks
raw input x
task 1 output y1
task 3 output y3
task 2output y2
shared intermediate representation h
24
11/9/15
5
Semi-Supervised Learning
• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)
purelysupervised
25
Semi-Supervised Learning
• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)
semi-supervised
26
Advantages of Deep LearningPart 2
Part1.1:The Basics
27
#4 Unsupervised feature learning
Today,mostpractical,goodNLP&MLmethodsrequirelabeled trainingdata(i.e., supervisedlearning)
Butalmostalldataisunlabeled
MostinformationmustbeacquiredunsupervisedFortunately,agoodmodelofobserveddatacanreallyhelpyoulearnclassificationdecisions
Commentary:Thisismorethedreamthanthereality;mostoftherecentsuccessesofdeeplearninghavecomefromregularsupervisedlearningoververylargedatasets28
#5 Handling the recursivity of language
Humansentencesarecomposedfromwordsandphrases
Weneedcompositionality inourMLmodels
Recursion:thesameoperator(sameparameters)isappliedrepeatedlyondifferentcomponents
Recurrentmodels:Recursionalongatemporalsequence
Asmallcrowdquietlyentersthehistoricchurch
historicthe
quietlyenters
SVP
Det. Adj.
NPVP
Asmallcrowd
NP
NP
church
N.
SemanticRepresentations
xt−1 xt xt+1
zt−1 zt zt+1
29
#6 Why now?
Despitepriorinvestigationandunderstandingofmanyofthealgorithmictechniques…
Before2006trainingdeeparchitectureswasunsuccessfulL
Whathaschanged?• Newmethodsforunsupervisedpre-training(Restricted
BoltzmannMachines=RBMs,autoencoders,contrastiveestimation,etc.)anddeepmodeltrainingdeveloped
• Moreefficientparameterestimationmethods• Betterunderstandingofmodelregularization• More data andmorecomputationalpower
11/9/15
6
Deep Learning models have already achieved impressive results for HLT
NeuralLanguageModel[Mikolov etal. Interspeech 2011]
MSRMAVISSpeechSystem[Dahletal. 2012;Seide etal. 2011;followingMohamedetal. 2011]
“Thealgorithmsrepresentthefirsttimeacompanyhasreleasedadeep-neural-networks(DNN)-basedspeech-recognitionalgorithminacommercialproduct.”
Model\ WSJASRtask EvalWERKN5Baseline 17.2DiscriminativeLM 16.9Recurrent NNcombination 14.4
Acousticmodel&training
Recog\ WER
RT03SFSH
Hub5SWB
GMM40-mix,BMMI,SWB309h
1-pass−adapt
27.4 23.6
DBN-DNN7layerx2048,SWB309h
1-pass−adapt
18.5(−33%)
16.1(−32%)
GMM72-mix,BMMI,FSH2000h
k-pass+adapt
18.6 17.131
Deep Learn Models Have Interesting Performance Characteristics
Deeplearningmodelscannowbeveryfastinsomecircumstances
• SENNA[Collobert etal. 2011]candoPOSorNERfasterthanotherSOTAtaggers(16xto122x),using25xlessmemory
• WSJPOS97.29% acc;CoNLL NER89.59% F1;CoNLL Chunking 94.32% F1
Changesincomputingtechnologyfavordeeplearning• InNLP,speedhastraditionallycomefromexploitingsparsity• Butwithmodernmachines,branchesandwidelyspacedmemoryaccessesarecostly• Uniformparalleloperationsondensevectorsarefaster
Thesetrendsareevenstrongerwithmulti-coreCPUsandGPUs32
TREE STRUCTURES WITH CONTINUOUS VECTORS
34
Compositionality We need more than word vectors and bags! What of larger semantic units?
Howcanweknowwhen largerunitsaresimilarinmeaning?
• Thesnowboarderisleapingoverthemogul
• Apersononasnowboardjumpsintotheair
Peopleinterpret themeaningoflarger textunits–entities,descriptiveterms, facts,arguments,stories– bysemanticcomposition ofsmallerelements
11/9/15
7
Socher, Manning, NgSoc her, Ng, Manning
Representing PhrasesasVectors
x2
x10 1 2 3 4 5 6 7 8 9 10
5
4
3
2
1
Monday92
Tuesday 9.51.5
Ifthevectorspacecapturessyntacticandsemanticinformation,thevectorscanbeusedasfeaturesforparsingandinterpretation
15
1.14
thecountryofmybirththeplacewhereI wasborn
Canweextendideasofwordvectorspacestophrases?
France 22.5
Germany 13
Vector for single words are useful as features but limited
Socher, Manning, NgSoc her, Ng, Manning
How should wemapphrasesinto avector space?
the country of my birth
0.40.3
2.33.6
44.5
77
2.13.3
2.53.8
5.56.1
13.5
15
Use theprinciple ofcompositionality!
Themeaning (vector)ofasentenceisdetermined by(1) themeanings ofits words and(2) the rules thatcombine them.
Modeljointlylearnscompositionalvectorrepresentationsandtreestructure.
x2
x10 1 2 3 4 5 6 7 8 9 10
5
4
3
2
1
the country of my birththe place where I was born
Monday
Tuesday
FranceGermany
Socher, Manning, NgSoc her, Ng, Manning
TreeRecursiveNeuralNetworks(TreeRNNs)
on the mat.
91
43
33
83
Computationalunit:SimpleNeuralNetworklayer,appliedrecursively
85
33
NeuralNetwork
83
1.3
85
(Goller &Küchler 1996,Costa etal. 2003, Socheretal. ICML, 2011)
Socher, Manning, NgSoc her, Ng, Manning
Version 1:Simple concatenation TreeRNN
Onlyasingleweightmatrix=composition function!
No reallyinteractionbetweentheinputwords!
Notadequate for human language composition function
W
c1 c2
pWscore s
p=tanh(W +b),
wheretanh:
score=VT p
c1c2
Socher, Manning, NgSoc her, Ng, Manning
Version 4:RecursiveNeuralTensorNetwork
Allowsthetwowordorphrasevectorstointeractmultiplicatively
Socher, Manning, NgSoc her, Ng, Manning
Beyondthe bagof words:Sentiment detection
Isthetoneofapieceoftextpositive,negative,orneutral?
• Sentimentisthatsentimentis“easy”• Detectionaccuracyforlongerdocuments~90%,BUT
……loved……………great………………impressed………………marvelous…………
11/9/15
8
Socher, Manning, NgSoc her, Ng, Manning
Stanford Sentiment Treebank
• 215,154 phraseslabeled in11,855 sentences• Canactually trainandtest compositions
http://nlp.stanford.edu:8080/sentiment/Socher, Manning, NgSoc her, Ng, Manning
BetterDataset Helped All Models
• Hardnegationcasesarestillmostlyincorrect• Wealsoneedamorepowerfulmodel!
75
76
77
78
79
80
81
82
83
84
TrainingwithSentenceLabels
TrainingwithTreebank
BiNB
RNN
MV-RNN
Socher, Manning, NgSoc her, Ng, Manning
Version 4:RecursiveNeuralTensorNetwork
Idea:Allowbothadditiveandmediatedmultiplicativeinteractionsofvectors
Socher, Manning, NgSoc her, Ng, Manning
RecursiveNeuralTensor Network
Socher, Manning, NgSoc her, Ng, Manning
RecursiveNeuralTensor Network
Socher, Manning, NgSoc her, Ng, Manning
RecursiveNeuralTensor Network
• Useresultingvectorsintreeasinputtoaclassifierlikelogisticregression
• Trainallweightsjointlywithgradientdescent
11/9/15
9
Socher, Manning, NgSoc her, Ng, Manning
Positive/Negative Results on Treebank
74
76
78
80
82
84
86
TrainingwithSentenceLabels TrainingwithTreebank
BiNBRNNMV-RNNRNTN
ClassifyingSentences:Accuracyimprovesto85.4
Note:formorerecentwork,seeLe&Mikolov (2014),Irsoy &Cardie (2014),Taietal. (2015)Socher, Manning, NgSoc her, Ng, Manning
Experimental Resultson Treebank• RNTNcancaptureconstructionslikeXbutY• RNTNaccuracyof72%,comparedtoMV-RNN(65%),biword NB(58%)andRNN(54%)
Socher, Manning, NgSoc her, Ng, Manning
Negation ResultsWhennegatingnegatives,positiveactivationshouldincrease!
Demo:http://nlp.stanford.edu:8080/sent ime nt/
Conclusion
Developingintelligentmachinesinvolvesbeingabletorecognizeandexploitcompositionalstructure
Italsoinvolvesotherthingsliketop-downprediction,ofcourse
Workisnowunderwayonhowtodomorecomplextasksthanstraightclassificationinsidedeeplearningsystems