CS388:NaturalLanguageProcessing
GregDurre8
Lecture7:WordEmbeddings
300-d
vector
spacelexical
semantics
Administrivia
‣Mini1gradesouttonightortomorrow
‣ Project1dueTuesday
ClarificaIon:Forward-Backward
‣ Lecture5notesupdatedwithF-BonCRFs
‣ Forward-backwardslidesshowedforward-backwardintheHMMcase(emissionscoreswereprobabiliIesP(xi|yi))
‣ ForCRFs:usetransiIon/emissionpotenIals(computedfromfeatures+weights)insteadofprobabiliIes
Recall:FeedforwardNNs
V
nfeatures
dhiddenunits
dxnmatrix num_classesxdmatrix
soWmaxWf(x)
z
nonlinearity(tanh,relu,…)
g P(y
|x)
P (y|x) = softmax(Wg(V f(x)))num_classesprobs
Recall:BackpropagaIon
V
dhiddenunits
soWmaxWf(x)
zg P
(y|x)
P (y|x) = softmax(Wg(V f(x)))
@L@W err(root)@z
@Verr(z)
zf(x)
ThisLecture
‣WordrepresentaIons
‣ word2vec/GloVe
‣ EvaluaIngwordembeddings
‣ TrainingIps
TrainingTips
Batching
‣ BatchingdatagivesspeedupsduetomoreefficientmatrixoperaIons
‣ NeedtomakethecomputaIongraphprocessabatchatthesameIme
probs = ffnn.forward(input) # [batch_size, num_classes]loss = torch.sum(torch.neg(torch.log(probs)).dot(gold_label))
...
‣ Batchsizesfrom1-100oWenworkwell
def make_update(input, gold_label)
# input is [batch_size, num_feats] # gold_label is [batch_size, num_classes]
...
TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopImizaIonmethod(SGD,Adagrad,etc.)
‣ HowtoiniIalize?Howtoregularize?WhatopImizertouse?
‣ Thislecture:somepracIcaltricks.TakedeeplearningoropImizaIoncoursestounderstandthisfurther
HowdoesiniIalizaIonaffectlearning?
V
nfeatures
dhiddenunits
dxnmatrix mxdmatrix
soWmaxWf(x)
z
nonlinearity(tanh,relu,…)
g P(y
|x)
P (y|x) = softmax(Wg(V f(x)))
‣ HowdoweiniIalizeVandW?Whatconsequencesdoesthishave?
‣ Nonconvexproblem,soiniIalizaIonma8ers!
‣ Nonlinearmodel…howdoesthisaffectthings?
‣ IfcellacIvaIonsaretoolargeinabsolutevalue,gradientsaresmall
‣ ReLU:largerdynamicrange(allposiIvenumbers),butcanproducebigvalues,canbreakdownifeverythingistoonegaIve
HowdoesiniIalizaIonaffectlearning? IniIalizaIon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange
‣ Candorandomuniform/normaliniIalizaIonwithappropriatescale
U
"�r
6
fan-in + fan-out,+
r6
fan-in + fan-out
#‣ GlorotiniIalizer:
‣Wantvarianceofinputsandgradientsforeachlayertobethesame
‣ BatchnormalizaIon(IoffeandSzegedy,2015):periodicallyshiW+rescaleeachlayertohavemean0andvariance1overabatch(usefulifnetisdeep)
2)IniIalizetoolargeandcellsaresaturated
Dropout‣ ProbabilisIcallyzerooutpartsofthenetworkduringtrainingtopreventoverfiing,usewholenetworkattestIme
Srivastavaetal.(2014)
‣ Similartobenefitsofensembling:networkneedstoberobusttomissingsignals,soithasredundancy
‣ FormofstochasIcregularizaIon
‣ OnelineinPytorch/Tensorflow
OpImizer‣ Adam(KingmaandBa,ICLR2015):verywidelyused.AdapIvestepsize+momentum
‣Wilsonetal.NIPS2017:adapIvemethodscanactuallyperformbadlyattestIme(Adamisinpink,SGDinblack)
‣ Onemoretrick:gradientclipping(setamaxvalueforyourgradients)
WordRepresentaIons
WordRepresentaIons
‣ ConInuousmodel<->expectsconInuoussemanIcsfrominput
‣ “Youshallknowawordbythecompanyitkeeps”Firth(1957)
‣ NeuralnetworksworkverywellatconInuousdata,butwordsarediscrete
slidecredit:DanKlein
DiscreteWordRepresentaIons
good
enjoyablegreat
0
fishcat
‣ Brownclusters:hierarchicalagglomeraIvehardclustering(eachwordhasonecluster,notsomeposteriordistribuIonlikeinmixturemodels)
‣Maximize
‣ UsefulfeaturesfortaskslikeNER,notsuitableforNNs
dog…
isgo
0
0 1 1
11
1
10
0
P (wi|wi�1) = P (ci|ci�1)P (wi|ci)
Brownetal.(1992)
WordEmbeddings
Bothaetal.(2017)
…
Fedraisesinterestratesinorderto…
f(x)?? em
b(raises)
‣Wordembeddingsforeachwordforminput emb(interest)
emb(rates)
previousword
currword
nextword
otherwords,feats,etc.
‣ Part-of-speechtaggingwithFFNNs
‣WhatproperIesshouldthesevectorshave?
goodenjoyable
bad
dog
great
is
‣Wantavectorspacewheresimilarwordshavesimilarembeddings
themoviewasgreat
themoviewasgood
~~
WordEmbeddings
‣ Goal:comeupwithawaytoproducetheseembeddings
‣ Foreachword,want“medium”dimensionalvector(50-300dims)represenIngit
word2vec/GloVe
ConInuousBag-of-Words‣ Predictwordfromcontext
thedogbittheman
‣ Parameters:dx|V|(oned-lengthcontextvectorpervocword),|V|xdoutputparameters(W)
dog
the
+
sized
soWmaxMulIplybyW
goldlabel=bit,nomanuallabelingrequired!
Mikolovetal.(2013)
d-dimensional wordembeddings
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
size|V|xd
Skip-Gram
thedogbittheman‣ Predictonewordofcontextfromword
bit
soWmaxMulIplybyW
gold=dog
‣ Parameters:dx|V|vectors,|V|xdoutputparameters(W)(alsousableasvectors!)
‣ Anothertrainingexample:bit->the
P (w0|w) = softmax(We(w))
Mikolovetal.(2013)
HierarchicalSoWmax
‣Matmul+soWmaxover|V|isveryslowtocomputeforCBOWandSG
‣ HierarchicalsoWmax:
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
‣ StandardsoWmax:[|V|xd]xd log(|V|)dotproductsofsized,
…
…
thea
‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake
|V|xdparameters Mikolovetal.(2013)
P (w0|w) = softmax(We(w))
‣ log(|V|)binarydecisions
Skip-GramwithNegaIveSampling
‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)
Mikolovetal.(2013)
(bit,the)=>+1(bit,cat)=>-1
(bit,a)=>-1(bit,fish)=>-1
‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaIveexamplesbysamplingfromunigramdistribuIon
wordsinsimilarcontextsselectforsimilarcvectors
P (y = 1|w, c) = ew·c
ew·c + 1
‣ ObjecIve=sampled
logP (y = 1|w, c) + 1
k
nX
i=1
logP (y = 0|wi, c)<latexit sha1_base64="WNhcgQmvSCUwPzR5dEpsXRMdq3w=">AAADb3icfVJNb9NAEN04fJTwlZYDhyK0oqpkqyayCxJcIlVw4Rgk0laKw2q9WTurrD+0OyaJXB/5g9z4D1z4B6wdB9G0YiRbs++9mbc7mjCXQoPn/exY3Tt3793fe9B7+Ojxk6f9/YNznRWK8THLZKYuQ6q5FCkfgwDJL3PFaRJKfhEuPtb8xTeutMjSL7DO+TShcSoiwSgYiOx3vh8HVOZzSnxbO3iIA77K7SCfC8Jt7fpukFCYh1G5qhynt9WCrQk0al0kpNSkhNd+VeGWbk52izrXWxJwYbfpXx62RW7d38GGCznc6nfS+DVsc7BbcNeuAV3z+68puNvyxlTyCOyNV4wDkeIA+ApUUqaxook2zktiCDbLALNAiXgOTi+QWYxH9nroXy1d5uATHESKstKvykXV3lwM/eqrabeVeldLIoyY9I+8gdcEvpn4bXKE2hiR/o9glrEi4SkwSbWe+F4O05IqEEzyqhcUmueULWjMJyZNacL1tGz2pcLHBpnhKFPmS80LavTfitK8Ua+T0CjrmeldrgZv4yYFRO+npUjzAnjKNkZRITFkuF4+PBOKM5Brk1CmhLkrZnNqhgRmRXtmCP7uk28m56cD/83g9PPbo7MP7Tj20CF6hWzko3foDH1CIzRGrPPLOrAOrRfW7+7z7ssu3kitTlvzDF2LrvMHMNAQkQ==</latexit>
ConnecIonswithMatrixFactorizaIon
Levyetal.(2014)
‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors
wordpaircounts
|V|
|V| |V|
d
d
|V|
contextvecsword vecs
‣ LooksalmostlikeamatrixfactorizaIon…canweinterpretitthisway?
Skip-GramasMatrixFactorizaIon
Levyetal.(2014)
|V|
|V|Mij = PMI(wi, cj)� log k
PMI(wi, cj) =P (wi, cj)
P (wi)P (cj)=
count(wi,cj)D
count(wi)D
count(cj)D
‣ IfwesamplenegaIveexamplesfromtheuniformdistribuIonoverwords
numnegaIvesamples
‣ …andit’saweightedfactorizaIonproblem(weightedbywordfreq)
Skip-gramobjecIveexactlycorrespondstofactoringthismatrix:
GloVe(GlobalVectors)
Penningtonetal.(2014)
X
i,j
f(count(wi, cj))�w>
i cj + ai + bj � log count(wi, cj))�2‣ ObjecIve=
‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix
‣ Constantinthedatasetsize(justneedcounts),quadraIcinvocsize
‣ Byfarthemostcommonwordvectorsusedtoday(5000+citaIons)
wordpaircounts
|V|
|V|
fastText:Sub-wordEmbeddings
‣ SameasSGNS,butbreakwordsdowninton-gramswithn=3to6
Bojanowskietal.(2017)
where:3-grams:<wh,whe,her,ere,re>4-grams:<whe,wher,here,ere>,5-grams:<wher,where,here>,6-grams:<where,where>
‣ Replaceinskip-gramcomputaIonwithw · c<latexit sha1_base64="okWze5eSFLfuKNB+HrglgUZNf6E=">AAADBXicfVLLbtQwFHXCqwyPTmHZjcWoUiLCKClIZYNUwYZlkZi20mQUOZ6bjlUnsewb2lGUDRt+hQ0LEGLLP7Djb3DSDKLTiivZOj7n3nv8SpUUBsPwt+PeuHnr9p2Nu4N79x883BxuPTo0ZaU5THgpS32cMgNSFDBBgRKOlQaWpxKO0tM3rX70AbQRZfEelwpmOTspRCY4Q0slW872TsykWrAk8oxPX9EYzpUXq4VIwDNBFMQ5w0Wa1eeN7w9WueiZBLtsU+VJbZIan0VNQ3u5W3k9619umWCA603/6rgqCtr+PrVaCnit39POr1O7hdeT63YdGdjpv6YYrMqt6RmN+bxEypPhKByHXdCrIOrBiPRxkAx/xfOSVzkUyCUzZhqFCmc10yi4hGYQVwYU46fsBKYWFiwHM6u7V2zojmXmNCu1HYV1b9l/K2qWG7PMU5vZnsSsay15nTatMHs5q0WhKoSCXxhllaRY0vZL0LnQwFEuLWBcC7tXyhdMM4724wzsJUTrR74KDnfH0fPx7rsXo/3X/XVskG3yhHgkIntkn7wlB2RCuPPR+ex8db65n9wv7nf3x0Wq6/Q1j8mlcH/+AZLG7d0=</latexit>
X
g2ngrams
wg · c!
<latexit sha1_base64="W7TRZGN2482/ctHk1BCU8sCouKE=">AAADMXicfVJNb9QwEHXCV1k+uoUjF4tVpUSEVVKQ4IJUwQGORWLbSutV5HidrFXHiewJ7CrKX+LCP0FcegAhrvwJnGwW0W3FSI6e35uZN3aclFIYCMNzx712/cbNWzu3B3fu3ru/O9x7cGyKSjM+YYUs9GlCDZdC8QkIkPy01JzmieQnydmbVj/5yLURhfoAq5LPcpopkQpGwVLxnvN2n1BZLmgcecbHrzDhy9Ij5ULE3DNBFJCcwiJJ62Xj+4NNLngmhi7bVHlcm7iGp1HT4F7udl7P+hdbxhDAdtO/OmyKgra/j62WcLjS70nn16ndxuvJbbuODOznv6YQbMqtKZE8BW9tlWEiFCbAl6DzWmWa5sYaf4qtwOYFYEa0yBbgx8NROA67wJdB1IMR6uMoHn4l84JVOVfAJDVmGoUlzGqqQTDJmwGpDC8pO6MZn1qoaM7NrO7+eIP3LTPHaaHtUnaIlv23orZjmlWe2Mz21GZba8mrtGkF6ctZLVRZAVdsbZRWEkOB2+eD50JzBnJlAWVa2FkxW1BNGdhHNrCXEG0f+TI4PhhHz8YH75+PDl/317GDHqHHyEMReoEO0Tt0hCaIOZ+db85354f7xT13f7q/1qmu09c8RBfC/f0HGUoAvg==</latexit>
‣ Advantages?
UsingWordEmbeddings
‣ Approach1:learnembeddingsasparametersfromyourdata
‣ Approach2:iniIalizeusingGloVe,keepfixed
‣ Approach3:iniIalizeusingGloVe,fine-tune‣ Fasterbecausenoneedtoupdatetheseparameters
‣Worksbestforsometasks
‣ OWenworkspre8ywell
Preview:Context-dependentEmbeddings
Petersetal.(2018)
‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaIonsaswordvectors
‣ Context-sensiCvewordembeddings:dependonrestofthesentence
‣ HugeimprovementsacrossnearlyallNLPtasksoverGloVe
they hit the ballsthey dance at balls
‣ Howtohandledifferentwordsenses?Onevectorforballs
ComposiIonalSemanIcs
‣WhatifwewantembeddingrepresentaIonsforwholesentences?
‣ Skip-thoughtvectors(Kirosetal.,2015),similartoskip-gramgeneralizedtoasentencelevel(morelater)
‣ IsthereawaywecancomposevectorstomakesentencerepresentaIons?Summing?
‣WillreturntothisinafewweeksaswemoveontosyntaxandsemanIcs
EvaluaIon
EvaluaIngWordEmbeddings
‣WhatproperIesoflanguageshouldwordembeddingscapture?
goodenjoyable
bad
dog
great
is
cat
wolf
Cger
was
‣ Similarity:similarwordsareclosetoeachother
‣ Analogy:
ParisistoFranceasTokyoisto???
goodistobestassmartisto???
Similarity
Levyetal.(2015)
‣ SVD=singularvaluedecomposiIononPMImatrix
‣ GloVedoesnotappeartobethebestwhenexperimentsarecarefullycontrolled,butitdependsonhyperparameters+thesedisIncIonsdon’tma8erinpracIce
Analogies
queenking
woman
man
(king-man)+woman=queen
‣Whywouldthisbe?
‣ woman-mancapturesthedifferenceinthecontextsthattheseoccurin
king+(woman-man)=queen
‣ Dominantchange:more“he”withmanand“she”withwoman—similartodifferencebetweenkingandqueen
‣ Canevaluateonthisaswell
Whatcangowrongwithwordembeddings?
‣What’swrongwithlearningaword’s“meaning”fromitsusage?
‣Whatdataarewelearningfrom?
‣Whatarewegoingtolearnfromthisdata?
Whatdowemeanbybias?
‣ IdenIfyshe-heaxisinwordvectorspace,projectwordsontothisaxis
Bolukbasietal.(2016)
Manzinietal.(2019)
‣ Nearestneighborof(b-a+c)
Debiasing
Bolukbasietal.(2016)
‣ IdenIfygendersubspacewithgenderedwords
she
he
homemaker
woman
man
‣ Projectwordsontothissubspace
‣ SubtractthoseprojecIonsfromtheoriginalword
homemaker’
HardnessofDebiasing
GonenandGoldberg(2019)
‣ NotthateffecIve…andthemaleandfemalewordsaresIllclusteredtogether
‣ Biaspervadesthewordembeddingspaceandisn’tjustalocalpropertyofafewwords
Takeaways
‣ Lotstotunewithneuralnetworks
‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaIonapproaches(constantindatasetsize)
‣ Training:opImizer,iniIalizer,regularizaIon(dropout),…
‣ Hyperparameters:dimensionalityofwordembeddings,layers,…
‣ NextIme:RNNsandCNNs
‣ LotsofpretrainedembeddingsworkwellinpracIce,theycapturesomedesirableproperIes
‣ Evenbe8er:context-sensiIvewordembeddings(ELMo)
Top Related