Word-embeddings and Word2Vec and The Computation Graphu.cs.biu.ac.il/~89-687/lec5.pdf ·...

Word-embeddings and Word2Vec

and The Computation GraphYoav Goldberg

Last Time

• Language modeling.

• 1-hot X matrix

• Embedding layer

• (pre-trained) Word Embeddings.

softmax(⇤)

"⇤W3 + b3

"g(⇤W2 + b2)

"g(⇤W1 + b1)

"x

"encode(xk�4, xk�3, xk�2, xk�1)

Rdin

Rd1

Rd2

Rdout

Rdout

E[xk�4] �E[xk�3] �E[xk�2] �E[xk�1]

what's in W3?

columns of W3correspond

to vocab items!

Neural LM

• Consider the columns of W3.

• Consider the rows of E.

Neural Word Embeddings

• We trained a language model.

• We ended up with vector representations of words.

• These representations are useful -- they encode various aspects of word similarity.

Neural Word Embeddings

softmax(⇤)

"⇤W3 + b3

"g(⇤W2 + b2)

"g(⇤W1 + b1)

"x


Rdin

Rd1

Rd2

Rdout

Rdout

E[xk�4] �E[xk�3] �E[xk�2] �E[xk�1]

How are the softmax scores determined?

Neural LM

softmax(⇤)

"⇤W3 + b3

"g(⇤W2 + b2)

"g(⇤W1 + b1)

"x


How are the softmax scores determined?

Neural LM

a vector representation of the context words

each column of W3 corresponds to

a word

the big black dogX

throwbarkedbarkgreenateballparkw

hofeel

oneis are

• Training the language model is expensive. (Why?)

• Predicting a word based on previous words is nice, but can we do better?

• If all we care about are the word vectors...

towards word2vec

• Radically simplify the neural LM.

• Very fast training.

• Obtain good word representations.

Word2Vec

word2vec

word2vec

I dogI cat, dogs, dachshund, rabbit, puppy, poodle, rottweiler,

mixed-breed, doberman, pigI sheep

I cattle, goats, cows, chickens, sheeps, hogs, donkeys,herds, shorthorn, livestock

I novemberI october, december, april, june, february, july, september,

january, august, marchI jerusalem

I tiberias, jaffa, haifa, israel, palestine, nablus, damascuskatamon, ramla, safed

I tevaI pfizer, schering-plough, novartis, astrazeneca,

glaxosmithkline, sanofi-aventis, mylan, sanofi, genzyme,pharmacia

• instead of predicting word based on previous words...

• ...predict word based on surrounding words.

Collobert and Weston model

10.4. WORD EMBEDDING ALGORITHMS 113

which each dimension corresponds to a specific context the word occurs in, the dimensionsin the distributed representation are not interpretable, and specific dimensions do not nec-essarily correspond to specific concepts. The distributed nature of the representation meansthat a given aspect of meaning may be captured by (distributed over) a combination ofmany dimensions, and that a given dimension may contribute to capturing several aspectsof meaning.6

Consider the language modeling network in equation (9.3) in chapter 9. The context ofa word is the kgram of words preceding it. Each word is associated with a vector, and theirconcatenation is encoded into a dhid dimensional vector h using a non-linear transformation.The vector h is then multiplied by a matrix W2 in which each column corresponds to aword, and interactions between h and columns in W2 determine the probabilities of thedi↵erent words given the context. The columns ofW2 (as well as the rows of the embeddingsmatrix E) are distributed representations of words: the training process determines goodvalues to the embeddings such that they produce correct probability estimates for a wordin the context of a kgram, capturing the “meaning” of the words in the columns of W2

associated with them.

Collobert and Weston

The design of the network in equation 9.3 is driven by the language modeling task, whichposes two important requirements: the need to produce a probability distributions overwords, and the need to condition on contexts that can be combined using the chain-rule ofprobability to produce sentence-level probability estimates. The need to produce a proba-bility distribution dictates the need to compute an expensive normalization term involvingall the words in the output vocabulary, while the need to decompose according to thechain-rule restricts the conditioning context to preceding kgrams.

If we only care about the resulting representations, both of the constraints can berelaxed, as was done by Collobert and Weston [2008] in a model which was refined andpresented in greater depth by Bengio et al. [2009]. The first change introduced by theCollobert and Weston was changing the context of a word from the preceding kgram (thewords to its left) to a word-window surrounding the it (i.e. computing P (w3|w1w22w4w5)instead of P (w5|w1w2w3w42)). The generalization to other kinds of fixed-sized contextsc1:k is straightforward.

The second change introduced by Collobert and Weston is to abandon the probabilis-tic output requirement. Instead of computing a probability distribution over target wordsgiven a context, their model only attempts to assign a score to each word, such that thecorrect word scores above incorrect ones. This removes the need to perform the computa-

6We note that in many ways the explicit distributional representations is also “distributed”: di↵erent aspectsof the meaning of a word are captured by groups of contexts the word occurs in, and a given context cancontribute to di↵erent aspects of meaning. Moreover, after performing dimensionality reduction over the word-context matrix, the dimensions are no longer easily interpretable.


which each dimension corresponds to a specific context the word occurs in, the dimensionsin the distributed representation are not interpretable, and specific dimensions do not nec-essarily correspond to specific concepts. The distributed nature of the representation meansthat a given aspect of meaning may be captured by (distributed over) a combination ofmany dimensions, and that a given dimension may contribute to capturing several aspectsof meaning.6

Consider the language modeling network in equation (9.3) in chapter 9. The context ofa word is the kgram of words preceding it. Each word is associated with a vector, and theirconcatenation is encoded into a dhid dimensional vector h using a non-linear transformation.The vector h is then multiplied by a matrix W2 in which each column corresponds to aword, and interactions between h and columns in W2 determine the probabilities of thedi↵erent words given the context. The columns ofW2 (as well as the rows of the embeddingsmatrix E) are distributed representations of words: the training process determines goodvalues to the embeddings such that they produce correct probability estimates for a wordin the context of a kgram, capturing the “meaning” of the words in the columns of W2

associated with them.

Collobert and Weston

The design of the network in equation 9.3 is driven by the language modeling task, whichposes two important requirements: the need to produce a probability distributions overwords, and the need to condition on contexts that can be combined using the chain-rule ofprobability to produce sentence-level probability estimates. The need to produce a proba-bility distribution dictates the need to compute an expensive normalization term involvingall the words in the output vocabulary, while the need to decompose according to thechain-rule restricts the conditioning context to preceding kgrams.

If we only care about the resulting representations, both of the constraints can berelaxed, as was done by Collobert and Weston [2008] in a model which was refined andpresented in greater depth by Bengio et al. [2009]. The first change introduced by theCollobert and Weston was changing the context of a word from the preceding kgram (thewords to its left) to a word-window surrounding the it (i.e. computing P (w3|w1w22w4w5)instead of P (w5|w1w2w3w42)). The generalization to other kinds of fixed-sized contextsc1:k is straightforward.

The second change introduced by Collobert and Weston is to abandon the probabilis-tic output requirement. Instead of computing a probability distribution over target wordsgiven a context, their model only attempts to assign a score to each word, such that thecorrect word scores above incorrect ones. This removes the need to perform the computa-

6We note that in many ways the explicit distributional representations is also “distributed”: di↵erent aspectsof the meaning of a word are captured by groups of contexts the word occurs in, and a given context cancontribute to di↵erent aspects of meaning. Moreover, after performing dimensionality reduction over the word-context matrix, the dimensions are no longer easily interpretable.

• Getting rid of the softmax over the vocabulary:

• probabilities --> scores

• score the quality of a sequence, not all the words

• ranking based loss



score(D,A,⇤, C, F ;G) = g(xU)v

x = (E[D] �E[A] �E[G] �E[C] �E[F ])



x = (E[D] �E[A] �E[G] �E[C] �E[F ])

score(c1, c2,⇤, c3, c4;w) > score(c1, c2,⇤, c3, c4;w0) + 1

random wordobserved word



x = (E[D] �E[A] �E[G] �E[C] �E[F ])



L(w, c, w0) = max(0, 1� (score(c1:k;w)� score(c1:k;w0)))



x = (E[D] �E[A] �E[G] �E[C] �E[F ])




How do you train this?(discuss)

from Collobert and Weston to Word2Vec

• Back to probabilities• Simplify the scoring function

Word2Vec• Back to probabilities


P (D = 1|w, c) = �(s(w, c)) =1

1 + e�s(w,c)


P (D = 1|w, c) = �(s(w, c)) =1

1 + e�s(w,c)

L(⇥;D, D̄) =X

(w,c)2D

logP (D = 1|w, c) +X

(w0,c)2D̄

logP (D = 0|w0, c)


P (D = 1|w, c) = �(s(w, c)) =1

1 + e�s(w,c)

L(⇥;D, D̄) =X

(w,c)2D


(w0,c)2D̄

logP (D = 0|w0, c)

good word+context pairs bad word+context pairs


P (D = 1|w, c) = �(s(w, c)) =1

1 + e�s(w,c)

L(⇥;D, D̄) =X

(w,c)2D


(w0,c)2D̄

logP (D = 0|w0, c)

P (D = 0|w, c) = 1� P (D = 1|w, c)

Word2Vec• Simplify the score function:

CBOWSkip-grams


score(w; c1, ..., ck) = (kX

i=1

E[ci]) ·E0[w]

CBOW


Word2Vec replaces the margin-based ranking objective with a probabilistic one. Considera set D of correct word-context pairs, and a set D̄ of incorrect word-context pairs. The goalof the algorithm is to estimate the probability P (D = 1|w, c) that the word-context paircame from the correct set D. This should be high (1) for pairs from D and low (0) for pairsfrom D̄. The probability constraint dictates that P (D = 1|w, c) = 1� P (D = 0|w, c). Theprobability function is modeled as a sigmoid over the score s(w, c):

P (D = 1|w, c) =1

1 + e�s(w,c)(10.6)

The corpus-wide objective of the algorithm is to maximize the log-likelihood of thedata D [ D̄:

L(⇥;D, D̄) =X

(w,c)2D


(w,c)2D̄

logP (D = 0|w, c) (10.7)

The positive examples D are generated from a corpus. The negative examples D̄ canbe generated in many ways. In Word2Vec, they are generated by the following process:for each good pair (w, c) 2 D, sample k words w1:k and add each of (wi, c) as a negativeexample to D̄. This results in the negative samples data D̄ being k times larger than D.The number of negative samples k is a parameter of the algorithm.

The negative words w can be sampled according to their corpus based frequency#(w)Pw0 #(w0) , or, as done in the Word2Vec implementation, according to a smoothed version

in which the counts are raised to the power of 3

4before normalizing: #(w)

0.75P

w0 #(w0)0.75 . Thissecond version gives more relative weight to less frequent words, and results in better wordsimilarities in practice.

CBOW Other than changing the objective from margin-based to a probabilistic one,Word2Vec also considerably simplify the definition of the word-context scoring function,s(w, c). For a multi-word context c1:k, the CBOW variant of Word2Vec defines the contextvector c to be a sum of the embedding vectors of the context components: c =

Pki=1

ci. Itthen defines the score to be simply s(w, c) = w · c, resulting in:

P (D = 1|w, c1:k) =1

1 + e�(w·c1+w·c2+...+w·ck)

The CBOW variant loses the order information between the context’s elements. Inreturn, it allows the use of variable-length contexts. However, note that for contexts withbound length, the CBOW can still retain the order information by including the relativeposition as part of the content element itself, i.e. by assigning di↵erent embedding vectorto context elements in di↵erent relative positions.

Skip-Gram The skip-gram variant of Word2Vec scoring decouples the dependencebetween the context elements even further. For a k-elements context c1:k, the skip-gram


116 10. PRE-TRAINED WORD REPRESENTATIONS

variant assumes that the elements ci in the context are independent from each other,essentially treating them as k di↵erent contexts. I.e., a word-context pair (w, ci:k) will berepresented in D as k di↵erent contexts: (w, c1), . . . , (w, ck). The scoring function s(w, c) isdefined as in the CBOW version, but now each context is single embedding vector:

P (D = 1|w, ci) =1

1 + e�w·ci

P (D = 1|w, c1:k) =kY

i=1

P (D = 1|w, ci) =kY

1=i

1

1 + e�w·ci

logP (D = 1|w, c1:k) = logkX

i=1

1

1 + e�w·ci

(10.8)

While introducing strong independence assumptions between the elements of thecontext, the skip-gram variant is very e↵ective in practice, and very commonly used.

10.4.3 CONNECTING THE WORLDS

Both the distributional “count-based” method and the distributed “neural” ones are basedon the distributional hypothesis, attempting capture the similarity between words based onthe similarity between the contexts in which they occur. In fact, Levy and Goldberg [2014]show that the ties between the two worlds are deeper than appear at first sight.

The training of Word2Vec models result in two embedding matrices, EW 2R|VW |⇥demb and EC 2 R|VC |⇥demb representing the words and the contexts respectively.The context embeddings are discarded after training, and the word embeddings are kept.However, imagine keeping the context embedding matrix EC and consider the productEW ⇥EC>

= M0 2 R|VW |⇥|VC |. Viewed this way, Word2Vec is factorizing an implicitword-context matrix M0. What are the elements of matrix M0? An entry M0

[w,c] corre-sponds to the dot product of the word and context embedding vectors w · c. Levy andGoldberg show that for the combination of skip-grams contexts and the negative sam-pling objective with k negative samples, the global objective is minimized by settingw · c = M0

[w,c] = PMI(w, c)� log k. That is, Word2Vec is implicitly factorizing a ma-trix which is closely related to the well-known word-context PMI matrix! Remarkably, itdoes so without ever explicitly constructing the matrix M0.7

7If the optimal assignment was satisfiable, the skip-grams with negative-sampling (SGNS) solution is the sameas the SVD over word-context matrix solution. Of course, the low dimensionality demb of w and c may makeit impossible to satisfy w · c = PMI(w, c)� log k for all w and c pairs, and the optimization procedure willattempt to find the best achievable solution, while paying a price for each deviation from the optimal assignment.This is where the SGNS and the SVD objectives di↵er – SVD puts a quadratic penalty on each deviation, whileSGNS uses a more complex penalty term.

Skip-grams:


Skip-grams:

L(⇥;D, D̄) =X

(w,c)2D


(w0,c)2D̄

logP (D = 0|w0, c)

Word2Vec• How to train a word2vec model? (discuss)

"FastText"

"FastText"Represent a word by the sum of its char ngrams.

"FastText"

vwhere =<latexit sha1_base64="cLdkzOvJFiGW+v/P2UaWlDlN1Sg=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0ItQ9OKxgv2ANpTNdtIu3WzC7qZSQn+GFw+KePXXePPfuG1z0NYHA4/3ZpiZFySCa+O6305hbX1jc6u4XdrZ3ds/KB8eNXWcKoYNFotYtQOqUXCJDcONwHaikEaBwFYwupv5rTEqzWP5aCYJ+hEdSB5yRo2VOuNe9jREhVNy0ytX3Ko7B1klXk4qkKPeK391+zFLI5SGCap1x3MT42dUGc4ETkvdVGNC2YgOsGOppBFqP5ufPCVnVumTMFa2pCFz9fdERiOtJ1FgOyNqhnrZm4n/eZ3UhNd+xmWSGpRssShMBTExmf1P+lwhM2JiCWWK21sJG1JFmbEplWwI3vLLq6R5UfXcqvdwWand5nEU4QRO4Rw8uIIa3EMdGsAghmd4hTfHOC/Ou/OxaC04+cwx/IHz+QMyLZEt</latexit><latexit sha1_base64="cLdkzOvJFiGW+v/P2UaWlDlN1Sg=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0ItQ9OKxgv2ANpTNdtIu3WzC7qZSQn+GFw+KePXXePPfuG1z0NYHA4/3ZpiZFySCa+O6305hbX1jc6u4XdrZ3ds/KB8eNXWcKoYNFotYtQOqUXCJDcONwHaikEaBwFYwupv5rTEqzWP5aCYJ+hEdSB5yRo2VOuNe9jREhVNy0ytX3Ko7B1klXk4qkKPeK391+zFLI5SGCap1x3MT42dUGc4ETkvdVGNC2YgOsGOppBFqP5ufPCVnVumTMFa2pCFz9fdERiOtJ1FgOyNqhnrZm4n/eZ3UhNd+xmWSGpRssShMBTExmf1P+lwhM2JiCWWK21sJG1JFmbEplWwI3vLLq6R5UfXcqvdwWand5nEU4QRO4Rw8uIIa3EMdGsAghmd4hTfHOC/Ou/OxaC04+cwx/IHz+QMyLZEt</latexit><latexit sha1_base64="cLdkzOvJFiGW+v/P2UaWlDlN1Sg=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0ItQ9OKxgv2ANpTNdtIu3WzC7qZSQn+GFw+KePXXePPfuG1z0NYHA4/3ZpiZFySCa+O6305hbX1jc6u4XdrZ3ds/KB8eNXWcKoYNFotYtQOqUXCJDcONwHaikEaBwFYwupv5rTEqzWP5aCYJ+hEdSB5yRo2VOuNe9jREhVNy0ytX3Ko7B1klXk4qkKPeK391+zFLI5SGCap1x3MT42dUGc4ETkvdVGNC2YgOsGOppBFqP5ufPCVnVumTMFa2pCFz9fdERiOtJ1FgOyNqhnrZm4n/eZ3UhNd+xmWSGpRssShMBTExmf1P+lwhM2JiCWWK21sJG1JFmbEplWwI3vLLq6R5UfXcqvdwWand5nEU4QRO4Rw8uIIa3EMdGsAghmd4hTfHOC/Ou/OxaC04+cwx/IHz+QMyLZEt</latexit><latexit sha1_base64="cLdkzOvJFiGW+v/P2UaWlDlN1Sg=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0ItQ9OKxgv2ANpTNdtIu3WzC7qZSQn+GFw+KePXXePPfuG1z0NYHA4/3ZpiZFySCa+O6305hbX1jc6u4XdrZ3ds/KB8eNXWcKoYNFotYtQOqUXCJDcONwHaikEaBwFYwupv5rTEqzWP5aCYJ+hEdSB5yRo2VOuNe9jREhVNy0ytX3Ko7B1klXk4qkKPeK391+zFLI5SGCap1x3MT42dUGc4ETkvdVGNC2YgOsGOppBFqP5ufPCVnVumTMFa2pCFz9fdERiOtJ1FgOyNqhnrZm4n/eZ3UhNd+xmWSGpRssShMBTExmf1P+lwhM2JiCWWK21sJG1JFmbEplWwI3vLLq6R5UfXcqvdwWand5nEU4QRO4Rw8uIIa3EMdGsAghmd4hTfHOC/Ou/OxaC04+cwx/IHz+QMyLZEt</latexit> E["<where>"] + E["<where"] + .....

"<where>""<where","where>","<wher","where", "here>","<whe","wher","here","ere>","<wh",....

where =

Represent a word by the sum of its char ngrams.

Other word vectors• Other contexts are also possible.

• Other algorithms are also possible.

• GloVe

• NCE

• SVD

• Implementations:

• word2vec, word2vecf, GloVe, gensim, your own (in HW)

Using Word Vectors

• Use for initializing the word embeddings in other models. ("pre-training". more soon.)

• Use for finding similar words. (how?)

• Find a word similar to a group of words. (how?)

• Find the word that does not belong to a group. (how?)

Working with Dense Vectors

Word Similarity

I Similarity is calculated using cosine similarity :

sim( ~dog, ~cat) =~dog · ~cat

|| ~dog|| || ~cat ||

I For normalized vectors (||x || = 1), this is equivalent to adot product:

sim( ~dog, ~cat) = ~dog · ~cat

I Normalize the vectors when loading them.

Working with Dense VectorsFinding the most similar words to ~dog

I Compute the similarity from word ~v to all other words.I This is a single matrix-vector product: W · ~v>

I Result is a |V | sized vector of similarities.I Take the indices of the k -highest values.I FAST! for 180k words, d=300: ⇠30ms


Most Similar Words, in python+numpy codeW,words = load_and_normalize_vectors("vecs.txt")# W and words are numpy arrays.w2i = {w:i for i,w in enumerate(words)}

dog = W[w2i[’dog’]] # get the dog vector

sims = W.dot(dog) # compute similarities

most_similar_ids = sims.argsort()[-1:-10:-1]sim_words = words[most_similar_ids]


Similarity to a group of words

I “Find me words most similar to cat, dog and cow”.I Calculate the pairwise similarities and sum them:

W · ~cat + W · ~dog + W · ~cow

I Now find the indices of the highest values as before.

I Matrix-vector products are wasteful. Better option:

W · ( ~cat + ~dog + ~cow)

tagging + pre-training

The brown fox jumped over

NOUN

brown fox jumped over the

VERB

fox jumped over the lazy

PREP

we can use the E we got from LM training to initialize E for the POS tagging task.

(why is that helpful?)

Pre-training• A large part of the success of feed-forward

networks in NLP comes from the use of pre-trained word embeddings.

• Pre-trained embeddings are an easy way to perform semi-supervised learning (or transfer learning).

• But notice: fine-tuning the pre-trained embeddings means that some features change, while most stay the same...

Pre-training

• Define an auxiliary task that you suspect is correlated with your prediction problem.

• Train a model to perform this task.

• Take features representations from this model as inputs to another model.

Pre-training• In word2vec, auxiliary tasks are "predict word based on a

window of size k around it" (CBOW) or "predict neighboring words in a window of k around a focus word" (skipgram).

• More generally, "predict word based on some context of the word".

• This is useful, as we can get tons of training data for free.

• The choice of contexts determines the resulting word representations.

Some possible contexts• Window of k around a word.

(smaller k: more syntactic. Larger k: more semantic)

• Positional window around a word. (more syntactic)

• Aligned words in a different language. Requires parallel corpus

(synonyms, paraphrases)

• Neighbors in a dependency tree. Requires parser. (functional similarity)

Effect of ContextEmbedding Similarity with Different Contexts

Target Word Bag of Words (k=5) DependenciesDumbledore Sunnydale

hallows CollinwoodHogwarts half-blood Calarts

(Harry Potter’s school) Malfoy GreendaleSnape Millfield

Related to Harry Potter Schools

“Dependency-Based Word Embeddings”Levy & Goldberg, ACL 2014

Levy and Goldberg 2014 Dependency-based word embeddings

Embedding Similarity with Different Contexts

Target Word Bag of Words (k=5) Dependenciesnondeterministic Paulingnon-deterministic Hotelling

Turing computability Heting(computer scientist) deterministic Lessing

finite-state HammingRelated to

computability Scientists


Effect of Context


Embedding Similarity with Different Contexts

Target Word Bag of Words (k=5) Dependenciessinging singingdance rapping

dancing dances breakdancing(dance gerund) dancers miming

tap-dancing buskingRelated to

dance Gerunds


Effect of Context


• no layers.

• no softmax. (so negative sampling.)

• cbow or skipgrams objectives.

• many implementations available.

• more in the NLP course.

Word2Vec

Back to some technical stuff

• The computation graph.

• Backprop Algorithm.

Gradient based training• Computing the gradients:

• The network (and loss calculation) is a mathematical function.

• Calculus rules apply. • (a bit hairy, but carefully follow the chain rule and

you'll get there)

`(x, k) = �log(softmax(W3g2(W2g1(W1x+ b1)+b2) + b3)[k])

(chain rule - on whiteboard)

The Computation Graph (CG)

• a DAG.

• Leafs are inputs (or parameters).

• Nodes are operators (functions).

• Edges are results (values).

• Can be built for any function.

RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2014) are designed toselect the learning rate for each minibatch, sometimes on a per-coordinate basis, potentiallyalleviating the need of fiddling with learning rate scheduling. For details of these algorithms,see the original papers or (Bengio et al., 2015, Sections 8.3, 8.4). As many neural-networksoftware frameworks provide implementations of these algorithms, it is easy and sometimesworthwhile to try out di↵erent variants.

6.2 The Computation Graph Abstraction

While one can compute the gradients of the various parameters of a network by hand andimplement them in code, this procedure is cumbersome and error prone. For most pur-poses, it is preferable to use automatic tools for gradient computation (Bengio, 2012). Thecomputation-graph abstraction allows us to easily construct arbitrary networks, evaluatetheir predictions for given inputs (forward pass), and compute gradients for their parameterswith respect to arbitrary scalar losses (backward pass).

A computation graph is a representation of an arbitrary mathematical computation asa graph. It is a directed acyclic graph (DAG) in which nodes correspond to mathematicaloperations or (bound) variables and edges correspond to the flow of intermediary valuesbetween the nodes. The graph structure defines the order of the computation in terms ofthe dependencies between the di↵erent components. The graph is a DAG and not a tree, asthe result of one operation can be the input of several continuations. Consider for examplea graph for the computation of (a ⇤ b+ 1) ⇤ (a ⇤ b+ 2):

a b1 2

*++

*

The computation of a⇤b is shared. We restrict ourselves to the case where the computationgraph is connected.

Since a neural network is essentially a mathematical expression, it can be representedas a computation graph.

For example, Figure 3a presents the computation graph for a 1-layer MLP with a soft-max output transformation. In our notation, oval nodes represent mathematical operationsor functions, and shaded rectangle nodes represent parameters (bound variables). Networkinputs are treated as constants, and drawn without a surrounding node. Input and param-eter nodes have no incoming arcs, and output nodes have no outgoing arcs. The output ofeach node is a matrix, the dimensionality of which is indicated above the node.

This graph is incomplete: without specifying the inputs, we cannot compute an output.Figure 3b shows a complete graph for an MLP that takes three words as inputs, and predictsthe distribution over part-of-speech tags for the third word. This graph can be used forprediction, but not for training, as the output is a vector (not a scalar) and the graph doesnot take into account the correct answer or the loss term. Finally, the graph in 3c shows thecomputation graph for a specific training example, in which the inputs are the (embeddings

28

RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2014) are designed toselect the learning rate for each minibatch, sometimes on a per-coordinate basis, potentiallyalleviating the need of fiddling with learning rate scheduling. For details of these algorithms,see the original papers or (Bengio et al., 2015, Sections 8.3, 8.4). As many neural-networksoftware frameworks provide implementations of these algorithms, it is easy and sometimesworthwhile to try out di↵erent variants.

6.2 The Computation Graph Abstraction

While one can compute the gradients of the various parameters of a network by hand andimplement them in code, this procedure is cumbersome and error prone. For most pur-poses, it is preferable to use automatic tools for gradient computation (Bengio, 2012). Thecomputation-graph abstraction allows us to easily construct arbitrary networks, evaluatetheir predictions for given inputs (forward pass), and compute gradients for their parameterswith respect to arbitrary scalar losses (backward pass).

A computation graph is a representation of an arbitrary mathematical computation asa graph. It is a directed acyclic graph (DAG) in which nodes correspond to mathematicaloperations or (bound) variables and edges correspond to the flow of intermediary valuesbetween the nodes. The graph structure defines the order of the computation in terms ofthe dependencies between the di↵erent components. The graph is a DAG and not a tree, asthe result of one operation can be the input of several continuations. Consider for examplea graph for the computation of (a ⇤ b+ 1) ⇤ (a ⇤ b+ 2):

a b1 2

*++

*

The computation of a⇤b is shared. We restrict ourselves to the case where the computationgraph is connected.

Since a neural network is essentially a mathematical expression, it can be representedas a computation graph.

For example, Figure 3a presents the computation graph for a 1-layer MLP with a soft-max output transformation. In our notation, oval nodes represent mathematical operationsor functions, and shaded rectangle nodes represent parameters (bound variables). Networkinputs are treated as constants, and drawn without a surrounding node. Input and param-eter nodes have no incoming arcs, and output nodes have no outgoing arcs. The output ofeach node is a matrix, the dimensionality of which is indicated above the node.

This graph is incomplete: without specifying the inputs, we cannot compute an output.Figure 3b shows a complete graph for an MLP that takes three words as inputs, and predictsthe distribution over part-of-speech tags for the third word. This graph can be used forprediction, but not for training, as the output is a vector (not a scalar) and the graph doesnot take into account the correct answer or the loss term. Finally, the graph in 3c shows thecomputation graph for a specific training example, in which the inputs are the (embeddings

28

x

1⇥ 150

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

(a)

concat

1⇥ 150

lookup

1⇥ 50

lookup

1⇥ 50

lookup

1⇥ 50

“the” “black” “dog” E

|V |⇥ 50

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

(b)

concat

1⇥ 150

lookup

1⇥ 50

lookup

1⇥ 50

lookup

1⇥ 50


|V |⇥ 50

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

pick

1⇥ 1

5

log

1⇥ 1

neg

1⇥ 1

(c)

Figure 3: Computation Graph for MLP1. (a) Graph with unbound input. (b) Graphwith concrete input. (c) Graph with concrete input, expected output, and lossnode.

of) the words “the”, “black”, “dog”, and the expected output is “NOUN” (whose index is5).

Once the graph is built, it is straightforward to run either a forward computation (com-pute the result of the computation) or a backward computation (computing the gradients),as we show below. Constructing the graphs may look daunting, but is actually very easyusing dedicated software libraries and APIs.

Forward Computation The forward pass computes the outputs of the nodes in thegraph. Since each node’s output depends only on itself and on its incoming edges, it istrivial to compute the outputs of all nodes by traversing the nodes in a topological order andcomputing the output of each node given the already computed outputs of its predecessors.

More formally, in a graph of N nodes, we associate each node with an index i accordingto their topological ordering. Let fi be the function computed by node i (e.g. multiplication.addition, . . . ). Let ⇡(i) be the parent nodes of node i, and ⇡

�1(i) = {j | i 2 ⇡(j)} thechildren nodes of node i (these are the arguments of fi). Denote by v(i) the output of node

29

MLP1

input

parameters

hidden layer

output layer

with concrete input

x

1⇥ 150

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

(a)

concat

1⇥ 150

lookup

1⇥ 50

lookup

1⇥ 50

lookup

1⇥ 50


|V |⇥ 50

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

(b)

concat

1⇥ 150

lookup

1⇥ 50

lookup

1⇥ 50

lookup

1⇥ 50


|V |⇥ 50

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

pick

1⇥ 1

5

log

1⇥ 1

neg

1⇥ 1

(c)







29

MLP1

input

parameters

hidden layer

output layer

Embedding matrix

x

1⇥ 150

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

(a)

concat

1⇥ 150

lookup

1⇥ 50

lookup

1⇥ 50

lookup

1⇥ 50


|V |⇥ 50

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

(b)

concat

1⇥ 150

lookup

1⇥ 50

lookup

1⇥ 50

lookup

1⇥ 50


|V |⇥ 50

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

pick

1⇥ 1

5

log

1⇥ 1

neg

1⇥ 1

(c)







29

MLP1

input

parameters

hidden layer

output layer

Embedding matrix

with concrete input and loss

loss

expected output

• Create a graph for each training example.

• Once graph is built, we have two essential algorithms:

• Forward: compute all values.

• Backward (backprop): compute all gradients.

x

1⇥ 150

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

(a)

concat

1⇥ 150

lookup

1⇥ 50

lookup

1⇥ 50

lookup

1⇥ 50


|V |⇥ 50

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

(b)

concat

1⇥ 150

lookup

1⇥ 50

lookup

1⇥ 50

lookup

1⇥ 50


|V |⇥ 50

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

pick

1⇥ 1

5

log

1⇥ 1

neg

1⇥ 1

(c)







29

x

1⇥ 150

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

(a)

concat

1⇥ 150

lookup

1⇥ 50

lookup

1⇥ 50

lookup

1⇥ 50


|V |⇥ 50

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

(b)

concat

1⇥ 150

lookup

1⇥ 50

lookup

1⇥ 50

lookup

1⇥ 50


|V |⇥ 50

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

pick

1⇥ 1

5

log

1⇥ 1

neg

1⇥ 1

(c)







29

Computing the Gradients (backprop)

• Consider the chain-rule (example on blackboard)

• Each node needs to know how to:

• Compute forward.

• Compute its local gradient.

CG Software Packages• Theano (Bengio's lab, python, low level, grandfather of CG, retired).

• Torch (Lua, wide support, Facebook backed, very fast on GPU, almost retired)

• Tensor Flow (Google, python / C++ hybrid)

• Chainer (python)

• PyTorch (python, dynamic, by Facebook)

• DyNet (C++/Python, by Chris Dyer, Graham Neubig, and Yoav Goldberg)

• Keras (python, high level, theano/TF backends) best bet for out-of-the-box models

shines for dynamic graphs, recursive nets

The Python Neural Networks ToolkitsLandscape (partial)


high-level

low-level


high-level

static graphsdynamic graphs


high-level

static graphsdynamic graphs

- fast also on CPU - automatic batching

Network Training algorithm:

• For each training example(or mini-batch):

• Create graph for computing loss.

• Compute loss (forward).

• Compute gradients (backwards).

• Update model parameters.x

1⇥ 150

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

(a)

concat

1⇥ 150

lookup

1⇥ 50

lookup

1⇥ 50

lookup

1⇥ 50


|V |⇥ 50

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

(b)

concat

1⇥ 150

lookup

1⇥ 50

lookup

1⇥ 50

lookup

1⇥ 50


|V |⇥ 50

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

pick

1⇥ 1

5

log

1⇥ 1

neg

1⇥ 1

(c)







29

DyNet Example

x

1⇥ 150

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

(a)

concat

1⇥ 150

lookup

1⇥ 50

lookup

1⇥ 50

lookup

1⇥ 50


|V |⇥ 50

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

(b)

concat

1⇥ 150

lookup

1⇥ 50

lookup

1⇥ 50

lookup

1⇥ 50


|V |⇥ 50

W1

150⇥ 20

MUL

1⇥ 20

ADD

1⇥ 20

b1

1⇥ 20

tanh

1⇥ 20

W2

20⇥ 17

b2

1⇥ 17

MUL

1⇥ 17

ADD

1⇥ 17

softmax

1⇥ 17

pick

1⇥ 1

5

log

1⇥ 1

neg

1⇥ 1

(c)







29

50 5. NEURAL NETWORKS TRAINING

the computation. Thus, the gradient of pick(x, 5) is a vector g with the dimensionalityof x where g[5] = 1 and g[i 6=5] = 0. Similarly, for the function max(0, x) the value ofthe gradient is 1 for x > 0 and 0 otherwise.

For further information on automatic di↵erentiation see [Neidinger, 2010, Section7], [Baydin et al., 2015]. For more in depth discussion of the backpropagation algorithmand computation graphs (also called flow graphs) see [Bengio et al., 2016, Section 6.4],[Bengio, 2012, LeCun et al., 1998b]. For a popular yet technical presentation, see ChrisOlah’s description at http://colah.github.io/posts/2015-08-Backprop/.

5.1.3 SOFTWARE

Several software packages implement the computation-graph model, including Theano1,TensorFlow2, Chainer3, and CNN/pyCNN4. All these packages support all the essentialcomponents (node types) for defining a wide range of neural network architectures, coveringthe structures described in this book and more. Graph creation is made almost transparentby use of operator overloading. The framework defines a type for representing graph nodes(commonly called expressions), methods for constructing nodes for inputs and parameters,and a set of functions and mathematical operations that take expressions as input and resultin more complex expressions. For example, the python code for creating the computationgraph from Figure (5.1c) using the pyCNN framework is:

from pycnn import *

# model initialization.model = Model()

mW1 = model.add_parameters((20,150))

mb1 = model.add_parameters(20)

mW2 = model.add_parameters((17,20))

mb2 = model.add_parameters(17)

lookup = model.add_lookup_parameters((100, 50))

# Building the computation graph:renew_cg() # create a new graph.# Wrap the model parameters as graph-nodes.W1 = parameter(mW1)

b1 = parameter(mb1)

W2 = parameter(mW2)

b2 = parameter(mb2)

def get_index(x): return 1

# Generate the embeddings layer.vthe = lookup[get_index("the")]

vblack = lookup[get_index("black")]

vdog = lookup[get_index("dog")]

1http://deeplearning.net/software/theano/

2https://www.tensorflow.org/

3http://chainer.org

4https://github.com/clab/cnn

5.1. THE COMPUTATION GRAPH ABSTRACTION 51

# Connect the leaf nodes into a complete graph.x = concatenate([vthe, vblack, vdog])

output = softmax(W2*(tanh(W1*x)+b1)+b2)

loss = -log(pick(output, 5))

loss_value = loss.forward()

loss.backward() # the gradient is computed# and stored in the corresponding# parameters.

Most of the code involves various initializations: the first block defines model parametersthat are be shared between di↵erent computation graphs (recall that each graph correspondsto a specific training example). The second block turns the model parameters into the graph-node (Expression) types. The third block retrieves the Expressions for the embeddings of theinput words. Finally, the fourth block is where the graph is created. Note how transparentthe graph creation is – there is an almost a one-to-one correspondence between creatingthe graph and describing it mathematically. The last block shows a forward and backwardpass. The other software frameworks follow similar patterns.

Theano and TensorFlow involve an optimizing compiler for computation graphs,which is both a blessing and a curse. On the one hand, once compiled, large graphs can berun e�ciently on either the CPU or a GPU, making it ideal for large graphs with a fixedstructure, where only the inputs change between instances. However, the compilation stepitself can be costly, and it makes the interface a bit cumbersome to work with. In contrast,the other packages focus on building large and dynamic computation graphs and execut-ing them “on the fly” without a compilation step. While the execution speed may su↵erwith respect to Theano and TensorFlow’s optimized version, these packages are especiallyconvenient when working with the recurrent and recursive networks described in chapters14 and 18 as well as in structured prediction settings as described in chapter 19. Finally,packages such as Keras5 provide a higher level interface on top of packages such as Theanoand TensorFlow, allowing the definition and training of complex neural networks with evenfewer lines of code.

5.1.4 IMPLEMENTATION RECIPE

Using the computation graph abstraction, the pseudo-code for a network training algorithmis given in Algorithm 5.

Here, build computation graph is a user-defined function that builds the computationgraph for the given input, output and network structure, returning a single loss node.update parameters is an optimizer specific update rule. The recipe specifies that a newgraph is created for each training example. This accommodates cases in which the networkstructure varies between training example, such as recurrent and recursive neural networks,

5https://keras.io

Questions?

Word-embeddings and Word2Vec and The Computation Graphu.cs.biu.ac.il/~89-687/lec5.pdf ·...

Documents

Transcript of Word-embeddings and Word2Vec and The Computation Graphu.cs.biu.ac.il/~89-687/lec5.pdf ·...