Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings...
Transcript of Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings...
![Page 1: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/1.jpg)
Part-of-Speech Tagging
+
Neural Networks 3: Word Embeddings
CS 287
![Page 2: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/2.jpg)
Review: Neural Networks
One-layer multi-layer perceptron architecture,
NNMLP1(x) = g(xW1 + b1)W2 + b2
I xW+ b; perceptron
I x is the dense representation in R1×din
I W1 ∈ Rdin×dhid ,b1 ∈ R1×dhid ; first affine transformation
I W2 ∈ Rdhid×dout ,b2 ∈ R1×dout ; second affine transformation
I g : Rdhid×dhid is an activation non-linearity (often pointwise)
I g(xW1 + b1) is the hidden layer
![Page 3: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/3.jpg)
Review: Non-Linearities Tanh
Hyperbolic Tangeant:
tanh(t) =exp(t)− exp(−t)exp(t) + exp(−t)
I Intuition: Similar to sigmoid, but range between 0 and -1.
![Page 4: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/4.jpg)
Review: Backpropagation
fi (. . . f1(x0))
fi+1(∗; θi+1)
fi+1(fi (. . . f1(x0)))
∂L
∂fi (. . . f1(x0))∂L
∂fi+1(. . . f1(x0))
∂L
∂θi+1
![Page 5: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/5.jpg)
Quiz
One common class of operations in neural network models is known as
pooling. Informally a pooling layer consists of aggregation unit, typically
unparameterized, that reduces the input to a smaller size.
Consider three pooling functions of the form f : Rn 7→ R,
1. f (x) = maxi xi
2. f (x) = mini xi
3. f (x) = ∑i xi/n
What action do each of these functions have? What are their
gradients? How would you implement backpropagation for these units?
![Page 6: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/6.jpg)
Quiz
I Max pooling: f (x) = maxi xi
I Keeps only the most activated input
I Fprop is simple; however must store arg max (“switch”)
I Bprop gradient is zero except for switch, which gets gradoutput
I Min pooling: f (x) = mini xiI Keeps only the least activated input
I Fprop is simple; however must store arg min (“switch”)
I Bprop gradient is zero except for switch, which gets gradoutput
I Avg pooling: f (x) = ∑i xi/nI Keeps the average activation input
I Fprop is simply mean.
I Gradoutput is averaged and passed to all inputs.
![Page 7: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/7.jpg)
Quiz
I Max pooling: f (x) = maxi xi
I Keeps only the most activated input
I Fprop is simple; however must store arg max (“switch”)
I Bprop gradient is zero except for switch, which gets gradoutput
I Min pooling: f (x) = mini xiI Keeps only the least activated input
I Fprop is simple; however must store arg min (“switch”)
I Bprop gradient is zero except for switch, which gets gradoutput
I Avg pooling: f (x) = ∑i xi/nI Keeps the average activation input
I Fprop is simply mean.
I Gradoutput is averaged and passed to all inputs.
![Page 8: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/8.jpg)
Quiz
I Max pooling: f (x) = maxi xi
I Keeps only the most activated input
I Fprop is simple; however must store arg max (“switch”)
I Bprop gradient is zero except for switch, which gets gradoutput
I Min pooling: f (x) = mini xiI Keeps only the least activated input
I Fprop is simple; however must store arg min (“switch”)
I Bprop gradient is zero except for switch, which gets gradoutput
I Avg pooling: f (x) = ∑i xi/nI Keeps the average activation input
I Fprop is simply mean.
I Gradoutput is averaged and passed to all inputs.
![Page 9: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/9.jpg)
Contents
Embedding Motivation
C&W Embeddings
word2vec
Evaluating Embeddings
![Page 10: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/10.jpg)
1. Use dense representations instead of sparse
2. Use windowed area instead of sequence models
3. Use neural networks to model windowed interactions
![Page 11: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/11.jpg)
What about rare words?
![Page 12: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/12.jpg)
Word Embeddings
Embedding layer,
x0W0
I x0 ∈ R1×d0 one-hot word.
I W0 ∈ Rd0×din , d0 = |V|
Notes:
I d0 >> din, e.g. d0 = 10000, din = 50
![Page 13: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/13.jpg)
Pretraining Representations
I We would strong shared representations of words
I However, PTB only 1M labeled words, relatively small
I Collobert et al. (2008, 2011) use semi-supervised method.
I (Close connection to Bengio et al (2003), next topic)
![Page 14: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/14.jpg)
Semi-Supervised Training
Idea: Train representations separately on more data
1. Pretrain word embeddings W0 first.
2. Substitute them in as first NN layer
3. Fine-tune embeddings for final task
I Modify the first layer based on supervised gradients
I Optional, some work skips this step
![Page 15: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/15.jpg)
Large Corpora
To learn rare word embeddings, need many more tokens,
I C&W
I English Wikipedia (631 million words tokens)
I Reuters Corpus (221 million word tokens)
I Total vocabulary size: 130,000 word types
I word2vec
I Google News (6 billion word tokens)
I Total vocabulary size: ≈ 1M word types
But this data has no labels...
![Page 16: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/16.jpg)
Contents
Embedding Motivation
C&W Embeddings
word2vec
Evaluating Embeddings
![Page 17: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/17.jpg)
C&W Embeddings
I Assumption: Text in Wikipedia is coherent (in some sense).
I Most randomly corrupted text is incoherent.
I Embeddings should distinguish coherence.
I Common idea in unsupervised learning (distributional hypothesis).
![Page 18: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/18.jpg)
C&W Setup
Let V be the vocabulary of English and let s score any window of size
dwin = 5, if we see the phrase
[ the dog walks to the ]
It should score higher by s than
[ the dog house to the ]
[ the dog cats to the ]
[ the dog skips to the ]
...
![Page 19: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/19.jpg)
C&W Setup
Can estimate score s as a windowed neural network.
s(w1, . . . ,wdwin) = hardtanh(xW1 + b1)W2 + b
with
x = [v(w1) v(w2) . . . v(wdwin)]
I din = dwin × 50, dhid = 100, dwin = 11, dout = 1!
Example: Function s
x = [v(w3) v(w4) v(w5) v(w6) v(w7)]
din/dwin din/dwin
xdin/dwin din/dwin din/dwin
![Page 20: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/20.jpg)
Training?
I Different setup than previous experiments.
I No direct supervision y
I Train to rank good examples better.
![Page 21: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/21.jpg)
Ranking Loss
Given only example {x1, . . . , xn} and for each example have set D(x) of
alternatives.
L(θ) = ∑i
∑x′∈D(x)
Lranking (s(xi ; θ), s(x′; θ))
Lranking (y , y) = max{0, 1− (y − y)}
Example: C&W ranking
x = [the dog walks to the]
D(x) = { [the dog skips to the], [the dog in to the], . . . }
I (Torch nn.RankingCriterion)
I Note: slightly different setup.
![Page 22: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/22.jpg)
C&W Embeddings in Practice
I Vocabulary size |D(x)| > 100, 000
I Training time for 4 weeks
I (Collobert is main an author of Torch)
![Page 23: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/23.jpg)
Sampling (Sketch of Wsabie (Weston, 2011))
Observation: in many contexts
Lranking (y , y) = max{0, 1− (y − y)} = 0
Particularly true later in training.
For difficult contexts, may be easy to find
Lranking (y , y) = max{0, 1− (y − y)} 6= 0
We can therefore sample from D(x) to find an update.
![Page 24: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/24.jpg)
C&W Results
1. Use dense representations instead of sparse
2. Use windowed area instead of sequence models
3. Use neural networks to model windowed interactions
4. Use semi-supervised learning to pretrain representations.
![Page 25: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/25.jpg)
Contents
Embedding Motivation
C&W Embeddings
word2vec
Evaluating Embeddings
![Page 26: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/26.jpg)
word2vec
I Contributions:
I Scale embedding process to massive sizes
I Experiments with several architectures
I Empirical evaluations of embeddings
I Influential release of software/data.
I Differences with C&W
I Instead of MLP uses (bi)linear model (linear in paper)
I Instead of ranking model, directly predict word (cross-entropy)
I Various other extensions.
I Two different models
1. Continuous Bag-of-Words (CBOW)
2. Continuous Skip-gram
![Page 27: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/27.jpg)
word2vec
I Contributions:
I Scale embedding process to massive sizes
I Experiments with several architectures
I Empirical evaluations of embeddings
I Influential release of software/data.
I Differences with C&W
I Instead of MLP uses (bi)linear model (linear in paper)
I Instead of ranking model, directly predict word (cross-entropy)
I Various other extensions.
I Two different models
1. Continuous Bag-of-Words (CBOW)
2. Continuous Skip-gram
![Page 28: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/28.jpg)
word2vec (Bilinear Model)
Back to pure bilinear model, but with much bigger output space
y = softmax((∑i x
0i W
0
dwin − 1)W1)
I x0i ∈ R1×d0 input words one-hot vectors .
I W0 ∈ Rd0×din ; d0 = |V|, word embeddings
I W1 ∈ Rdin×dout ; dout = |V| output embeddings
Notes:
I Bilinear parameter interaction.
I d0 >> din, e.g. 50 ≤ din ≤ 1000, 10000 ≤ |V| ≤ 1M or more
![Page 29: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/29.jpg)
word2vec (Mikolov, 2013)
![Page 30: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/30.jpg)
Continuous Bag-of-Words (CBOW)
y = softmax((∑i x
0i W
0
dwin − 1)W1)
I Attempt to predict the middle word
[ the dog walks to the ]
Example: CBOW
x =v(w3) + v(w4) + v(w6) + v(w7)
dwin − 1
y = δ(w5)
W1 is no longer partitioned by row (order is lost)
![Page 31: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/31.jpg)
Continous Skip-gram
y = softmax(x0W0)W1)
I Also a bilinear model
I Attempt to predict each context-word from middle
[ the dog ]
Example: Skip-gram
x = v(w5)
y = δ(w3)
Done for each word in window.
![Page 32: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/32.jpg)
Additional aspects
I The window dwin is sampled for each SGD step
I SGD is done less for frequent words.
I We have slightly simplified the training objective.
![Page 33: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/33.jpg)
Softmax Issues
Use a softmax to force a distribution,
softmax(z) =exp(z)
∑c∈C
exp(zc)
log softmax(z) = z− log ∑c∈C
exp(zc)
I Issue: class C is huge.
I For C&W, 100,000, for word2vec 1,000,000 types
I Note largest dataset is 6 billion words
![Page 34: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/34.jpg)
Two-Layer Softmax
First, clustering words into hard classes (for instance Brown clusters)
Groups words into classes based on word-context.
5
. . .motorcycletruckcar
. . .3
. . .horsecatdog
. . .
![Page 35: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/35.jpg)
Two-Layer Softmax
Assume that we first generate a class C and then a word,
p(Y |X ) ≈ P(Y |C ,X ; θ)P(C |X ; θ)
Estimate distributions with a shared embedding layer,
P(C |X ; θ)
y1 = softmax((x0W0)W1 + b)
P(Y |C = class,X ; θ)
y2 = softmax((x0W0)Wclass + b))
![Page 36: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/36.jpg)
Softmax as Tree
5
. . .motorcycletruckcar
. . .3
. . .horsecatdog
. . .
y(1) = softmax((x0W0)W1 + b)
y(2) = softmax((x0W0)Wclass + b))
L2SM(y(1), y(2), y(1), y(2)) = − log p(y|x, class(y))− log p(class(y)|x)
= − log y(1)c1− log y
(2)c2
![Page 37: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/37.jpg)
Speed
5
. . .motorcycletruckcar
. . .3
dog cat horse . . .
. . .
I Computing loss only requires walking path.
I Two-layer a balanced tree.
I Computing loss requires O(√|V|)
I (Note: computing full distribution requires O(|V|))
![Page 38: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/38.jpg)
Hierarchical Softmax(HSM)
I Build multiple layer tree
LHSM(y(1), . . . , y(C ), y(1), . . . , y(C )) = −∑i
log y(i)c i
I Balanced tree only requires O(log2 |V|)
I Experiments on website (Mnih and Hinton, 2008)
![Page 39: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/39.jpg)
HSM with Huffman Encoding
I Requires O(log2 perp(unigram))
I Reduces time to only 1 day for 1.6 million tokens
![Page 40: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/40.jpg)
![Page 41: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/41.jpg)
Contents
Embedding Motivation
C&W Embeddings
word2vec
Evaluating Embeddings
![Page 42: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/42.jpg)
How good are embeddings?
I Qualitative Analysis/Visualization
I Analogy task
I
I Extrinsic Metrics
![Page 43: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/43.jpg)
Metrics
Dot-product
xcatx>dog
Cosine Similarity
xcatx>dog||xcat || ||xdog ||
![Page 44: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/44.jpg)
k-nearest neighbors (cosine sim)
dog
cat 0.921800527377
dogs 0.851315870426
horse 0.790758298322
puppy 0.775492121034
pet 0.772470734611
rabbit 0.772081457265
pig 0.749006160038
snake 0.73991884888
I Intuition: trained to match words that act the same.
![Page 45: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/45.jpg)
Empirical Measures: Analogy task
Analogy questions:
A:B::C:
I 5 types of semantic questions, 9 types of syntactic
![Page 46: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/46.jpg)
Embedding Tasks
![Page 47: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/47.jpg)
Analogy Prediction
A:B::C:
x′ = xB − xA + xC
Project to the closest word,
arg maxD∈V
xDx′>
||xD ||||x′||
I Code example
![Page 48: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/48.jpg)
Extrinsic Tasks
I Text classification
I Part-of-speech tagging
I Many, many others over last couple years
![Page 49: Part-of-Speech Tagging + Neural Networks 3: Word Embeddings · Neural Networks 3: Word Embeddings CS 287. Review: Neural Networks One-layer multi-layer perceptron architecture, NN](https://reader030.fdocuments.us/reader030/viewer/2022040913/5e89d341697b8d60e44c8e64/html5/thumbnails/49.jpg)
Conclusion
I Word Embeddings
I Scaling issues and tricks
I Next Class: Language Modeling