Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting...

113
Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Distributed word representations Christopher Potts Stanford Linguistics CS 224U: Natural language understanding April 8 and 13 1 / 90

Transcript of Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting...

Page 1: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Distributed word representations

Christopher Potts

Stanford Linguistics

CS 224U: Natural language understandingApril 8 and 13

1 / 90

Page 2: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Plan

1. High-level goals and guiding hypotheses2. Matrix designs3. Vector comparison4. Basic reweighting5. Subword information6. Visualization7. Dimensionality reduction

a. Latent Semantic Analysisb. Autoencodersc. GloVed. word2vec

8. Retrofitting

2 / 90

Page 3: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Meaning latent in co-occurrence patterns

against age agent ages ago agree ahead ain’t air aka al

against 2003 90 39 20 88 57 33 15 58 22 24age 90 1492 14 39 71 38 12 4 18 4 39

agent 39 14 507 2 21 5 10 3 9 8 25ages 20 39 2 290 32 5 4 3 6 1 6ago 88 71 21 32 1164 37 25 11 34 11 38

agree 57 38 5 5 37 627 12 2 16 19 14ahead 33 12 10 4 25 12 429 4 12 10 7

ain’t 15 4 3 3 11 2 4 166 0 3 3air 58 18 9 6 34 16 12 0 746 5 11

aka 22 4 8 1 11 19 10 3 5 261 9al 24 39 25 6 38 14 7 3 11 9 861

3 / 90

Page 4: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Meaning latent in co-occurrence patterns

Class Word

0 awful0 terrible0 lame0 worst0 disappointing1 nice1 amazing1 wonderful1 good1 awesome

Pr(Class = 1) Word

? w1? w2? w3? w4

A hopeless learning scenario

3 / 90

Page 5: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Meaning latent in co-occurrence patterns

Class Word

0 awful0 terrible0 lame0 worst0 disappointing1 nice1 amazing1 wonderful1 good1 awesome

Pr(Class = 1) Word

? w1? w2? w3? w4

A hopeless learning scenario

3 / 90

Page 6: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Meaning latent in co-occurrence patterns

Class Word excellent terrible

0 awful 6 1130 terrible 8 3090 lame 1 690 worst 9 2020 disappointing 19 291 nice 118 21 amazing 91 61 wonderful 66 71 good 21 91 awesome 67 2

Pr(Class=1) Word excellent terrible

≈0 w1 4 82≈0 w2 5 84≈1 w3 49 3≈1 w4 41 5

A promising learning scenario

3 / 90

Page 7: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Meaning latent in co-occurrence patterns

Class Word excellent terrible

0 awful 6 1130 terrible 8 3090 lame 1 690 worst 9 2020 disappointing 19 291 nice 118 21 amazing 91 61 wonderful 66 71 good 21 91 awesome 67 2

Pr(Class=1) Word excellent terrible

≈0 w1 4 82≈0 w2 5 84≈1 w3 49 3≈1 w4 41 5

A promising learning scenario

3 / 90

Page 8: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

High-level goals

1. Begin thinking about how vectors can encode themeanings of linguistic units.

2. Foundational concepts for vector-space model (VSMs).

3. A foundation for deep learning NLU models.

4. In your assignment and projects, you’re likely to userepresentations like these:É to understand and model linguistic and social

phenomena; and/orÉ as inputs to other machine learning models.

4 / 90

Page 9: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Associated materials

1. Codea. vsm.pyb. vsm_01_distributional.ipynbc. vsm_02_dimreduce.ipynbd. vsm_03_retrofitting.ipynb

2. Homework 1 and bake-off 1: hw_wordsim.ipynb3. Screencasts:

a. Overview [link]b. Vector comparison [link]c. Reweighting [link]d. Dimensionality reduction [link]

4. Core readings: Turney and Pantel 2010; Smith 2019;Pennington et al. 2014; Faruqui et al. 2015

5 / 90

Page 10: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Guiding hypothesesFirth (1957)“You shall know a word by the company it keeps.”

Firth (1957)“the complete meaning of a word is always contextual, andno study of meaning apart from context can be takenseriously.”

Wittgenstein (1953)“the meaning of a word is its use in the language”

Harris (1954)“distributional statements can cover all of the material of alanguage without requiring support from other types ofinformation.”

Turney and Pantel (2010)“If units of text have similar vectors in a text frequencymatrix, then they tend to have similar meanings.”

6 / 90

Page 11: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Great power, a great many design choicestokenizationannotationtaggingparsingfeature selection... cluster texts by date/author/discourse context/. . .⇓ w

Matrix design

word × documentword × wordword × search proximityadj. × modified nounword × dependency rel.

...

Reweighting

probabilitieslength norm.TF-IDFPMIPositive PMI

...

Dimensionalityreduction

LSAPLSALDAPCANNMF

...

Vectorcomparison

EuclideanCosineDiceJaccardKL

...

Nearly the full cross-product to explore; only a handful of the com-binations are ruled out mathematically. Models like GloVe andword2vec offer packaged solutions to design/weighting/reduction andreduce the importance of the choice of comparison method.

7 / 90

Page 12: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Designs

1. High-level goals and guiding hypotheses2. Matrix designs3. Vector comparison4. Basic reweighting5. Subword information6. Visualization7. Dimensionality reduction

a. Latent Semantic Analysisb. Autoencodersc. GloVed. word2vec

8. Retrofitting

8 / 90

Page 13: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

word x word

against age agent ages ago agree ahead ain’t air aka al

against 2003 90 39 20 88 57 33 15 58 22 24age 90 1492 14 39 71 38 12 4 18 4 39

agent 39 14 507 2 21 5 10 3 9 8 25ages 20 39 2 290 32 5 4 3 6 1 6ago 88 71 21 32 1164 37 25 11 34 11 38

agree 57 38 5 5 37 627 12 2 16 19 14ahead 33 12 10 4 25 12 429 4 12 10 7

ain’t 15 4 3 3 11 2 4 166 0 3 3air 58 18 9 6 34 16 12 0 746 5 11

aka 22 4 8 1 11 19 10 3 5 261 9al 24 39 25 6 38 14 7 3 11 9 861

9 / 90

Page 14: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

word x document

d1 d2 d3 d4 d5 d6 d7 d8 d9 d10

against 0 0 0 1 0 0 3 2 3 0age 0 0 0 1 0 3 1 0 4 0

agent 0 0 0 0 0 0 0 0 0 0ages 0 0 0 0 0 2 0 0 0 0ago 0 0 0 2 0 0 0 0 3 0

agree 0 1 0 0 0 0 0 0 0 0ahead 0 0 0 1 0 0 0 0 0 0

ain’t 0 0 0 0 0 0 0 0 0 0air 0 0 0 0 0 0 0 0 0 0

aka 0 0 0 1 0 0 0 0 0 0

10 / 90

Page 15: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

word x discourse context

Upper left corner of an interjection × dialog-act tag matrixderived from the Switchboard Dialog Act Corpus:

% + ˆ2 ˆg ˆh ˆq aa

absolutely 0 2 0 0 0 0 95actually 17 12 0 0 1 0 4anyway 23 14 0 0 0 0 0

boy 5 3 1 0 5 2 1bye 0 1 0 0 0 0 0

bye-bye 0 0 0 0 0 0 0dear 0 0 0 0 1 0 0

definitely 0 2 0 0 0 0 56exactly 2 6 1 0 0 0 294

gee 0 3 0 0 2 1 1goodness 1 0 0 0 2 0 0

11 / 90

Page 16: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

phonological segment × feature valuesDerived from http://www.linguistics.ucla.edu/people/hayes/120a/.Dimensions: (141× 28).

sylla

bic

stre

ss

long

conso

nan

tal

sonor

ant

conti

nuan

t

del

ayed

.rel

ease

app

roxi

man

t

tap

trill

· · ·

6 1 −1 −1 −1 1 1 0 1 −1 −1

· · ·

A 1 −1 −1 −1 1 1 0 1 −1 −1Œ 1 −1 −1 −1 1 1 0 1 −1 −1a 1 −1 −1 −1 1 1 0 1 −1 −1æ 1 −1 −1 −1 1 1 0 1 −1 −12 1 −1 −1 −1 1 1 0 1 −1 −1O 1 −1 −1 −1 1 1 0 1 −1 −1o 1 −1 −1 −1 1 1 0 1 −1 −1G 1 −1 −1 −1 1 1 0 1 −1 −1@ 1 −1 −1 −1 1 1 0 1 −1 −1...

...

12 / 90

Page 17: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

phonological segment × feature valuesDerived from http://www.linguistics.ucla.edu/people/hayes/120a/.Dimensions: (141× 28).

ɒɑ

ɶ

ʌ

ɔ

o ɤ

ɘ

œ

əә

e

ɞ

ø

ɛ

ɵ

ɯu

ʊ

ɨʉ

y i

ʏɪ

ŋ

ʟ

ɫ

ɴ

ʀ

ɲ

ʎ

ŋŋ

ʟʟ

ɳ

ʙ

ɭ

ɺ

ɻ

ɽr

nm

l

ɾ

ɱ

ʔɣx

kg

kxg ɣ

ħʕ

ʁq

χ

ɢ

ɕ

ɟ

ʝ

c

ç

dʑtç

ɣɣ

xx

k k

g g

ʑ

ʈ ɖ

ɬ

ʐ

ɸ

ʂ ʒ

z

v

t

ʃ

s

p

f

d

b

θ

ɮ

ð

β

dʒdz

d ɮ tʃtɬ tstɬ ts

tɬ d zd ɮ ʈʂ

ɖʑ

pf bv

pɸbβ

tθdð

cçɟʝ

kxkxgɣ

g ɣqχɢʁ

ɧ

kpgb

pt

bd

ɰɰ

w ɥ

ʋ

ʍ ɦh

12 / 90

Page 18: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Feature representations of data

• the movie was horrible becomes [4,0,1/4].

• The complex, real-world response of an experimentalsubject to a particular example becomes [0,1] or [118,1].

• A human is modeled as a vector [24,140,5,12].

• A continuous, noisy speech stream is reduced to arestricted set of acoustic features.

13 / 90

Page 19: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Other designs

• word × dependency rel.• word × syntactic context• adj. × modified noun• word × search query• person × product• word × person• word × word × pattern• verb × subject × object...

14 / 90

Page 20: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Windows and scaling: What is a co-occurrence?

• Larger, flatter windows capture more semanticinformation.

• Small, more scaled windows capture more syntactic(collocational) information.

• Textual boundaries can be separately controlled; coreunit as the sentence/paragraph/document will havemajor consequences.

15 / 90

Page 21: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Windows and scaling: What is a co-occurrence?

from swerve of shore to bend of bay , brings

4 3 2 1 0 1 2 3 4 5

• Larger, flatter windows capture more semanticinformation.

• Small, more scaled windows capture more syntactic(collocational) information.

• Textual boundaries can be separately controlled; coreunit as the sentence/paragraph/document will havemajor consequences.

15 / 90

Page 22: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Windows and scaling: What is a co-occurrence?

from swerve of shore to bend of bay , brings

4 3 2 1 0 1 2 3 4 5

from swerve of shore to bend of bay , brings

Window: 3 4 3 2 1 0 1 2 3 4 5

Scaling: flat 0 1 1 1 1 1 1 1 0 0

Scaling: 1n 0 1

312

11 1 1

112

13 0 0

• Larger, flatter windows capture more semanticinformation.

• Small, more scaled windows capture more syntactic(collocational) information.

• Textual boundaries can be separately controlled; coreunit as the sentence/paragraph/document will havemajor consequences.

15 / 90

Page 23: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Windows and scaling: What is a co-occurrence?

from swerve of shore to bend of bay , brings

4 3 2 1 0 1 2 3 4 5

from swerve of shore to bend of bay , brings

Window: 3 4 3 2 1 0 1 2 3 4 5

Scaling: flat 0 1 1 1 1 1 1 1 0 0

Scaling: 1n 0 1

312

11 1 1

112

13 0 0

• Larger, flatter windows capture more semanticinformation.

• Small, more scaled windows capture more syntactic(collocational) information.

• Textual boundaries can be separately controlled; coreunit as the sentence/paragraph/document will havemajor consequences.

15 / 90

Page 24: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Windows and scaling: What is a co-occurrence?

from swerve of shore to bend of bay , brings

4 3 2 1 0 1 2 3 4 5

• Larger, flatter windows capture more semanticinformation.

• Small, more scaled windows capture more syntactic(collocational) information.

• Textual boundaries can be separately controlled; coreunit as the sentence/paragraph/document will havemajor consequences.

15 / 90

Page 25: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Code snippets

snippets

March 31, 2019

In [1]: import syssys.path.append("../cs224u")

In [3]: import osimport pandas as pd

DATA_HOME = os.path.join('data', 'vsmdata')

# IMDB: Window size = 5; scaling = 1/nimdb5 = pd.read_csv(

os.path.join(DATA_HOME, 'imdb_window5-scaled.csv.gz'), index_col=0)

# IMDB: Window size = 20; scaling = flatimdb20 = pd.read_csv(

os.path.join(DATA_HOME, 'imdb_window20-flat.csv.gz'), index_col=0)

# Gigaword: Window size = 5; scaling = 1/ngiga5 = pd.read_csv(

os.path.join(DATA_HOME, 'giga_window5-scaled.csv.gz'), index_col=0)

# Gigaword: Window size = 20; scaling = flatgiga20 = pd.read_csv(

os.path.join(DATA_HOME, 'giga_window20-flat.csv.gz'), index_col=0)

In [48]: import osimport pandas as pdimport vsm

In [32]: ABC = pd.DataFrame([[ 2.0, 4.0],[10.0, 15.0],[14.0, 10.0]],index=['A', 'B', 'C'],columns=['x', 'y'])

In [33]: A = ABC.loc['A']B = ABC.loc['B']

1

16 / 90

Page 26: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Vector comparison

1. High-level goals and guiding hypotheses2. Matrix designs3. Vector comparison4. Basic reweighting5. Subword information6. Visualization7. Dimensionality reduction

a. Latent Semantic Analysisb. Autoencodersc. GloVed. word2vec

8. Retrofitting

17 / 90

Page 27: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Running example

dx dy

A 2 4B 10 15C 14 10

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

(10,15)B

(2,4)A

(14,10)C

• Focus on distance measures• Illustrations with row vectors

18 / 90

Page 28: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

EuclideanBetween vectors u and v of dimension n:

euclidean(u,v) =

n∑

i=1

|ui − vi|2

dx dy

A 2 4B 10 15C 14 10

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

(10,15)B

(2,4)A

(14,10)C

10 − 142 + 15 − 102 = 6.4

2 − 102 + 4 − 152 = 13.6

19 / 90

Page 29: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Length normalization

Given a vector u of dimension n, the L2-length of u is

||u||2 =

n∑

i=1

u2i

and the length normalization of u is�

u1

||u||2,

u2

||u||2, · · · ,

un

||u||2

20 / 90

Page 30: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Length normalization

dx dy ||u||2A 2 4 4.47B 10 15 18.03C 14 10 17.20

row L2 norm⇒

dx dy

A 24.47

44.47

B 1018.03

1518.03

C 1417.20

1017.20

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

(10,15)B

(2,4)A

(14,10)C

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

(0.55,0.83)B

(0.45,0.89)A

(0.81,0.58)C

20 / 90

Page 31: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Cosine distanceBetween vectors u and v of dimension n:

cosine(u,v) = 1−

∑ni=1 ui × vi

||u||2 × ||v||2

dx dy

A 2 4B 10 15C 14 10

21 / 90

Page 32: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Cosine distanceBetween vectors u and v of dimension n:

cosine(u,v) = 1−

∑ni=1 ui × vi

||u||2 × ||v||2

dx dy

A 2 4B 10 15C 14 10

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

(10,15)B

(2,4)A

(14,10)C

1 −(10

× 14) + (15

× 10)

||10, 1

5||× ||14

, 10||= 0.

065

1−

(2× 1

0)+ (

4× 1

5)

||2, 4

|| ×||1

0, 1

5||=

0.00

8

21 / 90

Page 33: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Cosine distanceBetween vectors u and v of dimension n:

cosine(u,v) = 1−

∑ni=1 ui × vi

||u||2 × ||v||2

dx dy

A 2 4B 10 15C 14 10

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

(0.55,0.83)B

(0.45,0.89)A

(0.81,0.58)C

1 −(0.

55× 0.

81) + (0.

83× 0.

58)

||0.55

, 0.8

1||× ||0.

81, 0

.58||= 0.

065

1−

(0.4

0.55

) +(0

.89

×0.

83)

||0.4

5, 0

.89||

×||0

.55,

0.8

3||=

0.00

8

21 / 90

Page 34: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Matching-based methods

Matching coefficientmatching(u,v) =

n∑

i=1

min(ui,vi)

Jaccard distancejaccard(u,v) = 1−

matching(u,v)∑n

i=1 max(ui,vi)

Dice distancedice(u,v) = 1−

2×matching(u,v)∑n

i=1 ui + vi

Overlapoverlap(u,v) = 1−

matching(u,v)

min�∑n

i=1 ui ,∑n

i=1 vi�

22 / 90

Page 35: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

KL divergenceBetween probability distributions p and q:

D(p ‖ q) =n∑

i=1

pi log�

pi

qi

p is the reference distribution. Before calculation, smooth byadding ε.

23 / 90

Page 36: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

KL divergence

dx dy

A 2 4B 10 15C 14 10

Normalize the rows⇒

dx dy

A 0.33 0.67B 0.40 0.60C 0.58 0.42

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

(10,15)B

(2,4)A

(14,10)C

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

(0.4,0.6)B

(0.33,0.67)A

(0.58,0.42)C

(0.33 × log(0.330.4

)) + (0.67 × log(0.670.6

)) = 0.01

(0.4 × log( 0.40.58

)) + (0.6 × log( 0.60.42

)) = 0.13

23 / 90

Page 37: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

KL variantsSymmetric KL

D(p ‖ q) +D(q ‖ p)

KL-divergence with skew

D(p ‖ αq+ (1− α)p) 0 ≤ α ≤ 1

p q

α = 1 ; SkewKL = 1.17

0.10.2

0.7

0.10.2

0.7

p q

α = 0.8 ; SkewKL = 0.63

0.10.2

0.7

0.200.22

0.58

p q

α = 0.5 ; SkewKL = 0.25

0.10.2

0.7

0.2

0.40.4

p q

α = 0.2 ; SkewKL = 0.05

0.10.2

0.7

0.200.22

0.58

p q

α = 0 ; SkewKL = 0

0.10.2

0.7

0.10.2

0.7

Jensen–Shannon distance√

√1

2D�

p ‖p+ q

2

+1

2D�

q ‖p+ q

2

24 / 90

Page 38: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Relationships and generalizations

1. Euclidean, Jaccard, and Dice with raw count vectors willtend to favor raw frequency over distributional patterns.

2. Euclidean with L2-normed vectors is equivalent to cosinew.r.t. ranking (Manning and Schütze 1999:301).

3. Jaccard and Dice are equivalent w.r.t. ranking.

4. Both L2-norms and probability distributions can obscuredifferences in the amount/strength of evidence, whichcan in turn have an effect on the reliability of cosine,normed-euclidean, and KL divergence. Theseshortcomings might be addressed through weightingschemes.

25 / 90

Page 39: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Proper distance metric?To qualify as a distance metric, a vector comparison methodd has to be symmetric (d(x,y) = d(y,x)), assign 0 to identicalvectors (d(x,x) = 0), and satisfy the triangle inequality:

d(x, z) ¶ d(x,y) + d(y, z)

Cosine distance as I defined itdoesn’t satisfy this:

Distance metric?

Yes: Euclidean, Jaccard forbinary vectors, Jensen–Shannon,cosine as

cos−1�∑n

i=1ui × vi

||u||2×||v||2

π

No: Matching, Jaccard, Dice,Overlap, KL divergence,Symmetric KL, KL with skew

26 / 90

Page 40: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Comparing the two versions of cosineRandom sample of 100 vectors from our giga20 countmatrix. Correlation is 99.8.

0.2 0.3 0.4 0.5Improper cosine distance

0.175

0.200

0.225

0.250

0.275

0.300

0.325

0.350

Prop

er c

osin

e di

stan

ce

27 / 90

Page 41: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Code snippetsIn [ ]: import sys

sys.path.append("../cs224u")

DATA_HOME = os.path.join('../cs224u/data', 'vsmdata')

In [1]: import osimport pandas as pdimport vsm

In [2]: ABC = pd.DataFrame([[ 2.0, 4.0],[10.0, 15.0],[14.0, 10.0]], index=['A', 'B', 'C'], columns=['x', 'y'])

In [3]: vsm.euclidean(ABC.loc['A'], ABC.loc['B'])

Out[3]: 13.601470508735444

In [4]: vsm.vector_length(ABC.loc['A'])

Out[4]: 4.47213595499958

In [5]: vsm.length_norm(ABC.loc['A']).values

Out[5]: array([0.4472136 , 0.89442719])

In [6]: vsm.cosine(ABC.loc['A'], ABC.loc['B'])

Out[6]: 0.007722123286332261

In [7]: vsm.matching(ABC.loc['A'], ABC.loc['B'])

Out[7]: 6.0

In [8]: vsm.jaccard(ABC.loc['A'], ABC.loc['B'])

Out[8]: 0.76

In [9]: DATA_HOME = os.path.join('data', 'vsmdata')

imdb5 = pd.read_csv(os.path.join(DATA_HOME, 'imdb_window5-scaled.csv.gz'), index_col=0)

In [10]: vsm.cosine(imdb5.loc['good'], imdb5.loc['excellent'])

Out[10]: 0.9644382411451131

In [11]: vsm.cosine(imdb5.loc['good'], imdb5.loc['bad'])

Out[11]: 0.9480014759326252

2

28 / 90

Page 42: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Code snippetsIn [ ]:

In [ ]:

In [9]: DATA_HOME = os.path.join('data', 'vsmdata')

imdb5 = pd.read_csv(os.path.join(DATA_HOME, 'imdb_window5-scaled.csv.gz'), index_col=0)

In [10]: vsm.cosine(imdb5.loc['good'], imdb5.loc['excellent'])

Out[10]: 0.9644382411451131

In [11]: vsm.cosine(imdb5.loc['good'], imdb5.loc['bad'])

Out[11]: 0.9480014759326252

In [12]: vsm.neighbors('bad', imdb5).head()

Out[12]: bad 0.000000guys 0.823744. 0.844851taste 0.893747guy 0.896312dtype: float64

In [13]: vsm.neighbors('bad', imdb5, distfunc=vsm.jaccard).head(3)

Out[13]: bad 0.000000think 0.783744better 0.788782dtype: float64

In [ ]:

3

28 / 90

Page 43: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Basic reweighting

1. High-level goals and guiding hypotheses2. Matrix designs3. Vector comparison4. Basic reweighting5. Subword information6. Visualization7. Dimensionality reduction

a. Latent Semantic Analysisb. Autoencodersc. GloVed. word2vec

8. Retrofitting

29 / 90

Page 44: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Goals of reweighting

• Amplify the important, the trustworthy, the unusual;deemphasize the mundane and the quirky.

• Absent a defined objective function, this will remainfuzzy.

• The intuition behind moving away from raw counts isthat frequency is a poor proxy for the above values.

• So we should ask of each weighting scheme:É How does it compare to the raw count values?É How does it compare to the word frequencies?É What overall distribution of values does it deliver?

• We hope to do no feature selection based on counts,stopword dictionaries, etc. Rather, we want our methodsto reveal what’s important without these ad hocinterventions.

30 / 90

Page 45: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

NormalizationL2 norming (repeated from earlier)Given a vector u of dimension n, the L2-length of u is

||u||2 =

n∑

i=1

u2i

and the length normalization of u is�

u1

||u||2,

u2

||u||2, · · · ,

un

||u||2

Probability distributionGiven a vector u of dimension n containing all positive values, let

sum(u) =n∑

i=1

ui

and then the probability distribution of u is�

u1

sum(u),

u2

sum(u), · · · ,

un

sum(u)

31 / 90

Page 46: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Observed/Expected

rowsum(X, i) =n∑

j=1

Xij colsum(X, j) =m∑

i=1

Xij sum(X) =m∑

i=1

n∑

j=1

Xij

expected(X, i, j) =rowsum(X, i) · colsum(X, j)

sum(X)

oe(X, i, j) =Xij

expected(X, i, j)

32 / 90

Page 47: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Observed/Expected

rowsum(X, i) =n∑

j=1

Xij colsum(X, j) =m∑

i=1

Xij sum(X) =m∑

i=1

n∑

j=1

Xij

expected(X, i, j) =rowsum(X, i) · colsum(X, j)

sum(X)

oe(X, i, j) =Xij

expected(X, i, j)

a b rowsum

x 34 11 45y 47 7 54

colsum 81 18 99

oe⇒

a b

x 3445·81

99

1145·18

99

y 4754·81

99

754·18

99

32 / 90

Page 48: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Observed/Expected

rowsum(X, i) =n∑

j=1

Xij colsum(X, j) =m∑

i=1

Xij sum(X) =m∑

i=1

n∑

j=1

Xij

expected(X, i, j) =rowsum(X, i) · colsum(X, j)

sum(X)

oe(X, i, j) =Xij

expected(X, i, j)

Observed

tabs reading birds

keep 20 20 20enjoy 1 20 20

keep and tabs co-occur more thanexpected given their frequencies,enjoy and tabs less than expected

Expected

tabs reading birds

keep 60·21101

60·40101

60·40101

enjoy 41·21101

41·40101

41·40101

=

tabs reading birds

keep 12.48 23.76 23.76enjoy 8.5 16.24 16.24

32 / 90

Page 49: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Pointwise Mutual Information (PMI)

PMI is observed/expected in log-space (with log(0) = 0):

pmi(X, i, j) = log�

Xij

expected(X, i, j)

= log�

P(Xij)

P(Xi∗) · P(X∗j)

d1 d2 d3 d4

A 10 10 10 10B 10 10 10 0C 10 10 0 0D 0 0 0 1

P(w,d) P(w)

A 0.11 0.11 0.11 0.11 0.44B 0.11 0.11 0.11 0.00 0.33C 0.11 0.11 0.00 0.00 0.22D 0.00 0.00 0.00 0.01 0.01

P(d) 0.33 0.33 0.22 0.12

PMI⇓

d1 d2 d3 d4

A −0.28 −0.28 0.13 0.73B 0.01 0.01 0.42 0.00C 0.42 0.42 0.00 0.00D 0.00 0.00 0.00 2.11

33 / 90

Page 50: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Selected PMI valuesSelected PMI values

P(word)

P(context)

P(word, context) =

1.02

0

-0.67

-1.18

0.51 0.17 -0.08

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0 0.25

34 / 90

Page 51: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Positive PMI

The issuePMI is actually undefined when Xij = 0. The usual response isthe one given above: set PMI to 0 in such cases. However,this is arguably not coherent (Levy and Goldberg 2014):

• Larger than expected count ⇒ large PMI• Smaller than expected count ⇒ small PMI• 0 count ⇒ placed right in the middle!?

PPMI

ppmi(X, i, j) = max(0,pmi(X, i, j))

35 / 90

Page 52: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

TF-IDFFor a corpus of documents D:

• Term frequency (TF): P(w|d)

• Inverse document frequency (IDF): log�

|D|�

�{d∈D:w∈d}�

(log(0) = 0)

• TF-IDF: TF × IDF

d1 d2 d3 d4

A 10 10 10 10B 10 10 10 0C 10 10 0 0D 0 0 0 1

⇒IDF

A 0.00B 0.29C 0.69D 1.39

⇓TF

d1 d2 d3 d4

A 0.33 0.33 0.50 0.91B 0.33 0.33 0.50 0.00C 0.33 0.33 0.00 0.00D 0.00 0.00 0.00 0.09

TF-IDFd1 d2 d3 d4

A 0.00 0.00 0.00 0.00B 0.10 0.10 0.14 0.00C 0.23 0.23 0.00 0.00D 0.00 0.00 0.00 0.13

36 / 90

Page 53: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

IDF values

docCount

0 1 2 3 4 5 6 7 8 9 10

000.110.220.360.510.690.92

1.2

1.61

2.3

= corpus size

IDF

= lo

g(10

/ do

cCou

nt)

37 / 90

Page 54: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Selected TF-IDF valuesSelected TF-IDF values

TF

docCount

0.23

0.07

0.01

0.46

0.14

0.02

1.15

0.35

0.05

2.3

0.69

0.11

0.11

0.18

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0

1

2

3

4

5

6

7

8

9

10

38 / 90

Page 55: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Other weighting/normalization schemes

• t-test: P(w,d)−P(w)P(d)pP(w)P(d)

• TF-IDF variants that seek to be sensitive to the empiricaldistribution of words (For discussion and references,Manning and Schütze 1999:553.)

• Pairwise distance matrices:

dx dy

A 2 4B 10 15C 14 10

cosine⇒

A B C

A 0 0.008 0.116B 0.008 0 0.065C 0.116 0.065 0

39 / 90

Page 56: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Weighting scheme cell-value distributions

0.00 0.25 0.50 0.75 1.00 1.25Cell value 1e8

101

103

105

107

Raw counts

0.0 0.2 0.4 0.6 0.8 1.0Cell value

101

102

103

104

105

106

107

L2 norming

0.0 0.2 0.4 0.6 0.8Cell value

101

103

105

107

Probabilities

0.0 0.5 1.0 1.5 2.0 2.5 3.0Cell value 1e7

101

103

105

107

Observed/Expected

10 5 0 5 10 15Cell value

101

102

103

104

105

106

107

PMI

0 5 10 15Cell value

101

102

103

104

105

106

107

PPMI

0.0 0.5 1.0 1.5 2.0 2.5Cell value

101

103

105

107

TF-IDF

Uses the giga5 matrix loaded earlier. Others look similar.40 / 90

Page 57: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Weighting scheme relationships to counts

Uses the giga5 matrix loaded earlier. Others look similar.41 / 90

Page 58: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Relationships and generalizations• The theme running through nearly all these schemes is

that we want to weight a cell value Xij relative to thevalue we expect given Xi∗ and X∗j.

• Many weighting schemes end up favoring rare eventsthat may not be trustworthy.

• The magnitude of counts can be important; [1,10] and[1000,10000] might represent very different situations;creating probability distributions or length normalizingwill obscure this.

• PMI and its variants will amplify the values of counts thatare tiny relative to their rows and columns.Unfortunately, with language data, these are often noise

• TF-IDF severely punishes words that appear in manydocuments – it behaves oddly for dense matrices, whichcan include word × word matrices.

42 / 90

Page 59: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Code snippets

vsm_01_code4_solved

April 1, 2019

In [1]: import osimport pandas as pdimport vsm

DATA_HOME = os.path.join('data', 'vsmdata')

imdb5 = pd.read_csv(os.path.join(DATA_HOME, 'imdb_window5-scaled.csv.gz'), index_col=0)

imdb5_oe = vsm.observed_over_expected(imdb5)

imdb5_norm = imdb5.apply(vsm.length_norm, axis=1)

imdb5_ppmi = vsm.pmi(imdb5)

imdb5_pmi = vsm.pmi(imdb5, positive=False)

imdb5_tfidf = vsm.tfidf(imdb5)

In [2]: vsm.neighbors('bad', imdb5).head()

Out[2]: bad 0.000000guys 0.823744. 0.844851taste 0.893747guy 0.896312dtype: float64

In [3]: vsm.neighbors('bad', imdb5_ppmi).head()

Out[3]: bad 0.000000good 0.701241awful 0.757309terrible 0.763324horrible 0.763637dtype: float64

1

43 / 90

Page 60: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Code snippets

vsm_01_code4_solved

April 1, 2019

In [1]: import osimport pandas as pdimport vsm

DATA_HOME = os.path.join('data', 'vsmdata')

imdb5 = pd.read_csv(os.path.join(DATA_HOME, 'imdb_window5-scaled.csv.gz'), index_col=0)

imdb5_oe = vsm.observed_over_expected(imdb5)

imdb5_norm = imdb5.apply(vsm.length_norm, axis=1)

imdb5_ppmi = vsm.pmi(imdb5)

imdb5_pmi = vsm.pmi(imdb5, positive=False)

imdb5_tfidf = vsm.tfidf(imdb5)

In [2]: vsm.neighbors('bad', imdb5).head()

Out[2]: bad 0.000000guys 0.823744. 0.844851taste 0.893747guy 0.896312dtype: float64

In [3]: vsm.neighbors('bad', imdb5_ppmi).head()

Out[3]: bad 0.000000good 0.701241awful 0.757309terrible 0.763324horrible 0.763637dtype: float64

1

43 / 90

Page 61: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Subword information

1. High-level goals and guiding hypotheses2. Matrix designs3. Vector comparison4. Basic reweighting5. Subword information6. Visualization7. Dimensionality reduction

a. Latent Semantic Analysisb. Autoencodersc. GloVed. word2vec

8. Retrofitting

44 / 90

Page 62: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Motivation

1. Schütze (1993) pioneered subword modeling to improverepresentations by reducing sparsity, thereby increasingthe density of connections in a VSM.

2. Subword modeling will also

a. Pull morphological variants closer togetherb. Facilitate modeling out-of-vocabulary itemsc. Reduce the importance of any particular tokenization

scheme

45 / 90

Page 63: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Technique

Bojanowski et al. (2016) (the fastText team) motivate astraightforward approach:

1. Given a word-level VSM, the vector for a character-leveln-gram x is the sum of all the vectors of wordscontaining x.

2. Represent each word w as the sum of its character-leveln-grams.

3. Add in the representation of w if available

A linguistically richer variant might use sequences ofmorphemes rather than characters.

Example with 4-gramssuperbly becomes

[<w>sup, supe, uper, perb, erbl, rbly, bly</w>]

46 / 90

Page 64: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Code snippets

vsm_01_code5_solved

April 1, 2019

In [1]: import osimport pandas as pdimport vsm

DATA_HOME = os.path.join('data', 'vsmdata')

imdb5 = pd.read_csv(os.path.join(DATA_HOME, 'imdb_window5-scaled.csv.gz'), index_col=0)

In [2]: imdb5_ngrams = vsm.ngram_vsm(imdb5, n=4)

In [3]: imdb5_ngrams.loc['<w>sup'].values

Out[3]: array([3.41545000e+03, 3.70000000e+01, 4.95458333e+04, ...,2.23950000e+02, 4.64833333e+01, 3.12166667e+01])

In [4]: imdb5_ngrams.shape

Out[4]: (9806, 5000)

In [5]: vsm.get_character_ngrams("superbly", n=4)

Out[5]: ['<w>sup', 'supe', 'uper', 'perb', 'erbl', 'rbly', 'bly</w>']

In [6]: def character_level_rep(word, cf, n=4):ngrams = vsm.get_character_ngrams(word, n)ngrams = [n for n in ngrams if n in cf.index]reps = cf.loc[ngrams].valuesreturn reps.sum(axis=0)

In [7]: superbly = character_level_rep("superbly", imdb5_ngrams)

In [8]: superbly.shape

Out[8]: (5000,)

1

47 / 90

Page 65: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Word-piece tokenizing

Later in the term, we’ll encounter tokenizers that break somewords into subword chunks:

Encode this sentence.

En##codethissentence.

Bert knows Snuffleupagus

BertknowsS##nu##ffle##up##agu##s

48 / 90

https://github.com/google/sentencepiece

Page 66: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Visualization

1. High-level goals and guiding hypotheses2. Matrix designs3. Vector comparison4. Basic reweighting5. Subword information6. Visualization7. Dimensionality reduction

a. Latent Semantic Analysisb. Autoencodersc. GloVed. word2vec

8. Retrofitting

49 / 90

Page 67: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Techniques

• Our goal is to visualize very high-dimensional spaces intwo or three dimensions. This will inevitably involvecompromises.

• Still, visualization can give you a feel for what is in yourVSM, especially if you pair it with other kinds ofqualitative exploration (e.g., using vsm.neighbors).

• There are many visualization techniques implemented insklearn.manifold; see this user guide for an overviewand discussion of trade-offs.

50 / 90

Page 68: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

t-SNE on the giga20 PPMI VSM

80 60 40 20 0 20 40 60 80

60

40

20

0

20

40

60

!

);

.

...

;p

</p><p>

?

___

abandoned

abdomen

abduct

ability

able

abortion

abraham

abroad absence

absorb

abstain

abstract

abu

abundance

abuse

accelerate

acceptaccepted

access

accident

accommodation

accomplish

accord

according

accountaccountingaccounts

accumulate

accuse

accused

ache

achieve

acknowledge

acknowledged

acoustic

acquire

acquit

acre

across acrylic

act

acting

action

actions

active

activists

activities

activity

actoractress

actually

ad

add

added

adding

additionadditional

address

adhesive

adjourn

adjustment

administration

admiralty

admire

admission

admit

admitted

adopt

adore

adorn

adult

adv

advance

advanced

advantage

adventure

adversaryadvertisement

advertiseradvertising

advice

advise

adviser

aerial

aeronautics

affairs

affect

affected

afghanafghanistan

afraid

africa

african

afternoon

age

agencies

agency

agenda

agent

agents

aggravate

aggression

aggressive

ago

agony

agreeagreed

agreement

ahead

aid

aides

aids

aim

aimed

airaircraft

airlineairlines

airplane

airport

aisle

al

al-qaida

alan

alarm

albanian

albert

alcohol

alexander

algorithm

ali

alien

alive

allegations

allegedallegedly

allen

alley

alliance

allies

allow

allowedallowing

allows

alloy

almost

alone

along

alphabet

alreadyalso

alt

alter

alternative

although

aluminum

always

amateur

amaze

ambassador

ambitiousamerica

america's

american

americans

amid

among

amount

amphibians

amuse

amusement

analyst

analysts

analyze

anarchy

anatomy

anchor

ancient

anderson

andrew

angel

angeles

anger

angle

angola

angry

animalanimals

ankle

annihilate

anniversary

announceannouncedannouncement

annoy

annual

anonymity

another

answer

ant

antecedent

antonio

anxiety

anxious

anyone

anything

ap

apart

apartment

apollo

apparel

apparent

apparently

appeal

appeals

appear

appearance

appeared

appears

apple

appliance

apply

appoint

appointedappointment

approach

approvalapproveapprovedapproving

april

aquarium

arab

arabia

arafat

arc

arch

archbishop

architecture

archive

area

areas

argentina

argue

argued

argument

arithmetic

arizona

arm

armed

armor

arms

army

aroma

around

arrange

arrangement

arrestarrested

arrivalarrivearrived

arrow

art

articlearticles

artifact

artillery

artist

artistsarts

ascend

ashamed

ashes

asiaasian

aside

askaskedasking

aspen

asphaltass

assassination

assault

assaulted

assembly

assets

assignment

assist

assistance

assistant

associateassociated

association

assume

astronomer

asylum

athens

athleteathletes

athletics

atlanta

atlantic

atmosphere

atom

atomic

attach

attack

attacked

attacksattemptattempts

attendattended

attention

attitude

attorney

attract

attraction

attribute

audience

aug

august

aunt

austin

australia

australian

austria

author

authorities

authority

autoautomobile

autumnavailable

avenue

average

aviation

avoid

awake

awardawards

aware

awareness

away

awful

b

baby

back

backed

bacon

bad

badge

bag

baghdad

bail

bait

bake

bakery

balance

ball

ballad

balloon

ballots

baltimore

ban

banana

band

bank

banker

banking

bankruptcy

banks

banned

bar

bargain

barn

barrel

barter

base

baseball

based

basic

basin

basis

basket

basketball

bathbathroom

battered

battle

battleship

bay

be

beach

beads

beam

bean

bear

beard

beast

beat

beating

beautifulbeauty

became

become

becomes

becoming

bedbedroom

bee

beef

beer

beethoven

beetles

beg

began

begin

beginner

beginning

begins

begun

behavebehavior

behind

beijing

belgium

belief

believe

believed

believes

bell

belly

belong

belt

bench

benchmark

bend

benefitbenefits

berlin

berry

best

bet

betray

better

beverage

beware

beyond

bias

bible

bicycle

bid

big

bigger

biggest

bike

bikini

bill

billboard

billion

bills

bin

binary

biography

biology

bird

birds

birth

birthday

bishopbishops

bit

bite

bitter

bizarre

black

blade

blair

blame

blamed

blanket

blast

blaze

bleach

blend

bless

blizzard

block

blonde

blood

bloombloomberg

blossom

blow

blue

blues

bluff

blur

blurred

blush

board

boardwalk

boat

bob

bodies

body

boeing boil

bold

bomb

bombingbombings

bombs

bondbonds

bone

book

booklet

books

boom

boost

boot

booth

border

born

borrow

bosniabosnian

boston

bother

bottle

bottom

bought

bounce

bound

boundary

bow

bowl

box

boxer

boxing

boy

boys

brace

bracelet

brain

brake

branch

brand

brandy

brass

brazil

bread

break

breakfast

breaking

breathe

breed

brian

brick

bride

bridgebridges

brief

bright

bring

bringingbritain

british

broad

broadcastbroadcasting

brochure

broil

broke

broken

brokerage

bronze

brother

brothers

brought

brow

brown

brush

brussels

bubble

buck

bucket

buddy

budget

buds

buffer

bug

buildbuilders

buildingbuildings

built

bulb

bulgarian

bull

bulletin

bumble

bump

bun

bunch

bunny

bureau

burger

burial

burn

burnedburning

burst

bury

bus

bushbush's

business

businesses

butter

butterfly

button

buybuying

buzz

c

cab

cabbage

cabin

cabinet

cable

cache

cactus

cafe

cage

cake

calculatecalculation

calendar

calf

calif

california

call

called

callingcalls

came

camel

camera

camp

campaigncampaigning

can

can't

canada

canadian

canal

cancer

candidatecandidates

candles

candy

caninecannon

cannot

canyon

cap

capability

capacity

capital

capitalism

capitulate

captain

capturecaptured

car

carbon

card

cardboard

cardinals

cards

care

career

carlos

carnival

carnivore

carolina

carpet

carriage

carried

carrots

carry

carrying

cars

cart

carter

cartoon

case

cases

casey

cash

cast

castle

cat

catastrophe

catch

category

caterpillars

cathedral

catholiccatholics

cattle

caught

causecaused

cautious

cave

cbs

cd

cdy

cease-fire

ceiling

celebratecelebration

cell

cellar

cells

cement

cemetery

cent

center

centers

central

cents

century

ceramic

cereal

ceremony

certain

certainly

certificatecertify

chain

chair

chairman

challenge

chamber

champagne

champion

championschampionship

championships

chance

chances

chandler

change

changed

changes

changing

channel

chaos

chapel

chapter

charactercharacters

charcoal

charge

chargedcharges

charles

chase

cheap

cheat

chechnya

check

cheek

cheerful

cheerleaderscheers

cheesecheetah

chef

chemical

chemistry

cherish

cherry

chess

chest

chew

chicago

chick

chicken

chief

child

childhood

childish

children

chile

chin

chinachina'schinese

chipmunkchirp

chocolate

choice

choir

choke

choose

chop

chopper

chord

chris

christian

christmas

chronicle

chuck

church

cigar

cigarette

circle

circulation

circumstance

cited

cities

citing

citizen

citizenscitizenship

citrus

city

city's

civil

civilian

civilians

claim

claimed

claims

clarify

clash

class

classic

classics

classroom

clean

clear

clearly

clench

cleveland

clever

click

clients

cliff

climb

clinic

clintonclinton's

clock

close

closed

closely

closer

closet

closing

cloth

clothes

cloud

cloudy

clown

clr

club

clubhouse

clue

co

coach

coaches

coal

coalition

coast

coat

cock

cocktail

code

coffee

cognition

coil

coincoins

cold

collaborate

collage

collapse

collar

colleagues

collect

collection

college

collision

colombia

colonel

colonies

color

colorado

colorful

coloring

colour

colt

columbia

column

combat

combination

combine

combined

come

comes

comfort

coming

comma

commandcommander

comment

commentary

comments

commerce

commercial

commission

commissioner

commit

commitment

committed

committee

common

communicatecommunication

communications

communist

communitiescommunity

commuters

companies

companion

companycompany's

compare

compared

comparison

compete

competence

competition

competitive

complain

complained

complete

completed

completely

complex

compose

composercomposers

compound

comprehend

computationcompute

computercomputers

con

concentrate

concept

concern

concerned

concerns

concert

conclude

conclusion

concrete

concur

condemncondemned

condition

conditions

conduct

conducted

cone

conference

confess

confession

confidence

confident

confirmed

conflict

confuse

confusion

congresscongressional

connect

connection

conquer

conquest

consecutive

conservation

conservative

consider

considered

considering

console

conspiracy

conspire

constellation

constitutionconstitutional

construct

construction

consumer

consumers

contact

contained

container

contemplate

contend

continent

continue

continued

continues

continuing

contractcontracts

contributed

control

controls

controversy

convention

conversation

convertible

convicted

convince

cook

cookies

cooking

cool

cooperate

cooperation

coordinator

cop

cope

copper

copy

copyright

cord

corner

corp

corporate

corporation

correct

correspondencecorridor

corruptcorruption

costcosts

costumecostumes

cottage

cotton

couch

cough

could

council

count

counter

countries

country

country's

county

couple

course

court

courts

cousin

cover

coverage

covered

covering

cow

cox

crab

crack

crackle

cradle

craft

crane

crash crave

crawlcrazy

create

created

creating

creation

creativity

creator

creaturecreatures

credibility

credit

creek

crew

crib

crime

crimes

criminal

crisis

criterion

critic

critical

criticismcriticize

criticized

critics

croak

croatia

crochet

crop

cross

crow

crowd

crowded

crown

crucial

cruel

cruise

crunch

crush

cry

crystal

cuba

cube

cucumber

cuddle

cuisine

culturalculture

cup

curator

curled

currency

current

currently

curtain

curve

customers

cut

cute

cuts

cutter

cutting

cylinder

czech

dad

daffodils

daily

daisy

dallas

damage

damages

damn

dan

dancedancers

dandelion

dangerdangerous

daniel

dare

dark

dash

dashboard

data

database

date

daughter

david

davis

dawn

day

days

de

dead

deadline

deal

deals

death

deaths

debate

debt

dec

decade

decades

decay

deceive

december

decide

decided

decision

decisions

deck

declare

declared

decline

declined

decompose

decoratedecoration

decrease

dedicate

deep

deer

def

defeat

defeateddefeating

defend

defending

defense

defensive

deficit

definedefinition

defrost

degrade

degree

del

delay

delightful

deliverdelivered

delivery

demand

demandeddemanding

demands

democracy

democrat

democratic

democrats

demolishdemolition

demon

demonstrate

denial

denied

dense

dentist

denver

deny

depart

department

departure

depend

deployment

deposit

depression

depth

deputy

descend

descent

describe

described

desert

deserve

designdesigned

desirable

desire

desk

despair

despise

despite

dessert

destroy

destroyed

destruction

detach

details

detaineddeteriorate

determination

determine

determined

detroit

develop

developed

developing

development

device

devil

devote

dew

dialogue

diamond

dice

dictionary

die

died

diego

diet

differ

difference

differences

different

difficult

difficulty

dig

digit

digital

dignity

diminish

diner

dinner

dip

diplomaticdiplomats

direct

directed

direction

directly

director

dirt

dirty

disability

disagree

disallow

disappear

disappoint

disaster

disbelieve

disc

discharge

discipline

discourage

discover

discovered

discovery

discussdiscussed

discussion

disease

disguise

disgust

dish

disintegrate

dislike

dismay

dismissdismissed

disney

disorganize

disown

disperse

display

dispose

disposition

disprove

dispute

disregard

dissolve

distance

distribution

distributor

district

disturb

disturbances

dive

diversion

divide

divided

dividend

diving

division

dlrs

do

dock

doctordoctors

document

documents

dodgers

dog

dole

dollar

dollars

dolls

domain

dome

domestic

dominate

done

donkey

donut

doodle

doordoorway

double

doubt

dow

downtown

doze

dozendozens

dr

draft

drag

dragon

dragonfly

drain

drama

draw

drawer

drawingdrawn

dream

dreary

drench

dress

dressing

drew

drift

drill

drink

drip

dripping

drivedriverdriversdriving

drizzle

drop

droplets

dropped

drought

drove

drown

drugdrugs

drum

dryer

duck

dude

due

duel

dull

dumb

dump

dunk

duplicate

dusk

dutch

duty

dye

e

e-mail

eagle

ear

earlier

early

earnearnedearning

earnings

earth

earthquake

ease

easier

easily

east

easter

eastern

easy

eat

ecology

economic

economist

economy

ecuador

edge

editing

editor

editors

educate

education

effect

effective

effects

effort

efforts

eggeggs

ego

egypt

eight

eighth

einstein

either

el

elastic

elbow

elect

elected

election

elections

electric

electricity

electronic

electronics

elegance

element

elephant

elevator

else

elsewhere

embarrass

embassy

embrace

emergency

emerging

emission

emotion

employeeemployees

employer

en

encourage

encouragement

encyclopedia

end

endedending

endorsement

ends

endurance

energy

enforce

enforcement

engage

engagement

engine

engineering

england

english enjoy

enough

enrageensure

enter

entered

enterprise

entertain

entertainment

entire

entity

entrance

environment

environmental

episcopal

equality

equation

equipment

equity

era

eric

erode

error

erupt

escalator

escape

especially

espionage

essay

essential

established

establishment

estate

estimateestimatedestimates

et

ethnic

eu

euro

europe

european

evacuate

evaluate

even

evening

eventevents

eventually

ever

every

everybody

everyone

everything

evict

evidence

exactly

examinationexamine

examiner

example

exceed

excel

except

exchange

excuse

executiveexecutives

exercise

exhale

exhibitexhibition

exile

exist

existing

exit

exotic

expand

expansion

expect

expectations

expected

expects

expedition

expensive

experience

experts

explainexplanation

explode

exploit

exploitation

explore

explosion

explosive

exports

express

expressed

extended

extra

extractextracts

extremely

eyeeyes

f

fabric

face

faced

faces

facilities

facing

fact

factory

fade

fail

failedfailure

fair

faith

fallfalling

fame

familiar

families

family

famousfans

fantasy

far

farley

farmfarmer

farmers

fashion

fast fasten

father

fault

fauna

favor

favorite

fax

fbi

fear

fears

featherfeathers

featurefeatures

feb

february

fed

federal

federation

fee

feed

feedback

feelfeeling

fees

feetfeline

fell

fellow

feltfemale

fence

ferry

fertility

festival

fever

fewer

fiction

field

fifth

fight

fighters

fighting

figure

figures

file

filed

fill

filled

filmfilmsfinal

finally

finance

financial

find

finding

fine

finger

fingerprint

finishfinished

fire

fired

fireworks

firm

firms

first

fiscal

fish

fishing

fit

five

fix

fla

flag

flame

flamingo

flap

flash

flat

flavor

fledfleefleeing

flesh

flex

flexible

flight

flights

fling

flip

float

flood

floor

flora

florida

flour

flow

flowerflowers

flu

flush

flute

flutter

fly flyerflying

focusfocused

fog

foil

fold

foliage

follow

followed

following

follows

food

foolish

foot

football

footprint

forbear

forbes

forbid

force

forced

forces

ford

forecast

foreign

forest

forged

forget

forgive

form

formal

format

formationformed

former

formula

forswearing

fort

forthcoming

forward

fought

found

foundation

fountain

four

fourth

fox

fragile

fragrance

frame

framework

france

francisco

frank

fraternity

fraud

free

freedom

freeze

french

fresh

freud

friday

friday's

friedman

friend

friendly

friends

frighten

frigid

frog

front

frost

frozen

fruit

frustrate

frustration

fry

fuck

fuel

full

fully

fun

fund

funding

funds

funeral

fungi

funny

fur

furnace

furniture

fury

future

g

gain

gained

gains

galaxy

gallon

gamble

gambling

game

games

garage

garbage

garden

garfield

garlic

garment

gary

gas

gasoline

gate

gateway

gather

gathered

gathering

gauge

gave

gay

gaza

gear

gem

gen

gender

generalgenerally

generation

generous

geniusgenre

gentleman

george

georgia

german

germany

get

gets

getting

giant

giants

giggle

gin

giraffe

girl

girls

give

given

gives

giving

glad

glance

glare

glass

glide

glittering

global

globe

glove

glow

glue

gmt

go

goalgoals

goat

goats

god

goes

going

gold

golden

golf

gone

good

goods

goose

gop

gore

gospel

gossip

got

gov

govermentgovernmentgovernment's

governments

governor

grab

grade

graduate

graffiti

grand

grant

grapes

graphicgraphics

grasp

grass

grate

gravegravestone

graveyard

gray

graze

great

greater

greatest

greece

greek

green

greet

grew

grey

grief

grill

grind

grip

grocery

groom

ground

group

groups

grow

growing

grown

growth

guarantee

guard

guardian

guerrillas

guess

guest

guild

guilt

guilty

guitar

gulf

gull

gulp

gum

gun

guns

guru

gut

guy

guys

gymnastics

h

hack

hair

haircut

half

hallhalloween

hallway

hamas

hamburger

hamlet

hamper

hamster

hancock

hand

handbag

handle

hands

handwriting

hanghanging

happenhappenedhappening

happens

happiness

happy

harborharbour

hard

hardware

harm

harris

harsh

harvard

hat

hatch

hate

haul

haunt

have

hawk

haze

he'shead

headed

headquarters

heads

heal

health

healthy

hearheard

hearing

heart

heat

heaven

heavily

heavy

height

held

helium

hell

helmet

help

helped

helper

helping

hen

henry

herb

heritage

hero

heroin

heroine

hesitate

hide

high

higher

highest

highly

highway

hike

hill

hip

hired

historic

history

hit

hitch

hits

hitting

hockey

holdholding

holds

hole

holiday

holler

hollywood

holy

home

homeless

homer

homes

honest

honey

hong

honolulu

honor

hood

hoothop

hope

hoped

hopes

hoping

horace

horizon

horn

horrible

horse

hose

hospital

hospitals

host

hostility

hot

hotel

hound

hourhours

house

houses

housing

houston

howard

however

hug

huge

hum

human

humiliate

hummingbird

hundredhundreds

hunt

hurricanehurry

hurt

husband

husky

hussein

hustle

hut

hydrogen

hymn

hypertension

hypnotize

hysteria

i'di'll

i'mi've

ice

icon

ideaideas

identified

ignorance

ignore

ii

ill

illegal

illness

illusion

illustrations

imageimages

imagine

imitate

immediate

immediately

immigrantsimmigration

immoral

immunity

impact

impartiality

impatient

impersonate

implementimplementation

implode

importanceimportant

impose

imposed

impossible

impression

imprisoned

improve

improved

improving

impulse

inc

inchinches

incident

incline

include

included

includes

including

income

increaseincreasedincreases

increasing

increasingly

indeed

independence

independent

index

india

indian

indicated

indication

individual

indonesia

industrial

industriesindustry

inexpensive

infection

infinite

inflation

influence

inform

information

infrastructure

inhale

initial

initially

injuredinjuries

injury

ink

inmate

inn

inninginnings

inquire

insaneinsectinsects

insert

inside

insight

insisted

inspect

inspectors

installation

instance

instead

institute

institution

institutions

instruct

instruction

instructor

instrumentinstrumentation

insult

insurance

insure

insured

integrate

integration

intellect

intelligence

intelligent

intend

intended

intensity

interaction

interest

interested

interests

interim

interior

internal

international

internet

interrupt

intervention

interview

intoxicate

introduceintroduced

intuition

invalidate

invent

inventory

investigate

investigatinginvestigationinvestigators

investmentinvestments

investor

investors

invite

invoice

involve

involvedinvolving

ipod

iran

iraqiraq's

iraqi

ireland

iris

irish

iron

irritate

islamic

islandislands

isolationisraelisrael'sisraeliisraelis issue

issued

issues

italian

italy

items

ivy

j

jack

jacket

jackson

jaguar

jail

jamaica

james

jan

january

japan

japan'sjapanese

jar

jaw

jay

jazzjean

jeff

jellyfish

jerk

jerry

jersey

jerusalem

jet

jets

jewel

jewelry

jewishjews

jim

job

jobs

joe

jog

john

johnson

join

joined

joint

joke

jones

jordan

jose

joseph

journal

journal-constitution

journalists

journey

joy

jr

juan

judge

judges

judging

judgment

juice

july

jump

jumper

june

jurisdiction

jury

justice

justify

juvenile

k

kansas

keepkeeping

kennedy

kept

kerry

kevin

key

keyboard

kick

kid

kidnap

kidney

kids

kill

killedkilling

kilometer

kilometers

kim

kind

king

kiss

kitchen

kittens

kitty

klan

knee

kneel

knew

knife

knight

knit

knitting

knock

know

knowledgeknown

knows

kong

koreakoreankosovo

kremlin

l

la

lab

labor

laboratory

lace

lack

lad

laden

lady

lake

lakers

lamb

lamp

land

landlord

landscape

language

lantern

large

largely

larger

largest

larry

las

last

late

later

latest

latex

latin

laugh

launch

launched

laundering

law

lawmakers

lawn

lawslawsuit

lawyer

lawyers

lay

layer

lead

leader

leaders

leadership

leading

leads

leaf

league

lean

leap

learn

learned

least

leather

leave

leaves

leaving

lebanon

led

ledge

lee

left

leg

legal

legion

legislationlegislature

lego

lemon

lend

lens

less

lesson

let

letter

letters

level

levels

lewis

liability

liberal

library

libya

license

licklicking

lie

lien

life

lift

light

lighthouse lighting

lightning

like

likely

lily

limb

limit

limited

limits

limp

lincoln

line

linen

lines

lineup

lingerie

link

linked

lion

liplips

liquid

liquor

list

listed

listen

listing

literature

litter

little

live

lived

liver

liveslivinglizard

load

loanloans

lobster

local

locate

location

locked

locomotives

log

logic

london

long

long-term

longer

look

looked

looking

looks

loop

loosen

los

lose

losing

loss

losses

lost

lot

louis

louisiana

lounge

love

lover

low

lowerlowest

ltd

luck

luggage

lunch

lung

luxury

lyric

mac

machine

mad

made

madhouse

madrid

magazine

magic

magician

magnify

magnitude

magnolia

maid

mail

mainmainly

maintain

major

majority

make

maker

makes

makeup

making

malaysia

malemales

mallard

mammalmammals

man

manage

managed

management

manager

managers

managing

manchester

manhattan

mannequin

manner

manslaughter

manufacturer

many

mapmaple

maradona

marathon

marble

march

marching

mare

marijuana

mark

marked

market

marketing

markets

marks

marriage

married

marrow

marry

mars

martin

mary

mask

mass

massachusetts

massive

matchmatches

mate

materialmaterialsmatter

matters

maxwell

may

maybe

mayor

meal

mean

means

meant

meanwhile

measure

measures

meat

mechanic

mechanism

medal

media

medicalmedicine

medium

meet

meetingmeetings

melody

melt

membermembers

memorabilia

memorial

memory

men

men's

mend

menumere

merger

message

met

metal

metermeters

metro

mets

mexican

mexico

miami

michael

michigan

microsoft

microwave

midday

middle

might

mike

milan

mile

miles

militantmilitants

militarymilitia

milk

mill

miller

million

millions

milosevic

mimic

mind

miniature

minister

ministers

ministry

mink

minnesota

minor

minority

minute

minutes

mirror

misery

miss

missed

missilemissiles

missing

mission

misspend

mist

mistake

mistreat

misty

mix

mixed

mixture

mm

mob

mode

model

modern

modest

modification

molecule

mom

moment

momentum

monday

monetary

money

monk

monkeys

monster

monthmonths

montreal

monument

mood

moon

morality

morgan

morning

mortal

mortality

moscow

moss

mostly

motel

mother

motion

motive

motor

motorcycle

motto

mount

mountain

mouse

mouth

move

moved

movement

moves

moviemovies

moving

mow

mozart

mph

mr

mrsms

much

mud

mug

multiply

munch

munich

mural

murder

murdered

murphy

muscle

museum

mushrooms

music

musician

musicians

muslimmuslims

mussolini

must

mustard

mutual

mystery

myth

n

nag

nailnails

name

named

names

nap

narrow

nasdaq

nation

nation's

national

nations

nationwide

native

nato

natural

nature

navy

nazi

nba

near

nearby

nearly

necessary

neck

necklace

need

needed

needle

needs

neglect

negotiations

negroes

neighborhood

neighboring

neighbors

neither

neon

nephew

nerve

nervousnest

net

netanyahu

netherlands

networknetworksneutral

nevernew

news

newspaper

newspapers

next

nfl

nice

nick

nickel

night

nine

ninth

nn

nobody

noise

none

noodlesnoon

normal

north

northeast

northern

northwest

nose

note

notebook

noted

notes

nothing

notice

noticeable

notify

nov

novel

november

novice

nuclear

nude

nuisance

nullify

numbernumbers

nursenurses

nut

nutrition

nuts

nyt

oak

obey

object

objective

obligation

observation

observatory

observe

obtain

obvious

obviously

occasion

occupation

occur

occurred

occurrence

ocean

oct

october

odd

odyssey

offend

offense

offensive

offer

offered

offering

offersoffice

officer

officers

offices

officialofficials

often

ohio

oil

oklahoma

old

older

oldest

olympicolympics

one

ones

onion

online

onto

opec

open

opened

opening

opera

operate

operating

operation

operations

operative

operator

opinion

opponent

opponents

opportunities

opportunity

opposed

oppositionoption

optional

optionsoracle

orange

orchestra

orchids

order

ordered

orders

organ

organism

organizationorganizations

organizeorganized

origami

origin

original

originate

orleans

orthodontist

others

otherwise

outdoor

outfit

outlet

outnumber

outside

oval

overall

overcome

overflow

overhead

overnight

overpower

overseas

overwhelm

owe

owl

own

owned

owner

owners

owns

oxoxford

oxygen

p

pace

pacific

package

packaging

packed

packet

pact

padding

page

pages

paid

pain

paint

painted

painting

pair

pakistan

palace

palestinianpalestinians

palm

pan

panda

panel

panorama

paper

papers

parade

paragraph

parcel

pardon

parent

parents

paris

parish

parkparking

parliamentparliamentary

parrot

part

participant

participate

particular

particularly

parties

partly

partnerpartners

parts

party

pass

passagepassed

passengers

passing

passion

passport

past

pat

patch

patent

path

patients

patio

patrick

patternpatterns

paul

pause

paws

paypaying

payment

payments

peace

peaceful

peacock

pearl

pebbles

peel

pelican

pelt

pen

penalty

pencil

pennsylvania

pentagon

people

people's

pepper

per

perceive

percent

percentage

perception

perfect

performperformance

performer

perfumeperhaps

period

perish

perjury

permanent

permission

permit

person

personal

personnel

perspire

persuade

pet

petals

peter

pets

phantom

phenomenon

philadelphia

philip

philippines

phoenix

phone

photo

photographer

photography

photos

phrase

physical

physician

physics

piano

piazza

pick

picked

picturepictures

pie

piecepieces

pier

pig

pigeon

pigs

pillow

pilot

pilots

pin

pinch

pink

pinnacle

pipe

pitch

pittsburgh

pizzaplace

placed

places

plague

plan

plane

planes

planet

plannedplanning

plans

plant

plantation

plants

plastic

plate

plates

playplayedplayerplayers

playground

playing

playoffplayoffs

plays

pleaplead

please

pledges

plenty

plot

plow

pluck

plus

poach

pod

poem

poetpoetry

pointpoints

poise

poker

poland

pole

poles

police

policiespolicy

polish polite

political

politician

politicians

politics

pollpolls

pollution

polo

polyester

pond

ponder

poodle

pool

poor

pop

popcorn

pope

poppy

popular

population

porch

pork

port

portion

portrait

portray

position

positions

positive

possess

possession

possibilitypossible

possibly

post

postage

postcards

posted

poster

pot

potato

potatoes

potential

poultry

pounce

poundpounds

pour

poverty

powell

powerpowerful

powers

practice

prayprayer

precedent

predict

predicted

prefer

preference

pregnancy

pregnant

prejudice

premier

preparationprepareprepared

preparing

presence

presentpresented

preservation

presidencypresident

president's

presidential

press

pressure

prestige

presume

pretend

pretty

prevent

preview

previous

previously

prey

price

prices

prick

pride

priest

primary

prime

princeprincess

print

printer

priority

prison

prisoners

privateprivileges

prize

probability

probably

problemproblems

procedure

proceedings

process

processing

proclaim

produce

produced

producer

product

production

products

profession

professional

professionals

professor

profitprofits

programprograms

progress

prohibit

project

projector

projects

prominence

prominent

promise

promised

promote

promoter

proof

propaganda

proper

property

proposalproposalsproposed

prose

prosecute

prosecutorprosecutors

prosper

protectprotection

protective

protest

protestant

protesters

protests

protocol

proton

proud

prove

provideprovided

provides

providing

province

proving

proximity

psychiatrypsychology

pub

public

publication

publicly

published

puddlepuerto

pugpull

pulled

pump

pumpkin

punch

punish

punishment

punk

pup

pupil

puppy

purchase

purple

purpose

purse

pursue

pushpushedpushing

putputting

pyramid

q

quality

quantity

quarantine

quarrel

quarter

quarterback

quartz

queen

queens

quench

query

quest

question

questions

queue

quick

quickly

quiet

quit

quite

quiz

quota

quotationquote

quoted

r

rabbi

rabbit

race

racer

racesracing

racism

racket

radar

radiation

radical

radio

railrailroad

rails

railway

rain

rainbow

raiseraised raising

rally

ran

range

rangers

rap

rapid

rare

rash

raspberry

rat

raterates

ratherrationalize

rattle

ray

reachreached

reaction

read

reader

reading

readyreal

reality

realize

really

reap

reappear

rearrangereason

reasons

rebelrebels

recall

recalled

receivereceived

receiver

recent

recently

recess

recognition

recommend

recommendation

record

recorder

records

recovery

recreation

recruit

recyclerecycling

red

redhead

reduce

reduced

reef reel

reference

referring

reflect

reflection

reform

reforms

refrain

refrigerator

refugees

refuserefused

regime

region

regional

registerregistration

regret

regular

reichstag

rejectrejected

rejoice

related

relation

relations

relationship

relativerelatively

relatives

relax

relaxation

relaxed

relay

release

released

relief

religion

religious

rely

remain

remained

remaining

remains

remark

remarks

remedy

remember

remindremove

removed

renounce

rental

rep

repair

repeal

repeat

repeatedly

replacereplaced

reply

report

reported

reportedly

reporter

reporters

reports

representation

representativerepresentatives

represents

repress

reprimand

reproduce

reptiles

republic

republicanrepublicans

request requirerequiredrequires

rescuerescued

research

researchers

reservation

reserve

residents

resolution

resort

resources

respect

respond

responded

response

responsibilityresponsible

rest

restaurant

restless

restore

restrict

result

results

resume

retail

retailer

retain

retire

retired

retirement

retiring

retreat

return

returnedreturning

returns

revenuerevenues

review

revolution

reward

rhodes

rhythm

rice

rich

richard

rick

ride

ridge

ridicule

right

rights

ring

rinse

rip

ripples

riserising

risk

rival

river

riverside

rn

road

roam

roar

rob

robert rock

rocker

rocket

rod

rodents

role

roles

roll

romance

roof

rook

room

rooms

rooster

root

rope

rose

rough

round

route

rover

row

rowing

royal

rub

rubber

rubbish

rugby

ruins

rule

ruled

rules

ruling

run

running

runs

rush

russia

russia'srussian

rust

rusty

sabbath

sad

saddam

safe

safety

said

sailsailing

sailor

saint

saints

sake

salad

salary

sale

sales

salt

salute

san

sanctions

sand

sandwich

santa

sat

satan

satellite

satin

satisfy

saturday

saturday's

sauce

saudi

sausage

save

savings

saw

say

saying

says

scale

scandal

scarce

scare

scene

scenery

schedule

scheduled

scheme

school

schools

sciencescientist scientists

scold

scooter

scorescored

scores

scoring

scotch

scott

scottish

scouting

scramble

scratch

scream

screen

scribble

script

scrutiny

sculpture

sea

seafood

seagull

search

seashore

seasonseasons

seatseats

seattle

second

seconds

secret

secretary

section

sector

secure

securities

security

see

seed

seeing

seekseeking

seem

seemed

seems

seen

seepage

segment

seize

seized

seizure

selectselection

self

sellselling

seminar

sensenatesenator

send

sending

senior

sense

sent

sentencesentenced

sentiment

sentry

separate

separation

sept

september

serbserbs

serial

series

serious

seriously

serve

served

server

service

services

serving

session

set

sets

setting

settle

settlement

seven

seventh

several

severe

sewsewing

sexsexual

sexy

shade

shadow

shake

shame

share

shareholders

shares

shariff

shark

sharon

sharp

shatter

shed

sheep

sheet

shell

shelter

sheriff

shift

shine

ship

shirt

shiver

shock

shoes

shoot

shooting

shopshopping

shore

short

shortage

shortly

shot

shots

shoulder

shove

show

showed

shower

showingshown

shows

shrink

shuffle

shun

shut

sicksickness

side

sides

sidewalksight

sign

signal

signature

signed

significant

signs

silence

silhouette

silver

similar

similarity

simple

simply

simulation

since

sing

singapore

singer

single

singles

sink

sinner

sip

sister

sit

sitesites

sitting

situation

six

sixteen

sixth

size

skate

skateboard

skatingsketch

skiskiing

skill

skin

skipskirt

skull

sky

skylineskyscraper

slap

slash

slaveslaveryslaves

slay

slaying

sleep

sleeve

slice

slide

slightly

slip

slither

slope

slow

slug

slurp

sly

small

smaller

smart

smash

smear

smell

smile

smith

smoke

smoking

smother

smudge

snake

snap

snatchsneak

sneeze

sniff

snoozesnore

snow

snowboarding

snowman

snuggle

so-called

soak

soap

soar

sob

soccer

social

socialism

society

socks

sofa

softball

software

soil

solar

sold

soldier

soldiers

sole

solid

solution

solve

someonesomething

sometimes

son

songsongs

soon

soothe

sorrow

sort

sought

soul

sound

soup

source

sources

south

southeast

southern

sovietsoviets

sow

sox

soybean

space

spaghetti

spain

spanish

spank

spare

spark

spawnspeak

speaker

speaking

special

species

specific

speculation

speech

speed

spell

spend

spending

spent

sphere

spider

spill

spin

spine

spiral

spirit

splash

split

spoil

spoke

spokesmanspokeswoman

spoon

sport

sports

spot

sprain

spray

spread

spring

springfield

sprinkle

sprint

spy

squad

square

squash

squeaksqueal

squeeze

squint

squirrel

sri

st

stab

stability

stadium

staff

stage

stain

staircase

stairs

stake

stalin

stamp

stance

stand

standard

standards

standingstands

stanley

star

stare

stars

start

started

starter

startingstarts

starve

statestate's

statement

statements

states

station

stations

statue

status

stay

stayed

steak

steal

steel

steeple

steer

stem

stencil

step

stephen

steps

steve

stevenson

stew

stewart

stick

stickers

still

sting

stink stir

stitch

stitches

stock

stockings

stocks

stomach

stone

stood

stop

stopped

store

stores

stories

stork

storm

story

stove

straight

strain

strange

stranger

strategy

straw

strawberry

stray

streak

stream

street

streets

strength

stretch

strike

strikes

string

strip

stripes

strive

stroke

strongstronger

struck

structure

struggle

struggling

stud

studentstudents

studiesstudy

stuff

stumble

stupid

style

subcommittee

subject

submarine

submit

substance

subtract

subway

succeed

successsuccessful

suck

suddenly

suds

sue

sufferedsuffering

sugar

suggest

suggested

suicide

suit

sum

summer

summit

sun

sunday

sunday's

sunflower

sunglasses

sunk

sunlight

sunny

sunrisesunset

sunshine

super

supper

suppliessupply

supportsupported

supporter

supporters

suppose

supposed

supreme

sure

surf

surface

surfers

surgery

surname

surpass

surprise surprised

surprises

survey

survive

survived

sushi

suspect

suspected

suspects

suspended

suspicion

swallow

swamp

swan

swap

sway

swear

sweat

sweater

sweden

sweet

swimswimming

swimsuitswing

swiss

switzerland

swoon

sydney symbol

syndicate

syria

system

systems

table

tableware

tackle

tag tail

taiwan

take

taken

takes

taking

taliban

talktalked

talking

talks

tall

tampa

tan

tank

tap

target

targets

tarnish

task

taste

tattoo

tax

taxation

taxes

taxi

taxpayer

taylor

tea

teach teacher

teachers

teaching

team

team's

teams

tear

tears

tease

technical technician

technology

teeth

tel

telecommunications

telephone

television

tell

telling

temper

temperature

temple

tend

tennis

tenor

tent

term

terms

terrible

terrier

terrific

territory

terrorterrorismterroristterrorists

testtested

testifytestimony

testingtests

texas

text

textbook

textile

texture

thailand

thanks

that's

thaw

theatertheatre

theft

theme

theory

there's

they're

they've

thief

thingthings

think

thinkingthinks

third

thomas

though

thoughtthousands

threat

threatened

threats

three

threw

throat

throne

throughout

throw

thumb

thunderstorm

thursday

thus

ticker

tickettickets

tickle

tietied

ties

tiger

tight

tiles

tim

timber

time

timer

times

tin

tiny

tire

title

toast

tobacco

today

today's

toe

toes

together

toil

toilet

tokyo

told

tolerance

tom

tomato

tone

tongue

tonight

tons

tony

took

tool

tooth

top

topic

toronto

toss

total

tote

touch

tough

tour

touristtourists

tournament

tow

toward

tower

town

towns

toytoys

track

tract

trade

traded

traders

trading

traditional

traffic

tragedy

trail

train

trainer

training

tram

transformation

transformers

transmission

transplant

transporttransportation

trap

travel

traveling

tread

treasury

treat

treatedtreatment

treaty

treetrees

trial

tribunal

trick

tried

trim

trinity

trip

triumph

troops

tropical

trouble

troubles

truck

trucks

true

trunk

trust

truth

trytrying

tsunami

tub

tuesday

tug

tulip

tumble

tune

tunnel

turkey

turkish

turks

turn

turned

turning

turns

tv

twice

twigs

twist

two

type

u

ultimate

umbrella

unable

uncle

undated

underground

understand

underwater

unemployment

unhappy

uniform

union

unit

unite

united

units

universe

university

unless

unlike

unlikely

unload

unnecessary

unusual

update

upon

upset

urban

urge

urged

us

useused

users

usesusing

usually

usurp

v

vacate

vacation

valentine

valley

valor

value

van

vanish

variety

various

vary

vault

veer

vehiclevehicles

vein

venture

verdict

verify

verse

version

vessel

veteran

via

vice

victim

victims

victor

victory

video

vietnamvietnamese

view

viewer

views

villa

village

villages

vine

vinegar

vintage

vinyl

violation

violenceviolent

violet

violin

virginia

virtually

virtuoso

virus

vision

visit

visitedvisiting

visitorvisitors

vitamin

vocal

vodka

voice

volcano

volume

volunteer

vomit

votevotedvotersvotes

voting

voyage

vs

w

wag

wagner

wagon

waist

wait

waiting

wake

walk

walked

walking

wall

wander

wantwanted

wants

war

warn

warned

warner

warning

warranty

warrior

warships

wash

washerwashing

washington

wasp

waste

watch

watched

watching

waterwaterfall

wave

way

ways

we'llwe'rewe've

weak

wealth

weapon

weapons

wear

wearing

weather

weave

web

wed

wedding

wednesday

weed

week

week's

weekend

weekly

weeks

weep

weight

weird

welcome

welfare

well

went

west

western

wet

whale

what's

whatever

wheat

whether

whip

whiskers

whiskey

whisper

whistle

white

whole

whose

wide

widely

wife

wig

wiggle

wild

william

williams

willing

wilson

win

wind

windmill

window

windows

wine

wing

wingswinner

winningwins

winter

wipe

wire

wisdom

wise

wit

withdrawwithdrawal

withdrawn

within

without

witnesses

wizard

wolf

woman

women

women's

wonderful

wood

woodland

woods

wool

word

words

workworked

worker

workers

working

workplace

worksworkshopworld

world's

worldsources

worldwide

worm

worried

worryworse

worst

worth

would

wounded

wrap

wreck

wrist

write

writer

writes

writingwritten

wrong

wrote

x

yacht

yale

yankees

yardyards yarn

yawn

year

year's

year-old

yearn

years

yell

yellow

yeltsin

yen

yes

yesterday

yet

yield

york

young

younger

youth

yugoslavyugoslavia

zealand

zebra

zinc

zombie

zone

zoo

51 / 90

Page 69: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

t-SNE on the giga20 PPMI VSM

cooking conflict

51 / 90

Page 70: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

t-SNE on the imdb20 PPMI VSM

60 40 20 0 20 40 60

80

60

40

20

0

20

40

60

80

!

);

.

.....

:)

?

abandoned

abdomen

abduct

ability

able

abortion

abraham

absence

absolute

absolutely

absorb

abstain

abstract

absurd

abundance

abuse

academy

accelerate

accent

accept

access

accident

accommodation

accomplish

accordingaccount

accumulate

accurate

accuse

acheachieve

acknowledge

acoustic

acquire

acquit

acre

across

acrylic

act

acted

acting

action

actions

activity

actor

actors

actressactresses

acts

actual

actually

ad

adam

adaptation

addadded

adding

addition

address

adds

adhesive

adjourn

adjustment

administration

admiralty

admire

admission

admit

adopt

adore

adorn

adultadults

advance

adventure

adventures

adversary

advertisement

advertiser

adviceadvise

aerial

aeronautics

affair

affect

afghanistan

afraid

africa

african

afternoon

age

agency

agent

ages

aggravate

aggression

ago

agony

agree

agreement

ahead

aid

aim

ain't

airaircraft

airplane

airport

aisle

al

alan

alarm

albeit

albert

alcohol

alex

algorithm

alice

alienaliens

alive

allen

alley

allowallowed

allows

alloy

almost

alone

along

alphabet

already

alright

also

alt

alter

alternative

although

aluminum

always

amateur

amaze

amazed

amazing

amazingly

ambitious

america

american

americans

amongamongst

amount

amphibians

amuse

amusement

amusing

analyze

anarchy

anatomy

anchor

ancient

anderson

andyangel

anger

angle

angles

angola

angry

animalanimals

animatedanimation

anime

ankle

annanna

anne

annihilate

anniversary

announce

announced

announcement

annoy

annoying

another

answeranswers

ant

antecedent

anthony

anxiety

anxious

anybodyanymore

anyone

anythinganyway

anywhere

apart

apartment

apollo

apparel

apparent apparently

appeal

appealing

appear

appearance

appeared

appears

apple

appliance

apply

appoint

appointment

appreciateappreciated

approach

appropriate

approval

approve

approving

aquarium

arafat

arc

arch

archbishop

architecture

archive

area

argentina

argue

argument

arithmetic

arm

armor

arms

army

arnold

aroma

around

arrange

arrangement

arrest

arrivalarrivearrives

arrow

art

arthur

article

artifact

artillery

artist

artistic

arts

ascend

ashamed

ashes

asia

asian

aside

ask

askedasking

asks

asleep

aspectaspects

aspen

asphalt

ass

assassination

assault

assaulted

assembly

assets

assignmentassist

associate

association

assume

astronomer

asylum

athens

athleteathletics

atmosphere

atom

atomic

attach

attackattacks

attemptattempts

attend

attention

attitude

attorney

attract

attraction

attractive

attribute

audience

audiences

august

aunt

author

autoautomobile

autumn

available

avenue

average

aviation

avoid

awake

awardawards

aware

awareness

away

awesome

awful

awkward

b

baby

back

background

bacon

bad

badge

badly

bag

bail

bait

bake

bakery

balance

ball

ballad

balloon

ballots

ban

banana

band

bankbanker

bankruptcy

bar

barely

bargain

barn

barrel

barter

base

baseball

based

basic

basically

basin

basis

basket

basketball

bath

bathroom

batman

battered

battlebattles

battleship

bay

be

beach

beads

beam

bean

bear

beard

beast

beat

beautifulbeautifullybeauty

became

become

becomes

becoming

bedbedroom

bee

beef

beer

beethoven

beetles

beg

began

begin

beginner

beginning

begins

behave

behavior

behind

belief

believable

believe

believed

believes

bell

belly

belong

belt

ben

bench

benchmark

bend

berlin

berry

besides

best

bet

betray

better

beverage

beware

beyond

bias

bible

bicycle

bigbigger

biggest

bike

bikini

bill

billboard

billion

billy

bin

binary

biography

biology

birdbirds

birth

birthday

bishop

bishops

bit

bite

bits

bitter

bizarre

black

blade

blair

blame

bland

blanket

blaze

bleach

blend

bless

blind

blizzard

block

blockbuster

blonde

bloodbloody

bloom

blossom

blow

blown

blue

blues

bluff

blurblurred

blush

board

boardwalk

boat

bob

bodiesbodyboil

bold

bomb

bond

bone

bookbooklet

books

boom

boost

boot

booth

border

boredboring

born

borrow

boss

boston

bother

bothered

bottle

bottom

bought

bounce

bound

boundary

bow

bowl

box

boxerboxing

boy

boyfriend

boys

br

brace

bracelet

brad

brain

brake

branch

brand

brandy

brass

brave

brazil

bread

break

breakfast

breaking

breaks

breathbreathe

breathtaking

breed

brian

brick

bride

bridge

bridges

brief

bright

brilliant

brilliantly

bringbringing

brings

britain

british

broad

broadcastbroadcasting

brochure

broil

broken

brokerage

bronze

brooks

brother

brothers

brought

brow

brown

bruce

brush

brussels

brutal

bubble

buck

bucket

buddy

budget

buds

buffer

bug

build

builders

building

built

bulb

bulgarian

bull

bulletin

bumble

bumpbun

bunch

bunny

burger

burial

burnburnedburning

burst

burton

bury

bus

bushbusiness

butter

butterfly

button

buy

buzz

c

cab

cabbage

cabin

cable

cache

cactus

cafe

cage

cake

calculate

calculation

calendar

calf

call

called

calling

calls

came

camel

cameo

camera

cameron

camp

campaign

campaigning

cancan't

canal

cancer

candidate

candles

candy

canine

cannon

cannot

cant

canyon

cap

capability

capable

capital

capitalism

capitulate

captain

capturecaptured

captures

car

carbon

card

cardboard

cardinals

care

career

cares

carnival

carnivore

carpet

carrey

carriage

carried

carries

carrots

carry

cars

cart

carter

cartooncartoons

casecases

casey

cash

castcasting

castle

cat

catastrophe

catch

category

caterpillars

cathedral

catholics

cattle

caught

cause

caused

cautious

cave

cd

ceiling

celebratecelebration

cell

cellarcement

cemetery

cent

center

central

century

ceramic

cereal

ceremony

certain

certainly

certificate

certify

cgi

chain

chair

challenge

chamber

champagne

championchampionship

chan

chance

chandler

change

changed

changeschanging

channel

chaos

chapel

chapter

character

character's

characters

charcoal

charge

charged

charlescharlie

charm

charming

chase

cheap

cheat

check

cheek

cheerful

cheerleaders

cheers

cheese

cheesy

cheetah

chef

chemical

chemistry

cherish

cherry

chess

chest

chew

chick

chicken

chief

childchildhood

childish

children

children's

chile

chilling

chin

chinese

chipmunk

chirp

chocolate

choice

choices

choir

choke

choose

chop

chopper

chord

chose

chosen

chris

christchristian

christmas

christopher

chronicle

chuck

church

cigarcigarette

cinema

cinematic

cinematography

circle

circulation

circumstancecircumstances

citizen

citizenship

citrus

city

claim

clarify

clark

clash

class

classicclassics

classroom

clean

clear

clearly

clench

clever

clichéclichés

click

cliff

climax

climb

clinic

clock

close

closer

closet

closing

clothclothes

cloud

clown

club

clubhouse

clue

coach

coal

coast

coat

cock

cocktail

code

coffee

cognition

coil

coin coins

cold

collaborate

collage

collapse

collar

collect

collection

college

collision

colonel

colonies

color

colorful

coloringcolorscolour

colt

column

combinationcombinecombined

come

comedic

comediescomedy

comes

comfort

comiccomics

coming

comma

commandcommander

comment

commentary

comments

commerce

commercial

commission

commit

commitment

committed

common

communicatecommunicationcommunity

commuters

companion

company

comparecomparedcomparison

compelling

compete

competence

competition

complain

complaint

complete

completely

complex

complicated

composecomposercomposers

compound

comprehend

computation

compute

computer

con

concentrateconcept

concern

concerned

concert

conclude

conclusion

concrete

concur

condemn

condemned

condition

conditions

cone

confess

confession

confidenceconfident

conflict

confuse

confused

confusing

confusion

congress

connect

connection

conquerconquest

conservation

considerconsidered

considering

console

conspiracy

conspire

constantconstantly

constellation

constitution

construct

construction

consumer

contact

contain

container

contains

contemplate

contemporary

contend

content

context

continent

continue

continues

contract

contrast

contrived

control

controversy

conversation

convertible

convinceconvinced

convincing

cook

cookies

cooking

cool

cooperate

coordinator

cop

cope

copper

cops

copy

cord

core

corner

corny

corporation

correct

correspondence

corridor

corruptcorruption

cost

costume

costumes

cottage

cotton

couch

cough

could

could've

council

count

counter

country

couple

course

court

cousin

cover

covering

cow

crab

crack

crackle

cradle

craft

crafted

crane

crap

crash

crave

crawl

crazy

create

createdcreatescreating

creation

creativecreativity

creator

creaturecreatures

credibility

credit

credits

creek

creepy

crew

crib

crimecriminal

crisis

criterion

critic

criticalcriticismcriticize

critics

croak

crochet

crop

cross

crow

crowd

crowded

crowe

crown

crucial

cruelcruise

crunch

crush

crycrying

crystal

cube

cucumber

cuddle

cuisine

cult

culture

cup

curator

curious

curledcurrency

current

curtain

curve

customers

cut

cute

cuts

cutter

cutting

cylinder

cynical

dad

daffodils

daisy

damage

damages

damn

dan

dancedancersdancing

dandelion

dangerdangerous

daniel

danny

dare

dark

darker

darkness

dash

dashboard

database

date

dated

daughter

david

davis

dawndaydays

de

dead

deadly

deal

dealingdeals

dean

deathdeaths

debate

debt

debut

decade

decades

decay

deceive

decent

decide

decided

decides

decision

deck

declare

decline

decompose

decorate

decoration

decrease

dedicate

deep

deeper

deeply

deer

defeat

defeating

defend

defensive

deficit

define

definitely

definition

defrost

degrade

degree

delay

delightdelightful

deliverdelivered

delivers

delivery

demand

demolishdemolition

demon

demonstrate

denial

dennis

dense

dentist

deny

depart

department

departure

depend

depicteddepiction

deployment

deposit

depp

depressing

depression

depth

deputy

descend

descent

describe

described

desert

deserve

deserveddeserves

design

designed desirable

desire

desk

despair

desperate

despise

despite

dessert

destroydestroyeddestruction

detach

detaildetails

detective

deteriorate

determination

determine

determined

develop

developed

development

device

devil

devote

dew

dialogdialogue

diamond

dice

dickdictionary

die

dieddies

diet

differdifference

different

difficult

difficulty

dig

digit

dignity

diminish

diner

dinner

dip

direct

directed

directingdirection

directly

director

director's

directors

dirt

dirty

disability

disagree

disallow

disappear

disappoint

disappointeddisappointing

disappointment

disaster

disbelief

disbelieve

disc

discharge

discipline

discourage

discover

discovered

discovers

discovery

discussdiscussion

disease

disguise

disgust

disgusting

dish

disintegrate

dislike

dismay

dismiss

disney

disorganize

disown

disperse

display

dispose

disposition

disprove

disregard

dissolve

distance

distributiondistributor

disturb

disturbances

disturbing

dive

diversion

divide

dividend

diving

division

do

dock

doctor

document

documentary

dogdogs

dollardollars

dolls

domain

dome

dominate

donald

done

donkey

donut

doodle

door

doorway

double

doubt

douglas

downtown

doze

dr

dracula

draft

drag

dragon

dragonfly

drain

drama

dramatic

draw

drawer

drawing

drawn

dreamdreams

dreary

drench

dressdresseddressing

drew

drift

drill

drink

dripdripping

drive

driven

driver

driving

drizzle

drop

droplets

dropped

drought

drown

drugdrugs

drum

drunk

dry

dryerduck

dudedue

duel

dull

dumb

dump

dunk

duplicate

dusk

duty

dvd

dye

dying

e

eagle

ear

earlier

early

earnearning

earth

earthquake

ease

easily

east

easter

eastwood

easy

eateating

ecology

ecuador

ed

eddie

edge

edited

editing

editor

educate

edward

eerie effect

effective

effectively

effects

effort

efforts

eggeggs

ego

eight

einstein

either

elastic

elbow

elect

electronics

elegance

elementelements

elephant

elevator

elizabeth

else

embarrass

embrace

emergency

emission

emotionemotional

emotionallyemotions

empire

employeeemployer

empty

encounter

encourage

encouragement

encyclopedia

end

ended

ending

endless

endorsement

ends

endurance

enemy

energy

enforce

engage

engagement

engaging

engine

engineering

england

english

enjoy

enjoyableenjoyed

enjoying

enough

enrage

ensemble

ensure

enter

enterprise

entertain

entertained

entertaining

entertainment

entire

entirely

entity

entrance

environment

epic

episcopal

episodeepisodes

equal

equality

equally

equation

equipment

era

eric

erode

error

erupt

escalator

escape

especially

espionageessay

essenceessential

essentially

establishment

etc

europe

european

evacuate

evaluate

even

evening

eventevents

eventually

ever

every

everybody

everyday

everyone

everything

everywhere

evict

evidence

evil

exact

exactly

examinationexamine

examiner

example

exceed

excel

excellent

except

exception exceptional

exchange

excited

excitementexciting

excuse

executedexecution

executive

exhale

exhibit

exhibition

exile

exist

existence

exit

exotic

expand

expect

expectationsexpectedexpecting

expedition

experience

experienced

experiences

explainexplained

explains

explanation

explode

exploit

exploitation

explore

explosion

explosions

explosive

express

expressionexpressions

extended

extent

extra

extract

extracts

extraordinary

extras

extreme

extremely

eye

eyes

f

fabric

fabulous

face

faces

fact

factor

factory

facts

fade

fail

failed

fails

failure

fair

fairlyfairy

faith

faithful

fake

fall

falling

fallsfalse

fame

familiar

families

family

famous

fanfans

fantastic

fantasy

far

fare

farley

farmfarmer

fascinating

fashion

fast

fasten

fat

fate

father

fault

fauna

favor

favoritefavoritesfavourite

fbi

fear

featherfeathers

feature

featured

features

featuring

fee

feed

feedback

feelfeeling

feelings

feels

feet

feline

fell

fellow

felt

female

fence

ferry

fertility

festival

fever

fiction

field

fightfightingfights

figure

figured

fill

filled

film

film's

film-making

filmedfilming

filmmaker

filmmakers

films

final

finale

finally

find

finding

finds

fine

finest

finger

fingerprint

finish

finished

fire

fireworks

first

fish

fishing

fit

fits

five

fix

flag

flame

flamingoflap

flash

flashbacks

flat

flavor

flaw

flawed

flawless

flaws

fleefleeing

flesh

flexflexible

flick

flicks

flight

fling

flip

float

flood

floor

flora

flour

flow

flowerflowers

flu

flush

flute

flutter

fly

flyer

flying

focusfocused

focuses

fog

foil

foldfoliage

folks

follow

followed

following

follows

food

foolish

foot

footage

football

footprint

forbear

forbes

forbid

force

forced

forces

ford

forecastforeign

forest

forever

forged

forget

forgive

forgot

forgotten

form

formal

format

formation

former

formula

forswearing

forth

forthcoming

forward

foster

found

foundation

fountain

four

fourth

fox

fragile

fragrance

frame

framework

france

franchise

frank

frankly

fraternity

fraud

fred

freddy

free

freedom

freeman

freeze

french

fresh

freud

friday

friedman friend

friendly

friends

friendship

frighten

frightening

frigid

frog

front

frost

frozen

fruit

frustrate

frustration

fry

fuck

fuel

full

fully

fun

fundfunds funeral

fungi

funnierfunniestfunny

fur

furnace

furniture

fury

future

g

gags

gain

galaxy

gallon

gamble

gambling

gamegames

gang

gangster

garage

garbage

garden

garfield

garlic

garment

gary

gasgasoline

gate

gateway

gathergathering

gauge

gave

gay

gear

gem

gender

gene

general

generally

generation

generous

genius

genre

gentleman

genuine

genuinely

george

germangermany

get

gets

getting

ghost

giant

gibson

giggle

gin

giraffe

girl

girlfriend

girls

give

given

gives

giving

glad

glance

glare

glass

glide

glittering

globe

glory

glove

glow

glue

go

goal

goat

goats

god

godfather

goes

going

gold

golden

golf

gone

gonna

good

goofy

goose

gordon

gore

gorgeous

gory

gospel

gossip

got

gotten

goverment

government

governor

grab

grace

grade

graduate

graffiti

grand

grant

granted

grapes

graphic

graphics

grasp

grass

grate

gratuitous

grave

gravestone

graveyard

gray

graze

great

greater

greatest

greatly

green

greet

grew

grey

grief

grill

grind

grip

gripping

gritty

grocery

groom

gross

ground

group

grow

growing

grown

growth

guarantee

guardian

guess

guest

guild

guilt

guilty

guitar

gulf

gull

gulp

gum

gunguns

guru

gut

guy

guys

gymnastics

h

hack

hairhaircut

half

hall

halloween

hallway

hamburger

hamlet

hamper

hamster

hancock

hand

handbag

handle

handled

hands

handsome

handwriting

hang

hanging

hanks

happenhappened

happening

happens

happiness

happy

harbor

harbour

hard

hardly

hardware

harm

harris

harryharsh

harvard

hat

hatch

hate

hated

haul

haunt

haunted

haunting

have

hawk

haze

he's

head

heads

heal

health

hear

heard

hearing

heart

heatheaven

heavily

heavy

heck

height

held

helen

helium

hell

helmet

help

helped

helper

helps

hen

henry

herb

here's

heritage

hero

heroes

heroin

heroine

hesitate

hey

hidden

hide

high

higher

highlight

highly

highway

hike

hilarious

hill

hip

historicalhistory

hit

hitch

hitchcock

hits

hockey

hoffman

hold

holding

holds

hole

holes

holiday

holler

hollywood

holmes

holy

home

homeless

homer

honest

honestly

honey

honolulu

hood

hooked

hoot

hop

hopehopefully

hopes

hoping

hopkins

horace

horizon

horn

horrible

horror

horse

hose

hospital

hostility

hot

hotel

hound

hourhours

house

housing

houston

howard

however

hug

huge

hugh

hum

humanhumanity

humans

humiliate

hummingbird

humor

humorous

humour

hundred

hunthunter

hurricane

hurry

hurt

husband

husky

hustle

hut

hydrogen

hymn

hype

hypertension

hypnotize

hysteria

i'd

i'll

i'm

i've

ian

ice

icon

idea

ideas

identity

ignorance

ignore

ii

iii

illegal

illness

illusion

illustrations

image

imageryimages

imagination

imagine

imdb

imitate

immediately

immoral

immunity

impact

impartiality

impatient

impersonate

implementimplementation

implode

importance

important

impose

impossible

impressed

impression

impressive

imprisoned

improving

impulse

inch

incline

include

included

includesincluding

income

increase

incredible

incredibly

indeed

independence

independent

indexindian

indiana

indication

individual

industrial

industry

inexpensive

infection

infinite

influence

inform

information

infrastructure

inhale

initial

initially

ink

inmate

inn

inner

innocence

innocent

inquire

insane

insectinsects

insert

inside

insight

inspect

inspired

installation

installment

instance

instead

institution

instructinstruction

instructor

instrument

instrumentation

insult

insurance

insureinsured

integrateintegration

intellect

intelligence

intelligent

intend

intended

intenseintensity

interaction

interest

interested

interesting

interim

interior

international

internet

interpretation

interrupt

intervention

interview

intoxicate

intriguing

introduce

introducedintroduction

intuition

invalidate

invent

inventory

investigateinvestigation

investment

investor invite

invoice

involve

involved

involvesinvolving

ipod

iris

irish

iron

irritate

irritating

island

isolation

israel

issueissues

italian

italy

ivy

j

jack

jacket

jackie

jackson

jaguar

jail

jake

jamaica

james

jamie

jane

japanjapanese

jar

jason

jaw

jay

jazz

jean

jeff

jellyfish

jennifer

jerk

jerry

jerusalem

jessica

jesus

jet

jewel

jewelry

jewish

jim

jimmy

joan

job

joe

jog

john

johnny

join

joint

joke

jokes

jones

journal

journey

joy

jr

judge

judging

judgment

juice

julia

julie

jump

jumper

jungle

jurisdiction

jury

justice

justify

juvenile

kkane

kansas

kate

keaton

keep

keeping

keeps

kelly

kennedy

kept

kevin

key

keyboard

kick

kid

kidnap

kidney

kids

killkilled

killerkillers

killingkills

kilometer

kind

kinda

kinds

king

kiss

kitchen

kittens

kitty

klan

knee

kneel

knew

knife

knight

knit

knitting

knock

know

knowing

knowledge

knownknows

kong

kremlin

kubrick

l

la

lablaboratory

lace

lacklackinglacks

lad

laden

ladies

lady

lake

lamb

lame

lamp

land

landlord

landscape

language

lantern

large

largely

larry

last

late

later

latest

latex

latin

latter

laugh

laughable

laughedlaughing

laughs

laughter

launch

laundering

laura

law

lawn

lawyer

lay

layer

lead

leader

leadingleads

leaf

lean

leap

learnlearned

learns

least

leather

leave

leavesleaving

lebanon

led

ledge

lee

left

leg

legend

legendary

legion

lego

lemon

lend

length

lens

less

lesson

let

let's

lets

letter

levellevels

lewis

liability

library

libya

license

lick

licking

lie

lien

lies

life

lift

light

lighthouse

lighting

lightning

lights

likable

like

liked

likely

likes

lily

limb

limit

limited

limp

lincoln

line

linen

lines

lineup

lingerie

link

lion

lip

lips

liquid

liquor

list

listen

listing

literally

literature

litter

little

livelived

liver

lives

living

lizard

load

loan

lobster

local

locate

locationlocations

locked

locomotives

log

logic

london

lonely

long

longer

looklookedlooking

looks

loop

loose

loosen

lord

lose

loseslosing

loss

losses

lost

lot

lots

loud

louisiana

lounge

love

loved

lovely

lover

lovers

lovesloving

low

lower

lucas

luck

lucky

luggage

luke

lunch

lung

luxury

lynch

lyric

mac

machine

mad

made

madhouse

magazine

magicmagical

magician

magnificent

magnify

magnitude

magnolia

maid

mail

main

mainly

mainstream

majormajority

make

make-up

maker

makers

makes

makeup

making

malemales

mallard

mammalmammals

manman's

managemanaged

management

manager

manages

manchester

mannequin

manner

manslaughter

manufacturer

many

map

maple

maradona

marathon

marble

marching

mare

marijuana

mark

market

marriagemarried

marrow

marry

mars

martial

martin

mary

mask

massacre

massive

master

masterpiece

match

mate

material

matrix

matt

matter

matters

matthew

mature

max

maxwell

may

maybe

mayor

meal

mean

meaning

means

meant

meanwhile

measure

meat

mechanic

mechanism

medal

media

medicine

mediocre

medium

meetmeeting

meets

mel

melody

melt

membermembers

memorabilia

memorable

memorial

memoriesmemory

men

menace

mend

mental

mention

mentioned

menu

meremerely

mess

message

met

metal

meter

metro

mexico

michael

michelle

microwave

midday

middle

might

mike

mile

miles

military

militia

milk

mill

miller

millionmillions

mimic

mindmindless

minds

mine

miniature

minister

ministry

mink

minor

minority

minuteminutes

mirror

misery

miss

missed

missile

missing

mission

misspend

mist

mistake

mistakes

mistreat

misty

mixmixedmixture

mob

mode

model

modern

modest

modification

molecule

mom

moment

moments

momentum

money

monk

monkeys

monstermonsters

monthmonths

monument

mood

moon

moore

moralmorality

morgan

morning

mortal

mortality

moss

mostly

motel

mother

motion

motive

motormotorcycle

motto

mount

mountain

mouse

mouth

move

moved

movement

moves

movie

movie's

movies

moving

mow

mozart

mr

mrs

ms

much

mud

mug

multiple

multiply

mummy

munch

munich

mural

murdermurdered

murders

murphymurray

musclemuseum

mushrooms

musicmusical

musicianmusicians

mussolini

must

mustard

myers

mysterious

mystery

myth

nag

nailnails

naked

name

named

namesnap

narration

narrative

narrow

nasty

nation

national

native

natural

naturally

nature

navy

nazi

near

nearly

necessarily

necessary

neck

necklace

need

needed

needle

needs

negative

neglect

negroes

neither

neon

nephew

nerve

nervous

nest

net

network

neutral

never

nevertheless

new

news

newspaper

next

nice

nicely

nicholson

nick

nickel

night

nightmare

nobody

noir

noise

nominated

none

nonetheless

nonsense

noodles

noon

normal

normally

north

nose

notch

note

notebook

nothing

notice

noticeable

noticed

notify

novel

novice

nowhere

nuclear

nude

nudity

nuisance

nullify

numbernumbers

numerous

nursenurses

nut

nutrition

nuts

oak

obey

objectobjective

obligation

observation

observatory

observe

obsessed

obtain

obvious

obviously

occasion

occasionally

occupation

occuroccurrence

ocean

odd

odyssey

offend

offer

offers

office

officer

official

often

oh

oil

okokay

old

older

oldest

oliver

one

one's

ones

onion

onto

opec

open

opening

opens

opera

operateoperationoperative

operator

opinion

opponent

opportunity

opposite

option

oracle

orange

orchestra

orchids

order

ordinary

organ

organism

organization organize

origami

origin

original

originality

originally

originate

orthodontist

oscaroscars

others otherwise

outdoor

outfit

outlet

outnumber

outside

outstanding

oval

over-the-top

overall

overcome

overflow

overhead

overly

overpower

overwhelm

owe

owen

owl

own

owner

ox

oxford

oxygen

p

pacepaced

pacific

pacing

pacino

package

packaging packed

packet

pact

padding

page

paid

pain

painful

paintpainted

painting

pair

pakistan

palace

palestinianpalestinians

palm

pan

panda

panorama

paper

papers

par

parade

paragraph

parcel

pardon

parentparents

paris

parish

park

parker

parking

parody

parrot

part

participant

participate

particular

particularly

partner

parts

party

pass

passage

passed

passion

passport

past

pat

patch

patent

path

pathetic

patients

patio

patrick

patternpatterns

paul

pause

paws

pay

payment

peace

peaceful

peacock

pearl

pebbles

peel

pelicanpelt

pen

pencil

pennsylvania

people

people's

pepper

perceive

percent

perception

perfect

perfection

perfectly

perform

performanceperformances

performedperformer

perfume

perhaps

period

perish

perjury

permission

permit

person

personal

personality

personally

personnel

perspective

perspire

persuade

pet

petals

peter

pets

pg

phantom

phenomenonphilip

phone

photographer

photography

photos

phrase

physical

physician

physics

piano

piazza

pickpicked

picture

pictures

pie

piecepieces

pier

pig

pigeon

pigs

pillowpilot

pin

pinch

pink

pinnacle

pipe

pitch

pitt

pity

pizza

place

placed

places

plague

plain

plan

plane

planet

planningplans

plant

plantation

plastic

plateplates

play

played

playerplayers

playground

playing

plays

pleaplead

pleasant

pleasantly

please

pleasure

pledges

plenty

plot

plots

plow

pluck

plus

poach

pod

poem

poet

poetry

poignant

point

pointless

points

poise

poker

poland

pole

poles

police

polish

polite

political

politician

politics

pollution

polo

polyester

pond

ponder

poodle

pool

poor

poorly

pop

popcorn

poppy

popular

population

porch

pork

porn

port

portrait

portray

portrayalportrayed

portrayingportrays

position

positive

possess

possession

possibility

possible

possibly

post

postage

postcards

posted

poster

pot

potatopotatoes

potential

potter

poultry

pounce

pour

poverty

power

powerful

powers

practically

practice

praise

pray

prayer

precedent

predict

predictable

prefer

preference

pregnancy

pregnant

prejudice

premise

preparation

prepare

prepared

presence

presentpresented

presents

preservation

president

press

pressure prestige

presume

pretend

pretentious

pretty

preview

previous

previously

prey

price

prick

pride

priest

prime

princeprincess

print

printer

prior

prison

prisoners

private

privileges

probability

probably

problem

problems

procedure

proceedings

process

processing

proclaim

produce

produced

producer

producers

product

production

profession

professional

professionals

professorprofit

program

prohibit

project

projector

prominence

promise

promote

promoter

proof

propaganda

proper

property

prose

prosecute

prosper

protagonist

protectprotection

protective

protest

protestant

protocol

proton

proud

prove

provedproves

provideprovidedprovides

province

proving

proximity

psychiatry

psycho

psychological

psychology

pub

public

publication

puddle

pug

pullpulled

pulls

pulp

pump

pumpkin

punch

punishpunishment

punk

pup

pupil

puppy

purchase

pure

purple

purpose

purse

pursue

push

put

puts

putting

pyramid

qualities

quality

quantity

quarantine

quarrel

quarterback

quartz

queenqueens

quench

query

quest

questionquestions

queue

quick

quickly

quiet

quirky

quit

quite

quiz

quota

quotation

quote

r

rabbi

rabbit

race

racer

rachel

racing

racism

racket

radar

radiation

radical

radio

railrailroad

rails

railway

rain

rainbow

raise

raised

rally

ran

random

range

rap

rape

rapid

rarerarely

rash

raspberry

rat

rate

rated

rather

rating

rationalize

rattle

raw

ray

reach

reaction

read

reader

reading

ready

real

realism

realistic

reality

realize

realized

realizes

really

reap

reappear

rearrange

reason

reasons

rebel

recall

receivereceived

receiver

recent

recently

recess

recognition

recognize

recommend

recommendation

recommended

record

recorder

recovery

recreation

recruit

recyclerecycling

red

redeeming

redemption

redheadreduce

reef

reel

reeves

reference

references

reflectreflection

refrain

refreshing

refrigerator

refugees

refuse

regarding

regardless

region

register

registration

regret

regular

reichstag

reject

rejoice

relate

relation relationship

relationships

relative

relatively

relax

relaxation

relaxed

relay

releasereleased

relief

religionreligious

rely

remain

remains

remake

remark

remarkable

remedy

remember

remembered

remindremindedreminds

reminiscent

remove

renounce

rentrental

rented

repair

repeal

repeat

replacereply

report

representation

representative

repress

reprimandreproduce

reptiles

republic

republicans

reputation

request

requirerequired

rescuerescued

researchresearchers

reservation

reserve

respect

respond

responsible

rest

restaurant

restless

restore

restrict

result

results

retailer

retain

retireretiring

retreat

returnreturns

revealrevealed

revenge

review

reviewersreviews

revolution

reward

rhodes

rhythm

rice

rich

richard

ride

ridge

ridicule

ridiculous

right

ring

rings

rinse

rip

ripples

rise

risk

river

riverside

road

roamroar

rob

robert

roberts

robin

robot

rock

rocker

rocket

rocky

rod

rodents

roger

roleroles

roll

rolling

romanceromantic

ron

roof

rook

room

rooster

root

rope

rose

roughround

routine

rover

row

rowing

rub

rubber

rubbish

rugby

ruin

ruined

ruins

rule

rules

run

running

runs

rush

russell

russiarussian

rust

rusty

ryan

sabbath

sad

sadly

safe

safety

said

sailsailing

sailor

saint

saints

sake

salad

salary

sale

sales

salt

salutesam

sanctions

sand

sandwich

santa

sarah

sat

satan

satellite

satin

satire

satisfy

satisfying

saturday

sauce

sausage

savesaved

saving

saw

say

saying

says

scale

scandal

scarce

scarescared

scaresscary

scene

scenery

scenes

scheme

school

sci-fi

science

scientist

scold

scooter

score

scotch

scott

scottish

scouting

scramble

scratch

scream

screaming

screen

screening

screenplay

scribble

script

scrutiny

sculpture

sea

seafood

seagull

sean

search

seashore

seasonseasons

seat

second

seconds

secret

secretary

secure

security

see

seed

seeing

seek

seem

seemed

seemingly

seems

seenseepage

sees

segment

seizeseizure

select

selection

self

sell

seminar

senate

send

sense

sensitive

sent

sentence

sentenced

sentiment

sentry

separate

separation

sequelsequels

sequence

sequences

serial

series

serious

seriously

serveserved

server

serves

service

serving

session

set

sets

setting

settle

seven

several

sew

sewing

sexsexual

sexy

shade

shadow

shake

shakespeare

shallowshame

share

shariff

shark

sharp

shatter

shed

sheep

sheer

sheet

shell

shelter

sheriff

shift

shineshines

ship

shirt

shiver

shock

shocked

shocking

shoes

shootshooting

shopshopping

shore

short

shortage

shotshots

shoulder

shove

show

showed

shower

showingshown

shows

shrink

shuffle

shun

shut

sick

sickness

side

sides

sidewalk

sight

sign

signal

signature

signed

significant

signs

silence

silent

silhouette

silly

silver

similar

similaritysimon

simple

simply

simulation

since

sing

singapore

singer

singing

single

sink

sinner

sip

sister

sisters

sit

site

sitting

situation

situations

six

sixteen

sixth

size

skateskateboard

skating

sketch

skiskiing

skillskills

skin

skip

skirt

skull

sky

skyline

skyscraper

slap

slapstick

slash

slasher

slaveslaveryslaves

slayslaying

sleep

sleeve

slice

slide

slightly

slip

slither

slope

slow

slowly

slurp

sly

small

smart

smash

smear

smell

smile

smith

smoke

smoking

smother

smudge

snake

snap

snatch

sneak

sneeze

sniff

snoozesnore

snow

snowboarding

snowman

snuggle

soak

soap

soar

sob

soccer

social

socialism

society

socks

sofa

softball

software

soil

solar

soldiersoldiers

sole

solid

solve

somebody

somehow

someone

something

sometimes

somewhat

somewhere

son

song

songs

soon

soothe

sorrow

sorry

sort

sorts

soul

soundsounds

soundtrack

soup

source

south

soviets

sow

soybean

space

spaghetti

spanish

spank

spare

spark

spawn

speakspeakingspeaks

special

species

spectacular

speculation

speech

speed

spell

spendspending

spent

sphere

spider

spider-man

spielberg

spill

spin

spine

spiral

spirit

spite

splash

split

spoilspoilerspoilers

spoon

sportsports

spot

sprain

spray

spread

spring

springfield

sprinkle

sprint

spy

squad

square

squash

squeaksqueal

squeeze

squint

squirrel

stab

stability

stadium

stage

stain

staircase

stairs

stake

stalin

stamp

stance

stand

standardstandards

standing

stands

star

stare

starring

stars

start

started

starter

starting

starts

starve

state

statement

states

station

statue

status

stay

stayed

stays

steak

steal

steals

steel

steeple

steer

stem

stencil

stepstephen

stereotypes

steve

steven

stevenson

stew

stewart

stick

stickers

still

sting

stink

stir

stitch

stitches

stock

stockings

stomach

stonestop

stopped

store

stories

stork

storm

storystoryline

storytelling

stove

straight

strain

strange

stranger

strategy

straw

strawberry

stray

stream

street

streets

strength

stretch

strike

string

strip

stripes

strive

stroke

strong

strongly

structure

struggle

struggling

stuck

stud

studentstudents

studio

study

stuff

stumble

stunning

stunts

stupid

stylestylish

subcommittee

subject

submarine

submitsubstance

subtitles

subtle

subtract

subway

succeed

succeeds

successsuccessful

sucksuckedsucks

sudden

suddenly

suds

sue

suffering

suffers

sugar

suggest

suicide

suit

sum

summary

summer

sun

sunflower

sunglasses

sunk

sunlightsunny

sunrisesunset

sunshine

super

superb

superhero

superior

superman

supernatural

supper

supply

support

supporter

supporting

suppose supposed

supposedly

sure

surely

surf

surface

surfers

surgery

surname

surpass

surprise

surprisedsurprises

surprising

surprisingly

surreal

survive

susan

sushi

suspect

suspensesuspenseful

suspicion

swallowswamp

swan

swap

sway

swear

sweat

sweater

sweden

sweet

swimswimming

swimsuit

swing

swoon

sword

sydney

symbol

sympathetic

sympathy

system

table

tableware

tackle

tag

tail

take

taken

takestaking

tale

talenttalentedtalents

tales

talktalking

talks

tall

tan

tank

tap

tape

tarantino

target

tarnish

task

taste

tattoo

tax

taxation

taxi

taxpayer

taylor

tea

teach teacherteaching

team

teartears

tease

technical

technician

technology

teen

teenageteenager

teenagersteens

teeth

telephone

television

tell

telling

tells

temper

temperature

temple

ten

tend

tennis

tenor

tensetension

tent

term

terminator

terms

terrible terribly

terrier

terrific

terrifying

territory

terror

test

tested

testify

texas

text

textbook

textile

texture

thailand

thank

thankfully

thanks

that's

thats

thaw

theater

theaters

theatre

theatrical

theft

theme

themes

theory

there's

therefore

they're

they've

thief

thin

thing

things

think

thinking

thinks

third

thomas

thoroughlythough

thought

thoughts

threatthreats

three

thrillerthrilling

throat

throne

throughout

throw

thrown

thumb

thunderstorm

thus

ticker

tickettickets

tickle

tie

tiger

tight

tiles

till

tim

timber

time

timer

times

tin

tiny

tire

tired

titanic

title

toast

tobacco

todaytoday's

toe

toes

together

toil

toilet

told

tolerance

tom

tomato

tommy

tone

tongue

tony

took

tool

tooth

top

topic

torture

toss

total

totally

tote

touch

touched

touches

touching

tough

tourist

tournament

tow

towardtowards

tower

town

toytoys

track

tract

tradetrading

traditional

traffic

tragedytragic

trail

trailertrailers

train

trainer

training

tram

transformation

transformers

transmission

transplant

transporttransportation

trap

trapped

trash

traveltraveling

tread

treasure

treat

treatedtreatment

treaty

tree

trek

trial

tribunal

trick

tried

tries

trilogy

trim

trinity

trip

triumph

troops

tropical

trouble

troubles

truck

true

truly

trunk

trust

truth

trytrying

tsunami

tub

tug

tulip

tumble

tune

tunnel

turkey

turks

turn

turned

turning

turns

tv

twenty

twice

twigs

twist

twisted

twists

two

type

types

typical

u

ugly

uk

ultimate

ultimately

umbrella

unable

unbelievable

uncle

underground

underrated

understand

understanding

understood

underwater

unexpected

unforgettable

unfortunate

unfortunately

unhappy

uniform

union

unique

unit

unite

united

universal

universe

unknown

unless

unlike

unlikely

unload

unnecessary

unrealistic

unusual

update

upon

upset

urban

urge

us

usa

useused

usesusing

usual

usually

usurp

utter

utterly

v

vacate

vacation

valentine

valley

valor

value

values

vampirevampires

van

vanish

variety

various

vary

vault

veer

vehicle

vein

verdict

verify

verse

versionversions

vessel

veteran

vhs

victimvictims

victor

victory

video

vietnamvietnamese

view

viewed

viewer

viewers

viewing

views

villa

villagevillages

villainvillains

vincent

vine

vinegar

vintage

vinyl

violation

violenceviolent

violet

violin

virtually

virtuoso

vision

visit

visitor

visualvisuallyvisuals

vitamin

vocal

vodka

voicevoices

volcano

volunteer

vomit

vote

voyage

vs

w

wag

wagner

wagon

waist

wait

waiting

wake

walk

walked

walking

walkswall

walter

wander

want

wanted

wanting

wants

war

warm

warnwarning

warranty

warrior

wars

warships

wash

washer

washing

washington

wasp

waste

wasted

watch

watchable

watchedwatching

water

waterfall

wave

way

wayne

ways

we're

we've

weak

wealth

weaponweapons

wearwearing

weather

weave

web

wed

wedding

wednesday

weed

weekweekendweeks

weep

weight

weird

welcome

welfare

well

went

west

western

wet

whale

what's

whatever

whatsoever

wheat

wheneverwhereas

whether

whilst

whip

whiskers

whiskey

whisper

whistle

white

who's

whole

whose

wide

wife

wig

wiggle

wild

william

williams

willing

willis

wilson

win

wind

windmill

window

wine

wing

wings

winnerwinning

winter

wipe

wire

wisdom

wise

wish

wit

witch

withdrawwithdrawal

withdrawn

within

without

witness

witty

wizard

wolf

woman

women

wonder

wonderful

wonderfully

wondering

wood

wooden

woodland

woods

woody

wool

wordwords

workworked

worker

working

workplace

works

workshop

world

worlds

worm

worry

worseworst

worth

worthwhile

worthy

would

would've

wow

wrap

wreck

wrist

write

writer

writers

writingwritten

wrong

wrote

x

x-men

yacht

yale

yard

yarn

yawn

yeah

year

year-oldyearn

years

yell

yellow

yen

yes

yet

yield

york

young

younger

youth

zebra

zero

zinc

zombie

zombies

zone

zoo

?

?

´

52 / 90

Page 71: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

t-SNE on the imdb20 PPMI VSM

positivity negativity

52 / 90

Page 72: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Code snippets

vsm_01_code6_solved

April 1, 2019

In [1]: %matplotlib inlinefrom nltk.corpus import opinion_lexiconimport osimport pandas as pdimport vsm

In [2]: DATA_HOME = os.path.join('data', 'vsmdata')

imdb5 = pd.read_csv(os.path.join(DATA_HOME, 'imdb_window5-scaled.csv.gz'), index_col=0)

In [3]: imdb5_ppmi = vsm.pmi(imdb5)

In [4]: # Supply a str filename to write the output to a file:vsm.tsne_viz(imdb5_ppmi, output_filename=None)

In [5]: # To display words in different colors based on external criteria:positive = set(opinion_lexicon.positive())negative = set(opinion_lexicon.negative())

colors = []for w in imdb5_ppmi.index:

if w in positive:color = 'red'

elif w in negative:color = 'blue'

else:color = 'gray'

colors.append(color)

vsm.tsne_viz(imdb5_ppmi, colors=colors)

1

53 / 90

Page 73: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Latent Semantic Analysis (LSA)

1. High-level goals and guiding hypotheses2. Matrix designs3. Vector comparison4. Basic reweighting5. Subword information6. Visualization7. Dimensionality reduction

a. Latent Semantic Analysisb. Autoencodersc. GloVed. word2vec

8. Retrofitting

54 / 90

Page 74: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Overview

• Due to Deerwester et al. 1990.

• One of the oldest and most widely used dimensionalityreduction techniques.

• Also known as Truncated Singular Value Decomposition(Truncated SVD).

• Standard baseline, often very tough to beat.

55 / 90

Page 75: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Guiding intuitions for LSA

x

y

0 2 10 11 12

0

34

1415

A

B

C

D

56 / 90

Page 76: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

The LSA method

Singular value decompositionFor any matrix of real numbers A of dimension (m× n) thereexists a factorization into matrices T, S, D such that

Am×n = Tm×mSm×mDTn×m

· · · ·· · · ·· · · ·

=

· · ·· · ·· · ·

···

· · ·· · ·· · ·· · ·

T

A3×4 = T3×3 S3×3 DT4×3

57 / 90

Page 77: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Idealized LSA example

d1 d2 d3 d4 d5 d6

gnarly 1 0 1 0 0 0wicked 0 1 0 1 0 0

awesome 1 1 1 1 0 0lame 0 0 0 0 1 1

terrible 0 0 0 0 0 1

⇓⇑

Distance from gnarly

1. gnarly2. awesome3. terrible4. wicked5. lame

T(erm)

gnarly 0.41 0.00 0.71 0.00 -0.58wicked 0.41 0.00 -0.71 0.00 -0.58

awesome 0.82 -0.00 -0.00 -0.00 0.58lame 0.00 0.85 0.00 -0.53 0.00

terrible 0.00 0.53 0.00 0.85 0.00

×

S(ingular values)

2.45 0.00 0.00 0.00 0.000.00 1.62 0.00 0.00 0.000.00 0.00 1.41 0.00 0.000.00 0.00 0.00 0.62 0.000.00 0.00 0.00 0.00 -0.00

×

D(ocument)

d1 0.50 -0.00 0.50 0.00 -0.71d2 0.50 0.00 -0.50 0.00 0.00d3 0.50 -0.00 0.50 0.00 0.71d4 0.50 -0.00 -0.50 -0.00 0.00d5 -0.00 0.53 0.00 -0.85 0.00d6 0.00 0.85 0.00 0.53 0.00

T

gnarly 0.41 0.00wicked 0.41 0.00

awesome 0.82 -0.00lame 0.00 0.85

terrible 0.00 0.53

× 2.45 0.000.00 1.62 =

gnarly 1.00 0.00wicked 1.00 0.00

awesome 2.00 0.00lame 0.00 1.38

terrible 0.00 0.85

Distance from gnarly

1. gnarly2. wicked3. awesome4. terrible5. lame

58 / 90

Page 78: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Cell-value comparisons (k = 100)

0.00 0.25 0.50 0.75 1.00 1.25Cell value 1e8

101

103

105

107

Raw counts

1.0 0.5 0.0 0.5Cell value 1e8

100

101

102

103

104

105

LSA (k=100)

10 5 0 5 10 15Cell value

101

102

103

104

105

106

107

PMI

50 0 50 100 150 200Cell value

101

102

103

104

105

PMI+LSA (k=100)

59 / 90

Page 79: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Choosing the LSA dimensionality

0 10 20 30 40Singular value rank

0.3

0.4

0.5

0.6

0.7

Sing

ular

val

ue

Dream scenario for choosing k

0 1000 2000 3000 4000 5000Singular value rank

5.0

2.5

0.0

2.5

5.0

7.5

Sing

ular

val

ue (l

og-s

cale

)

PMI+LSA

60 / 90

Page 80: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Related dimensionality reduction techniques

• Principal Components Analysis (PCA)• Non-negative Matrix Factorization (NMF)• Probabilistic LSA (PLSA; Hofmann 1999)• Latent Dirichlet Allocation (LDA; Blei et al. 2003)• t-SNE (van der Maaten and Hinton 2008)

See sklearn.decomposition and sklearn.manifold

61 / 90

Page 81: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Code snippetsvsm_01_code6_solved

April 1, 2019

In [1]: import osimport pandas as pdimport vsm

In [2]: DATA_HOME = os.path.join('data', 'vsmdata')

giga5 = pd.read_csv(os.path.join(DATA_HOME, 'giga_window5-scaled.csv.gz'), index_col=0)

In [3]: giga5.shape

Out[3]: (5000, 5000)

In [4]: giga5_lsa100 = vsm.lsa(giga5, k=100)

In [5]: giga5_lsa100.shape

Out[5]: (5000, 100)

1

62 / 90

Page 82: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Autoencoders

1. High-level goals and guiding hypotheses2. Matrix designs3. Vector comparison4. Basic reweighting5. Subword information6. Visualization7. Dimensionality reduction

a. Latent Semantic Analysisb. Autoencodersc. GloVed. word2vec

8. Retrofitting

63 / 90

Page 83: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Overview

• Autoencoders are a flexible class of deep learningarchitectures for learning reduced dimensionalrepresentations.

• Chapter 14 of Goodfellow et al. (2016) is an excellentdiscussion.

64 / 90

Page 84: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

The basic autoencoder model

x

x_hat = hW_hy + b_hy

h = f(xW_xh + b_xh)

Seeks to predict its own input.

High-dimensional inputs are fed through a narrow hidden layer

(or multiple hidden layers).This is the representation of

interest – akin to LSA output.

This might be preceded by a separate dimensionality

reduction step (e.g., LSA)

y_err = x_hat - x

d_b_hy = y_err

h_err = y_err.dot(W_hy.T) * f’(h)

d_W_hy outer(h, y_err)

d_W_xh = outer(x, h_err)

d_b_xh = h_err

Assume f = tanh and so f’(z) = 1.0 - z 2. Per example error is ∑i 0.5 * (x_hati - xi)2

65 / 90

Page 85: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Autoencoder code snippets

vsm_01_code7

April 2, 2019

In [1]: from np_autoencoder import Autoencoderimport osimport pandas as pdfrom torch_autoencoder import TorchAutoencoderimport vsm

In [2]: DATA_HOME = os.path.join('data', 'vsmdata')

giga5 = pd.read_csv(os.path.join(DATA_HOME, 'giga_window5-scaled.csv.gz'), index_col=0)

In [3]: # You'll likely need a larger network, trained longer, for good results.ae = Autoencoder(max_iter=10, hidden_dim=50)

In [4]: # Scaling the values first will help the network learn:giga5_l2 = giga5.apply(vsm.length_norm, axis=1)

In [5]: # The `fit` method returns the hidden reps:giga5_ae = ae.fit(giga5_l2)

Finished epoch 10 of 10; error is 0.4883386066987744

In [6]: torch_ae = TorchAutoencoder(max_iter=10, hidden_dim=50)

In [7]: # A potentially interesting pipeline:giga5_ppmi_lsa100 = vsm.lsa(vsm.pmi(giga5), k=100)

In [8]: giga5_ppmi_lsa100_ae = torch_ae.fit(giga5_ppmi_lsa100)

Finished epoch 10 of 10; error is 1.2230274677276611

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

1

66 / 90

Page 86: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Autoencoder code snippetsIn [9]: vsm.neighbors("finance", giga5).head()

Out[9]: finance 0.000000minister 0.870300. 0.880074</p> 0.896013ministry 0.897051dtype: float64

In [10]: vsm.neighbors("finance", giga5_ae).head()

Out[10]: finance 0.000000article 0.504076style 0.526473domain 0.538920investigators 0.548903dtype: float64

In [11]: vsm.neighbors("finance", giga5_ppmi_lsa100_ae).head()

Out[11]: finance 0.000000affairs 0.232635management 0.248080commerce 0.255099banking 0.256428dtype: float64

2

66 / 90

Page 87: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Global Vectors (GloVe)

1. High-level goals and guiding hypotheses2. Matrix designs3. Vector comparison4. Basic reweighting5. Subword information6. Visualization7. Dimensionality reduction

a. Latent Semantic Analysisb. Autoencodersc. GloVed. word2vec

8. Retrofitting

67 / 90

Page 88: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Overview

• Pennington et al. (2014)

• Roughly speaking, the objective is to learn vectors forwords such that their dot product is proportional to theirprobability of co-occurrence.

• We’ll use the implementation in the mittens package(Dingwall and Potts 2018). There is a referenceimplementation in vsm.py. For really big vocabularies,the GloVe team’s C implementation is probably the bestchoice.

• We’ll make use of the GloVe team’s pretrainedrepresentations throughout this course.

68 / 90

Page 89: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

The GloVe objective

w>iewk + bi + ebk = log(Xik)

Equation (6):

w>iewk = log(Pik) = log(Xik)− log(Xi)

Allowing different rows and columns:

w>iewk = log(Pik) = log(Xik)− log(Xi∗ · X∗k)

That’s PMI!

pmi(X, i, j) = log�

Xij

expected(X, i, j)

= log�

P(Xij)

P(Xi∗) · P(X∗j)

By the equivalence log(xy ) = log(x)− log(y)

69 / 90

Page 90: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

The weighted GloVe objective

Original

w>iewk + bi + ebk = log(Xik)

Weighted

|V|∑

i,j=1

f�

Xij�

w>iewj + bi + ebj − logXij

�2

where V is the vocabulary and f is

f (x)

¨

(x/xmax)α if x < xmax

1 otherwise

Typically, α is set to 0.75 and xmax to 100.

70 / 90

Page 91: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

GloVe hyperparameters

• Learned representation dimensionality.• xmax, which flattens out all high counts.• α, which scales the values as (x/xmax)

α.

f (x)

¨

(x/xmax)α if x < xmax

1 otherwise

f��

100 99 75 10 1��

=�

1.00 0.99 0.81 0.18 0.03�

71 / 90

Page 92: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

GloVe learningThe loss calculations

f (Xij)�

w>iewj − logXij

show how gnarly and wicked arepulled toward awesome. Biasterms left out for simplicity.gnarly and wicked deliberatelyfar apart in w0 and ew0.

0.4 0.2 0.0 0.2 0.4GloVe dimension 1

0.4

0.2

0.0

0.2

0.4

GloV

e di

men

sion

2

gnarly

wicked

awesome

terribleBefore iteration 1

0.5 1.0 1.5 2.0GloVe dimension 1

1.5

1.0

0.5

0.0

0.5

GloV

e di

men

sion

2

gnarly

wicked

awesome

terribleAfter iteration 1

Counts gnarly wicked awesome terrible

gnarly 10 0 9 1wicked 0 10 9 1awesome 9 9 19 1terrible 1 1 1 3

Weights(xmax = 10, α = 0.75)gnarly wicked awesome terrible

gnarly 1.00 0.00 0.92 0.18wicked 0.00 1.00 0.92 0.18awesome 0.92 0.92 1.00 0.18terrible 0.18 0.18 0.18 0.41

w0

gnarly 0.27 −0.27wicked −0.27 0.27awesome 0.36 −0.50terrible 0.08 0.16

ew0

gnarly 0.18 −0.18wicked −0.18 0.18awesome 0.03 0.20terrible 0.17 0.32

0.92�

0.27 −0.27�> � 0.03 0.20

− log( 9 )�

= −2.06

0.92�

−0.27 0.27�> � 0.03 0.20

− log( 9 )�

= −1.98

w1

gnarly 0.99 −0.85wicked 0.74 −0.54awesome 0.37 −0.26terrible 0.12 0.21

ew1

gnarly 0.97 −0.82wicked 0.73 −0.54awesome 0.34 −0.25terrible 0.20 0.34

0.92�

0.99 −0.85�> � 0.34 −0.25

− log( 9 )�

= −1.51

0.92�

0.74 −0.54�> � 0.34 −0.25

− log( 9 )�

= −1.66

72 / 90

Page 93: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

GloVe cell-value comparisons (n = 50)

0.00 0.25 0.50 0.75 1.00 1.25Cell value 1e8

101

103

105

107

Raw counts

2 1 0 1 2Cell value

103

104

GloVe (n=50)

73 / 90

Page 94: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

GloVe code snippets

vsm_01_code8

April 2, 2019

In [1]: from mittens import GloVeimport numpy as npimport osimport pandas as pdimport vsm

In [2]: DATA_HOME = os.path.join('data', 'vsmdata')

imdb5 = pd.read_csv(os.path.join(DATA_HOME, 'imdb_window5-scaled.csv.gz'), index_col=0)

imdb20 = pd.read_csv(os.path.join(DATA_HOME, 'imdb_window20-flat.csv.gz'), index_col=0)

In [3]: # What percentage of the non-zero values are being mapped to 1 by f?def percentage_nonzero_vals_above(df, n=100):

v = df.values.reshape(1, -1).squeeze()v = v[v > 0]above = v[v > n]return len(above) / len(v)

In [4]: percentage_nonzero_vals_above(imdb5)

Out[4]: 0.017534398942316464

In [5]: percentage_nonzero_vals_above(imdb20)

Out[5]: 0.1519065095882084

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

1

74 / 90

Page 95: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

GloVe code snippetsIn [ ]:

In [6]: glv = GloVe(max_iter=100, n=50)

In [7]: imdb5_glv = glv.fit(imdb5)

Iteration 100: loss: 536157.755

In [8]: glv.sess.close()

In [9]: imdb20_glv = glv.fit(imdb20)

Iteration 100: loss: 1043351.625

In [10]: # Restore the original `pd.DataFrame` structure:imdb20_glv = pd.DataFrame(imdb20_glv, index=imdb20.index)

In [11]: # To what a degree is the GloVe objective achieved?def correlation_test(true, pred):

mask = true > 0M = pred.dot(pred.T)with np.errstate(divide='ignore'):

log_cooccur = np.log(true)log_cooccur[np.isinf(log_cooccur)] = 0.0row_prob = np.log(true.sum(axis=1))row_log_prob = np.outer(row_prob, np.ones(true.shape[1]))prob = log_cooccur - row_log_prob

return np.corrcoef(prob[mask], M[mask])[0, 1]

In [12]: correlation_test(imdb5.values, imdb5_glv)

Out[12]: 0.38032242586515264

In [13]: correlation_test(imdb20.values, imdb20_glv.values)

Out[13]: 0.484126476892789

In [ ]:

In [ ]:

In [ ]:

In [14]: %matplotlib inlineimport matplotlib.pyplot as plt

In [15]: def hist(df, output_filename, color, title):figdir = "/Users/cgpotts/Documents/teaching/2018-2019/spring/cs224u/vsm/fig/"output_filename = os.path.join(figdir, output_filename)vals = df.values.reshape(1, -1).squeeze()plt.hist(vals, log=True, facecolor=color)plt.xlabel("Cell value")plt.title(title)plt.savefig(output_filename)plt.close()

2

74 / 90

Page 96: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

word2vec

1. High-level goals and guiding hypotheses2. Matrix designs3. Vector comparison4. Basic reweighting5. Subword information6. Visualization7. Dimensionality reduction

a. Latent Semantic Analysisb. Autoencodersc. GloVed. word2vec

8. Retrofitting

75 / 90

Page 97: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Overview

• Introduced by Mikolov et al. (2013).

• Goldberg and Levy (2014) identify the relationshipbetween word2vec and PMI.

• The TensorFlow tutorial Vector representations of wordsis very clear and links to code.

• Gensim package has a highly scalable implementation.

76 / 90

Page 98: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

word2vec: From corpus to labeled data

it was the best of times, it was the worst of times, . . .

With window size 2:

x y

it wasit thewas itwas thewas bestthe wasthe itthe bestthe of

. . .

77 / 90

Page 99: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

word2vec: Basic skip-gramThe basic skip-gram model estimates the probability of aninput–output pair (a,b) as

P(b | a) =exp(xawb)

b′∈V exp(xawb′)

where xa is the row-vector representation of word a and wb isthe column vector representation of word b. Minimize:

−m∑

i=1

|V|∑

k=1

1{ci = k} logexp(xiwk)

∑|V|j=1 exp(xiwj)

where V is the vocabulary and c is a one-hot encoded vectorof the same length as V. This gives rise to a classifier:

C = softmax(XW + b)

We’re back to our core insight for this unit: word and contextmatrix, pushing their dot products in a specific direction.

78 / 90

Page 100: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

word2vec: Noise contrastive estimation

Training the basic skip-gram model directly is expensive forlarge vocabularies, because W, b, and C get so large. Noisecontrastive estimation addresses that:

a,b∈D− logσ(xawb) +

a,b∈D′logσ(xawb)

with σ the sigmoid activation function 11+exp(−x) . D′ is a

sample of pairs that don’t appear in the training data.

79 / 90

Page 101: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Retrofitting

1. High-level goals and guiding hypotheses2. Matrix designs3. Vector comparison4. Basic reweighting5. Subword information6. Visualization7. Dimensionality reduction

a. Latent Semantic Analysisb. Autoencodersc. GloVed. word2vec

8. Retrofitting

80 / 90

Page 102: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Central goals

• Distributional representations are powerful and easy toobtain, but they tend to reflect only similarity(synonymy, connotation).

• Structured resources are sparse and hard to obtain, butthey support learning rich, diverse semantic distinctions.

• Can we have the best aspects of both? Retrofitting is oneway of saying, “Yes”.

• Retrofitting is due to Faruqui et al. (2015).

81 / 90

Page 103: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Purely distributional representations

• High-dimensional

• Meaning from dense linguistic inter-relationships

• Meaning solely from (nth-order) co-occurrence

• No grounding in physical or social contexts

• Not symbolic

82 / 90

Page 104: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Grounding via supervisionWord vectors to maximize unsupervised log-likelihood ofwords given documents and supervised prediction accuracy:

(Maas et al. 2011)

83 / 90

Page 105: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Hidden representations from a deep classifier

84 / 90

Page 106: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

The retrofitting model

i∈Vαi

qi − qi

2+

(i,j,r)∈Eβij

qi − qj

2

• Balances fidelity to theoriginal vector qi

• against looking more likeone’s graph neighbors.

• Forces are balanced withα = 1 and β = 1

Degree(i)Figure 1: Word graph with edges between related wordsshowing the observed (grey) and the inferred (white)word vector representations.

Experimentally, we show that our method workswell with different state-of-the-art word vector mod-els, using different kinds of semantic lexicons andgives substantial improvements on a variety ofbenchmarks, while beating the current state-of-the-art approaches for incorporating semantic informa-tion in vector training and trivially extends to mul-tiple languages. We show that retrofitting givesconsistent improvement in performance on evalua-tion benchmarks with different word vector lengthsand show a qualitative visualization of the effect ofretrofitting on word vector quality. The retrofittingtool is available at: https://github.com/mfaruqui/retrofitting.

2 Retrofitting with Semantic Lexicons

Let V = {w1, . . . , wn} be a vocabulary, i.e, the setof word types, and⌦ be an ontology that encodes se-mantic relations between words in V . We represent⌦ as an undirected graph (V,E) with one vertex foreach word type and edges (wi, wj) 2 E ✓ V ⇥ Vindicating a semantic relationship of interest. Theserelations differ for different semantic lexicons andare described later (§4).

The matrix Q will be the collection of vector rep-resentations qi 2 Rd, for each wi 2 V , learnedusing a standard data-driven technique, where d isthe length of the word vectors. Our objective isto learn the matrix Q = (q1, . . . , qn) such that thecolumns are both close (under a distance metric) totheir counterparts in Q and to adjacent vertices in ⌦.Figure 1 shows a small word graph with such edgeconnections; white nodes are labeled with the Q vec-

tors to be retrofitted (and correspond to V⌦); shadednodes are labeled with the corresponding vectors inQ, which are observed. The graph can be interpretedas a Markov random field (Kindermann and Snell,1980).

The distance between a pair of vectors is definedto be the Euclidean distance. Since we want theinferred word vector to be close to the observedvalue qi and close to its neighbors qj ,8j such that(i, j) 2 E, the objective to be minimized becomes:

(Q) =

nX

i=1

24↵ikqi � qik2 +

X

(i,j)2E

�ijkqi � qjk2

35

where ↵ and � values control the relative strengthsof associations (more details in §6.1).

In this case, we first train the word vectors inde-pendent of the information in the semantic lexiconsand then retrofit them. is convex in Q and its so-lution can be found by solving a system of linearequations. To do so, we use an efficient iterativeupdating method (Bengio et al., 2006; Subramanyaet al., 2010; Das and Petrov, 2011; Das and Smith,2011). The vectors in Q are initialized to be equalto the vectors in Q. We take the first derivative of with respect to one qi vector, and by equating it tozero arrive at the following online update:

qi =

Pj:(i,j)2E �ijqj + ↵iqiP

j:(i,j)2E �ij + ↵i(1)

In practice, running this procedure for 10 iterationsconverges to changes in Euclidean distance of ad-jacent vertices of less than 10�2. The retrofittingapproach described above is modular; it can be ap-plied to word vector representations obtained fromany model as the updates in Eq. 1 are agnostic to theoriginal vector training model objective.

Semantic Lexicons during Learning. Our pro-posed approach is reminiscent of recent work onimproving word vectors using lexical resources (Yuand Dredze, 2014; Bian et al., 2014; Xu et al., 2014)which alters the learning objective of the originalvector training model with a prior (or a regularizer)that encourages semantically related vectors (in ⌦)to be close together, except that our technique is ap-plied as a second stage of learning. We describe the

1607

85 / 90

Page 107: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Simple retrofitting examples

i∈Vαi

qi − qi

2+

(i,j,r)∈Eβij

qi − qj

2

α = 0

86 / 90

Page 108: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Simple retrofitting examples

i∈Vαi

qi − qi

2+

(i,j,r)∈Eβij

qi − qj

2

α = 0

86 / 90

Page 109: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Simple retrofitting examples

i∈Vαi

qi − qi

2+

(i,j,r)∈Eβij

qi − qj

2

α = 0

86 / 90

Page 110: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Extensions

Drop the assumption that every edge means ‘similar’:

• Mrksic et al. (2016) AntonymRepel, SynonymAttract, andVectorSpacePreservation for different edge types.

• Lengerich et al. (2018): functional retrofitting to learnthe semantics of any edge types.

• This work is closely related to graph embedding(learning distributed representations for nodes), forwhich see Hamilton et al. 2017.

87 / 90

Page 111: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting

Retrofitting code snippets

vsm_01_code9

April 2, 2019

In [1]: import pandas as pdfrom retrofitting import Retrofitter

In [2]: Q_hat = pd.DataFrame([[0.0, 0.0],[0.0, 0.5],[0.5, 0.0]],

columns=['x', 'y'])

edges = {0: {1, 2}, 1: set(), 2: set()}

In [3]: Q_hat

Out[3]: x y0 0.0 0.01 0.0 0.52 0.5 0.0

In [4]: retro = Retrofitter(verbose=True)

In [5]: X_retro = retro.fit(Q_hat, edges)

Converged at iteration 2; change was 0.0000

In [6]: X_retro

Out[6]: x y0 0.125 0.1251 0.000 0.5002 0.500 0.000

In [7]: # For an application to WordNet, see `vsm_03_retrofitting`.

1

88 / 90

Page 112: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

References

References IDavid M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research,

3:993–1022.Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword

information. ArXiv:1607.04606.S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis.

Journal of the American Society for Information Science, 41(6):391–407.Nicholas Dingwall and Christopher Potts. 2018. Mittens: An extension of GloVe for learning domain-specialized

representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, pages 212–217, Stroudsburg, PA. Association forComputational Linguistics.

Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting wordvectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, pages 1606–1615, Stroudsburg, PA. Association forComputational Linguistics.

John R. Firth. 1957. A synopsis of linguistic theory 1930–1955. In Studies in Linguistic Analysis, pages 1–32. Blackwell,Oxford.

Yoav Goldberg and Omer Levy. 2014. word2vec explained: Deriving Mikolov et al.’s negative-sampling word-embeddingmethod. Technical Report arXiv:1402.3722v1, Bar Ilan University.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.http://www.deeplearningbook.org.

William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Representation learning on graphs: Methods and applications. InIEEE Data Engineering Bulletin, pages 52–74. IEEE Press.

Zellig Harris. 1954. Distributional structure. Word, 10(23):146–162.Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM

SIGIR Conference on Research and Development in Information Retrieval, pages 50–57, New York. ACM.Benjamin J. Lengerich, Andrew L. Maas, and Christopher Potts. 2018. Retrofitting distributional embeddings to knowledge

graphs with functional relations. In Proceedings of the 27th International Conference on Computational Linguistics,pages 2423–2436, Stroudsburg, PA. Association for Computational Linguistics.

Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Advances in NeuralInformation Processing Systems.

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning wordvectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for ComputationalLinguistics, pages 142–150, Portland, Oregon. Association for Computational Linguistics.

Laurens van der Maaten and Geoffrey E. Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research,9:2579–2605.

89 / 90

Page 113: Christopher Potts - Stanford University...Overview Designs Vector comparison Basic reweighting Subwords Viz LSA AE GloVe w2v Retrofitting Meaning latent in co-occurrence patterns

References

References II

Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press,Cambridge, MA.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words andphrases and their compositionality. In Christopher J. C. Burges, Leon Bottou, Max Welling, Zoubin Ghahramani, andKilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. CurranAssociates, Inc.

Nikola Mrksic, Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gasic, Lina M. Rojas-Barahona, Pei-Hao Su, David Vandyke,Tsung-Hsien Wen, and Steve Young. 2016. Counter-fitting word vectors to linguistic constraints. In Proceedings of the2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human LanguageTechnologies, pages 142–148. Association for Computational Linguistics.

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages1532–1543, Doha, Qatar. Association for Computational Linguistics.

Hinrich Schütze. 1993. Word space. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural InformationProcessing Systems 5, pages 895–902. Morgan-Kaufmann.

Noah A. Smith. 2019. Contextual word representations: A contextual introduction. ArXiv:1902.06006v2.Peter D. Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of

Artificial Intelligence Research, 37:141–188.Ludwig Wittgenstein. 1953. Philosophical Investigations. The MacMillan Company, New York. Translated by

G. E. M. Anscombe.

90 / 90