Clustering Semantically Similar Words

37
Clustering Semantically Similar Words 0.397 a a’ DSW Camp & Jam December 4th, 2016 Bayu Aldi Yansyah

Transcript of Clustering Semantically Similar Words

Page 1: Clustering Semantically Similar Words

Clustering SemanticallySimilar Words

0.397

a

a’

DSW Camp & JamDecember 4th, 2016

Bayu Aldi Yansyah

Page 2: Clustering Semantically Similar Words

- Understand step-by-step how to cluster words based on their

semantic similarity

- Understand how Deep learning model is applied to Natural

Language Processing

Our GoalsOverview

Page 3: Clustering Semantically Similar Words

- You understand the basic of Natural Language Processing and

Machine Learning

- You are familiar with Artificial neural networks

I assume …Overview

Page 4: Clustering Semantically Similar Words

1. Introduction to Word Clustering

2. Introduction to Word Embedding

- Feed-forward Neural Net Language Model

- Continuous Bag-of-Words Model

- Continuous Skip-gram Model

3. Similarity metrics

- Cosine similarity

- Euclidean similarity

4. Clustering algorithm: Consensus clustering

OutlineOverview

Page 5: Clustering Semantically Similar Words

1.WORD CLUSTERINGINTRODUCTION

- Word clustering is a technique for partitioning sets of words into

subsets of semantically similar words

- Suppose we have set of words W = w$,w&,… ,w( , n ∈ ℕ , our goal is

to find C = C$,C&,…, C. , k ∈ ℕ where

- w1 is a centroid of cluster C2- similarity w1,w is a function to measure the similarity score

- and 𝑡 is a threshold value where if 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑤D ,𝑤 ≥ 𝑡means that

𝑤D and 𝑤is semantically similar.

- For 𝑤$ ∈ 𝐶G and 𝑤& ∈ 𝐶H apply that 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑤$,𝑤& < 𝑡, so

𝐶J = 𝑤 ∀𝑤 ∈𝑊where𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑤D, 𝑤 ≥ 𝑡}

𝐶G ∩𝐶H = ∅, ∀𝐶G,𝐶H ∈ 𝐶

Page 6: Clustering Semantically Similar Words

1.WORD CLUSTERINGINTRODUCTION

In order to perform word clustering, we need to:

1. Represent word as vector semantics, so we can compute their

similarity and dissimilarity score.

2. Find the w1 for each cluster.

3. Choose the similarity metric 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑤D,𝑤 and the threshold

value 𝑡.

Page 7: Clustering Semantically Similar Words

Semantic ≠ Synonym

“Words are similar semantically if they have the same thing, are

opposite of each other, used in the same way, used in the same

context and one is a type of another.” − Gomaa and Fahmy (2013)

Page 8: Clustering Semantically Similar Words

2.WORD EMBEDDINGINTRODUCTION

- Word embedding is a technique to represent a word as a vector.

- The result of word embedding frequently referred as “word vector”

or “distributed representation of words”.

- There are 3 main approaches to word embedding:

1. Neural Networks model based

2. Dimensionality reduction based

3. Probabilistic model based

- We focus on (1)

- The idea of these approaches are to learn vector representations of

words in an unsupervised manner.

Page 9: Clustering Semantically Similar Words

2.WORD EMBEDDINGINTRODUCTION

- Some Neural networks models that can learn representation of

words are:

1. Feed-forward Neural Net Language Model by

Bengio et al. (2003).

2. Continuous Bag-of-Words Model by Mikolov et al. (2013).

3. Continuous Skip-gram Model by Mikolov et al. (2013).

- We will compare these 3 models.

- Fun fact: the last two models is highly-inpired by no 1.

- Only Feed-forward Neural Net Language Model is considered as

deep learning model.

Page 10: Clustering Semantically Similar Words

2.WORD EMBEDDINGCOMPARING NEURAL NETWORKS MODELS

- We will use notation from Collobert et al. (2011) to represent the

model. This help us to easily compare the models.

- Any feed-forward neural network with 𝐿 layers can be seen as a

composition of functions 𝑓TU(W), corresponding to each layer 𝑙:

- With parameter for each layer 𝑙:

- Usually each layer 𝑙 have weight 𝑊and bias 𝑏, 𝜃U = (𝑊U ,𝑏U).

𝑓T W = 𝑓T[(𝑓T[\$(… 𝑓T$(W)… ))

𝜃 = (𝜃$,𝜃&, …, 𝜃[)

Page 11: Clustering Semantically Similar Words

2.1.FEED-FORWARD NEURAL NET LANGUAGE MODELBengio et al. (2003)

- The training data is a sequence of words 𝑤$,𝑤& ,… ,𝑤] for 𝑤^ ∈ 𝑉

- The model is trying predict the next word 𝑤^ based on the previous

context (previous 𝑛 words: 𝑤^\$,𝑤^\&,… ,𝑤^\a). (Figure 2.1.1)

- The model is consist of 4 layers: Input layer, Projection layer, Hidden

layer(s) and output layer. (Figure 2.1.2)

- Known as NNLM

𝑤^

Keren Sale Stock bisa dirumah... ...

𝑤^\$𝑤^\&𝑤^\b𝑤^\c

Figure 2.1.1

Page 12: Clustering Semantically Similar Words

2.1.FEED-FORWARD NEURAL NET LANGUAGE MODELCOMPOSITION OF FUNCTIONS: INPUT LAYER

- 𝑥^\$,𝑥^\&,… , 𝑥^\a is a 1-of-|𝑉| vector or one-hot-encoded vector of

𝑤^\$,𝑤^\&,… ,𝑤^\a- 𝑛 is the number of previous words

- The input layer is just acting like placeholder here

𝑥′^ = 𝑓T 𝑥^\$,… , 𝑥^\a

Output layer : 𝑓Tc J= 𝑥′^ = 𝜎 𝑊c] 𝑓Tb J +𝑏

c

Hidden layer : 𝑓Tb J= tanh 𝑊b] 𝑓T& J +𝑏

b

Projection layer : 𝑓T& J = 𝑊&] 𝑓T$ J

Input layer for i-th example : 𝑓T$ J = 𝑥^\$, 𝑥^\&,… , 𝑥^\a

Page 13: Clustering Semantically Similar Words

2.1.FEED-FORWARD NEURAL NET LANGUAGE MODELCOMPOSITION OF FUNCTIONS: PROJECTION LAYER

- The idea of this layer is to project the |𝑉|-dimension vector to

smaller dimension.

- 𝑊& is the |𝑉|×𝑚 matrix, also known as embedding matrix, where

each row is a word vector

- Unlike hidden layer, there is no non-linearity here

- This layer also known as “The shared word features layer”

𝑥′^ = 𝑓T 𝑥^\$,… , 𝑥^\a

Output layer : 𝑓Tc J= 𝑥′^ = 𝜎 𝑊c] 𝑓Tb J +𝑏

c

Hidden layer : 𝑓Tb J= tanh 𝑊b] 𝑓T& J +𝑏

b

Projection layer : 𝑓T& J = 𝑊&] 𝑓T$ J

Input layer for i-th example : 𝑓T$ J = 𝑥^\$, 𝑥^\&,… , 𝑥^\a

Page 14: Clustering Semantically Similar Words

2.1.FEED-FORWARD NEURAL NET LANGUAGE MODELCOMPOSITION OF FUNCTIONS: HIDDEN LAYER

- 𝑊b is the ℎ×𝑛𝑚 matrix where ℎ is the number of hidden units.

- 𝑏b is a ℎ −dimensional vector.

- The activation function is hyperbolic tangent.

𝑥′^ = 𝑓T 𝑥^\$,… , 𝑥^\a

Output layer : 𝑓Tc J= 𝑥′^ = 𝜎 𝑊c] 𝑓Tb J +𝑏

c

Hidden layer : 𝑓Tb J= tanh 𝑊b] 𝑓T& J +𝑏

b

Projection layer : 𝑓T& J = 𝑊&] 𝑓T$ J

Input layer for i-th example : 𝑓T$ J = 𝑥^\$, 𝑥^\&,… , 𝑥^\a

Page 15: Clustering Semantically Similar Words

2.1.FEED-FORWARD NEURAL NET LANGUAGE MODELCOMPOSITION OF FUNCTIONS: OUPTUT LAYER

- 𝑊c is the ℎ×|𝑉| matrix.

- 𝑏c is a |𝑉|-dimensional vector.

- The activation function is softmax.

- 𝑥′^ is a |𝑉|-dimensional vector.

𝑥′^ = 𝑓T 𝑥^\$,… , 𝑥^\a

Output layer : 𝑓Tc J= 𝑥′^ = 𝜎 𝑊c] 𝑓Tb J +𝑏

c

Hidden layer : 𝑓Tb J= tanh 𝑊b] 𝑓T& J +𝑏

b

Projection layer : 𝑓T& J = 𝑊&] 𝑓T$ J

Input layer for i-th example : 𝑓T$ J = 𝑥^\$, 𝑥^\&,… , 𝑥^\a

Page 16: Clustering Semantically Similar Words

2.1.FEED-FORWARD NEURAL NET LANGUAGE MODELLOSS FUNCTION

- Where 𝑁 is the number of training data

- The goal is to maximize this loss function.

- The neural networks are trained using stochastic gradient ascent.

𝐿 =1𝑁nlog 𝑓T 𝑥^\$,…, 𝑥^\a; 𝜃 J

r

Js$

Page 17: Clustering Semantically Similar Words

Figure 2.1.2 Flow of the tensor of Feed-forward Neural Net Language Model

with vocabulary size |𝑉| and hyperparameter 𝑛 = 4,𝑚 = 2

and ℎ = 5.

𝑥^\$𝑥^\&𝑥^\b𝑥^\c

𝑣^\$𝑣^\&𝑣^\b𝑣^\c

𝑥′^

Page 18: Clustering Semantically Similar Words

2.2.CONTINUOUS BAG-OF-WORDS MODELMikolov et al. (2013)

- The training data is a sequence of words 𝑤$,𝑤& ,… ,𝑤] for 𝑤^ ∈ 𝑉

- The model is trying predict the word 𝑤^ based on the surrounding

context (𝑛words from left: 𝑤^\$,𝑤^\& and 𝑛 words from the right:

𝑤^\$,𝑤^\&). (Figure 2.2.1)

- There are no hidden layer in this model.

- Projection layer is averaged across input words.

𝑤^x&

Keren Sale bisa bayar dirumah... ...

𝑤^x$𝑤^𝑤^\$𝑤^\&

Figure 2.2.1

Page 19: Clustering Semantically Similar Words

2.2.CONTINUOUS BAG-OF-WORDS MODELCOMPOSITION OF FUNCTIONS: INPUT LAYER

- 𝑥^\y is a 1-of-|𝑉| vector or one-hot-encoded vector of 𝑤^\y .

- 𝑛 is the number of words on the left and the right.

𝑥′^ = 𝑓T 𝑥^\a,… , 𝑥^\$,𝑥^x$,…, 𝑥^xa

Output layer : 𝑓Tb J= 𝑥′^ = 𝜎 𝑊b] 𝑓T& J +𝑏

b

Projection layer : 𝑓T& J = 𝑣 =12𝑛

n 𝑊&] 𝑓T$(𝑗) J

\a{y{a,y|}

y

Input layer for i-th example : 𝑓T$(𝑗) J = 𝑥^\y ,−𝑛 ≤ 𝑗 ≤ 𝑛, 𝑗 ≠ 0

Page 20: Clustering Semantically Similar Words

2.2.CONTINUOUS BAG-OF-WORDS MODELCOMPOSITION OF FUNCTIONS: PROJECTION LAYER

- The difference from previous model is this model project all the

inputs to one 𝑚-dimensional vector 𝑣.

- 𝑊& is the |𝑉|×𝑚 matrix, also known as embedding matrix, where

each row is a word vector.

𝑥′^ = 𝑓T 𝑥^\a,… , 𝑥^\$,𝑥^x$,…, 𝑥^xa

Output layer : 𝑓Tb J= 𝑥′^ = 𝜎 𝑊b] 𝑓T& J +𝑏

b

Projection layer : 𝑓T& J = 𝑣 =12𝑛

n 𝑊&] 𝑓T$(𝑗) J

\a{y{a,y|}

y

Input layer for i-th example : 𝑓T$(𝑗) J = 𝑥^\y ,−𝑛 ≤ 𝑗 ≤ 𝑛, 𝑗 ≠ 0

Page 21: Clustering Semantically Similar Words

2.2.CONTINUOUS BAG-OF-WORDS MODELCOMPOSITION OF FUNCTIONS: OUPTUT LAYER

- 𝑊b is the m×|𝑉| matrix.

- 𝑏b is a |𝑉|-dimensional vector.

- The activation function is softmax.

- 𝑥′^ is a |𝑉|-dimensional vector.

𝑥′^ = 𝑓T 𝑥^\a,… , 𝑥^\$,𝑥^x$,…, 𝑥^xa

Output layer : 𝑓Tb J= 𝑥′^ = 𝜎 𝑊b] 𝑓T& J +𝑏

b

Projection layer : 𝑓T& J = 𝑣 =12𝑛

n 𝑊&] 𝑓T$(𝑗) J

\a{y{a,y|}

y

Input layer for i-th example : 𝑓T$(𝑗) J = 𝑥^\y ,−𝑛 ≤ 𝑗 ≤ 𝑛, 𝑗 ≠ 0

Page 22: Clustering Semantically Similar Words

2.2.CONTINOUS BAG-OF-WORDS MODELLOSS FUNCTION

- Where 𝑁 is the number of training data

- The goal is to maximize this loss function.

- The neural networks are trained using stochastic gradient ascent.

𝐿 =1𝑁nlog 𝑓T 𝑥^\a ,… , 𝑥^\$, 𝑥^x$,… , 𝑥^xa J

r

Js$

Page 23: Clustering Semantically Similar Words

𝑥^x&𝑥^x$𝑥^\$𝑥^\&

𝑥′^

𝑣

Figure 2.2.2 Flow of the tensor of Continuous Bag-of-Words Model with

vocabulary size |𝑉| and hyperparameter 𝑛 = 2,𝑚 = 2.

Page 24: Clustering Semantically Similar Words

2.3.CONTINUOUS SKIP-GRAM MODELMikolov et al. (2013)

- The training data is a sequence of words 𝑤$,𝑤& ,… ,𝑤] for 𝑤^ ∈ 𝑉

- The model is trying predict the surrounding context (𝑛words from

left: 𝑤^\$,𝑤^\& and 𝑛 words from the right: 𝑤^\$,𝑤^\&) based on the

word 𝑤^ . (Figure 2.3.1)

𝑤^x&

Keren bisa... ...

𝑤^x$𝑤^𝑤^\$𝑤^\&

Figure 2.3.1

Page 25: Clustering Semantically Similar Words

2.3.CONTINUOUS SKIP-GRAM MODELCOMPOSITION OF FUNCTIONS: INPUT LAYER

- 𝑥^ is a 1-of-|𝑉| vector or one-hot-encoded vector of 𝑤^ .

𝑋′ = 𝑓T 𝑥^

Output layer : 𝑓Tb J= 𝑋′ = 𝜎 𝑊b] 𝑓T& J +𝑏

b

Projection layer : 𝑓T& J = 𝑊&] 𝑓T$ J

Input layer for i-th example : 𝑓T$ J = 𝑥^

Page 26: Clustering Semantically Similar Words

2.3.CONTINUOUS SKIP-GRAM MODELCOMPOSITION OF FUNCTIONS: PROJECTION LAYER

- 𝑊& is the |𝑉|×𝑚 matrix, also known as embedding matrix, where

each row is a word vector.

- Same as Continuous Bag-of-Words model

𝑋′ = 𝑓T 𝑥^

Output layer : 𝑓Tb J= 𝑋′ = 𝜎 𝑊b] 𝑓T& J +𝑏

b

Projection layer : 𝑓T& J = 𝑊&] 𝑓T$ J

Input layer for i-th example : 𝑓T$ J = 𝑥^

Page 27: Clustering Semantically Similar Words

2.3.CONTINUOUS SKIP-GRAM MODELCOMPOSITION OF FUNCTIONS: OUTPUT LAYER

- 𝑊b is the m×2𝑛|𝑉| matrix.

- 𝑏b is a 2n|𝑉|-dimensional vector.

- The activation function is softmax.

- 𝑋� is a 2n|𝑉|-dimensional vector can be written as

𝑋′ = 𝑓T 𝑥^

Output layer : 𝑓Tb J= 𝑋′ = 𝜎 𝑊b] 𝑓T& J +𝑏

b

Projection layer : 𝑓T& J = 𝑊&] 𝑓T$ J

Input layer for i-th example : 𝑓T$ J = 𝑥^

𝑋� = (𝑝(𝑤^\a|𝑤^),… , 𝑝(𝑤^\$|𝑤^), 𝑝(𝑤^x$|𝑤^ ),… , 𝑝(𝑤^xa|𝑤^))

Page 28: Clustering Semantically Similar Words

2.3.CONTINOUS SKIP-GRAM MODELLOSS FUNCTION

- Where 𝑁 is the number of training data

- The goal is to maximize this loss function.

- The neural networks are trained using stochastic gradient ascent.

𝐿 =1𝑁nlog n 𝑝(𝑤^\y|𝑤^ )

\a{y{a,y|}

y J

r

Js$

Page 29: Clustering Semantically Similar Words

𝑥^x&𝑥^x$𝑥^\$𝑥^\&

𝑥^

𝑣

Figure 2.3.2 The flow of tensor of Continuous Skip-gram Model with

vocabulary size |𝑉| and hyperparameter 𝑛 = 2,𝑚 = 2.

Page 30: Clustering Semantically Similar Words

3.SIMILARITY METRICSINTRODUCTION

- Recall 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑤D ,𝑤 ≥ 𝑡

- Similarity metrics of words: Character-Based Similarity Measures

and Term-based Similarity Measures. (Gomaa and Fahmy 2013)

- We focus on Term-based Similarity Measures

Page 31: Clustering Semantically Similar Words

3.SIMILARITY METRICSINTRODUCTION

- Recall 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑤D ,𝑤 ≥ 𝑡

- Similarity metrics of words: Character-Based Similarity Measures

and Term-based Similarity Measures. (Gomaa and Fahmy 2013)

- We focus on Term-based Similarity Measures: Cosine & Euclidean.

Page 32: Clustering Semantically Similar Words

3.1.SIMILARITY METRICSCOSINE

- Where 𝑣J is our word vector

- Range value: −1 ≤ 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑣$,𝑣& ≤ 1

- Recommended threshold value : 𝑡 ≥ 0.5

𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑣$, 𝑣& = 𝑣$ W 𝑣&𝑣$ 𝑣&

Page 33: Clustering Semantically Similar Words

3.2.SIMILARITY METRICSEUCLIDEAN

- Where 𝑣J is our word vector

- Range value: 0 ≤ 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑣$, 𝑣& ≤ 1

- Recommended threshold value : 𝑡 ≥ 0.75

𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑣$,𝑣& =1

1 −𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑣$,𝑣&)

𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑣$,𝑣& = n(𝑣$J − 𝑣&J)&a

Js$

Page 34: Clustering Semantically Similar Words

4.CONSENSUS CLUSTERINGINTRODUCTION

- The basic idea here is we want to find the 𝑤D based on the

consensus

- There are 3 approaches for Consensus clustering: Iterative Voting

Consensus, Iterative Probabilistic Voting Consensus and Iterative

Pairwise Consensus. (Nguyen and Caruana 2007)

- We use slightly modified version of Iterative Voting Consensus

Page 35: Clustering Semantically Similar Words

4.1.CONSENSUS CLUSTERINGTHE ALGORITHM

Figure 4.1.1 Iterative Voting Consensus with slightly modification

Page 36: Clustering Semantically Similar Words

5.CASE STUDYOR DEMO

- Let’s do this

Page 37: Clustering Semantically Similar Words

thanks! | @bayualsyah

Notes available here: https://github.com/pyk/talks