Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202...

15
Statistical Methods for NLP LT 2202 Clustering with the k-means algorithm March 6, 2014 Richard Johansson Clustering with the k-means algorithm

Transcript of Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202...

Page 1: Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202 Clustering with the k-means algorithm March 6, 2014 Richard Johansson. Unsupervised

Statistical Methods for NLP

LT 2202

Clustering with the k-means algorithm

March 6, 2014

Richard Johansson

Clustering with the k-means algorithm

Page 2: Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202 Clustering with the k-means algorithm March 6, 2014 Richard Johansson. Unsupervised

Unsupervised learning

•So far, we have seen how to train

classifiers by learning from hand-tagged

training sets

•What if there are no tagged data?

Statistical Methods for NLP Richard Johansson

•What if there are no tagged data?

•Examples:

–Discover categories of documents

–Discover syntactic relations

•This is called unsupervised learning

or clustering

Page 3: Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202 Clustering with the k-means algorithm March 6, 2014 Richard Johansson. Unsupervised

Preliminaries

•We have a collection of objects (e.g.

documents)

•Assume that each object can be

represented as a point in a geometric

Statistical Methods for NLP Richard Johansson

represented as a point in a geometric

space

•Typically the points are constructed by

using word frequencies (bag of words)

–See appendix for details

Page 4: Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202 Clustering with the k-means algorithm March 6, 2014 Richard Johansson. Unsupervised

k-means clustering

•Assume that there are k classes

•For every class, create a centroid: a

point that is in the center of the class

Statistical Methods for NLP Richard Johansson

•Find centroids so that all the points in

each class are as near as possible

•Computationally hard to do exactly

Page 5: Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202 Clustering with the k-means algorithm March 6, 2014 Richard Johansson. Unsupervised

k-means clustering

Statistical Methods for NLP Richard Johansson

Page 6: Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202 Clustering with the k-means algorithm March 6, 2014 Richard Johansson. Unsupervised

k-means clustering: iterative approach

1. Start with a random assignment into

clusters

Statistical Methods for NLP Richard Johansson

2. Compute centroids of each cluster

3. Assign each point to the cluster of the

nearest centroid

4. If any change, repeat from step 2

Page 7: Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202 Clustering with the k-means algorithm March 6, 2014 Richard Johansson. Unsupervised

K-means example (1)

Statistical Methods for NLP Richard Johansson

Page 8: Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202 Clustering with the k-means algorithm March 6, 2014 Richard Johansson. Unsupervised

K-means example (2)

Statistical Methods for NLP Richard Johansson

Page 9: Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202 Clustering with the k-means algorithm March 6, 2014 Richard Johansson. Unsupervised

K-means example (3)

Statistical Methods for NLP Richard Johansson

Page 10: Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202 Clustering with the k-means algorithm March 6, 2014 Richard Johansson. Unsupervised

K-means example (4)

Statistical Methods for NLP Richard Johansson

Page 11: Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202 Clustering with the k-means algorithm March 6, 2014 Richard Johansson. Unsupervised

Implementations

•k-means is simple and popular and is

implemented in many software libraries

–NLTK

–Scikit-learn

Statistical Methods for NLP Richard Johansson

–Scikit-learn

Page 12: Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202 Clustering with the k-means algorithm March 6, 2014 Richard Johansson. Unsupervised

Appendix

Statistical Methods for NLP Richard Johansson

Page 13: Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202 Clustering with the k-means algorithm March 6, 2014 Richard Johansson. Unsupervised

Bag of words

• In a bag-of-words representation, we

assign one dimension for each word in

the vocabulary

•To represent a document, we build a list

Statistical Methods for NLP Richard Johansson

•To represent a document, we build a list

of word frequencies:

[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, ...

•Very often used in document classification

Page 14: Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202 Clustering with the k-means algorithm March 6, 2014 Richard Johansson. Unsupervised

Bag of words: example

“The same year, the Company bought

the Waterstone’s chain of bookshops"

•Word frequencies in this text:

Statistical Methods for NLP Richard Johansson

the: 3

same: 1

year: 1

...

[0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 0, 1, 0, ...]

Page 15: Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202 Clustering with the k-means algorithm March 6, 2014 Richard Johansson. Unsupervised

Implementation note: sparse vectors

• In NLP, most features are very rare

–Word, bigram features

•Assume that the feature list is

Statistical Methods for NLP Richard Johansson

[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0]

• It is more efficient to use a sparse

representation:

Index list: [6, 12]

Value list: [1, 2]

Alternatively: list of pairs: [(6, 1), (12, 2)]