Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202...

Statistical Methods for NLP

LT 2202

Clustering with the k-means algorithm

March 6, 2014

Richard Johansson

Clustering with the k-means algorithm

Unsupervised learning

•So far, we have seen how to train

classifiers by learning from hand-tagged

training sets

•What if there are no tagged data?

Statistical Methods for NLP Richard Johansson

•What if there are no tagged data?

•Examples:

–Discover categories of documents

–Discover syntactic relations

•This is called unsupervised learning

or clustering

Preliminaries

•We have a collection of objects (e.g.

documents)

•Assume that each object can be

represented as a point in a geometric

•Typically the points are constructed by

using word frequencies (bag of words)

–See appendix for details

k-means clustering

•Assume that there are k classes

•For every class, create a centroid: a

point that is in the center of the class

•Find centroids so that all the points in

each class are as near as possible

•Computationally hard to do exactly

k-means clustering

k-means clustering: iterative approach

1. Start with a random assignment into

clusters

2. Compute centroids of each cluster

3. Assign each point to the cluster of the

nearest centroid

4. If any change, repeat from step 2

K-means example (1)

K-means example (2)

K-means example (3)

K-means example (4)

Implementations

•k-means is simple and popular and is

implemented in many software libraries

–NLTK

–Scikit-learn

Appendix

Bag of words

• In a bag-of-words representation, we

assign one dimension for each word in

the vocabulary

•To represent a document, we build a list

of word frequencies:

[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, ...

•Very often used in document classification

Bag of words: example

“The same year, the Company bought

the Waterstone’s chain of bookshops"

•Word frequencies in this text:

the: 3

same: 1

year: 1

[0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 0, 1, 0, ...]

Implementation note: sparse vectors

• In NLP, most features are very rare

–Word, bigram features

•Assume that the feature list is

[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0]

• It is more efficient to use a sparse

representation:

Index list: [6, 12]

Value list: [1, 2]

Alternatively: list of pairs: [(6, 1), (12, 2)]

Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202...

Documents

Transcript of Statistical Methods for NLP LT 2202 - Göteborgs universitet · Statistical Methods for NLP LT 2202...

NURSING 2202

Course Manual 2202

(Nlp) Nlp Secrets

2202 Business Model Innovation H Chesbrough

NLP-10005586(1-2018) · NLP BusinessNLP NLP Meta-Thinkingprocess BusinessNLPŽ r * -J NLP Module 1 BusinessNLP Practitioner Foundation Skills BusinessNLP (NLP Quotient) BusinessNLPŽ

FlexTop 2202 Temperature Transmitter · 2019. 10. 13. · FlexTop 2202 is embedded in silicone which makes it resistant to humid environments. FlexTop 2202, fitting into the DIN B

C:UsersgauffreDocumentsMarathon2019Marathon2019 2202 Gf … · 2019. 3. 7. · Title: C:UsersgauffreDocumentsMarathon2019Marathon2019 2202 Gf semi A3 (1) Author: GAUFFRE Created Date:

Life Changing NLP Training - Home - Dr Bridget NLP · Life Changing NLP Training ... Certified Master Practitioner of NLP (Neuro Linguistic Programming), ... NLP PRACTITIONER COURSE

Electric Kettle 2202 Booklet - sunsetcookware.co.za

55'00 2203 2202 - University of Texas at Austinlegacy.lib.utexas.edu/maps/topo/haiti/cotes_de_fer-haiti... · 2010. 1. 26. · 55'00 2203 2202 03 03 02 03 02 2203 55'oo't 2202 2205

Chemistry 2202 Midterm 2012 - Mr. Rumbolt's Course ...mrrumbolt.weebly.com/uploads/4/9/5/2/4952485/chemistry... · Web viewChemistry 2202 Midterm 2012 18 CHEMISTRY 2202 MID-YEAR EXAMINATION

NASA-STD-2202-93 Effective Date: April 1993 NASA-STD-2202-93 · NASA-STD-2202-93 Effective Date: April 1993 NASA ... IEEEStandard Glossary of Software Engineering Terminology. ...

NLP Practitioner Heart of NLP - NLP Courses

Ball and Roller Bearing Catalog, 2202

Mobi Google 2202 system 3

New Doc 2018-03-13 - WordPress.com...2202 02 tot AA 05 02 tot AA 0524) 2202 02 tot AA 05 oa 0210t AA 0533) 2202 02 tot AA 0504 02 101 AA 0542) 2202 02 tot AA 05 05 02 tot AA 0559 2202

Wi-Fi Module QFM-2202

2202 GL Trials Summary

2202 surface area.notebook - Weebly

OSHA 2202 - Construction Industry Digest