Made by: Maor Levy, Temple University 2012 1. Many data points, no labels 2.

34
Machine Learning Unit 5, Introduction to Artificial Intelligence, Stanford online course Made by: Maor Levy, Temple University 2012 1

Transcript of Made by: Maor Levy, Temple University 2012 1. Many data points, no labels 2.

1

Machine LearningUnit 5, Introduction to Artificial Intelligence, Stanford online course

Made by: Maor Levy, Temple University 2012

2

The unsupervised learning problem

Many data points, no labels

3

Google Street View

Unsupervised Learning?

4

K-Means

Many data points, no labels

5

Choose a fixed number of clusters

Choose cluster centers and point-cluster allocations to minimize error

can’t do this by exhaustive search, because there are too many possible allocations.

Algorithm◦ fix cluster centers;

allocate points to closest cluster

◦ fix allocation; compute best cluster centers

x could be any set of features for which we can compute a distance (careful about scaling)

K-Means

x j i2

jelements of i'th cluster

iclusters

* From Marc Pollefeys COMP 256 2003

6

K-Means

7

K-Means

* From Marc Pollefeys COMP 256 2003

8

K-means clustering using intensity alone and color alone

Image Clusters on intensity Clusters on color

* From Marc Pollefeys COMP 256 2003

Results of K-Means Clustering:

9

Is an approximation to EM◦ Model (hypothesis space): Mixture of N Gaussians◦ Latent variables: Correspondence of data and Gaussians

We notice: ◦ Given the mixture model, it’s easy to calculate the

correspondence◦ Given the correspondence it’s easy to estimate the

mixture models

K-Means

10

Data generated from mixture of Gaussians

Latent variables: Correspondence between Data Items and Gaussians

Expectation Maximzation: Idea

11

Generalized K-Means (EM)

12

Gaussians

13

ML Fitting Gaussians

14

Learning a Gaussian Mixture(with known covariance)

k

n

x

x

ni

ji

e

e

1

)(2

1

)(2

1

22

22

M-Step

k

nni

jiij

xxp

xxpzE

1

)|(

)|(][

E-Step

15

Converges! Proof [Neal/Hinton, McLachlan/Krishnan]:

◦ E/M step does not decrease data likelihood◦ Converges at local minimum or saddle point

But subject to local minima

Expectation Maximization

16

EM Clustering: Results

http://www.ece.neu.edu/groups/rpl/kmeans/

17

Number of Clusters unknown Suffers (badly) from local minima Algorithm:

◦ Start new cluster center if many points “unexplained”

◦ Kill cluster center that doesn’t contribute◦ (Use AIC/BIC criterion for all this, if you want to be

formal)

Practical EM

18

Spectral Clustering

19

The Two Spiral Problem

20

Spectral Clustering: Overview

Data Similarities Block-Detection

* Slides from Dan Klein, Sep Kamvar, Chris Manning, Natural Language Group Stanford University

21

Block matrices have block eigenvectors:

Near-block matrices have near-block eigenvectors: [Ng et al., NIPS 02]

Eigenvectors and Blocks

1 1 0 0

1 1 0 0

0 0 1 1

0 0 1 1

eigensolver

.71

.71

0

0

0

0

.71

.71

l1= 2 l2= 2 l3= 0 l4= 0

1 1 .2 0

1 1 0 -.2

.2 0 1 1

0 -.2 1 1

eigensolver

.71

.69

.14

0

0

-.14

.69

.71

l1= 2.02 l2= 2.02 l3= -0.02 l4= -0.02

* Slides from Dan Klein, Sep Kamvar, Chris Manning, Natural Language Group Stanford University

22

Can put items into blocks by eigenvectors:

Resulting clusters independent of row ordering:

Spectral Space

1 1 .2 0

1 1 0 -.2

.2 0 1 1

0 -.2 1 1

.71

.69

.14

0

0

-.14

.69

.71

e1

e2

e1 e2

1 .2 1 0

.2 1 0 1

1 0 1 -.2

0 1 -.2 1

.71

.14

.69

0

0

.69

-.14

.71

e1

e2

e1 e2

* Slides from Dan Klein, Sep Kamvar, Chris Manning, Natural Language Group Stanford University

23

The key advantage of spectral clustering is the spectral space representation:

The Spectral Advantage

* Slides from Dan Klein, Sep Kamvar, Chris Manning, Natural Language Group Stanford University

24

Measuring Affinity

Intensity

Texture

Distance

aff x, y exp 12 i

2

I x I y 2

aff x, y exp 12 d

2

x y

2

aff x, y exp 12 t

2

c x c y 2

* From Marc Pollefeys COMP 256 2003

25

Scale affects affinity

* From Marc Pollefeys COMP 256 2003

26* From Marc Pollefeys COMP 256 2003

27

The Space of Digits (in 2D)

M. Brand, MERL

28

Dimensionality Reduction with PCA

29

Fit multivariate Gaussian

Compute eigenvectors of Covariance

Project onto eigenvectors with largest eigenvalues

Linear: Principal Components

30

Other examples of unsupervised learning

Mean face (after alignment)

Slide credit: Santiago Serrano

31

Eigenfaces

Slide credit: Santiago Serrano

32

Isomap Local Linear Embedding

Non-Linear Techniques

Isomap, Science, M. Balasubmaranian and E. Schwartz

33

Scape (Drago Anguelov et al)

34

References:◦ Sebastian Thrun and Peter Norvig, Artificial Intelligence, Stanford

University http://www.stanford.edu/class/cs221/notes/cs221-lecture6-fall11.pdf