Day 8: Unsupervised learning and dimensional reduction · Day 8: Unsupervised learning and...

Day 8: Unsupervised learning and dimensionalreduction

ME314: Introduction to Data Science and Machine Learning

LSE Methods Summer Programme 2019

8 August 2019

Day 8 Outline

Unsupervised Learning

Principal Components Analysis

Clustering


Unsupervised vs Supervised Learning:

I Most of this course focuses on supervised learning methods such asregression and classification.

I In that setting we observe both a set of features X1,X2, . . . ,Xp foreach object, as well as a response or outcome variable Y . The goalis then to predict Y using X1,X2, . . . ,Xp.

I Here we instead focus on unsupervised learning, where we observeonly the features X1,X2, . . . ,Xp. We are not interested in prediction,because we do not have an associated response variable Y .

The Goals of Unsupervised Learning

I The goal is to discover interesting things about the measurements:is there an informative way to visualize the data? Can we discoversubgroups among the variables or among the observations?

I We discuss two methods:I principal components analysis, a tool used for data visualization or

data pre-processing before supervised techniques are applied, andI clustering, a broad class of methods for discovering unknown

subgroups in data.

The Challenge of Unsupervised Learning

I Unsupervised learning is more subjective than supervised learning, asthere is no simple goal for the analysis, such as prediction of aresponse.

I But techniques for unsupervised learning are of growing importancein a number of fields:I subgroups of breast cancer patients grouped by their gene expression

measurements,I groups of shoppers characterized by their browsing and purchase

histories,I movies grouped by the ratings assigned by movie viewers.

Another advantage

I It is often easier to obtain unlabeled data – from a lab instrument ora computer – than labeled data, which can require humanintervention.

I For example, it is difficult to automatically assess the overallsentiment of a movie review: is it favorable or not?

Principal Components Analysis

I PCA produces a low-dimensional representation of a dataset. It findsa sequence of linear combinations of the variables that have maximalvariance, and are mutually uncorrelated.

I Apart from producing derived variables for use in supervised learningproblems, PCA also serves as a tool for data visualization.

Principal Components Analysis: details

I The first principal component of a set of features X1,X2, . . . ,Xp isthe normalized linear combination of the features

Z1 = φ11X1 + φ21X2 + · · ·+ φp1Xp

that has the largest variance. By normalized, we mean that∑pj=1 φ

2j1 = 1.

I We refer to the elements φ11, . . . , φp1 as the loadings of the firstprincipal component; together, the loadings make up the principalcomponent loading vector,φ1 = (φ11 φ21 . . . φp1)T .

I We constrain the loadings so that their sum of squares is equal toone, since otherwise setting these elements to be arbitrarily large inabsolute value could result in an arbitrarily large variance.

PCA: example

10 20 30 40 50 60 70

05

10

15

20

25

30

35

Population

Ad S

pendin

g

I The population size (pop) and ad spending (ad) for 100 differentcities are shown as purple circles.

I The green solid line indicates the first principal component, and theblue dashed line indicates the second principal component.

Computation of Principal Components

I Suppose we have a n × p data set X. Since we are only interested invariance, we assume that each of the variables in X has beencentered to have mean zero (that is, the column means of X arezero).

I We then look for the linear combination of the sample feature valuesof the form

zi1 = φ11xi1 + φ21xi2 + · · ·+ φp1xip (1)

for i = 1, . . . , n that has largest sample variance, subject to theconstraint that

∑pj=1 φ

2j1 = 1.

I Since each of the xij has mean zero, then so does zi1 (for any valuesof φj1). Hence the sample variance of the zi1 can be written as1n

∑ni=1 z

2i1.

Computation: continued

I Plugging in (1) the first principal component loading vector solvesthe optimization problem

minimizeφ11,...,φp1

1

n

n∑i=1

p∑j=1

φj1xij

2

subject to

p∑j=1

φ2j1 = 1

I This problem can be solved via a singular-value decomposition of thematrix X, a standard technique in linear algebra.

I We refer to Z1 as the first principal component, with realized valuesz11, . . . , zn1.

Geometry of PCA

I The loading vector φ1 with elements φ11, φ21, . . . , φp1 defines adirection in feature space along which the data vary the most.

I If we project the n data points x1, . . . , xn onto this direction, theprojected values are the principal component scores z11, . . . , zn1themselves.

Further principal components

I The second principal component is the linear combination ofX1, . . . ,Xp that has maximal variance among all linear combinationsthat are uncorrelated with Z1.

I The second principal component scores z12, z22, . . . , zn2 take theform

zi2 = φ12xi1 + φ22xi2 + · · ·+ φp2xip,

where φ2 is the second principal component loading vector, withelements φ12, φ22, . . . , φp2.

Further principal components: continued

I It turns out that constraining Z2 to be uncorrelated with Z1 isequivalent to constraining the direction φ2 to be orthogonal(perpendicular) to the direction φ1. And so on.

I The principal component directions φ1, φ2, φ3, . . . are the orderedsequence of right singular vectors of the matrix X, and the variancesof the components are 1

n times the squares of the singular values.There are at most min(n − 1, p) principal components.

Illustration

I USAarrests data: For each of the fifty states in the United States,the data set contains the number of arrests per 100,000 residents foreach of three crimes: Assault, Murder, and Rape. We also recordUrbanPop (the percent of the population in each state living inurban areas).

I The principal component score vectors have length n = 50, and theprincipal component loading vectors have length p = 4.

I PCA was performed after standardizing each variable to have meanzero and standard deviation one.

USAarrests data: PCA plot

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

First Principal Component

Second P

rincip

al C

om

ponent

Alabama Alaska

Arizona

Arkansas

California

ColoradoConnecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

IndianaIowaKansas

KentuckyLouisiana

Maine Maryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma

OregonPennsylvania

Rhode Island

South Carolina

South Dakota Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

−0.5 0.0 0.5

−0.5

0.0

0.5

Murder

Assault

UrbanPop

Rape

Figure details

The first two principal components for the USArrests data.

I The blue state names represent the scores for the first two principalcomponents.

I The orange arrows indicate the first two principal component loadingvectors (with axes on the top and right). For example, the loadingfor Rape on the first component is 0.54, and its loading on thesecond principal component 0.17 [the word Rape is centered at thepoint (0.54, 0.17)].

I This figure is known as a biplot, because it displays both theprincipal component scores and the principal component loadings.

PC1 PC2Murder 0.5358995 -0.4181809Assault 0.5831836 -0.1879856UrbanPop 0.2781909 0.8728062Rape 0.5434321 0.1673186

Another Interpretation of Principal Components

Another Interpretation of Principal Components

First principal component

Second p

rincip

al com

ponent

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

••

•

•

••

••

•

•

••

•

•

•

• ••

••

•

•

•• •

••

• •

• •

•

•

•

••

•

••

•

•

• •

• ••

•

•

••

•

•

•

•

• •

•

•

•

•

• •

•

• •

•

•

• •

•

•

••

•• •

•

••

• •

••

••

•

••

••

PCA finds the hyperplane closest to the observations

I The first principal component loading vector has a very specialproperty: it defines the line in p-dimensional space that is closest tothe n observations (using average squared Euclidean distance as ameasure of closeness).

I The notion of principal components as the dimensions that areclosest to the n observations extends beyond just the first principalcomponent.

I For instance, the first two principal components of a data set spanthe plane that is closest to the n observations, in terms of averagesquared Euclidean distance.

Scaling of the variables matters

I If the variables are in different units, scaling each to have standarddeviation equal to one is recommended.

I If they are in the same units, you might or might not scale thevariables.

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23


Second

Pri

ncip

al C

om

ponent

* *

*

*

*

**

**

*

*

*

*

***

* *

* *

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

***

*

*

* *

*

*

*

*

*

*

*

*

−0.5 0.0 0.5

−0.5

0.0

0.5

Murder

Assault

UrbanPop

Rape

Scaled

−100 −50 0 50 100 150

−1

00

−50

050

100

150


Second

Pri

ncip

al C

om

ponent

* *

*

*

***

* **

*

*

*** *

* ** *

***

*

**

**

*

*

**

**

** **

*

***

**

*

**

*

**

−0.5 0.0 0.5 1.0

−0.5

0.0

0.5

1.0

Murder Assault

UrbanPop

Rape

Unscaled

Proportion Variance Explained

I To understand the strength of each component, we are interested inknowing the proportion of variance explained (PVE) by each one.

I The total variance present in a data set (assuming that the variableshave been centered to have mean zero) is defined as

p∑j=1

Var(Xj) =

p∑j=1

1

n

n∑i=1

x2ij ,

and the variance explained by the mth principal component is

Var(Zm) =1

n

n∑i=1

z2im.

I It can be shown that∑p

j=1 Var(Xj) =∑M

m=1 Var(Zm), withM = min(n − 1, p).

Proportion Variance Explained: continued

I Therefore, the PVE of the mth principal component is given by thepositive quantity between 0 and 1∑n

i=1 z2im∑p

j=1

∑ni=1 x

2ij

.

I The PVEs sum to one. We sometimes display the cumulative PVEs.

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.0

0.2

0.4

0.6

0.8

1.0

Principal Component

Pro

p.

Va

ria

nce

Exp

lain

ed

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.0

0.2

0.4

0.6

0.8

1.0

Principal Component

Cu

mu

lative

Pro

p.

Va

ria

nce

Exp

lain

ed

How many principal components should we use?

If we use principal components as a summary of our data, how manycomponents are sufficient?

I No simple answer to this question, as cross-validation is notavailable for this purpose.

I the “scree plot” on the previous slide can be used as a guide: welook for an “elbow.”

Clustering

I Clustering refers to a very broad set of techniques for findingsubgroups, or clusters, in a data set.

I We seek a partition of the data into distinct groups so that theobservations within each group are quite similar to each other.

I To make this concrete, we must define what it means for two ormore observations to be similar or different.

I Indeed, this is often a domain-specific consideration that must bemade based on knowledge of the data being studied.

PCA vs Clustering

I PCA looks for a low-dimensional representation of the observationsthat explains a good fraction of the variance.

I Clustering looks for homogeneous subgroups among theobservations.

Clustering for Market Segmentation

I Suppose we have access to a large number of measurements (e.g.median household income, occupation, distance from nearest urbanarea, and so forth) for a large number of people.

I Our goal is to perform market segmentation by identifyingsubgroups of people who might be more receptive to a particularform of advertising, or more likely to purchase a particular product.

I The task of performing market segmentation amounts to clusteringthe people in the data set.

Two clustering methods

I In K-means clustering, we seek to partition the observations into apre-specified number of clusters.

I In hierarchical clustering, we do not know in advance how manyclusters we want; in fact, we end up with a tree-like visualrepresentation of the observations, called a dendrogram, that allowsus to view at once the clusterings obtained for each possible numberof clusters, from 1 to n.

K-means clustering

K=2 K=3 K=4

I A simulated data set with 150 observations in 2-dimensional space.

I Panels show the results of applying K -means clustering withdifferent values of K , the number of clusters.

I The color of each observation indicates the cluster to which it wasassigned using the K -means clustering algorithm.

I Note that there is no ordering of the clusters, so the cluster coloringis arbitrary. These cluster labels were not used in clustering; instead,they are the outputs of the clustering procedure.

Details of K -means clustering

I Let C1, . . . ,CK denote sets containing the indices of theobservations in each cluster. These sets satisfy two properties:

1. C1 ∪ C2 ∪ · · · ∪ CK = {1, . . . , n}. In other words, each observationbelongs to at least one of the K clusters.

2. Ck ∩ Ck′ = ∅ for all k 6= k ′. In other words, the clusters arenon-overlapping: no observation belongs to more than one cluster.

I For instance, if the ith observation is in the kth cluster, then i ∈ Ck .

Details of K -means clustering: continued

I The idea behind K -means clustering is that a good clustering is onefor which the within-cluster variation is as small as possible.

I The within-cluster variation for cluster Ck is a measure WCV(Ck) ofthe amount by which the observations within a cluster differ fromeach other.

I Hence we want to solve the problem

maximizeC1,...,CK

{K∑

k=1

WCV(Ck)

}. (2)

I In words, this formula says that we want to partition theobservations into K clusters such that the total within-clustervariation, summed over all K clusters, is as small as possible.

How to define within-cluster variation?

I Typically we use Euclidean distance

WCV(Ck) =1

|Ck |∑

i,i ′∈Ck

p∑j=1

(xij − xi ′j)2, (3)

where |Ck | denotes the number of observations in the kth cluster.

I Combining (2) and (3) gives the optimization problem that definesK -means clustering,

minimizeC1,...,CK

K∑

k=1

1

|Ck |∑

i,i ′∈Ck

p∑j=1

(xij − xi ′j)2

. (4)

K -Means Clustering Algorithm

1. Randomly assign a number, from 1 to K , to each of theobservations. These serve as initial cluster assignments for theobservations.

2. Iterate until the cluster assignments stop changing:

2.1 For each of the K clusters, compute the cluster centroid. The kthcluster centroid is the vector of the p feature means for theobservations in the kth cluster.

2.2 Assign each observation to the cluster whose centroid is closest(where closest is defined using Euclidean distance).

Properties of the Algorithm

I This algorithm is guaranteed to decrease the value of the objective(4) at each step. Note that

1

|Ck |∑

i,i ′∈Ck

p∑j=1

(xij − xi ′j)2 = 2

∑i∈Ck

p∑j=1

(xij − x̄kj)2,

where x̄kj = 1|Ck |

∑i∈Ck

xij is the mean for feature j in cluster Ck .

I However it is not guaranteed to give the global minimum.

K-means clustering

Data Step 1 Iteration 1, Step 2a

Iteration 1, Step 2b Iteration 2, Step 2a Final Results

Details of Previous Figure

The progress of the K -means algorithm with K = 3.

I Top left: The observations are shown.

I Top center: In Step 1 of the algorithm, each observation is randomlyassigned to a cluster.

I Top right: In Step 2(a), the cluster centroids are computed. Theseare shown as large colored disks. Initially the centroids are almostcompletely overlapping because the initial cluster assignments werechosen at random.

I Bottom left: In Step 2(b), each observation is assigned to thenearest centroid.

I Bottom center: Step 2(a) is once again performed, leading to newcluster centroids.

I Bottom right: The results obtained after 10 iterations.

Example: different starting values

320.9 235.8 235.8

235.8 235.8 310.9

Details of Previous Figure

I K -means clustering performed six times on the data from previousfigure with K = 3, each time with a different random assignment ofthe observations in Step 1 of the K -means algorithm.

I Above each plot is the value of the objective (4).

I Three different local optima were obtained, one of which resulted ina smaller value of the objective and provides better separationbetween the clusters.

I Those labeled in red all achieved the same best solution, with anobjective value of 235.8

Hierarchical Clustering

I K -means clustering requires us to pre-specify the number of clustersK . This can be a disadvantage (later we discuss strategies forchoosing K )

I Hierarchical clustering is an alternative approach which does notrequire that we commit to a particular choice of K .

I Here, we describe bottom-up or agglomerative clustering. This is themost common type of hierarchical clustering, and refers to the factthat a dendrogram is built starting from the leaves and combiningclusters up to the trunk.

Hierarchical Clustering Algorithm

I Start with each point in its own cluster.

I Identify the closest two clusters and merge them.

I Repeat.

I Ends when all points are in a single cluster.

An Example

−6 −4 −2 0 2

−2

02

4

X1

X2

I 45 observations generated in 2-dimensional space.

I In reality there are three distinct classes, shown in separate colors.

I However, we will treat these class labels as unknown and will seek tocluster the observations in order to discover the classes from thedata.

Application of hierarchical clustering

02

46

81

0

02

46

81

0

02

46

81

0

Details of previous figure

I Left: Dendrogram obtained from hierarchically clustering the datafrom previous slide, with complete linkage and Euclidean distance.

I Center: The dendrogram from the left-hand panel, cut at a height of9 (indicated by the dashed line). This cut results in two distinctclusters, shown in different colors.

I Right: The dendrogram from the left-hand panel, now cut at aheight of 5. This cut results in three distinct clusters, shown indifferent colors. Note that the colors were not used in clustering, butare simply used for display purposes in this figure

Another Example

3

4

1 6

9

2

8

5 7

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1

2

3

4

5

6

7

8

9

−1.5 −1.0 −0.5 0.0 0.5 1.0

−1

.5−

1.0

−0

.50

.00

.5

X1

X2

I An illustration of how to properly interpret a dendrogram with nine observations intwo-dimensional space. The raw data on the right was used to generate the dendrogram onthe left.

I Observations 5 and 7 are quite similar to each other, as are observations 1 and 6.

I However, observation 9 is no more similar to observation 2 than it is to observations 8, 5,and 7, even though observations 9 and 2 are close together in terms of horizontal distance.

I This is because observations 2, 8, 5, and 7 all fuse with observation 9 at the same height,approximately 1.8.

Merges in previous example

1

2

3

4

5

6

7

8

9

−1.5 −1.0 −0.5 0.0 0.5 1.0

−1

.5−

1.0

−0

.50

.00

.5

1

2

3

4

5

6

7

8

9

−1.5 −1.0 −0.5 0.0 0.5 1.0

−1

.5−

1.0

−0

.50

.00

.5

1

2

3

4

5

6

7

8

9

−1.5 −1.0 −0.5 0.0 0.5 1.0

−1

.5−

1.0

−0

.50

.00

.5

1

2

3

4

5

6

7

8

9

−1.5 −1.0 −0.5 0.0 0.5 1.0

−1

.5−

1.0

−0

.50

.00

.5

X1X1

X1X1

X2

X2

X2

X2

Types of Linkage

Linkage DescriptionComplete Maximal inter-cluster dissimilarity. Compute all pair-

wise dissimilarities between the observations in clusterA and the observations in cluster B, and record thelargest of these dissimilarities.

Single Minimal inter-cluster dissimilarity. Compute all pairwisedissimilarities between the observations in cluster A andthe observations in cluster B, and record the smallestof these dissimilarities.

Average Mean inter-cluster dissimilarity. Compute all pairwisedissimilarities between the observations in cluster A andthe observations in cluster B, and record the average ofthese dissimilarities.

Centroid Dissimilarity between the centroid for cluster A (a meanvector of length p) and the centroid for cluster B. Cen-troid linkage can result in undesirable inversions.

Choice of Dissimilarity Measure

I So far have used Euclidean distance.

I An alternative is correlation-based distance which considers twoobservations to be similar if their features are highly correlated.

I This is an unusual use of correlation, which is normally computedbetween variables; here it is computed between the observationprofiles for each pair of observations.

5 10 15 20

05

10

15

20

Variable Index

Observation 1

Observation 2

Observation 3

1

2

3

Scaling of the variables matters

Socks Computers

02

46

810

Socks Computers

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Socks Computers

0500

1000

1500

Practical issues

I Should the observations or features first be standardized in someway? For instance, maybe the variables should be centered to havemean zero and scaled to have standard deviation one.

I In the case of hierarchical clustering,I What dissimilarity measure should be used?I What type of linkage should be used?

I How many clusters to choose? (in both K -means or hierarchicalclustering). Difficult problem. No agreed-upon method. SeeElements of Statistical Learning, chapter 13 for more details.

Conclusions

I Unsupervised learning is important for understanding the variationand grouping structure of a set of unlabeled data, and can be auseful pre-processor for supervised learning.

I It is intrinsically more difficult than supervised learning becausethere is no gold standard (like an outcome variable) and no singleobjective (like test set accuracy).

I It is an active field of research, with many recently developed toolssuch as self-organizing maps, independent components analysis andspectral clustering.

I See The Elements of Statistical Learning, chapter 14.

Day 8: Unsupervised learning and dimensional reduction · Day 8: Unsupervised learning and...

Documents

Transcript of Day 8: Unsupervised learning and dimensional reduction · Day 8: Unsupervised learning and...