Cluster Analysis Classifying the Exoplanets. Cluster Analysis Simple idea, difficult execution Used...

Post on 16-Jan-2016

224 views 0 download

Transcript of Cluster Analysis Classifying the Exoplanets. Cluster Analysis Simple idea, difficult execution Used...

Cluster Analysis

Classifying the Exoplanets

Cluster Analysis

• Simple idea, difficult execution

• Used for indexing large amounts of data in databases. (very hot skill to have 70/hour)

• “The best form of cluster analysis is ordination, because ordination is not a form of cluster analysis.” –Morgan Byron– No formal def. of a cluster– Results are descriptive and subjective.

R Commands

• library("scatterplot3d")• scatterplot3d(log(planets$mass),

log(planets$period), log(planets$eccen), type = "h", angle = 55, scale.y = 0.7, pch = 16, y.ticklabs = seq(0, 10, by = 2), y.margin.add = 0.1)– Taking the log of the each data point– Setting the angle and the physical scale so it looks

like a box– Pch is the symbol used for the data point– Seq() function sets the numeric scales– Y.margin.add adds a bit to the vertical margins

Interpretation

• No real insight after our first view of the data, but it looks neat.

R Commands• rge <- apply(planets, 2, max) - apply(planets, 2, min)

– Stores the range of the data• 2 indicates the column margin of the data matrix

• planet.dat <- sweep(planets, 2, rge, FUN = "/")– Divides each element in the matrix by the range of the column

margin• n <- nrow(planet.dat)• wss <- rep(0, 10)

– Creates a 10 dimensional vector of all 0’s• wss[1] <- (n-1)*sum(apply(planet.dat, 2, var))

– This is the sum of squares of all the points – if we partition the data in 1 group.

• for (i in 2:10) wss[i] <- sum(kmeans(planet.dat, centers = i)$withinss)– Using the kmeans method, as the number of partitions

increases, calculates the sum of squares of the members of each group.

The K-Means Method

• This method uses different ways of minimizing a numerical value - often a notion of distance- by partitioning the data.

• The method used in this analysis is minimizing the sums of squares of data within a group, and finding a number of groups that has the lowest SS

• This method can be impractical with the number of partitions increasing very quickly as the number of groups and data points increases.

The “Elbow”

• In choosing a good number of partitions, the “elbow” or the sharpest angle in the graph is an easy approach.– The steepest

angles look to be at 3 and 5 number of groups.

Number of planets in the groups

• planet_kmeans3 <- kmeans(planet.dat, centers = 3)– We chose to try 3 groups

• table(planet_kmeans3$cluster)– 1 2 3 – 14 53 34

• ccent <- function(cl) { – f <- function(i) colMeans(planets[cl == i, ])

• Finds the mean for each cluster

– x <- sapply(sort(unique(cl)), f) • Sorts

– colnames(x) <- sort(unique(cl)) – return(x) }

The results

• > ccent(planet_kmeans3$cluster)

• Cluster 1 2 3• mass 10.56786 1.6710566 2.9276471• period 1693.17201 427.7105892 616.0760882• eccen 0.36650 0.1219491 0.4953529

• Number 14 53 34

Model-Based Clustering in brief

– The subjective decision or assumption is the number of clusters.

– After that, it becomes a problem of maximizing the likelihood that a partition is the best.

Mclust function

• Mclust find an appropriate model AND the optimal number of groups.– Not Free?!! Need a liscence agreement from

University of Washington.

• R Commands:– Library(“mclust”)– Planet_mclust <- Mclust(planet.dat)– Plot(planet_mclust, planet.dat)– Print(planet_mclust)

• The best model is of diagonal clusters of varying volume and shape with 3 groups

Homework

• Spend 30 minutes attempting exercise 15.1 and send me what you get done.

• Stick it to the Man!

• Then practice your air guitar

• zweihanderdawg@gmail.com