Post on 16-Jan-2016
Cluster Analysis
Classifying the Exoplanets
Cluster Analysis
• Simple idea, difficult execution
• Used for indexing large amounts of data in databases. (very hot skill to have 70/hour)
• “The best form of cluster analysis is ordination, because ordination is not a form of cluster analysis.” –Morgan Byron– No formal def. of a cluster– Results are descriptive and subjective.
R Commands
• library("scatterplot3d")• scatterplot3d(log(planets$mass),
log(planets$period), log(planets$eccen), type = "h", angle = 55, scale.y = 0.7, pch = 16, y.ticklabs = seq(0, 10, by = 2), y.margin.add = 0.1)– Taking the log of the each data point– Setting the angle and the physical scale so it looks
like a box– Pch is the symbol used for the data point– Seq() function sets the numeric scales– Y.margin.add adds a bit to the vertical margins
Interpretation
• No real insight after our first view of the data, but it looks neat.
R Commands• rge <- apply(planets, 2, max) - apply(planets, 2, min)
– Stores the range of the data• 2 indicates the column margin of the data matrix
• planet.dat <- sweep(planets, 2, rge, FUN = "/")– Divides each element in the matrix by the range of the column
margin• n <- nrow(planet.dat)• wss <- rep(0, 10)
– Creates a 10 dimensional vector of all 0’s• wss[1] <- (n-1)*sum(apply(planet.dat, 2, var))
– This is the sum of squares of all the points – if we partition the data in 1 group.
• for (i in 2:10) wss[i] <- sum(kmeans(planet.dat, centers = i)$withinss)– Using the kmeans method, as the number of partitions
increases, calculates the sum of squares of the members of each group.
The K-Means Method
• This method uses different ways of minimizing a numerical value - often a notion of distance- by partitioning the data.
• The method used in this analysis is minimizing the sums of squares of data within a group, and finding a number of groups that has the lowest SS
• This method can be impractical with the number of partitions increasing very quickly as the number of groups and data points increases.
The “Elbow”
• In choosing a good number of partitions, the “elbow” or the sharpest angle in the graph is an easy approach.– The steepest
angles look to be at 3 and 5 number of groups.
Number of planets in the groups
• planet_kmeans3 <- kmeans(planet.dat, centers = 3)– We chose to try 3 groups
• table(planet_kmeans3$cluster)– 1 2 3 – 14 53 34
• ccent <- function(cl) { – f <- function(i) colMeans(planets[cl == i, ])
• Finds the mean for each cluster
– x <- sapply(sort(unique(cl)), f) • Sorts
– colnames(x) <- sort(unique(cl)) – return(x) }
The results
• > ccent(planet_kmeans3$cluster)
• Cluster 1 2 3• mass 10.56786 1.6710566 2.9276471• period 1693.17201 427.7105892 616.0760882• eccen 0.36650 0.1219491 0.4953529
• Number 14 53 34
Model-Based Clustering in brief
– The subjective decision or assumption is the number of clusters.
– After that, it becomes a problem of maximizing the likelihood that a partition is the best.
Mclust function
• Mclust find an appropriate model AND the optimal number of groups.– Not Free?!! Need a liscence agreement from
University of Washington.
• R Commands:– Library(“mclust”)– Planet_mclust <- Mclust(planet.dat)– Plot(planet_mclust, planet.dat)– Print(planet_mclust)
• The best model is of diagonal clusters of varying volume and shape with 3 groups
Homework
• Spend 30 minutes attempting exercise 15.1 and send me what you get done.
• Stick it to the Man!
• Then practice your air guitar
• zweihanderdawg@gmail.com