Cluster Analysis Classifying the Exoplanets. Cluster Analysis Simple idea, difficult execution Used...
-
Upload
dominic-golden -
Category
Documents
-
view
224 -
download
0
Transcript of Cluster Analysis Classifying the Exoplanets. Cluster Analysis Simple idea, difficult execution Used...
![Page 1: Cluster Analysis Classifying the Exoplanets. Cluster Analysis Simple idea, difficult execution Used for indexing large amounts of data in databases. (very.](https://reader036.fdocuments.us/reader036/viewer/2022082405/56649e955503460f94b99c91/html5/thumbnails/1.jpg)
Cluster Analysis
Classifying the Exoplanets
![Page 2: Cluster Analysis Classifying the Exoplanets. Cluster Analysis Simple idea, difficult execution Used for indexing large amounts of data in databases. (very.](https://reader036.fdocuments.us/reader036/viewer/2022082405/56649e955503460f94b99c91/html5/thumbnails/2.jpg)
Cluster Analysis
• Simple idea, difficult execution
• Used for indexing large amounts of data in databases. (very hot skill to have 70/hour)
• “The best form of cluster analysis is ordination, because ordination is not a form of cluster analysis.” –Morgan Byron– No formal def. of a cluster– Results are descriptive and subjective.
![Page 3: Cluster Analysis Classifying the Exoplanets. Cluster Analysis Simple idea, difficult execution Used for indexing large amounts of data in databases. (very.](https://reader036.fdocuments.us/reader036/viewer/2022082405/56649e955503460f94b99c91/html5/thumbnails/3.jpg)
R Commands
• library("scatterplot3d")• scatterplot3d(log(planets$mass),
log(planets$period), log(planets$eccen), type = "h", angle = 55, scale.y = 0.7, pch = 16, y.ticklabs = seq(0, 10, by = 2), y.margin.add = 0.1)– Taking the log of the each data point– Setting the angle and the physical scale so it looks
like a box– Pch is the symbol used for the data point– Seq() function sets the numeric scales– Y.margin.add adds a bit to the vertical margins
![Page 4: Cluster Analysis Classifying the Exoplanets. Cluster Analysis Simple idea, difficult execution Used for indexing large amounts of data in databases. (very.](https://reader036.fdocuments.us/reader036/viewer/2022082405/56649e955503460f94b99c91/html5/thumbnails/4.jpg)
Interpretation
• No real insight after our first view of the data, but it looks neat.
![Page 5: Cluster Analysis Classifying the Exoplanets. Cluster Analysis Simple idea, difficult execution Used for indexing large amounts of data in databases. (very.](https://reader036.fdocuments.us/reader036/viewer/2022082405/56649e955503460f94b99c91/html5/thumbnails/5.jpg)
R Commands• rge <- apply(planets, 2, max) - apply(planets, 2, min)
– Stores the range of the data• 2 indicates the column margin of the data matrix
• planet.dat <- sweep(planets, 2, rge, FUN = "/")– Divides each element in the matrix by the range of the column
margin• n <- nrow(planet.dat)• wss <- rep(0, 10)
– Creates a 10 dimensional vector of all 0’s• wss[1] <- (n-1)*sum(apply(planet.dat, 2, var))
– This is the sum of squares of all the points – if we partition the data in 1 group.
• for (i in 2:10) wss[i] <- sum(kmeans(planet.dat, centers = i)$withinss)– Using the kmeans method, as the number of partitions
increases, calculates the sum of squares of the members of each group.
![Page 6: Cluster Analysis Classifying the Exoplanets. Cluster Analysis Simple idea, difficult execution Used for indexing large amounts of data in databases. (very.](https://reader036.fdocuments.us/reader036/viewer/2022082405/56649e955503460f94b99c91/html5/thumbnails/6.jpg)
The K-Means Method
• This method uses different ways of minimizing a numerical value - often a notion of distance- by partitioning the data.
• The method used in this analysis is minimizing the sums of squares of data within a group, and finding a number of groups that has the lowest SS
• This method can be impractical with the number of partitions increasing very quickly as the number of groups and data points increases.
![Page 7: Cluster Analysis Classifying the Exoplanets. Cluster Analysis Simple idea, difficult execution Used for indexing large amounts of data in databases. (very.](https://reader036.fdocuments.us/reader036/viewer/2022082405/56649e955503460f94b99c91/html5/thumbnails/7.jpg)
The “Elbow”
• In choosing a good number of partitions, the “elbow” or the sharpest angle in the graph is an easy approach.– The steepest
angles look to be at 3 and 5 number of groups.
![Page 8: Cluster Analysis Classifying the Exoplanets. Cluster Analysis Simple idea, difficult execution Used for indexing large amounts of data in databases. (very.](https://reader036.fdocuments.us/reader036/viewer/2022082405/56649e955503460f94b99c91/html5/thumbnails/8.jpg)
Number of planets in the groups
• planet_kmeans3 <- kmeans(planet.dat, centers = 3)– We chose to try 3 groups
• table(planet_kmeans3$cluster)– 1 2 3 – 14 53 34
• ccent <- function(cl) { – f <- function(i) colMeans(planets[cl == i, ])
• Finds the mean for each cluster
– x <- sapply(sort(unique(cl)), f) • Sorts
– colnames(x) <- sort(unique(cl)) – return(x) }
![Page 9: Cluster Analysis Classifying the Exoplanets. Cluster Analysis Simple idea, difficult execution Used for indexing large amounts of data in databases. (very.](https://reader036.fdocuments.us/reader036/viewer/2022082405/56649e955503460f94b99c91/html5/thumbnails/9.jpg)
The results
• > ccent(planet_kmeans3$cluster)
• Cluster 1 2 3• mass 10.56786 1.6710566 2.9276471• period 1693.17201 427.7105892 616.0760882• eccen 0.36650 0.1219491 0.4953529
• Number 14 53 34
![Page 10: Cluster Analysis Classifying the Exoplanets. Cluster Analysis Simple idea, difficult execution Used for indexing large amounts of data in databases. (very.](https://reader036.fdocuments.us/reader036/viewer/2022082405/56649e955503460f94b99c91/html5/thumbnails/10.jpg)
Model-Based Clustering in brief
– The subjective decision or assumption is the number of clusters.
– After that, it becomes a problem of maximizing the likelihood that a partition is the best.
![Page 11: Cluster Analysis Classifying the Exoplanets. Cluster Analysis Simple idea, difficult execution Used for indexing large amounts of data in databases. (very.](https://reader036.fdocuments.us/reader036/viewer/2022082405/56649e955503460f94b99c91/html5/thumbnails/11.jpg)
Mclust function
• Mclust find an appropriate model AND the optimal number of groups.– Not Free?!! Need a liscence agreement from
University of Washington.
• R Commands:– Library(“mclust”)– Planet_mclust <- Mclust(planet.dat)– Plot(planet_mclust, planet.dat)– Print(planet_mclust)
• The best model is of diagonal clusters of varying volume and shape with 3 groups
![Page 12: Cluster Analysis Classifying the Exoplanets. Cluster Analysis Simple idea, difficult execution Used for indexing large amounts of data in databases. (very.](https://reader036.fdocuments.us/reader036/viewer/2022082405/56649e955503460f94b99c91/html5/thumbnails/12.jpg)
Homework
• Spend 30 minutes attempting exercise 15.1 and send me what you get done.
• Stick it to the Man!
• Then practice your air guitar