Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy...
-
Upload
amanda-copeland -
Category
Documents
-
view
246 -
download
1
Transcript of Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy...
• Mahout Introduction • Machine Learning• Clustering• K-means • Canopy Clustering• Fuzzy K-Means
• Conclusion
What is Mahout?
• Distributed machine learning libraries– “scalable to reasonably large data sets”– Runs on Hadoop
What?• Hadoop brings:– Map/Reduce API– HDFS– In other words, scalability and fault-tolerance
• Mahout brings:– Library of machine learning algorithms– Examples
Why Mahout?
• Many Open Source ML libraries either:– Lack Community
– Lack Documentation and Examples
– Lack Scalability
– Lack the Apache License ;-)
– Or are research-oriented
Clustering
• Unsupervised• Find Natural Groupings– Documents– Search Results– People– Genetic traits in groups– Many, many more uses
Types
• Supervised– Using labeled training data, create function that
predicts output of unseen inputs• Unsupervised– Using unlabeled data, create function that predicts
output• Semi-Supervised– Uses labeled and unlabeled data
K-means Algorithm
1) Pick a number (k) of cluster centers2) Assign every element to its nearest cluster
center3) Move each cluster center to the mean of
its assigned elements 4) Repeat 2-3 until convergence
Figure 1: K-means algorithm. Training examples are shown as dots, and cluster centroids are shown as crosses.
K-means Example
Canopy Clustering• Canopy Clustering is a very simple, fast and surprisingly accurate method for
grouping objects into clusters.
Define two thresholdsTight: T1
Loose: T2Put all records into a set SWhile S is not empty
Remove any record r from S and create a canopy centered at rFor each other record ri, compute cheap distance d from r to ri If d < T2, place ri in r’s canopyIf d < T1, remove ri from S
Canopy Clustering
SequenceFile (WritableComparable, VectorWritable)
Invocation using the command line takes the form:
Fuzzy K-Means
Fuzzy K-Means (also called Fuzzy C-Means) is an extension of K-Means , the popular simple clustering technique. Fuzzy K-Means (also called Fuzzy C-Means) is an extension of K-Means , the popular simple clustering technique.
Like K-Means, Fuzzy K-Means works on those objects which can be represented in n-dimensional vector space and a distance measure is defined. The algorithm is similar to k-means.
Initialize k clusters
Until convergedCompute the probability of a point belong to a cluster for every pairRe-compute the cluster centers using above probability membership values of points to clusters.
Conclusion
• Mahout did not scale well• Mahout was not easy to learn• Mahout was not easily modifiable
• For performance and efficiency, it is better to– Understand the data set– Understand data mining– Understand the methodology