Introducing Apache Mahout
-
Upload
kieran-travis -
Category
Documents
-
view
28 -
download
1
description
Transcript of Introducing Apache Mahout
![Page 1: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/1.jpg)
Introducing Apache Mahout
Scalable Machine Learning for All!
Grant Ingersoll
Lucid Imagination
![Page 2: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/2.jpg)
Overview
• What is Machine Learning?
• Mahout
![Page 3: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/3.jpg)
Definition• “Machine Learning is programming
computers to optimize a performance criterion using example data or past experience”– Intro. To Machine Learning by E.
Alpaydin
• Subset of Artificial Intelligence– Many other fields: comp sci., biology,
math, psychology, etc.
![Page 4: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/4.jpg)
Types• Supervised
– Using labeled training data, create function that predicts output of unseen inputs
• Unsupervised– Using unlabeled data, create function
that predicts output
• Semi-Supervised– Uses labeled and unlabeled data
![Page 5: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/5.jpg)
Characterizations
• Lots of Data
• Identifiable Features in that Data
• Too big/costly for people to handle– People still can help
![Page 6: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/6.jpg)
Clustering
• Unsupervised
• Find Natural Groupings– Documents– Search Results– People– Genetic traits in groups– Many, many more uses
![Page 7: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/7.jpg)
Example: Clustering
Google News
![Page 8: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/8.jpg)
Collaborative Filtering
• Unsupervised
• Recommend people and products– User-User
• User likes X, you might too
– Item-Item• People who bought X also bought Y
![Page 9: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/9.jpg)
Example: Collab Filtering
Amazon.com
![Page 10: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/10.jpg)
Classification/Categorization
• Many, many types
• Spam Filtering
• Named Entity Recognition
• Phrase Identification
• Sentiment Analysis
• Classification into a Taxonomy
![Page 11: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/11.jpg)
Example: NER
NER?
Excerpt from Yahoo News
![Page 12: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/12.jpg)
Example: Categorization
![Page 13: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/13.jpg)
Info. Retrieval
• Learning Ranking Functions
• Learning Spelling Corrections
• User Click Analysis and Tracking
![Page 14: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/14.jpg)
Other
• Image Analysis
• Robotics
• Games
• Higher level natural language processing
• Many, many others
![Page 15: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/15.jpg)
What is Apache Mahout?
• A Mahout is an elephant trainer/driver/keeper, hence…
+Machine Learning
=
(and other distributed techniques)
![Page 16: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/16.jpg)
What?
• Hadoop brings:– Map/Reduce API– HDFS– In other words, scalability and fault-
tolerance
• Mahout brings:– Library of machine learning algorithms– Examples
![Page 17: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/17.jpg)
Why Mahout?• Many Open Source ML libraries either:
– Lack Community
– Lack Documentation and Examples
– Lack Scalability
– Lack the Apache License ;-)
– Or are research-oriented
![Page 18: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/18.jpg)
Why Mahout?• Intelligent Apps are the Present and
Future
• Thus, Mahout’s Goal is:– Scalable Machine Learning with Apache
License
![Page 19: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/19.jpg)
Current Status• What’s in it:
– Simple Matrix/Vector library– Taste Collaborative Filtering– Clustering
• Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet
– Classifiers• Naïve Bayes• Complementary NB
– Evolutionary• Integration with Watchmaker for fitness function
![Page 20: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/20.jpg)
How?
• Examples– Taste– Clustering– Classification– Evolutionary
![Page 21: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/21.jpg)
Taste: Movie Recommendations
• Given ratings by users of movies, recommend other movies
• http://lucene.apache.org/mahout/taste.html#demo
![Page 22: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/22.jpg)
Taste Demo
• http://localhost:8080/mahout-taste-webapp/RecommenderServlet?userID=12&debug=true
• http://localhost:8080/mahout-taste-webapp/RecommenderServlet?userID=43&debug=true
![Page 23: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/23.jpg)
Clustering: Synthetic Control Data
• http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series
• Each clustering impl. has an example Job for running in <MAHOUT_HOME>/examples– o.a.mahout.clustering.syntheticcontrol.*
• Outputs clusters…
![Page 24: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/24.jpg)
Classification: NB and CNB Examples
• 20 Newsgroups– http://cwiki.apache.org/confluence/
display/MAHOUT/TwentyNewsgroups
• Wikipedia– http://cwiki.apache.org/confluence/
display/MAHOUT/WikipediaBayesExample
![Page 25: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/25.jpg)
Evolutionary
• Traveling Salesman– http://cwiki.apache.org/confluence/
display/MAHOUT/Traveling+Salesman
• Class Discovery– http://cwiki.apache.org/confluence/
display/MAHOUT/Class+Discovery
![Page 26: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/26.jpg)
What’s Next?• More Examples• Winnow/Perceptron (MAHOUT-85)• Text Clustering• Association Rules (MAHOUT-108)• Logistic Regression• Solr Integration (SOLR-769)• GSOC
![Page 27: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/27.jpg)
When, Who• When? Now!
– Mahout is growing
• Who? You!– We want programmers who:
• Are comfortable with math• Like to work on hard problems
– We want others to:• Kick the tires
![Page 28: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/28.jpg)
Where?
• http://lucene.apache.org/mahout– Hadoop - http://hadoop.apache.org
• http://cwiki.apache.org/MAHOUT
• mahout-{user|dev}@lucene.apache.org– http://www.lucidimagination.com/search/p:mahout
![Page 29: Introducing Apache Mahout](https://reader035.fdocuments.us/reader035/viewer/2022062321/568134da550346895d9c0bd7/html5/thumbnails/29.jpg)
Resources
• “Programming Collective Intelligence” by Segaran
• “Data Mining - Practical Machine Learning Tools and Techniques” by Witten and Frank
• “Taming Text” by Ingersoll and Morton