Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout
-
Upload
michael-figuiere -
Category
Technology
-
view
1.369 -
download
0
description
Transcript of Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout
Machine Learning with Apache Mahout
Classification, Clustering and Recommendation
3/3/2011 Michaël Figuière
Machine Learning
Machine Learning
Artificial Intelligence
Machine Learning
Machine Learning is a subset of Artificial
Intelligence
NoSQL, Search and Machine Learning
NoSQL, Search and Machine Learning greatly complete
each other !MachineLearning
SearchNoSQL
Machine Learning algorithms
• Recommentations
• Classification
• Clustering
• Patterns mining, evolutionary algorithms, ...
Advice user with recommended items
Automatically classify documents based on a given set of examples
Automatically discover groups within a set of documents
Recommendation - User based
Amazon suggests articles bought
by similar customers
Recommendation - Item based
On the article page Amazon leverages item based recommendation
Similarities between users
A B D E FC
1 2
1
Here we observes that users 1 and 2 have similar tastes
Recommendation use cases
• Advice user with items on e-commerce websites
• Advice user with feature he may be interested in on a Web application
• Filter and adapt scoring of results of a search engine
And increase revenue
As most features are usually unknown
Based on similar users clicks, ...
Classification
Mails classified as spams by GMail
Classification use cases
• Automatically attach tags to documents
• Extract suspicious documents
Based on existing manual tagging, wikipedia, ...
Spam, corrupted documents, ...
Clustering
Trendy topics discovered by Google News
Clustering with K-Means
AB
DE
F
C
Clustering with K-Means
AB
DE
F
C
Cluster centerswith random initial position
Clustering with K-Means
AB
C
DE
F
Data are attached to the nearest cluster center
Clustering with K-Means
AB
DE
F
C
Cluster centers are moved in order to minimize the sum
of distances
Clustering with K-Means
AB
DE
F
C
The data point C is then attached to the first center as it has
become the nearest
Clustering use cases
• Finds key topics in a set of documents
• Finds some typical behaviors within a set of users
News feeds, business documents, ...
Visit frequency, buying habits, ...
Apache Mahout
In few words
• Implementation of machine learning algorithms in Java
• Most of them come in a MapReduce implementation for Hadoop
• Still quite young but growing fast
• Intended to be for Machine Learning what Lucene is for Information Retrieval
Continuously growing collection of algorithms
Scalable to huge datasets
Started in early 2009
Documentation
Recommendation example
DataModel model = new FileDataModel(new File("data.csv"));
UserSimilarity simil = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);
Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, simil);
List<RecommendedItem> recommendations = recommender.recommend(1, 1);
The code for a basic recommendation is pretty straightforward !
Classification with Mahout
Trainingalgorithm
Trainingexamples
New data
Model
Model Decision
Copy
Clustering with Mahout
ClusteringalgorithmDocuments List of
clusters
Relevance evaluation
Data used for training
Data used to evaluate relevance of an algorithm and its settings
Entire dataset
A search engine use case
A Search Engine
Search
A Search Engine
SearchMyCustomer
A Search Engine
SearchMyCustomer
Non Disclosure Agreement 12 days ago... MyCustomer agrees not to disclose any part of ...
2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...
Phone Call 2 days agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E
Document
Document
Phone Call
Indexing Pipeline
Text Extractor
Lucene
PhoneCall
Analyzer
Analyzer
SearchIndex
Tika
A more complex Search Engine
SearchMyCustomer
2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...
Phone Call 2 days agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E
Document
Phone Call
Sales Juridic Accounting
Indexing Pipeline with Mahout
Text Extractor
Lucene
PhoneCall
Analyzer
Analyzer
SearchIndex
Tika
Classifier
Classifier
Mahout
Query pipeline
Query
Results
Analyzer
SearchIndex
Lucene
Query pipeline with Mahout
Using Mahout recommendations
Query
Results
Analyzer
Analyzer
CustomScoring
SearchIndex
Lucene
Conclusion
• Machine learning brings a lot of valuable features for enterprises
• Mahout is growing fast and is becoming a great choice for Java apps
• Business people are not used to that kind of use cases
Revenue increasing, better productivity, user adoption, ...
With easy integration to business applications
Collaboration with technical folks is mandatory
Questions / Answers
?@mfiguiere
blog.xebia.fr