Machine Learning and Apache Mahout : An Introduction

Post on 08-Sep-2014

5.382 views 3 download

Tags:

description

An Introductory presentation on Machine Learning and Apache Mahout. I presented it at the BigData Meetup - Pune Chapter's first meetup (http://www.meetup.com/Big-Data-Meetup-Pune-Chapter/).

Transcript of Machine Learning and Apache Mahout : An Introduction

+

Varad MeruSoftware Development EngineerOrzota, Inc.about.me/vrdmr

Machine Learning

and Apache Mahout

© Varad Meru, 2013

+Who Am I

Orzota, Inc. Making BigData Easy Designing a Cloud-based platform for ETL, Analytics

Past Work Experience Persistent Systems Ltd.

Recommendation Engines and User Behavior Analytics.

Area of Interest Machine Learning Distributed Systems Recommendation Engines

2

+Outline

Introduction

Machine Learning Introduction and History Types of Learning Algorithms Applications What’s New

Apache Mahout History Architecture Applications and Examples

Conclusion© Varad Meru, 2013

3

+

Machine LearningRise of the Machine-Era

4

+Introduction

Term coined by Arthur Samuel "Field of study that gives computers the ability to learn

without being explicitly programmed“.

Branch of Artificial Intelligence and Statistics

Focuses on prediction based on known properties

Used as a sub-process in Data Mining. Data Mining focuses on discovering new, unknown

properties.

“Machine Learning is Programming Computers to optimize a Performance Criterion using

Example Data or Past Experience”

5

+Learning Algorithms

Supervised Learning Labelled input data. Creating classifiers to predict unseen inputs.

Unsupervised Learning Unlabelled input data. Creating a function to predict the relation and output

Semi-Supervised Learning Combines Supervised and Unsupervised Learning

methodology

Reinforcement Learning Reward-Punishment based agent.

6

+Supervised Learning

Learn from the Data

Data is already labelled Expert, Crowd-sourced or case-based labelling of data.

Applications Handwriting Recognition Spam Detection Information Retrieval

Personalisation based on ranks Speech Recognition

Introduction

7

+Supervised Learning

Decision Trees

k-Nearest Neighbours

Naive Bayes

Logistic Regression

Perceptron and Multi-level Perceptrons

Neural Networks

SVM and Kernel estimation

Algorithms

8

+Supervised LearningExample: Naive Bayes Classifier

President Obama’s Speech’s Word Map

9

+Supervised LearningExample: Naive Bayes Classifier

A Spam Document’s Word Map

10

+Supervised LearningExample: Naive Bayes Classifier

Running a test on the Classifier

Classifier

“Order a trial Adobe chicken daily EAB-List new summer

savings, welcome!”

11

SpamBin

+Unsupervised Learning

Finding hidden structure in data

Unlabelled Data

SMEs needed post-processing to verify, validate and use the output

Used in exploratory analysis rather than predictive analytics

Applications Pattern Recognition Groupings based on a distance measure

Group of People, Objects, ...

Introduction

12

+Unsupervised Learning

Clustering k-Means, MinHash, Hierarchical Clustering

Hidden Markov Models

Feature Extraction methods

Self-organizing Maps (Neural Nets)

Algorithms

13

+Unsupervised LearningExample K-Means

14

Source: http://apandre.wordpress.com/visible-data/cluster-analysis/

+Learning ProblemCat and Dog Problem

Humans can easily classify which is a cat and which is a dog.

But how can a computer do that?

Some attempts used Clustering Mechanisms to solve it – Co-occurence Clustering, Deep Learning

15

+

Apache MahoutScalable Machine Learning Library

© Varad Meru, 2013

16

+History and Etymology

Inspired from MapReduce for Machine Learning on Multicore” Ng et. al.

Written in Java. Apache License.

Founders Mahout – Isabel Drost, Grant Ingersoll,

Karl Witten. Taste – Sean Owen

Mahout – Keeper/Driver of Elephants.

Current Release – 0.8 (stable)

© Varad Meru, 2013

17

+Need

BigData Ever-growing data. Yesterday’s methods to

process tomorrow’s data Cheap Storage

Scalable from Ground Up Should be build on top of

any existing Distributed Systems framework

Should contain distributed version of ML algorithms

Size Classification Tools

LinesSample Data

Analysis and Visualisation

Whiteboard,Bash, ...

KBs – low MBsPrototype Data

Analysis and Visualisation

Matlab, Octave, R, Processing, Bash, ...

MBs – low GBs

Online Data

StorageMySQL (DBs), ...

Analysis

NumPy, SciPy, Pandas, Weka..

VisualisationFlare, AmCharts, Raphael

GBs – TBs – PBs

Big Data

StorageHDFS, Hbase, Cassandra,...

AnalysisHive, Giraph, Hama, Mahout

18

+Mahout Modules

Evolutionary Algorithms

Classification

Clustering Recommenders

Regression FPM Dimension Reduction

UtiliesLucene/Vectorizer

MathVectors/ Matrics/SVD

Collections(Primitives)

Hadoop

Applications

19

+Recommender Systems

© Varad Meru, 2013

20

+Recommender Systems

Types of Recommender Systems Content Based Recommendations Collaborative Filtering Recommendations

User-User Recommendations Item-Item Recommendations

Dimensionality Reduction (SVD) Recommendations

Applications Products you would like to buy People you might want to connect with Potential Life-Partners Recommending Songs you might like ...

21

Introduction

+Recommender Systems

22

Collaborative Filtering in Action

Assuming people have seen at least one movie. Cold Start?

1: seen

0: not seen

© Varad Meru, 2013

+Collaborative Filtering in Action

Tanimoto Coefficient

NA – Number of Customers who bought A

NB – Number of Customers who bought B

NC – Number of Customers who bought A and B

© Varad Meru, 2013

CBA

C

NNN

NbaT

),(

23

+Collaborative Filtering in Action

Cosine Coefficient

NA – Number of Customers who bought A

NB – Number of Customers who bought B

NC – Number of Customers who bought A and B

© Varad Meru, 2013

BA

C

NN

NbaC

),(

24

+Apache Mahout

Two Modes Stand-alone non distributed (“Taste”) Scalable Distributed Algorithmic version

for Collaborative Filtering

Top-level Packages Data Model User Similarity Item Similarity User Neighbourhood Recommender

25

Recommender System Architecture

+Naive Bayes Classifier

26

Classifier

“Order a trial Adobe chicken daily EAB-List new summer

savings, welcome!”

+Naive Bayes Classifier

Naive Bayes is a pretty complex process in Mahout: training the classifier requires four separate Hadoop jobs.

Training: Read the Features Calculate per-Document

Statistics Normalize across Categories Calculate normalizing factor

of each label

Testing Classification (fifth job, explicitly invoked)

© Varad Meru, 2013

27

+K-Means Clustering

28

Iterations

+K-Means Clustering

29

MapReduce Version

+ Summary• Machine Learning

• Learning Algorithms• Varied Applications

• Mahout• Scaling to Giga/Tera/Peta Scale• Free and Open Source

30

+More Info.

1. “Scalable Similarity-Based Neighborhood Methods with MapReduce” by Sebastian Schelter, Christoph Boden and Volker Markl. – RecSys 2012.

2. “Case Study Evaluation of Mahout as a Recommender Platform” by Carlos E. Seminario and David C. Wilson - Workshop on Recommendation Utility Evaluation: Beyond RMSE (RUE 2012)

3. http://mahout.apache.org/ - Apache Mahout Project Page

4. http://www.ibm.com/developerworks/java/library/j-mahout/ - Introducing Apache Mahout

5. [VIDEO] “Collaborative filtering at scale” by Sean Owen

6. [BOOK] “Mahout in Action” by Owen et. al., Manning Pub.

© Varad Meru, 2013

31

+

Questions?

© Varad Meru, 2013

32

+ Thank YouGo BigData!!!

33

© Varad Meru, 2014