Cloud Computing - Carnegie Mellon Universitymsakr/15319-s12/lectures/Lecture09_1… · Apache...

Post on 25-Jun-2020

1 views 1 download

Transcript of Cloud Computing - Carnegie Mellon Universitymsakr/15319-s12/lectures/Lecture09_1… · Apache...

Apache MahoutFeb 13, 2012

Shannon Quinn

Cloud ComputingCS 15-319

MapReduce Review

C0 C1 C2 C3

M0 M1 M2 M3

IO0 IO1 IO2 IO3

R0 R1

FO0 FO1

chunks

mappers

Reducers

Map

Pha

seR

educ

e Ph

ase Shuffling Data

Scalable programming modelMap phaseShuffleReduce phaseMapReduceImplementations

GoogleHadoop Figure from lecture 6: MapReduce

MapReduce Review

C0 C1 C2 C3

M0 M1 M2 M3

IO0 IO1 IO2 IO3

R0 R1

FO0 FO1

chunks

mappers

Reducers

Map

Pha

seR

educ

e Ph

ase Shuffling Data

Scalable programming modelMap phaseShuffleReduce phaseMapReduceImplementations

GoogleHadoop Figure from lecture 6: MapReduce

This is our focus!

Apache Mahout

A scalable machine learning library

Apache Mahout

A scalable machine learning libraryBuilt on Hadoop

Apache Mahout

A scalable machine learning libraryBuilt on HadoopPhilosophy of Mahout (and Hadoop by proxy)

What does Mahout do?

Recommendation

Classification

Clustering

Other Mahout Algorithms

Dimensionality Reduction

Regression

Evolutionary Algorithms

Mahout

1. Recommendation

2. Classification

3. Clustering

Recommendation Overview

Help users find items they might like based on historical preferences

Recommendation Overview

Mathematically

Recommendation Overview

Alice

Bob

Peter

5 1 4

? 2 5

4 3 2 *based on example bySebastian Schelter

Recommendation Overview

5 1 4

- 2 5

4 3 2 *based on example bySebastian Schelter

Recommendation Overview

5 1 4

? 2 5

4 3 2

Bob

*based on example bySebastian Schelter

Recommendation Overview

5 1 4

1.5 2 5

4 3 2

Bob

*based on example bySebastian Schelter

Recommendation in Mahout

1st Map phase: process input

*based on example bySebastian Schelter

Recommendation in Mahout

1st Map phase: process input

1st Reduce phase: list by user

*based on example bySebastian Schelter

Recommendation in Mahout

2nd Map phase: Emit co-occurred ratings

*based on example bySebastian Schelter

Recommendation in Mahout

2nd Map phase: Emit co-occurred ratings

2nd Reduce phase: Compute similarities

*based on example bySebastian Schelter

Mahout

1. Recommendation

2. Classification

3. Clustering

Classification Overview

Assigning data to discrete categories

Classification Overview

Assigning data to discrete categoriesTrain a model on labeled data

Spam Not spam

Classification Overview

Assigning data to discrete categoriesTrain a model on labeled dataRun the model on new, unlabeled dataSpam Not spam

?

Naïve Bayes Example

Naïve Bayes Example

Prob (token | label) =

Naïve Bayes ExampleNot spam

Naïve Bayes Example

President Obama’s Nobel Prize Speech

Not spam

Naïve Bayes ExampleSpam

Naïve Bayes ExampleSpam

Spam email content

Naïve Bayes Example

Naïve Bayes Example

“Order a trial Adobe chicken dailyEAB-List new summer savings, welcome!”

Naïve Bayes in Mahout

Complex!

Naïve Bayes in Mahout

Complex!Training1. Read the features

Naïve Bayes in Mahout

Complex!Training1. Read the features2. Calculate per-document statistics

Naïve Bayes in Mahout

Complex!Training1. Read the features2. Calculate per-document statistics3. Normalize across categories

Naïve Bayes in Mahout

Complex!Training1. Read the features2. Calculate per-document statistics3. Normalize across categories4. Calculate normalizing factor of each label

Naïve Bayes in Mahout

Complex!Training1. Read the features2. Calculate per-document statistics3. Normalize across categories4. Calculate normalizing factor of each label

TestingClassification

Other Classification Algorithms

Stochastic Gradient Descent

Other Classification Algorithms

Stochastic Gradient DescentSupport Vector Machines

Other Classification Algorithms

Stochastic Gradient DescentSupport Vector MachinesRandom Forests

Mahout

1. Recommendation

2. Classification

3. Clustering

Clustering Overview

Grouping unstructured data

Clustering Overview

Grouping unstructured dataSmall intra-cluster distance

Clustering Overview

Grouping unstructured dataSmall intra-cluster distanceLarge inter-cluster distance

K-Means Clustering Example

K-Means Clustering Example

K-Means Clustering Example

K-Means Clustering Example

K-Means Clustering Example

K-Means Clustering Example

K-Means Clustering Example

K-Means Clustering Example

K-Means Clustering Example

K-Means Clustering Example

Cats

Dogs

K-Means Clustering in Mahout

+

C0 C1 C2 C3

M0 M1 M2 M3

IO0 IO1 IO2 IO3

R0 R1

FO0 FO1

chunks

mappers

Reducers

Map

Pha

seR

educ

e Ph

ase Shuffling Data

Figure from lecture 6: MapReduce

K-Means Clustering in Mahout

Assume: # clusters <<< # points

K-Means Clustering in Mahout

Assume: # clusters <<< # points

M0 M1 M2 M3

K-Means Clustering in Mahout

Assume: # clusters <<< # points

M0 M1 M2 M3

<clusterID, observation>

R0 R1

K-Means Clustering in Mahout

Map phase: assign cluster IDs

K-Means Clustering in Mahout

Map phase: assign cluster IDs

Reduce phase: reset centroids

K-Means Clustering in Mahout

Important notes--maxIter--convergenceDeltamethod

Other Clustering Algorithms

Latent Dirichlet AllocationTopic models

Other Clustering Algorithms

Latent Dirichlet AllocationTopic models

Fuzzy K-MeansPoints are assigned multiple clusters

Other Clustering Algorithms

Latent Dirichlet AllocationTopic models

Fuzzy K-MeansPoints are assigned multiple clusters

Canopy clusteringFast approximations of clusters

Other Clustering Algorithms

Latent Dirichlet AllocationTopic models

Fuzzy K-MeansPoints are assigned multiple clusters

Canopy clusteringFast approximations of clusters

Spectral clusteringTreat points as a graph

Other Clustering Algorithms

Latent Dirichlet AllocationTopic models

Fuzzy K-MeansPoints are assigned multiple clusters

Canopy clusteringFast approximations of clusters

Spectral clusteringTreat points as a graph

K-Means & Eigencuts

Mahout in Summary

Mahout in Summary

Scalable library

Mahout in Summary

Scalable libraryThree primary areas of focus

Mahout in Summary

Scalable libraryThree primary areas of focusOther algorithms

Mahout in Summary

Scalable libraryThree primary areas of focusOther algorithmsAll in your friendly neighborhood MapReduce

Mahout in Summary

Scalable libraryThree primary areas of focusOther algorithmsAll in your friendly neighborhood MapReducehttp://mahout.apache.org/

Thank you!