Knitting boar - Toronto and Boston HUGs - Nov 2012

36
KNITTING BOAR Machine Learning, Mahout, and Parallel Iterative Algorithms 1 Josh Patterson Principal Solutions Architect

Transcript of Knitting boar - Toronto and Boston HUGs - Nov 2012

Page 1: Knitting boar - Toronto and Boston HUGs - Nov 2012

1

KNITTING BOARMachine Learning, Mahout, and Parallel Iterative Algorithms

Josh PattersonPrincipal Solutions Architect

Page 2: Knitting boar - Toronto and Boston HUGs - Nov 2012

Hello

✛ Josh Patterson> Master’s Thesis: self-organizing mesh networks

∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm

> Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA)

> Twitter: @jpatanooga> Email: [email protected]

Page 3: Knitting boar - Toronto and Boston HUGs - Nov 2012

Outline

✛ Introduction to Machine Learning✛ Mahout✛ Knitting Boar and YARN✛ Parting Thoughts

Page 4: Knitting boar - Toronto and Boston HUGs - Nov 2012

4

MACHINE LEARNINGIntroduction to

Page 5: Knitting boar - Toronto and Boston HUGs - Nov 2012

Basic Concepts

✛ What is Data Mining?> “the process of extracting patterns from data”

✛ Why are we interested in Data Mining?> Raw data essentially useless

∗ Data is simply recorded facts∗ Information is the patterns underlying the data

✛ Machine Learning> Algorithms for acquiring structural descriptions from

data “examples”∗ Process of learning “concepts”

Page 6: Knitting boar - Toronto and Boston HUGs - Nov 2012

Shades of Gray

✛ Information Retrieval> information science, information architecture,

cognitive psychology, linguistics, and statistics.✛ Natural Language Processing

> grounded in machine learning, especially statistical machine learning

✛ Statistics> Math and stuff

✛ Machine Learning> Considered a branch of artificial intelligence

Page 7: Knitting boar - Toronto and Boston HUGs - Nov 2012

Hadoop in Traditional Enterprises Today

✛ ETL✛ Joining multiple disparate data sources✛ Filtering data✛ Aggregation✛ Cube materialization

“Descriptive Statistics”

Page 8: Knitting boar - Toronto and Boston HUGs - Nov 2012

Hadoop All The Time?

✛ Don’t always assume you need “scale” and parallelization> Try it out on a single machine first> See if it becomes a bottleneck!

✛ Will the data fit in memory on a beefy machine?

✛ We can always use the constructed model back in MapReduce to score a ton of new data

Page 9: Knitting boar - Toronto and Boston HUGs - Nov 2012

Twitter Pipeline

✛ http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf

> Looks to study data with descriptive statistics in the hopes of building models for predictive analytics

✛ Does majority of ML work via Pig custom integrations

> Pipeline is very “Pig-centric”

> Example: https://github.com/tdunning/pig-vector

> They use SGD and Ensemble methods mostly being conducive to large scale data mining

✛ Questions they try to answer

> Is this tweet spam?> What star rating might this user give this movie?

Page 10: Knitting boar - Toronto and Boston HUGs - Nov 2012

Typical Pipeline for Cloudera Customer

✛ Data collection performed w Flume✛ Data cleansing / ETL performed with Hive

or Pig✛ ML work performed with

> SAS> SPSS> R> Mahout

Page 11: Knitting boar - Toronto and Boston HUGs - Nov 2012

11 MAHOUTIntroduction to

Page 12: Knitting boar - Toronto and Boston HUGs - Nov 2012

Algorithm Groups in Apache Mahout

Copyright 2010 Cloudera Inc. All rights reserved12

✛ Classification> “Fraud detection”

✛ Recommendation> “Collaborative

Filtering”✛ Clustering

> “Segmentation”✛ Frequent Itemset

Mining

Page 13: Knitting boar - Toronto and Boston HUGs - Nov 2012

Classification

✛ Stochastic Gradient Descent > Single process> Logistic Regression Model Construction

✛ Naïve Bayes> MapReduce-based> Text Classification

✛ Random Forests> MapReduce-based

Copyright 2010 Cloudera Inc. All rights reserved13

Page 14: Knitting boar - Toronto and Boston HUGs - Nov 2012

What Are Recommenders?

✛ An algorithm that looks at a user’s past actions and suggests> Products> Services> People

✛ Advertisement> Cloudera has a great Data Science training course on

this topic> http://university.cloudera.com/training/data_science/in

troduction_to_data_science_-_building_recommender_systems.html

Page 15: Knitting boar - Toronto and Boston HUGs - Nov 2012

Clustering: Topic Modeling

✛ Cluster words across docs to identify topics✛ Latent Dirichlet Allocation

Page 16: Knitting boar - Toronto and Boston HUGs - Nov 2012

Taking a Breath For a Minute

✛ Why Machine Learning?> Growing interest in predictive modeling

✛ Linear Models are Simple, Useful> Stochastic Gradient Descent is a very popular tool for

building linear models like Logistic Regression

✛ Building Models Still is Time Consuming> The “Need for speed”

> “More data beats a cleverer algorithm”

Page 17: Knitting boar - Toronto and Boston HUGs - Nov 2012

17

KNITTING BOARIntroducing

Page 18: Knitting boar - Toronto and Boston HUGs - Nov 2012

Goals

✛ Parallelize Mahout’s Stochastic Gradient Descent> With as few extra dependencies as possible

✛ Wanted to explore parallel iterative algorithms using YARN> Wanted a first class Hadoop-Yarn citizen

> Work through dev progressions towards a stable state

> Worry about “frameworks” later

Page 19: Knitting boar - Toronto and Boston HUGs - Nov 2012

19

Stochastic Gradient Descent

✛ Training> Simple gradient descent

procedure> Loss functions needs to be

convex✛ Prediction

> Logistic Regression:∗ Sigmoid function using

parameter vector (dot) example as exponential parameter

SGD

Model

Training Data

Page 20: Knitting boar - Toronto and Boston HUGs - Nov 2012

20

Current Limitations

✛ Sequential algorithms on a single node only goes so far

✛ The “Data Deluge”> Presents algorithmic challenges when combined with

large data sets> need to design algorithms that are able to perform in

a distributed fashion✛ MapReduce only fits certain types of algorithms

Page 21: Knitting boar - Toronto and Boston HUGs - Nov 2012

21

Distributed Learning Strategies

✛ Langford, 2007> Vowpal Wabbit

✛ McDonald 2010> Distributed Training Strategies for the Structured

Perceptron✛ Dekel 2010

> Optimal Distributed Online Prediction Using Mini-Batches

Page 22: Knitting boar - Toronto and Boston HUGs - Nov 2012

22

MapReduce vs. Parallel Iterative

Input

Output

Map Map Map

Reduce Reduce

Processor Processor Processor

Superstep 1

Processor Processor

Superstep 2

. . .

Processor

Page 23: Knitting boar - Toronto and Boston HUGs - Nov 2012

23

Why Stay on Hadoop?

“Are the gains gotten from using X worth the integration costs incurred in building the end-to-end solution?

If no, then operationally, we can consider the Hadoop stack …

there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.”

–– Lin, 2012

Page 24: Knitting boar - Toronto and Boston HUGs - Nov 2012

24

The Boar

✛ Parallel Iterative implementation of SGD on YARN

✛ Workers work on partitions of the data✛ Master keeps global copy of merged parameter

vector

Page 25: Knitting boar - Toronto and Boston HUGs - Nov 2012

25

Worker

✛ Each given a split of the total dataset> Similar to a map task

✛ Using a modified OLR> process N samples in a epoch (subset of split)

✛ Local parameter vector sent to master node> Master averages all workers’ vectors together

Page 26: Knitting boar - Toronto and Boston HUGs - Nov 2012

26

Master

✛ Gathers and averages worker parameter vectors> From worker OLR runs

✛ Produces new global parameter vector> By averaging workers’ vectors

✛ Sends update to all workers> Workers replace local parameter vector with new

global parameter vector

Page 27: Knitting boar - Toronto and Boston HUGs - Nov 2012

27

IterativeReduce

✛ ComputableMaster> Setup()> Compute()> Complete()

✛ ComputableWorker> Setup()> Compute()

Worker Worker Worker

Master

Worker Worker

Master

. . .

Worker

Page 28: Knitting boar - Toronto and Boston HUGs - Nov 2012

28

Comparison: OLR vs POLR

OnlineLogisticRegression

Model

Training Data

Worker 1

Master

Partial Model

Global Model

Worker 2

Partial Model

Worker N

Partial Model

OnlineLogisticRegression Knitting Boar’s POLRSplit 1 Split 2 Split 3

Page 29: Knitting boar - Toronto and Boston HUGs - Nov 2012

29

20Newsgroups

Input Size vs Processing Time

4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 410

50

100

150

200

250

300

OLRPOLR

Page 30: Knitting boar - Toronto and Boston HUGs - Nov 2012

30

PARTING THOUGHTSKnitting Boar

Page 31: Knitting boar - Toronto and Boston HUGs - Nov 2012

31

Knitting Boar Lessons Learned

✛ Parallel SGD> The Boar is temperamental, experimental

∗ Linear speedup (roughly)

✛ Developing YARN Applications> More complex the just MapReduce> Requires lots of “plumbing”

✛ IterativeReduce> Great native-Hadoop way to implement algorithms> Easy to use and well integrated

Page 32: Knitting boar - Toronto and Boston HUGs - Nov 2012

32

Bits

✛ Knitting Boar> https://github.com/jpatanooga/KnittingBoar> 100% Java> ASF 2.0 Licensed> Quick Start

∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start

✛ IterativeReduce> https://github.com/emsixteeen/IterativeReduce> 100% Java> ASF 2.0 Licensed

Page 33: Knitting boar - Toronto and Boston HUGs - Nov 2012

33

✛ Machine Learning is hard> Don’t believe the hype> Do the work

✛ Model development takes time> Lots of iterations> Speed is key here

Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg

Page 34: Knitting boar - Toronto and Boston HUGs - Nov 2012

34

References

✛ Strata / Hadoop World 2012 Slides> http://www.cloudera.com/content/cloudera/en/resourc

es/library/hadoopworld/strata-hadoop-world-2012-knitting-boar_slide_deck.html

✛ Mahout’s SGD implementation> http://lingpipe.files.wordpress.com/2008/04/lazysgdre

gression.pdf✛ MapReduce is Good Enough? If All You Have is

a Hammer, Throw Away Everything That’s Not a Nail!> http://arxiv.org/pdf/1209.2191v1.pdf

Page 35: Knitting boar - Toronto and Boston HUGs - Nov 2012

35

References

✛ Langford > http://hunch.net/~vw/

✛ McDonald, 2010> http://dl.acm.org/citation.cfm?id=1858068