Machine Learning with Hadoop

October 4, 2011Presented to Hadoop-DC

Training on a pluggable machine learning platform

Machine Learning on Hadoop at Huffington Post | AOL

A Little Bit about Us

Core Services Team at HPMG | AOL

Thu Kyaw ([email protected])• Principal Software Engineer• Worked on machine learning, data mining, and natural language processing

Sang Chul Song, Ph.D. ([email protected])• Senior Software Engineer• Worked on data intensive computing – data archiving / information retrieval

Machine Learning:Supervised Classification

Business Non-business

Investments are taxed as … Are you dense or just clueless?

the top tax bracket for … numbas is numbas …

Well, Mr. Geithner, … This is a joke, right?

the financial crises are unfair … My nephew is a hedge fund manager …

Train Model

Model

1. Learning Phase

2. Classifying Phase

capital gains to be taxed … Classify Result

“Business”

“Entertainment”

“Politics”

Two Machine Learning Use Cases at HuffPost | AOL

Comment Moderation• Evaluate All New HuffPost User Comments Every Day• Identify Abusive / Aggressive Comments• Auto Delete / Publish ~25% Comments Every Day

Article Classification• Tag Articles for Advertising

• E.g.: scary, salacious, …

Our Classification Tasks

abusive

non-abusive

non-abusive

non-abusive

non-abusive

abusive

scary

sexy

Comment Moderation Article Classification

In Order to Meet Our Needs,We Require…

Support for important algorithms, including• SVM• Perceptron / Winnow• Bayesian• Decision Tree• AdaBoost …

Ability to build tons of models on regular basis, and pick the best• Because, in general, it’s difficult to know in advance what algorithm / parameter set will

work best

However,

N algorithms, K parameters each, L values in each parameter There are N x LK combinations!, which is often too many to deal with sequentially.

For example, N=5, K=5, L=10 500K

So, we parallelize on Hadoop

Good news: • Mahout, a parallel machine learning tool, is already available.• There are Mallet, libsvm, Weka, … that support necessary algorithms.

Bad news: • Mahout doesn’t support necessary algorithms yet. • Other algorithms do not run natively on Hadoop.

Therefore, we do…

We build a flexible ML platform running on Hadoop that supports a wide range of algorithms, leveraging publicly available implementations.

On top of our platform, we generate / test hundred thousands models, and choose the best.

We use Pig for Hadoop implementation.

CONVENTIONAL

Training Data

Our Approach

1000s Models(one for each

param set)

Train (sequential) SelectBest

Model

OUR APPROACH More algorithms (thus better model), and faster parallel processing

Train

Requ

est Return

AdaBoost, SVM, Decision Tree,Bayesian and a Lot Others

What Parallelization?

Training Task

Training Task

Training Task

Training Task

Training Task

General Processing Flow

Preprocess Parameters• Stopword use, n-gram size, stemming, etc.

Train Parameters• Algorithm and algorithm specific parameters

• (e.g. SVM, C, Ɛ, and other kernel parameters)

PreprocessTrainingDocs

VectorizedDocs

Model

Train

Our Parallel Processing Flow

TrainingDocs

VectorizedDocs

Vectorized Docs

Vectorized Docs

Model

Model

Model

Model

Model

ModelModel

Model

Model

Preprocessing on Hadoop(see next

slide)

Preprocessing on Hadoop

279 68 ngram_stem_stopword 1 snowball true279 68 ngram_stem_stopword 2 snowball true279 68 ngram_stem_stopword 3 snowball true279 68 ngram_stem_stopword 1 porter true279 68 ngram_stem_stopword 2 porter true279 68 ngram_stem_stopword 3 none false…

Preprocessing Request (a parameter set per line)

business Investments are taxed as capital gains.....business It was the overleveraged and underregulated banks …none I am afraid we may be headed for …none In the famous words of Homer Simpson, “it takes 2 to lie …”…

Training Data

Vector 1

Vector 2

Vector 3

Vector 4

Vector 5

Vector k

Preprocessing on HadoopBig Picture

par = LOAD param_file AS par1, par2, …;run = FOREACH par GENERATE

RunPreprocess(par1, par2, …);STORE run ..;

Through UDF Call

RunPreprocess()

Vector 1

Vector 2

Vector k

……

..UDF

Preprocessors (Pluggable Pipes)

Stemmer

VectorizerFeatureSelector

Tokenizer StopwordFilter

Training on Hadoop

73 923 balanced_winnow 5 1 10 …73 923 balanced_winnow 5 2 10 …73 923 balanced_winnow 5 3 10 …73 923 balanced_winnow 5 1 20 …73 923 balanced_winnow 5 2 20 …73 923 balanced_winnow 5 3 20 ……

Train Request (a parameter set per line)

010101101020101100010101110100010101011100…010111010100010100100010101011100110110101…011101011010101011101011011010001010010101…010010111010100010101010001010111010101010…111010110001110101011010100101011010001011…

Vectors

Model 1

Model 2

Model 3

Model 4

Model 5

Model kMahout, Weka, Malletor libsvm

Training on Hadoop

(see next slide)

Training on HadoopBig Picture

par = LOAD param_file AS par1, par2, …;run = FOREACH par GENERATE

RunTrainer(par1, par2, …);STORE run ..;

Through UDF Call

RunTrainer()

Model 1

Model 2

Model k

……

.UDF

Mahout• Bayesian• Logistic Regression• …

Mallet• AdaBoost (M2)• Bagging• Balanced Winnow• C45• Decision Tree• …

Weka• AdaBoostM1• Bagging• Addictive Regression• …

libsvm• SVM

Training on Hadoop : Trick #1

Each model can be generated independently an easy parallelization problem (aka ‘embarrassingly parallel’)But, how do we achieve parallelism with Pig?

par = LOAD param_file AS par1, par2, …;run = FOREACH par GENERATE RunTrainer(par1, par2, …);STORE run ...;

par = LOAD param_file AS par1, par2, …;grp = GROUP par BY (par1, par2, …) PARALLEL 50fltn = FOREACH grp GENERATE group.par1 AS par1, …;run = FOREACH fltn GENERATE RunTrainer(par1, …);STORE run …;

Training on Hadoop: Trick #2

We call ML functions from UDF.Some functions can take too long to return, and Hadoop will kill the job if they do.

RunTrainer()

“Pig Heartbeat” ThreadMain Thread

As a result, we now see…

We are now able to build tens of thousands of models within an hour and choose the best.

• Previously, the same task took us days.

As we can generate more models more frequently, we become more adaptive to the fast-changing Internet community, catching up with newly-coined terms, etc.

Useful Resources

Mahout: http://mahout.apache.org/Mallet: http://mallet.cs.umass.edu/Weka: http://www.cs.waikato.ac.nz/ml/weka/libsvm: http://www.csie.ntu.edu.tw/~cjlin/libsvm/OpenNLP: http://incubator.apache.org/opennlp/Pig: http://pig.apache.org/

http://mahout.apache.org/

http://mahout.apache.org/

http://mallet.cs.umass.edu/

http://mallet.cs.umass.edu/

http://www.cs.waikato.ac.nz/ml/weka/

http://www.cs.waikato.ac.nz/ml/weka/

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

http://incubator.apache.org/opennlp/

http://incubator.apache.org/opennlp/

http://pig.apache.org/

http://pig.apache.org/

THANK YOU!

Machine Learning with Hadoop

Technology

Transcript of Machine Learning with Hadoop