CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
Machine Learning with Hadoop
-
Upload
sangchul-song -
Category
Technology
-
view
10.897 -
download
2
description
Transcript of Machine Learning with Hadoop
October 4, 2011Presented to Hadoop-DC
Training on a pluggable machine learning platform
Machine Learning on Hadoop at Huffington Post | AOL
A Little Bit about Us
Core Services Team at HPMG | AOL
Thu Kyaw ([email protected])• Principal Software Engineer• Worked on machine learning, data mining, and natural language processing
Sang Chul Song, Ph.D. ([email protected])• Senior Software Engineer• Worked on data intensive computing – data archiving / information retrieval
Machine Learning:Supervised Classification
Business Non-business
Investments are taxed as … Are you dense or just clueless?
the top tax bracket for … numbas is numbas …
Well, Mr. Geithner, … This is a joke, right?
the financial crises are unfair … My nephew is a hedge fund manager …
Train Model
Model
1. Learning Phase
2. Classifying Phase
capital gains to be taxed … Classify Result
“Business”
“Entertainment”
“Politics”
Two Machine Learning Use Cases at HuffPost | AOL
Comment Moderation• Evaluate All New HuffPost User Comments Every Day• Identify Abusive / Aggressive Comments• Auto Delete / Publish ~25% Comments Every Day
Article Classification• Tag Articles for Advertising
• E.g.: scary, salacious, …
Our Classification Tasks
abusive
non-abusive
non-abusive
non-abusive
non-abusive
abusive
scary
sexy
Comment Moderation Article Classification
In Order to Meet Our Needs,We Require…
Support for important algorithms, including• SVM• Perceptron / Winnow• Bayesian• Decision Tree• AdaBoost …
Ability to build tons of models on regular basis, and pick the best• Because, in general, it’s difficult to know in advance what algorithm / parameter set will
work best
However,
N algorithms, K parameters each, L values in each parameter There are N x LK combinations!, which is often too many to deal with sequentially.
For example, N=5, K=5, L=10 500K
So, we parallelize on Hadoop
Good news: • Mahout, a parallel machine learning tool, is already available.• There are Mallet, libsvm, Weka, … that support necessary algorithms.
Bad news: • Mahout doesn’t support necessary algorithms yet. • Other algorithms do not run natively on Hadoop.
Therefore, we do…
We build a flexible ML platform running on Hadoop that supports a wide range of algorithms, leveraging publicly available implementations.
On top of our platform, we generate / test hundred thousands models, and choose the best.
We use Pig for Hadoop implementation.
CONVENTIONAL
Training Data
Our Approach
1000s Models(one for each
param set)
Train (sequential) SelectBest
Model
OUR APPROACH More algorithms (thus better model), and faster parallel processing
Train
Requ
est Return
AdaBoost, SVM, Decision Tree,Bayesian and a Lot Others
What Parallelization?
Training Task
Training Task
Training Task
Training Task
Training Task
General Processing Flow
Preprocess Parameters• Stopword use, n-gram size, stemming, etc.
Train Parameters• Algorithm and algorithm specific parameters
• (e.g. SVM, C, Ɛ, and other kernel parameters)
PreprocessTrainingDocs
VectorizedDocs
Model
Train
Our Parallel Processing Flow
TrainingDocs
VectorizedDocs
Vectorized Docs
Vectorized Docs
Model
Model
Model
Model
Model
ModelModel
Model
Model
Preprocessing on Hadoop(see next
slide)
Preprocessing on Hadoop
279 68 ngram_stem_stopword 1 snowball true279 68 ngram_stem_stopword 2 snowball true279 68 ngram_stem_stopword 3 snowball true279 68 ngram_stem_stopword 1 porter true279 68 ngram_stem_stopword 2 porter true279 68 ngram_stem_stopword 3 none false…
Preprocessing Request (a parameter set per line)
business Investments are taxed as capital gains.....business It was the overleveraged and underregulated banks …none I am afraid we may be headed for …none In the famous words of Homer Simpson, “it takes 2 to lie …”…
Training Data
Vector 1
Vector 2
Vector 3
Vector 4
Vector 5
Vector k
Preprocessing on HadoopBig Picture
par = LOAD param_file AS par1, par2, …;run = FOREACH par GENERATE
RunPreprocess(par1, par2, …);STORE run ..;
Through UDF Call
RunPreprocess()
Vector 1
Vector 2
Vector k
……
..UDF
Preprocessors (Pluggable Pipes)
Stemmer
VectorizerFeatureSelector
Tokenizer StopwordFilter
Training on Hadoop
73 923 balanced_winnow 5 1 10 …73 923 balanced_winnow 5 2 10 …73 923 balanced_winnow 5 3 10 …73 923 balanced_winnow 5 1 20 …73 923 balanced_winnow 5 2 20 …73 923 balanced_winnow 5 3 20 ……
Train Request (a parameter set per line)
010101101020101100010101110100010101011100…010111010100010100100010101011100110110101…011101011010101011101011011010001010010101…010010111010100010101010001010111010101010…111010110001110101011010100101011010001011…
Vectors
Model 1
Model 2
Model 3
Model 4
Model 5
Model kMahout, Weka, Malletor libsvm
Training on Hadoop
(see next slide)
Training on HadoopBig Picture
par = LOAD param_file AS par1, par2, …;run = FOREACH par GENERATE
RunTrainer(par1, par2, …);STORE run ..;
Through UDF Call
RunTrainer()
Model 1
Model 2
Model k
……
.UDF
Mahout• Bayesian• Logistic Regression• …
Mallet• AdaBoost (M2)• Bagging• Balanced Winnow• C45• Decision Tree• …
Weka• AdaBoostM1• Bagging• Addictive Regression• …
libsvm• SVM
Training on Hadoop : Trick #1
Each model can be generated independently an easy parallelization problem (aka ‘embarrassingly parallel’)But, how do we achieve parallelism with Pig?
par = LOAD param_file AS par1, par2, …;run = FOREACH par GENERATE RunTrainer(par1, par2, …);STORE run ...;
par = LOAD param_file AS par1, par2, …;grp = GROUP par BY (par1, par2, …) PARALLEL 50fltn = FOREACH grp GENERATE group.par1 AS par1, …;run = FOREACH fltn GENERATE RunTrainer(par1, …);STORE run …;
Training on Hadoop: Trick #2
We call ML functions from UDF.Some functions can take too long to return, and Hadoop will kill the job if they do.
RunTrainer()
“Pig Heartbeat” ThreadMain Thread
As a result, we now see…
We are now able to build tens of thousands of models within an hour and choose the best.
• Previously, the same task took us days.
As we can generate more models more frequently, we become more adaptive to the fast-changing Internet community, catching up with newly-coined terms, etc.
Useful Resources
Mahout: http://mahout.apache.org/Mallet: http://mallet.cs.umass.edu/Weka: http://www.cs.waikato.ac.nz/ml/weka/libsvm: http://www.csie.ntu.edu.tw/~cjlin/libsvm/OpenNLP: http://incubator.apache.org/opennlp/Pig: http://pig.apache.org/
THANK YOU!