Machine Learning on Cell Processor
Supervisor: Dr. Eric McCreath Student: Robin Srivastava
Background and Motivation
Machine Learning
Batch Learning
Online Learning
Email-N ……..… email-2 Email-1
HAM
SPAM
Background and Motivation
Machine Learning
Sequential in Nature
Batch Learning
Online Learning
Email-N ……..… email-2 Email-1
HAM
SPAM
Object Performance evaluation of a parallel online machine
learning algorithm (Langford et. al. [1]) Target Machines
Cell Processor: One 3 GHz 64-bit IBM PowerPC, six specialized co-processors
Intel Dual Core Machine: 2GHz dual core processor, 1.86 GB of main memory
Stochastic Gradient Descent Step 1: Initialize weight vector w0 with some arbitrary
values Step 2: Update the weight vector as follows
where is the gradient of error function and is the learning rate
Step 3: Follow Step 2 for all the units for data €
w(t+1) = wt −η∇E wt( )
€
∇E
€
η
Delayed Stochastic Gradient Descent Step 1: Initialize weight vector w0 with some arbitrary
values Step 2: Update the weight vector as follows
where is the gradient of error function and is the learning rate
Step 3: Follow Step 2 for all the units for data €
w(t+1) = wt −η∇E wt−τ( )
€
∇E
€
η
Implementation Model C
ompl
ete
Dat
aset
Implementation Dataset – TREC 2007 Public Corpus
Number of mail: 75,419 Each mail classified as either ‘ham’ or ‘spam’
Pre-processing Total number of features extracted: 2,218,878 Pre-processed email format
<Number of features><space><index>:<count><space>…………..<index>:<count>
Memory Requirement Algorithm Implemented
Online Logistic Regression with delayed update Requirement per level of parallelization
Two private copy of weight vectors Two shared copy of weight vectors Two error gradients Required Dimension for each = Number of features = 2,218,878 Data type: Float (On Cell takes 4 bytes) Total = (6 x 2218878) x 4 = 53,253,072 bytes = 50.78 MB Size occupied by other auxiliary variables
Alternatively Make only shared copy use the full dimension Total size = (2 x 2218878) x 4 = 16.9 MB + others
Limitations on Cell Memory limitation of SPE
Available: 256 KB Required: approx. 51 MB Work Around:
Reduced the number of features Done one more level of pre-processing
SIMD limitation The time wasted in preparing the data for SIMD surpassed its
benefits for this implementation
Results Serial implementation of logistic regression on Intel Dual
core took 36.93 and 36.45 sec respectively for two consecutive executions.
Parallel implementation using stochastic gradient process
Results (contd.) Performance on Cell
Tim
e in
mic
rose
cond
s
References ① John Langford, Alexander J. Samola and Martin Zinkevich.
Slow learners are fast published in Journal of Machine Learning Research 1(2009)
② Michael Kistler, Michael Perrone, Fabrizio Petrini. Cell Multiprocessor Communication Network: Built for Speed.
③ Thomas Chen , Ram Raghavan , Jason Dale and Eiji Iwata. Cell Broadband Engine Architecture and its first implementation
④ Jonathan Bartlett. Programming high-performance applications on the Cell/B.E. processor, Part 6: Smart buffer management with DMA transfers
⑤ Introduction to Statistical Machine Learning, 2010 course assignment 1
⑥ Christopher Bishop, Pattern Recognition and Machine Learning.
Top Related