Download - Presentation on experimental setup for verigying - "Slow Learners are Fast"

Machine Learning on Cell Processor

Supervisor: Dr. Eric McCreath Student: Robin Srivastava

Background and Motivation

Machine Learning

Batch Learning

Online Learning

Email-N ……..… email-2 Email-1

HAM

SPAM

Background and Motivation

Machine Learning

Sequential in Nature

Batch Learning

Online Learning

Email-N ……..… email-2 Email-1

HAM

SPAM

Object   Performance evaluation of a parallel online machine

learning algorithm (Langford et. al. [1])   Target Machines

  Cell Processor: One 3 GHz 64-bit IBM PowerPC, six specialized co-processors

  Intel Dual Core Machine: 2GHz dual core processor, 1.86 GB of main memory

Stochastic Gradient Descent   Step 1: Initialize weight vector w0 with some arbitrary

values   Step 2: Update the weight vector as follows

where is the gradient of error function and is the learning rate

  Step 3: Follow Step 2 for all the units for data €

w(t+1) = wt −η∇E wt( )

€

∇E

€

η

Delayed Stochastic Gradient Descent   Step 1: Initialize weight vector w0 with some arbitrary

values   Step 2: Update the weight vector as follows

where is the gradient of error function and is the learning rate

  Step 3: Follow Step 2 for all the units for data €

w(t+1) = wt −η∇E wt−τ( )

€

∇E

€

η

Implementation Model C

ompl

ete

Dat

aset

Implementation   Dataset – TREC 2007 Public Corpus

  Number of mail: 75,419   Each mail classified as either ‘ham’ or ‘spam’

  Pre-processing   Total number of features extracted: 2,218,878   Pre-processed email format

<Number of features><space><index>:<count><space>…………..<index>:<count>

Memory Requirement   Algorithm Implemented

  Online Logistic Regression with delayed update   Requirement per level of parallelization

  Two private copy of weight vectors   Two shared copy of weight vectors   Two error gradients   Required Dimension for each = Number of features = 2,218,878   Data type: Float (On Cell takes 4 bytes)   Total = (6 x 2218878) x 4 = 53,253,072 bytes = 50.78 MB   Size occupied by other auxiliary variables

  Alternatively   Make only shared copy use the full dimension   Total size = (2 x 2218878) x 4 = 16.9 MB + others

Limitations on Cell   Memory limitation of SPE

  Available: 256 KB   Required: approx. 51 MB   Work Around:

  Reduced the number of features   Done one more level of pre-processing

  SIMD limitation   The time wasted in preparing the data for SIMD surpassed its

benefits for this implementation

Results   Serial implementation of logistic regression on Intel Dual

core took 36.93 and 36.45 sec respectively for two consecutive executions.

  Parallel implementation using stochastic gradient process

Results (contd.)   Performance on Cell

Tim

e in

mic

rose

cond

s

References ①  John Langford, Alexander J. Samola and Martin Zinkevich.

Slow learners are fast published in Journal of Machine Learning Research 1(2009)

②  Michael Kistler, Michael Perrone, Fabrizio Petrini. Cell Multiprocessor Communication Network: Built for Speed.

③  Thomas Chen , Ram Raghavan , Jason Dale and Eiji Iwata. Cell Broadband Engine Architecture and its first implementation

④  Jonathan Bartlett. Programming high-performance applications on the Cell/B.E. processor, Part 6: Smart buffer management with DMA transfers

⑤  Introduction to Statistical Machine Learning, 2010 course assignment 1

⑥  Christopher Bishop, Pattern Recognition and Machine Learning.