Sang Kyun Kim, Lawrence McAfee, Peter McMahon, Kunle...

h1 h2 h3 h4

v1 v2 v3 v4

hidden neurons

visible neurons

Sang Kyun Kim, Lawrence McAfee, Peter McMahon, Kunle Olukotun

Avalon Slave

AVALON MM

Main Controller

32 : CPU↔RBM register

MUX

Stream Logic

…

Buffers Buffers

Local FSMLocal FSM

Memory

Sigmoid / RNG / Compare

Memory

…


…

Buffers Buffers

Local FSMLocal FSM

Memory


Memory

…

GRP0 GRP1 GRP2 GRP3

Avalon Master

Avalon Slave


256 : CPU↔RBM mem

… … … …Visible Neuron Broadcast

256

256 : DDR2↔RBM

Main↔Local

256

TreeAdd Value Broadcast

Tree Add / Accum Tree Add / Accum Tree Add / Accum Tree Add / Accum

256 256 256

16

16

16

16

16

16

16

16

Restricted Boltzmann Machine

0 jihv 1 jihv

i

j

i

j

t = 0 t = 1 reconstructiondata

)( 10 jijiij hvhvw

Figure 1. RBM Structure (above)An RBM is a two layer neural network with all-to-all connections between the layers

Figure 2. Deep Belief Nets (right)A Deep Belief Network is a multi-layergenerative model. The network is firstlearned with all the weights tied, which is equivalent to an RBM. Then it freezes the first layer and learns the remaining weights, which is also equivalent to another RBM.

RBM

Figure 3. RBM Training(from Hinton’s tutorial at NIPS’07)The training begins with the dataat the visible layer computing the probabilities of the hidden layer.

The hidden layer is updated and stochastically fires to reconstruct the visible layer. The hidden layer is recomputed from the reconstructed visible layer. The weights are updated based on the difference between the visible-hidden product of the original training example and the visible-hidden product of the reconstructed data.

IntroductionRestricted Boltzmann Machines (RBMs) - the building block fornewly popular Deep Belief networks (DBNs) - are a promising newtool for machine learning practitioners. However, future research inapplications of DBNs is hampered by the considerable computationthat training requires. We have designed a novel architecture andFPGA implementation that accelerates the training of general RBMsin a scalable manner, with the goal of producing a system thatmachine learning researchers can use to investigate ever largernetworks. Our current (single FPGA) design uses a highly efficient,fully-pipelined architecture based on 16-bit arithmetic forperforming RBM training on an FPGA. Single-board results show aspeedup of 25-30X achieved over an optimized softwareimplementation on a high-end CPU. Current design efforts are for amulti-board implementation.

RBM Training ProcedureContrast-divergence learning to simplify infinite alternating Gibbs sampling

Experimental PlatformStackable Altera Stratix III FPGA board with DDR2 SDRAM interface.

Figure 4. Terasic DE3 Fast Prototyping BoardThe left image shows the DE3 board. It has an Altera Stratix IIIFPGA with high speed I/O interface for communications withmultiple other boards. It also includes a DDR2 SO-DIMM interface,USB JTAG interface, and USB 2.0 interface. The right imageillustrates multiple DE3 boards connected in a stacked manner.

Implementation DetailsSingle FPGA implementation of RBM is developed. Multi-FPGA version is being designed at the moment.

AVALON MM

DDR2 Controller

NiosII

32256

32256

Main Controller

Weight Array

Multiply Array

Adder Array

Sigmoid Array

256

Memory Stream

RNG /Compare

Array

Neuron ArrayUpdate Logic

RBM Module

ISP1761

32

USBDDR2 JTAG

UART

Figure 5. Overall SystemSingle FPGA consists of a CPU, DDR2 controller, and an RBM module. These components are connect-ed via Altera’s Avalon interface.

Figure 6. RBM ModuleThe RBM module is the most important module. The module wasdesigned to exploit all the available multipliers, maximizing perform-ance. The module was also segmented into groups to avoid long wiresand enforce localization. The weight matrix was assumed to fit on-chip.

Figure 7. Core Multiply ArrayTo eliminate transpose operation in learning algorithm, matrix multiply operation is performed as (a) linear combination of weight vectors in hidden phase, and as (b) vector inner product in reconstruct-ion phase.

Performance ResultsThe RBM module runs at 200MHz. Comparison was made against one core of Intel Core2 2.4GHz processor running on MATLAB.

Figure 8. SpeedupThe speedup against single precision float-ing point MATLAB run is around 25X. The speedup depends on the network structure and size.

Extending to Multiple FPGA SystemTwo issues being tackled1. The weight matrix increases as O(n2) with the number of nodes.

Thus, the weight matrix no longer fits on-chip and will have to bestreamed from DRAM. Using a batch size of greater than 16, wecan exploit the data parallelism to reduce the requiredbandwidth that is feasible for DDR2 SDRAM.

2. The single FPGA scheme issues a broadcast of visible data everycycle. If we extend this to multiple FPGAs, the height of theboard stack may limit the maximum clock rate driving the RBMmodule.

0

5

10

15

20

25

30

35

256x256 256x1024 512x512

Speedup for 50 epochs

Base:single precision Base:double precision

ConclusionDeep Belief Nets are an emerging area where Restricted BoltzmannMachine is at the heart of it. FPGAs can effectively exploit theinherent fine-grain parallelism in RBMs to reduce the computationalbottleneck for large scale DBN research. As a prototype of building afast DBN research machine, we implemented a high-speedconfigurable RBM on a single FPGA. The RBM has shownapproximately 25X speedup compared to a single precision softwareimplementation of the RBM running on a Intel Core2 processor.Interacting with the Stanford’s AI research group, our futureimplementation of multiple FPGA boards is expected to provideenough speedup to attack large problems that remained unsolvedfor decades.

Sang Kyun Kim, Lawrence McAfee, Peter McMahon, Kunle...

Documents

Transcript of Sang Kyun Kim, Lawrence McAfee, Peter McMahon, Kunle...