Sang Kyun Kim, Lawrence McAfee, Peter McMahon, Kunle...

1
h1 h2 h3 h4 v1 v2 v3 v4 hidden neurons visible neurons Sang Kyun Kim, Lawrence McAfee, Peter McMahon, Kunle Olukotun Avalon Slave AVALON MM Main Controller 32 : CPU↔RBM register MUX Stream Logic Buffers Buffers Local FSM Local FSM Memory Sigmoid / RNG / Compare Memory Sigmoid / RNG / Compare Buffers Buffers Local FSM Local FSM Memory Sigmoid / RNG / Compare Memory GRP0 GRP1 GRP2 GRP3 Avalon Master Avalon Slave Sigmoid / RNG / Compare 256 : CPU↔RBM mem Visible Neuron Broadcast 256 256 : DDR2↔RBM Main↔Local 256 TreeAdd Value Broadcast Tree Add / Accum Tree Add / Accum Tree Add / Accum Tree Add / Accum 256 256 256 16 16 16 16 16 16 16 16 Restricted Boltzmann Machine 0 j i h v 1 j i h v i j i j t = 0 t = 1 reconstruction data ) ( 1 0 j i j i ij h v h v w Figure 1. RBM Structure (above) An RBM is a two layer neural network with all-to-all connections between the layers Figure 2. Deep Belief Nets (right) A Deep Belief Network is a multi-layer generative model. The network is first learned with all the weights tied, which is equivalent to an RBM. Then it freezes the first layer and learns the remaining weights, which is also equivalent to another RBM. RBM Figure 3. RBM Training (from Hinton’s tutorial at NIPS’07) The training begins with the data at the visible layer computing the probabilities of the hidden layer. The hidden layer is updated and stochastically fires to reconstruct the visible layer. The hidden layer is recomputed from the reconstructed visible layer. The weights are updated based on the difference between the visible-hidden product of the original training example and the visible-hidden product of the reconstructed data. Introduction Restricted Boltzmann Machines (RBMs) - the building block for newly popular Deep Belief networks (DBNs) - are a promising new tool for machine learning practitioners. However, future research in applications of DBNs is hampered by the considerable computation that training requires. We have designed a novel architecture and FPGA implementation that accelerates the training of general RBMs in a scalable manner, with the goal of producing a system that machine learning researchers can use to investigate ever larger networks. Our current (single FPGA) design uses a highly efficient, fully-pipelined architecture based on 16-bit arithmetic for performing RBM training on an FPGA. Single-board results show a speedup of 25-30X achieved over an optimized software implementation on a high-end CPU. Current design efforts are for a multi-board implementation. RBM Training Procedure Contrast-divergence learning to simplify infinite alternating Gibbs sampling Experimental Platform Stackable Altera Stratix III FPGA board with DDR2 SDRAM interface. Figure 4. Terasic DE3 Fast Prototyping Board The left image shows the DE3 board. It has an Altera Stratix III FPGA with high speed I/O interface for communications with multiple other boards. It also includes a DDR2 SO-DIMM interface, USB JTAG interface, and USB 2.0 interface. The right image illustrates multiple DE3 boards connected in a stacked manner. Implementation Details Single FPGA implementation of RBM is developed. Multi-FPGA version is being designed at the moment. AVALON MM DDR2 Controller NiosII 32 256 32 256 Main Controller Weight Array Multiply Array Adder Array Sigmoid Array 256 Memory Stream RNG / Compare Array Neuron Array Update Logic RBM Module ISP1761 32 USB DDR2 JTAG UART Figure 5. Overall System Single FPGA consists of a CPU, DDR2 controller, and an RBM module. These components are connect- ed via Altera’s Avalon interface. Figure 6. RBM Module The RBM module is the most important module. The module was designed to exploit all the available multipliers, maximizing perform- ance. The module was also segmented into groups to avoid long wires and enforce localization. The weight matrix was assumed to fit on-chip. Figure 7. Core Multiply Array To eliminate transpose operation in learning algorithm, matrix multiply operation is performed as (a) linear combination of weight vectors in hidden phase, and as (b) vector inner product in reconstruct- ion phase. Performance Results The RBM module runs at 200MHz. Comparison was made against one core of Intel Core2 2.4GHz processor running on MATLAB. Figure 8. Speedup The speedup against single precision float- ing point MATLAB run is around 25X. The speedup depends on the network structure and size. Extending to Multiple FPGA System Two issues being tackled 1. The weight matrix increases as O(n 2 ) with the number of nodes. Thus, the weight matrix no longer fits on-chip and will have to be streamed from DRAM. Using a batch size of greater than 16, we can exploit the data parallelism to reduce the required bandwidth that is feasible for DDR2 SDRAM. 2. The single FPGA scheme issues a broadcast of visible data every cycle. If we extend this to multiple FPGAs, the height of the board stack may limit the maximum clock rate driving the RBM module. 0 5 10 15 20 25 30 35 256x256 256x1024 512x512 Speedup for 50 epochs Base:single precision Base:double precision Conclusion Deep Belief Nets are an emerging area where Restricted Boltzmann Machine is at the heart of it. FPGAs can effectively exploit the inherent fine-grain parallelism in RBMs to reduce the computational bottleneck for large scale DBN research. As a prototype of building a fast DBN research machine, we implemented a high-speed configurable RBM on a single FPGA. The RBM has shown approximately 25X speedup compared to a single precision software implementation of the RBM running on a Intel Core2 processor. Interacting with the Stanford’s AI research group, our future implementation of multiple FPGA boards is expected to provide enough speedup to attack large problems that remained unsolved for decades.

Transcript of Sang Kyun Kim, Lawrence McAfee, Peter McMahon, Kunle...

Page 1: Sang Kyun Kim, Lawrence McAfee, Peter McMahon, Kunle Olukotunforum.stanford.edu/events/posterslides/AHighlyScalableRestricted... · h 1 h 2 h 3 h 4 v 1 v 2 v 3 v 4 h id d e n n e

h1 h2 h3 h4

v1 v2 v3 v4

hidden neurons

visible neurons

Sang Kyun Kim, Lawrence McAfee, Peter McMahon, Kunle Olukotun

Avalon Slave

AVALON MM

Main Controller

32 : CPU↔RBM register

MUX

Stream Logic

Buffers Buffers

Local FSMLocal FSM

Memory

Sigmoid / RNG / Compare

Memory

Sigmoid / RNG / Compare

Buffers Buffers

Local FSMLocal FSM

Memory

Sigmoid / RNG / Compare

Memory

GRP0 GRP1 GRP2 GRP3

Avalon Master

Avalon Slave

Sigmoid / RNG / Compare

256 : CPU↔RBM mem

… … … …Visible Neuron Broadcast

256

256 : DDR2↔RBM

Main↔Local

256

TreeAdd Value Broadcast

Tree Add / Accum Tree Add / Accum Tree Add / Accum Tree Add / Accum

256 256 256

16

16

16

16

16

16

16

16

Restricted Boltzmann Machine

0 jihv 1 jihv

i

j

i

j

t = 0 t = 1 reconstructiondata

)( 10 jijiij hvhvw

Figure 1. RBM Structure (above)An RBM is a two layer neural network with all-to-all connections between the layers

Figure 2. Deep Belief Nets (right)A Deep Belief Network is a multi-layergenerative model. The network is firstlearned with all the weights tied, which is equivalent to an RBM. Then it freezes the first layer and learns the remaining weights, which is also equivalent to another RBM.

RBM

Figure 3. RBM Training(from Hinton’s tutorial at NIPS’07)The training begins with the dataat the visible layer computing the probabilities of the hidden layer.

The hidden layer is updated and stochastically fires to reconstruct the visible layer. The hidden layer is recomputed from the reconstructed visible layer. The weights are updated based on the difference between the visible-hidden product of the original training example and the visible-hidden product of the reconstructed data.

IntroductionRestricted Boltzmann Machines (RBMs) - the building block fornewly popular Deep Belief networks (DBNs) - are a promising newtool for machine learning practitioners. However, future research inapplications of DBNs is hampered by the considerable computationthat training requires. We have designed a novel architecture andFPGA implementation that accelerates the training of general RBMsin a scalable manner, with the goal of producing a system thatmachine learning researchers can use to investigate ever largernetworks. Our current (single FPGA) design uses a highly efficient,fully-pipelined architecture based on 16-bit arithmetic forperforming RBM training on an FPGA. Single-board results show aspeedup of 25-30X achieved over an optimized softwareimplementation on a high-end CPU. Current design efforts are for amulti-board implementation.

RBM Training ProcedureContrast-divergence learning to simplify infinite alternating Gibbs sampling

Experimental PlatformStackable Altera Stratix III FPGA board with DDR2 SDRAM interface.

Figure 4. Terasic DE3 Fast Prototyping BoardThe left image shows the DE3 board. It has an Altera Stratix IIIFPGA with high speed I/O interface for communications withmultiple other boards. It also includes a DDR2 SO-DIMM interface,USB JTAG interface, and USB 2.0 interface. The right imageillustrates multiple DE3 boards connected in a stacked manner.

Implementation DetailsSingle FPGA implementation of RBM is developed. Multi-FPGA version is being designed at the moment.

AVALON MM

DDR2 Controller

NiosII

32256

32256

Main Controller

Weight Array

Multiply Array

Adder Array

Sigmoid Array

256

Memory Stream

RNG /Compare

Array

Neuron ArrayUpdate Logic

RBM Module

ISP1761

32

USBDDR2 JTAG

UART

Figure 5. Overall SystemSingle FPGA consists of a CPU, DDR2 controller, and an RBM module. These components are connect-ed via Altera’s Avalon interface.

Figure 6. RBM ModuleThe RBM module is the most important module. The module wasdesigned to exploit all the available multipliers, maximizing perform-ance. The module was also segmented into groups to avoid long wiresand enforce localization. The weight matrix was assumed to fit on-chip.

Figure 7. Core Multiply ArrayTo eliminate transpose operation in learning algorithm, matrix multiply operation is performed as (a) linear combination of weight vectors in hidden phase, and as (b) vector inner product in reconstruct-ion phase.

Performance ResultsThe RBM module runs at 200MHz. Comparison was made against one core of Intel Core2 2.4GHz processor running on MATLAB.

Figure 8. SpeedupThe speedup against single precision float-ing point MATLAB run is around 25X. The speedup depends on the network structure and size.

Extending to Multiple FPGA SystemTwo issues being tackled1. The weight matrix increases as O(n2) with the number of nodes.

Thus, the weight matrix no longer fits on-chip and will have to bestreamed from DRAM. Using a batch size of greater than 16, wecan exploit the data parallelism to reduce the requiredbandwidth that is feasible for DDR2 SDRAM.

2. The single FPGA scheme issues a broadcast of visible data everycycle. If we extend this to multiple FPGAs, the height of theboard stack may limit the maximum clock rate driving the RBMmodule.

0

5

10

15

20

25

30

35

256x256 256x1024 512x512

Speedup for 50 epochs

Base:single precision Base:double precision

ConclusionDeep Belief Nets are an emerging area where Restricted BoltzmannMachine is at the heart of it. FPGAs can effectively exploit theinherent fine-grain parallelism in RBMs to reduce the computationalbottleneck for large scale DBN research. As a prototype of building afast DBN research machine, we implemented a high-speedconfigurable RBM on a single FPGA. The RBM has shownapproximately 25X speedup compared to a single precision softwareimplementation of the RBM running on a Intel Core2 processor.Interacting with the Stanford’s AI research group, our futureimplementation of multiple FPGA boards is expected to provideenough speedup to attack large problems that remained unsolvedfor decades.