Caffeine: Towards Uniformed Representation and...

Caffeine: Towards Uniformed Representation and Accelerationfor Deep Convolutional Neural Networks

Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, Jason Cong

◆ Challenge1: Programmability§ Network definitions (& layer parameters) change according to

applications• CNNs are getting deeper, more precision achieved with deeper network• Network parameters vary across layers, many user definable parameters

§ Hardware has limited programmability• Re-synthesize bitstream for each network? – Not preferred• Re-program FPGA for each layer? – No!

ChallengeforCNN/DNNaccelerator Caffeine:SW/HWCo-designedFPGA-baseddeeplearninglibrary

Convolutionlayer&FullyConnectedLayer

Results◆ Portability to different board

Summary:Performance: 7.3x speedup over Intel Xeon 12-core CPUEnergy: 43.5x over Intel Xeon, 1.5x over GPU

AutomationflowfromCaffe toFPGABandwidthimprovement

◆ Contributions§ A uniformed convolutional MM representation adapted for efficient FPGA

acceleration of both CONV and FCN layers in CNN/DNN.§ A HW/SW co-designed efficient and reusable CNN/DNN engine Caffeine, where the

FPGA accelerator maximizes the computing and bandwidth resource utilization.§ The first published attempt to incorporate FPGAs into the industry-standard deep

learning framework Caffe .

◆ Fully Connected Layer: Communication Intensive

◆ Challenge2: Various Computation Patterns§ Convolution layer (CONV) & Fully connected layer (FCN)

• COV-- Compute Intensive, FCN-- Communication Intensive§ Hardware has limited bandwidth resource

• Convolution layer – extensive previous study• Fully Connected layer -- How to maximize bandwidth utilization?• CONV + FCN How to re-use hardware for both kernels?

CONV FCN POOL ReLU

Uniformed Representation

FPGA-based Accelerator Design

Computation Optimization Bandwidth Optimization

Automation flow

Network definitions (feedforward)

CONV FCN POOL ReLU

Layer definitions

MKL, BLAS cuDNN

CPU_forward GPU_forward

Caffe framework

Our work

High-level Network Description

Computation Model Optimization for customized platforms

Applications (with CNNs) Customized Accelerator

PCIe DRAM

CPU FPGAInterconnect

input output row col kernel stride Input + output + weight size (MB)

Mega Floating Operations

Comp/DataRatio

64 64 224 224 3 1 6.16 3700 600.3

input output - - - - Input + output + weight size

Mega Floating Operations

Comp/DataRatio

25088 4096 - - - - 98 2100 21.4

◆ Convolution

◆ FPGA bandwidth§ Much smaller than GPU (10 vs 200)§ Sensitive to burst length &bus bitwidth

Input-MajorMapping

Fully connected layerOutput

Fully connected layerInput

ConvolutionOutput

ConvolutionInput

1. Batching 2. Merging

Raw Mapping

◆ Two ways of improving effective bandwidth

Weight-MajorMapping

Fully connected layerOutput

Fully connected layerInput

ConvolutionOutput

ConvolutionInput

1. Batching 2. MergingRaw Mapping

◆ Two ways of improving effective bandwidth

Method 1: Input-major Mapping Method 2: Weight-major Mapping

◆ Design Space in Roofline ModelExecute Once for a New Network

Execute Once for a New image

LayerConfigurati

onsWeight

sImage

Instruction Region Weight & bias Region Feature map Region

layer1 layer2 … Weight1 Bias 1 Weight2 … image fm1 fm2 …

DRAM Space

InstructionCache

Output buffer

Input bufferWeight

ComputationEngine

layers {bottom: "conv1_1“top: "conv1_2“name: "conv1_2“type: CONVOLUTIONconvolution_param {

num_output: 64pad: 1kernel_size: 3 }}

layers {bottom:

"conv1_2"top: "pool1"name: "pool1"type: POOLINGpooling_param {pool: MAXkernel_size: 2stride: 2 }

layers {bottom: "conv1_2"top: "conv1_2"name: "relu1_2"type: RELU

input: "data"input_dim: 3input_dim: 224input_dim: 224

Type input output row col kernel stride pad ReLU POOL size strideCONV/FCN 3 64 224 224 3 1 1 1 1 2 2

Customized Accelerator Instructions

CNN Layer Definitions

◆ Different data type

Deeplearning&applications

Unmanned Vehicle Speech & AudioText & Language

Genomics Image & Video Multi-media

Back-end training

Online Processing

Intelligent Device— billions requests/day— critical cost & power demands— on forward stage

— big training data— big models— fast iteration

— real-time response— critical cost & power demands— on forward stage

Center for Energy-effcient Computing and Applications, Peking University, Beijing, ChinaCenter for Domain-Specific Computing, University of California, Los Angeles, US

Falcon-computing Solutions Inc., Los Angeles, US

Caffeine: Towards Uniformed Representation and...

Documents

Transcript of Caffeine: Towards Uniformed Representation and...

UNIFORMED - Fairfax County

Caffeine modulates post-transcriptional regulation of … · Caffeine modulates post-transcriptional regulation of SRSF2 1 ... Caffeine modulates post-transcriptional regulation of

Caffeine and caffeine metabolites

Caffeine: Towards Uniformed Representation and Acceleration for …vast.cs.ucla.edu/~peipei/papers/iccad2016caffeine_camera... · 2016-07-25 · In this paper we design and implement

Caffeine: Towards Uniformed Representation and ...cadlab.cs.ucla.edu/beta/cadlab/sites/default/files/... · In this paper we design and implement Caffeine, a hardware/ soft-ware co-designed

Running Head: Caffeine, EEG, HR, and Persuasion 1 Caffeine ...

La2 Caffeine

Uniformed Firefighters Association

MODIFYING THE CAFFEINE CONSUMPTION …dl.uncw.edu/Etd/2012-1/r1/heatonj/jenniferheaton.pdf · MODIFYING THE CAFFEINE CONSUMPTION QUESTIONNAIRE: IMPULSIVITY AND ... caffeine consumption

Uniformed Business Office User Guide

Uniformed Services University

Caffeine PPT

Impact of caffeine and information relating to caffeine on ...

Uniformed tree searching

The Uniformed Services University

Caffeine 5 mg/ml Solution for Injection (caffeine) PL 20346/0005

Caffeine control

Caffeine: Towards Uniformed Representation and Acceleration for …vast.cs.ucla.edu/sites/default/files/publications/... · 2018. 10. 22. · Caffeine: Towards Uniformed Representation

Caffeine supplementation

UNIFORMED SERVICES UNIVERSITY OF