Riptide: Fast End-to-End Binarized Neural Networks03...Binarized Neural Networks Josh Fromm, Meghan...

Riptide: Fast End-to-End Binarized Neural Networks

Josh Fromm, Meghan Cowan, Matthai Philipose, Luis Ceze, and Shwetak Patel

2Canziani et al., “An analysis of deep neural network models for practical applications.” 2016

• Quantize floats to +/-1• 1.122 * -3.112 ==> 1 * -1• Notice:

• 1 * 1 = 1• 1 * -1 = -1• -1 * 1 = -1• -1 * -1 = 1

• Replacing -1 with 0, this is just XNOR• Retrain model to

convergence

1.2 3.12 -11.2 3.4 -2.12 -132.1 … 0.2 -121.1, …

0b110100…1 0xD0…

64 floats

64 bits

A[:64] . W[:64] == popc(A/64 XNOR W/64)

1-bit Matrix Operations

3Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016

float x[], y[], w[];...for i in 1…N:

y[j] += x[i] * w[i];

unsigned long x[], y[], w[];…for i in 1…N/64:

y[j] += 64 – 2*popc(not(x_b[i] xor w_b[i]));

2N ops

3N/64 ops

~40x faster32x smaller

Typically, lose ~10% accuracy

1-bit Matrix Operations: Cost/Benefit

32x smaller

float x[], y[], w[];...for i in 1…N:

y[j] += x[i] * w[i];

unsigned long x[], y[], w[];…for i in 1…N/64:

y[j] += 64 – 2*popc(not(x_b[i] xor w_b[i]));

2N ops

3N/64 ops

~40x faster

Typically, lose ~10% accuracy

~40x faster

Runtime

380 ms

Unoptimized Binary Network

1904 ms

Full Precision Baseline

Implementation Challenges

CPUs have no native support for low bit data typesNeed to work on packed data

Need to implement optimizations from scratchOptimizations tuned for specific CPU

uint1 uint2

Optimized linear algebra librariesHardware support for conventional deep learning

No optimized linear algebra libraries like BLAS to leverage

Baselines incredibly well optimized

Are Binary Networks Actually Fast?

Majority of work in binarization is simulated

• Which binarization techniques can be implemented efficiently?

• What are the runtime bottlenecks in a binary model?

• How do I deploy a fast binary model on my platform?

To address these questions we introduce Riptide.

A one-stop solution to training and deploying fast binary networks on a variety of hardware platforms.

• Addresses implementation issues in mixed polarity quantization

• Introduces the Fused Glue operation, removing all floating-point arithmetic from binary models.

• Provides high-performance bitserial operators through TVM.

• Yields 4-12X speedups across various models and bitwidthswhile maintaining state-of-the-art accuracy.

• Available open-source today at github.com/jwfromm/Riptide

Implementing Binary Layers

features: float array

…kernels: float array

activations: float array

Multiply Accumulate

features: int array

Multiply Accumulate

features: int array

…kernels: int array

activations: int array

features: int array

BitserialAccumulate

Quantization Function: !𝑥 = 𝑥 > 0

Quantization Polarity

Quantization Function: !𝑥 = 𝑠𝑖𝑔𝑛(𝑥)

Bipolar Quantization Unipolar Quantization

• Implemented with bitwise-xnor and popcount• Well-suited for weights, which represent

correlation (1) or inverse-correlation (-1)

• Implemented with bitwise-and and popcount• Well-suited for activations, which represent

pattern-match (1) or no pattern-match (0)15

Quantization Polarity

• XnorNet (all bipolar) -> 44.2% accuracy• DorefaNet (bipolar weights unipolar activations) -> 50.0% accuracy

16Zhou et al., “DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.” 2016

1 1 0 1 0 1 …A (unipolar)

0 1 0 0 1 0 …W (bipolar)

0 1 0 -1 0 -1 …Expected=

Multiple meanings of 0 bits causes mixed polarity to unimplementable

Mixed Polarity Operation

Count number of bit multiplications where output should be 1

Subtract cases where output should be -1

• Enables mixed polarity binary networks• Doubles amount of inner loop compute but does not require additional memory operations• Mixed polarity may offer compelling points on speedup to accuracy versus pure bipolar

Multibit Quantization

Quantization Function: !𝑥 = 𝑙𝑖𝑛𝑒𝑎𝑟(𝑥)

• Translates naturally to integer representation• Does not necessarily fit distribution

Zhou et al., “DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.” 2016

Quantization Function: !𝑥 = 𝐻𝑊𝐺𝑄(𝑥)

Cai et al., “Deep Learning with Low Precision by Half-wave Gaussian Quantization.” 2017

.2 2.1.6 1.1

• Better fit for Gaussian distribution• Not implementable

20Cai et al., “Deep Learning with Low Precision by Half-wave Gaussian Quantization.” 2017

.2 2.1.6 1.1

Bit pair: 00 01 10 11

Value is based on unique bit pair rather than combination of bits, (01 + 10 ≠ 11)

Unique bit combinations lost during popcount

activations: int array

features: int array

Bitwise Accumulate

1-bit bipolar quantization

N-bit linear bipolar or unipolar quantization

Xnor-popcount / mixed polarity-popcount

Implementing Binary Models

Full Precision

Binary

QConvConv QConv QConv QDenseQConv

Implementing Binary Models

Full Precision

Binary

QConvConv QConv QConv QDenseQConv

BatchNorm

Dequantize

WeightScale

Activation

Quantize

Bitpack

Computational Complexity: 4𝐻𝑊𝐹 𝐻𝑊𝐹 4𝐻𝑊𝐹 𝐻𝑊𝐹 5𝐻𝑊𝐹 3𝐻𝑊𝐹 𝑁𝐾𝐾𝐹𝐻𝑊𝐶43

Estimated Impact of Glue Layers

• Impact of glue layers is too high

• We must derive binarized glue for decent end-to-end speedups

Weight Scaling

𝛼? = 𝑚𝑒𝑎𝑛( 𝑊? )𝑞 𝑎 = 𝛼?𝑎

• Introduced in XnorNet

• Allows scale of weights to be preserved

• Brought accuracy from 27% to 44%

• Now used ubiquitously

Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016

Quantized Weight Scaling

𝛼? = 𝑚𝑒𝑎𝑛( 𝑊? )𝑞 𝑎 = 𝛼?𝑎

Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016

• Use approximate power of 2 (AP2)

• Replaces multiply with bitwise shift

• Constant at inference time

• Requires only a single instruction

BatchNormalization

• Centers and scales output activations

• Essential for quantization, used in all binary techniques

• Must derive quantized versions of both centering and scaling

𝜇? =CD∑FGCD (𝑎F) 𝜎?I =

CD∑FGCD (𝑎F − 𝜇?)I K𝑎F =

LM N OP

QPRS T

27Ioffe et al., “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” 2015

Binary Scaling

• We can simply compute the AP2 of standard deviation

Binary Center

To add a constant to a binarized tensor, we must use Fixed Point Quantization (FPQ) with the same bits and scale

1 1 1 1 0 1 1 0 …

N-bit input to next layer

wb fractional bits 𝐵 = 𝑁 + 𝑤𝑏

18 +⋯

𝑆 = 1 +\FGC

2_ − 1 ( `12)

= 1 +1

2_ − 1 (1 −12]^)�̂� = 𝐹𝑃𝑄(𝜇, 𝐵, 𝑆)

Fused Glue Operation

𝐵 = 𝑁 + 𝑤𝑏 𝑆 = 1 +1

2_ − 1 (1 −12]^)

�̂� = 𝐹𝑃𝑄(𝜇, 𝐵, 𝑆)

This is the fused glue operation

All terms are constant at runtime except 𝑎

Only requires two integer operations

Fully Binarized Network

Full Precision Binary

QConv QConv

BatchNorm

Dequantize

WeightScale

Activation

Quantize

BitpackComputational Complexity: 4𝐻𝑊𝐹 𝐻𝑊𝐹 4𝐻𝑊𝐹 𝐻𝑊𝐹 5𝐻𝑊𝐹 3𝐻𝑊𝐹 Total = 18𝐻𝑊𝐹

QConv QConv

BitpackComputational Complexity: 2𝐻𝑊𝐹 𝐻𝑊𝐹 3𝐻𝑊𝐹 Total = 6𝐻𝑊𝐹

Fused Glue

Traditional Binary Network

Fully Binarized Network

• 3X fewer glue operations

• No floating-point data

• No multiplication or division

FBN Accuracy

• Our system is comparable to state-of-the-art techniques

• Unipolar quantization yields higher accuracies as expected

• Effective across various models

Measurement Platform

Raspberry PiARM Cortex-A53

• Widely available and inexpensive

• Representative of IoT devices• QualComm Snapdragons• Azure Sphere

• Resource constrained / in need of acceleration

Separates compute and implementation into a declaration and scheduleSchedules contain knobs that are attuned for the backend

Tensor Expression Language

AutoTVM: Optimize Tensor Operators

Schedule Optimization Space

Optimizing deep learning compiler

34Chen et al., “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.” 2018

TVM Schedule Intrinsics

• Tiling: Break computation into chunks for better locality

• Vectorization: Use hardware SIMD instructions for more efficient operation execution.

• Parallelization: Leverage MIMD facilities such as multiple cores.

• Loop Unrolling: Replicate the body of loops to reduce overhead.

35Chen et al., “Learning to Optimize Tensor Programs.” 2019

Fast Popcount

2𝑁𝐻𝑊𝐶 Bytes

BinaryConvInt-16

𝑐j 𝑐I𝑐C 𝑐k 𝑐_.Int-16 Popcount Accumulation

𝑥j 𝑥C 𝑥I 𝑥k 𝑥m 𝑥n 𝑥o 𝑥p 𝑥h 𝑥q 𝑥_. . .Int-N Bit Packed Activations

Fused Shift/ScaleInt-N

𝑞j 𝑞I𝑞C 𝑞k .Int-16 Quantized Prepacked Bits

Int-16 Unused

Bit Pack

𝑦j 𝑦C 𝑦I 𝑦k 𝑦m 𝑦n 𝑦o 𝑦p 𝑦h 𝑦q 𝑦_. . .Int-N Bit Packed Outputs

𝑥j 𝑥C 𝑥I 𝑥k 𝑥m 𝑥n 𝑥o 𝑥p 𝑥h 𝑥q 𝑥_. . . _efgh

BinaryConv

𝑐j 𝑐I𝑐C 𝑐k 𝑐_. 2𝑁𝐻𝑊𝐶 Bytes

Int-N Bit Packed Activations

Int-16 Popcount Accumulation

Fused Shift/Scale

Int-16

𝑞j 𝑞I𝑞C 𝑞k .Int-16 Quantized Prepacked Bits

𝑞_ 2𝑁𝐻𝑊𝐶 Bytes

Bit Pack

𝑦j 𝑦C 𝑦I 𝑦k 𝑦m 𝑦n 𝑦o 𝑦p 𝑦h 𝑦q 𝑦_. . .Int-N Bit Packed Outputs

Int-16 Unused

Bitpack Fusion

Impact of Optimizations

• Combination of TVM optimizations gives 12X Speedup over baseline

• Each optimization has a significant impact

• Speedups from bitpack fusion are due to fewer memory operations

• With a high-quality implementation, we can study our design choices

Optimization Ablation Study

• Removing any optimization has a significant impact on performance

• Using fused glue gives nearly a 2X speedup, as predicted by opcount estimates

Glue Layer Impact

• Glue consistently takes a similar amount of time as core compute layers

• Our fused glue operation almost completely removes this cost

Impact of Polarity

• Baseline is near optimal

• Quantized layers have much more memory overhead

• Although unipolar quantization has twice as many operations, it is only marginally slower than bipolar quantization

Cumulative Speedup

Layerwise Speedup

• Speedup is not consistent across layers

• May be possible to design a network of binarizable layers

Thank You!

Paper:

Backup Slides

Riptide: Fast End-to-End Binarized Neural Networks03...Binarized Neural Networks Josh Fromm, Meghan...

Documents

Transcript of Riptide: Fast End-to-End Binarized Neural Networks03...Binarized Neural Networks Josh Fromm, Meghan...

Theoretical and Experimental Results for Planning with ...jakobn/research/Planning... · Keywords: automated planning binarized neural networks math-ematical optimization pseudo-Boolean

BING: Binarized normed gradients for objectness …mmcheng.net/mftp/Papers/ObjectnessBING.pdfBING: Binarized normed gradients for objectness estimation at 300fps 5 approach starts

Embedded Binarized Neural Networks - Computer Science ...htk/publication/2017-ewsn-mcdanel... · Surat Teerapittayanon ... have no hardware-supported parallelism and has only a rela-

Accelerating Binarized Convolutional Neural Networks with ...zhiruz/pdfs/bnn-fpga2017.pdfIntel FPGA SDK for OpenCL [8] and Xilinx SDSoC [13] of-fer further automation features for

On the Complexity of Neural-Network-Learned …faculty.washington.edu/marzban/boolean.pdfMatthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, Yoshua Bengio1 (2016): Binarized

Towards Fast and Energy-Efficient Binarized Neural Network ...cseweb.ucsd.edu/~shz338/images/FPGA.pdfTowards Fast and Energy-Efficient Binarized Neural Network Inference on FPGA Conference’17,

Training Binarized Neural Networks using MIP and CPrntoro/docs/cp19_bnns_slides.pdf · 2019. 10. 4. · Deep learning Binarized Neural Networks (BNNs) BNNs are NNs with binary weights

Fast Secure Comparison for Medium-Sized Integers and Its Application in Binarized ... · 2018. 12. 24. · Binarized Neural Networks Mark Abspoel1;2, Niek J. Bouman3, Berry Schoenmakers

Local-binarized very deep residual network for visual ...

Binary DAD-Net: Binarized Driveable Area Detection Network for ... · Binary DAD-Net: Binarized Driveable Area Detection Network for Autonomous Driving Alexander Frickenstein 1, Manoj-Rohit

MEMORY EFFECTS IN METAPLASTIC BINARIZED NEURAL NETWORKS · 2019. 6. 22. · Binarized Neural Networks (BNNs) are attractive for low power hardware implementation of artificial intelligence.

GUINNESS: A GUI Based Binarized Deep Neural Network ...

BING: Binarized Normed Gradients for Objectness Estimation ...

Implementing Binarized Neural Networks with ...

Synaptic metaplasticity in binarized neural networks

Embedded Binarized Neural Networks - arXiv

Simulate-the-hardware: Training Accurate Binarized Neural Networks …jiajun.org.cn/pdf/Simulate-the-hardware.pdf · 2020. 4. 4. · Simulate-the-hardware: Training Accurate Binarized

Megha Byali, Harsh Chaudhari*, Arpita Patra, and Ajith ... · Linear Regression, Logistic Regression, Deep Neural Networks, and Binarized Neural Networks. We sub-stantiate our theoretical

No. J1-13/P/DCRB/2011 Date 01.12 · 2017. 12. 4. · 1 BINU PHILIPOSE PHILIPOSE 32 PALLIPARAMBIL HOUSE RANNI PATHANAM THITTA AYARKUNNA M 20.11.17 9 HRS ... Babu Varghese M 52 Puthupparampil

Riptide: Techniques for Fast End-to-End Binarized Networks · 2020. 12. 12. · Riptide: Techniques for Fast End-to-End Binarized Networks Josh Fromm