An Analytical Method to Determine Minimum Per-Layer...

17
An Analytical Method to Determine Minimum Per-Layer Precision of Deep Neural Networks Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign Charbel Sakr, Naresh Shanbhag Code: https://github.com/charbel-sakr/Precision-Analysis-of-Neural-Networks Also found on sakr2.web.engr.Illinois.edu

Transcript of An Analytical Method to Determine Minimum Per-Layer...

Page 1: An Analytical Method to Determine Minimum Per-Layer ...sakr2.web.engr.illinois.edu/papers/2018/presentation...32C3-32C3-MP2-64C3-64C3-MP2-128C3-128C3-256FC-256FC-10 Precision chosen

An Analytical Method to Determine Minimum Per-Layer Precision of Deep

Neural Networks

Department of Electrical and Computer Engineering

University of Illinois at Urbana-Champaign

Charbel Sakr, Naresh Shanbhag

Code: https://github.com/charbel-sakr/Precision-Analysis-of-Neural-Networks

Also found on sakr2.web.engr.Illinois.edu

Page 2: An Analytical Method to Determine Minimum Per-Layer ...sakr2.web.engr.illinois.edu/papers/2018/presentation...32C3-32C3-MP2-64C3-64C3-MP2-128C3-128C3-256FC-256FC-10 Precision chosen

Machine Learning ASICs

How are they choosing these precisions?

Why is it working?

Can it be determined analytically?

[Sze’16, ISSCC]

AlexNet accelerator

16b fixed-point

[Chen’15, ASPLOS]

ML accelerator

16b fixed-point

[Google’17, ISCA]

Tensorflow accelerator

8b fixed-point

Eyeriss PuDianNao TPU

Page 3: An Analytical Method to Determine Minimum Per-Layer ...sakr2.web.engr.illinois.edu/papers/2018/presentation...32C3-32C3-MP2-64C3-64C3-MP2-128C3-128C3-256FC-256FC-10 Precision chosen

Current Approaches• Stochastic Rounding during training [Gupta, ICML’15 – Hubara, NIPS’16]

→ Difficulty of training in a discrete space

• Trial-and-error approach [Sung, SiPS’14]

→ Exhaustive search is expensive

• SQNR based precision allocation [Lin, ICML’16]

→ Lack of precision/accuracy understanding

Fixed-point Quantization

No accuracy

vs. precision

understanding

Page 4: An Analytical Method to Determine Minimum Per-Layer ...sakr2.web.engr.illinois.edu/papers/2018/presentation...32C3-32C3-MP2-64C3-64C3-MP2-128C3-128C3-256FC-256FC-10 Precision chosen

International Conference on Machine Learning (ICML) - 2017

Page 5: An Analytical Method to Determine Minimum Per-Layer ...sakr2.web.engr.illinois.edu/papers/2018/presentation...32C3-32C3-MP2-64C3-64C3-MP2-128C3-128C3-256FC-256FC-10 Precision chosen

Precision in Neural Networks

[Sakr et al., ICML’17]

Classification Output Quantization

Mismatch Probability

Floating-point Network

Fixed-point Network

𝑝𝑚: “Mismatch Probability”

Page 6: An Analytical Method to Determine Minimum Per-Layer ...sakr2.web.engr.illinois.edu/papers/2018/presentation...32C3-32C3-MP2-64C3-64C3-MP2-128C3-128C3-256FC-256FC-10 Precision chosen

Second Order Bound on 𝒑𝒎

[Sakr et al., ICML’17]

• Input/Weight precision trade-off:

→ Optimal precision allocation by balancing the sum

• Data dependence (compute once and reuse)

→ Derivatives obtained in last step of backprop

→ Only one forward-backward pass needed

activation quantization noise gain weight quantization noise gain

Page 7: An Analytical Method to Determine Minimum Per-Layer ...sakr2.web.engr.illinois.edu/papers/2018/presentation...32C3-32C3-MP2-64C3-64C3-MP2-128C3-128C3-256FC-256FC-10 Precision chosen

Proof Sketch• For one input, when do we have a mismatch?

→ If FL network predicts label “𝑗”→ But FX network predicts label “𝑖” where 𝑖 ≠ 𝑗

→ This happens with some probability computed as follows:

→ But we already know

→ Whose variance is

→ Applying Chebyshev + LTP yields the result

(Output re-ordering due to quantization)

(Symmetry of

quantization noise)

[Sakr et al., ICML’17]

Page 8: An Analytical Method to Determine Minimum Per-Layer ...sakr2.web.engr.illinois.edu/papers/2018/presentation...32C3-32C3-MP2-64C3-64C3-MP2-128C3-128C3-256FC-256FC-10 Precision chosen

Tighter Bound on 𝒑𝒎

𝑀: Number of Classes

𝑆: Signal to quantization noise ratio

𝑃1 & 𝑃2: Correction factors

• Mismatch probability decreases double exponentially with precision

→ Theoretically stronger than Theorem 1

→ Unfortunately, less practical

[Sakr et al., ICML’17]

Page 9: An Analytical Method to Determine Minimum Per-Layer ...sakr2.web.engr.illinois.edu/papers/2018/presentation...32C3-32C3-MP2-64C3-64C3-MP2-128C3-128C3-256FC-256FC-10 Precision chosen

Per-layer Precision Assignment

• Per-layer precision allows for more

aggressive complexity reductions

• Search space is huge: 2L

dimensional grid

• Example: 5 layer network, precision

considered up to 16 bits → 1610 ≈

one million millions design points

• We need an analytical way to reduce

the search space → maybe using the

analysis of Sakr et al.

Layer 1

Layer L

(input features)(activations)

(weights)

key idea: equalization of

reflected quantization

noise variances

Page 10: An Analytical Method to Determine Minimum Per-Layer ...sakr2.web.engr.illinois.edu/papers/2018/presentation...32C3-32C3-MP2-64C3-64C3-MP2-128C3-128C3-256FC-256FC-10 Precision chosen

Fine-grained Precision Analysis

[Sakr et al., ICML’17]

[Sakr et al., ICASSP’18]

per-layer quantization noise gains

Page 11: An Analytical Method to Determine Minimum Per-Layer ...sakr2.web.engr.illinois.edu/papers/2018/presentation...32C3-32C3-MP2-64C3-64C3-MP2-128C3-128C3-256FC-256FC-10 Precision chosen

Per-layer Precision Assignment Method

equalization of quantization noise variances

Search space is reduced to only a one

dimensional axis of reference precision min

Step 1: compute least

quantization noise gain

Step 2: select precision

to equalize all

quantization noise

variances

Page 12: An Analytical Method to Determine Minimum Per-Layer ...sakr2.web.engr.illinois.edu/papers/2018/presentation...32C3-32C3-MP2-64C3-64C3-MP2-128C3-128C3-256FC-256FC-10 Precision chosen

Comparison with related works• Simplified but meaningful model of complexity

→ Computational cost

→ Total number of FAs used assuming folded MACs with bit growth allowed

→ Number of MACs is equal to the number of dot products computed

→ Number of FAs per MAC:

→ Representational cost

→ Total number of bits needed to represent weights and activations

→ High level measure of area and communications cost (data movement)

• Other works considered

→ Stochastic quantization (SQ) [Gupta’15, ICML]

→ 784 – 1000 – 1000 – 10 (MNIST)

→ 64C5 – MP2 – 64C5 – MP2 – 64FC – 10 (CIFAR10)

→ BinaryNet (BN) [Hubara’16, NIPS]

→ 784 – 2048 – 2048 – 2048 – 10 (MNIST) & VGG (CIFAR10)

Page 13: An Analytical Method to Determine Minimum Per-Layer ...sakr2.web.engr.illinois.edu/papers/2018/presentation...32C3-32C3-MP2-64C3-64C3-MP2-128C3-128C3-256FC-256FC-10 Precision chosen

Precision Profiles – CIFAR-10VGG-11 ConvNet on CIFAR-10:

32C3-32C3-MP2-64C3-64C3-MP2-128C3-128C3-256FC-256FC-10Precision chosen such that 𝑝𝑚 ≤ 1%

• Weight precision requirements greater [Sakr et al., ICML’17]

• Precision decreases with depth – more sensitivity to perturbations in

early layers [Raghu et al., ICML’17]

Page 14: An Analytical Method to Determine Minimum Per-Layer ...sakr2.web.engr.illinois.edu/papers/2018/presentation...32C3-32C3-MP2-64C3-64C3-MP2-128C3-128C3-256FC-256FC-10 Precision chosen

Precision – Accuracy Trade-offs

• Precision reduction is the greatest when using the proposed fine-

grained precision assignment

Page 15: An Analytical Method to Determine Minimum Per-Layer ...sakr2.web.engr.illinois.edu/papers/2018/presentation...32C3-32C3-MP2-64C3-64C3-MP2-128C3-128C3-256FC-256FC-10 Precision chosen

Precision – Accuracy – Complexity Trade-offs

• Best complexity reduction when the proposed precision assignment

• Complexity even much smaller than a BinaryNet for same accuracy due

to much higher network complexity

Page 16: An Analytical Method to Determine Minimum Per-Layer ...sakr2.web.engr.illinois.edu/papers/2018/presentation...32C3-32C3-MP2-64C3-64C3-MP2-128C3-128C3-256FC-256FC-10 Precision chosen

Conclusion & Future Work• Presented an analytical method to determine per-layer precision of neural

networks

• Method based on equalization of reflected quantization noise variances

• Per-layer precision assignment reveals interesting properties of neural

networks

• Method leads to significant overall complexity reduction, e.g., reduction of

minimum precision to 2 bits and lesser cost of implementation than

BinaryNets

• Future work:

• Trade-offs between precision vs. depth vs. width

• Trade-offs between precision and pruning

Page 17: An Analytical Method to Determine Minimum Per-Layer ...sakr2.web.engr.illinois.edu/papers/2018/presentation...32C3-32C3-MP2-64C3-64C3-MP2-128C3-128C3-256FC-256FC-10 Precision chosen

Thank you!

Acknowledgment:

This work was supported in part by Systems on Nanoscale Information fabriCs (SONIC), one of the six

SRC STARnet Centers, sponsored by MARCO and DARPA.