FPGA-accelerated Dense Linear Machine Learning: A ... · Q4 FPGA float FPGA 7.2x 0.001 0.01 0.1 1...
Transcript of FPGA-accelerated Dense Linear Machine Learning: A ... · Q4 FPGA float FPGA 7.2x 0.001 0.01 0.1 1...
Kaan Kara, Dan Alistarh, Gustavo Alonso, Onur Mutlu, Ce Zhang
FPGA-accelerated Dense Linear Machine Learning:A Precision-Convergence Trade-off
1
What is the most efficient way of training dense linear models on an FPGA?
• FPGAs can handle floating-point, but they are certainly not the
best choice.
We have tried: On par with a 10-core Xeon, not a clear win.
Yet!
• How about some recent developments in the machine research?
Low-precision data EXACTLY the same end-result.In theory
leads to2
Stochastic Gradient Descent (SGD)
while not converged dofor i from 1 to M do
𝒈𝒊 = 𝒂𝒊 ∙ 𝒙 − 𝑏𝑖 𝒂𝒊𝒙 ← 𝒙 − 𝛾𝒈𝒊
endend
𝐴𝑀×𝑁
N features
M samples 𝑥𝑁×1x =𝑏𝑀×1
Data Set LabelsModel
Computation
CPU, GPU, FPGA
Memory
CPU Cache,DRAM,FPGA BRAM
Data Source
DRAM, SSD,Sensor
Data Ar
Model x
Gradient: dot(Ar, x)Ar
3
Advantage of Low-Precision Data + FPGA
while not converged dofor i from 1 to M do
𝒈𝒊 = 𝒂𝒊 ∙ 𝒙 − 𝑏𝑖 𝒂𝒊𝒙 ← 𝒙 − 𝛾𝒈𝒊
endend
𝑥𝑁×1x =𝑏𝑀×1
Data Set LabelsModel
Computation
CPU, GPU, FPGA
Memory
CPU Cache,DRAM,FPGA BRAM
Data Source
DRAM, SSD,Sensor
Data Ar
Model x
Gradient: dot(Ar, x)Ar
Compress!
𝐴
N features
M samples
4
||
Key Takeaway
Iterations
TrainingLoss
Naïve rounding
Stochastic rounding
Full-precision
• Using an FPGA, we can increase the hardware efficiency while maintaining the statistical efficiency of SGD for dense linear models.
Execution Time
TrainingLoss
faster
same quality
• In practice, the system opens up a multivariate trade-off space:Data properties, FPGA implementation, SGD parameters…
5
Outline for the Rest…
1. Stochastic quantization. Data layout
2. Target platform: Intel Xeon+FPGA
3. Implementation: From 32-bit to 1-bit SGD on FPGA.
4. Evaluation: Main results. Side effects.
5. Conclusion
6
Stochastic Quantization [1]
[…. 0.7 ….]
0
1 Naive solution: Nearest rounding (=1)=> Converge to a different solution.
Stochastic rounding: 0 with probability 0.3, 1 with probability 0.7=> Expectation remains the same. Converge to the same solution.
min𝑥
1
2
𝑟
(𝐴𝑟 ∙ 𝑥 − 𝑏𝑟)2
𝒈𝒊 = 𝒂𝒊 ∙ 𝒙 − 𝑏𝑖 𝒂𝒊Gradient
We need 2independent samples 𝒈𝒊 = 𝑄1(𝒂𝒊) ∙ 𝒙 − 𝑏𝑖 𝑄2(𝒂𝒊)
…and each iteration of the algorithm needs fresh samples.
[1] H Zhang, K Kara, J Li, D Alistarh, J Liu, C Zhang: ZipML: An End-to-end Bitwise Framework for Dense Generalized Linear Models, arXiv:1611.054027
Before the FPGA: The Data Layout
32-bit float value
1. Feature
1. Sample 32-bit float value
2. Feature
32-bit float value
3. Feature
32-bit float value
n. Feature
32-bit float valuem. Sample 32-bit float value 32-bit float value 32-bit float value
Quantize to 4-bit.We need 2 independent quantizations!
4-bit | 4-bit
1. Feature
1. Sample
2. Feature
m. Sample
4x compression!
3. Feature
4-bit | 4-bit 4-bit | 4-bit
n. Feature
4-bit | 4-bit
4-bit | 4-bit 4-bit | 4-bit 4-bit | 4-bit 4-bit | 4-bit
For each iteration
A new index!
1 quantization index
Currently slow!
8
Target Platform
Intel Xeon+FPGA
96GB Main Memory
Mem. Controller
Caches
Intel Xeon CPU
Altera Stratix V
~30 GB/s 6.5 GB/s
QPI
Accelerator
QPIEndpoint
Ivy Bridge, Xeon E5-2680 v2
10 cores, 25 MB L3
9
Full Precision SGD on FPGA
16 floats
64B cache-line
xx16 float multipliers
xx
++
+
+
+
++
+
++
+
+
++
+
16 fixedadders
-
x
b fifo
ax b
γ(ax-b)
a fifo
xxx16 float multipliers
---16 fixedadders
xloading
a
CCCC
16 float2fixedconverters
+
c
1 fixed2floatconverter
CCC16 float2fixedconverters
C
16 fixed2floatconverters
γ
γ(ax-b)a
x - γ(ax-b)a
1 float adder1 float multiplier
Model update
Dotproduct
Gradientcalculation
x
Batch sizeis reached?
1
2
3
4
56
B A
7
8
9
C
D
32-bit floating-point: 16 values
Processing rate: 12.8 GB/s
Computation
Custom LogicStorage Device
FPGA BRAM
Data Source
DRAM, SSD,Sensor
10
||
Scale out the design for low-precision data!
How can we increase the internal parallelism while maintaining the processing rate?
11
Challenge + Solution
16 floats
64B cache-line
xx16 float multipliers
xx
++
+
+
+
++
+
++
+
+
++
+
16 fixedadders
-
x
b fifo
ax b
γ(ax-b)
a fifo
xxx16 float multipliers
---16 fixedadders
xloading
a
CCCC
16 float2fixedconverters
+
c
1 fixed2floatconverter
CCC16 float2fixedconverters
C
16 fixed2floatconverters
γ
γ(ax-b)a
x - γ(ax-b)a
1 float adder1 float multiplier
Model update
Dotproduct
Gradientcalculation
x
Batch sizeis reached?
1
2
3
4
56
B A
7
8
9
C
D
8-bit: 32 values Q84-bit: 64 values Q42-bit: 128 values Q21-bit: 256 values Q1
=> Scaling out is not trivial!
1) We can get rid of floating-point arithmetic.
2) We can further simplify integer arithmetic for lowest precision data.
12
Trick 1: Selection of quantization levels
U
L
1. Select the lower and upper bound [L,U]
2. Select the size of the interval ∆
All quantized values are integers!
∆
13
This is enough for Q4 and Q8
8-bit: 32 values Q84-bit: 64 values Q42-bit: 128 values Q21-bit: 256 values Q1
K=128, 64, 32 quantized values
64B cache-line
xxK fixed multipliersxx
++
+
+
+
+
+
+
+
+
++
+
K fixedadders
-
x
b fifo
Q (a)x b
γ(Q (a)x-b)
a fifo
xxxK fixed multipliers
---K fixedadders
xloading
Q (a)
+
γ
γ(Q (a)x-b)Q (a)
x - γ(Q (a)x-b)Q (a)
1 fixed adder1 bit-shift
Model update
Dotproduct
Gradientcalculation
x
Batch sizeis reached?
A
1
2
3
14
Trick 2: Implement Multiplication using Multiplexer
In[31:0]
Q2value[1:0]
Out[31:0]000110
<<10
In[31:0]
Q2value[1:0]
Out[31:0]000111
0
-In[31:0]
Result[31:0]
1
2
Q2value normalized to[0,2] or [-1,1]
In[31:0]
Q1value[0]
Result[31:0]
0
1
0
Q2 multiplier Q1 multiplier
15
Support for all Qx
K=128, 64, 32 quantized values
64B cache-line
xxK fixed multipliersxx
++
+
+
+
+
+
+
+
+
++
+
K fixedadders
-
x
b fifo
Q (a)x b
γ(Q (a)x-b)
a fifo
xxxK fixed multipliers
---K fixedadders
xloading
Q (a)
+
γ
γ(Q (a)x-b)Q (a)
x - γ(Q (a)x-b)Q (a)
1 fixed adder1 bit-shift
Model update
Dotproduct
Gradientcalculation
x
Batch sizeis reached?
A
1
2
3
Circuit for Q8, Q4 and Q2 SGD
Processing rate: 12.8 GB/s
K=256 quantized values
64B cache-line
xx128 fixed
multipliersxx
++
+
+
+
+
+
+
+
+
++
+
128 fixedadders
-
x
b fifo
Q (a)x b
γ(Q (a)x-b)
a fifo
xxx128 fixed multipliers
---128 fixed
adders
xloading
Q (a)
+
γ
γ(Q (a)x-b)Q (a)
x - γ(Q (a)x-b)Q (a)
1 fixed adder1 bit-shift
Model update
Dotproduct
Gradientcalculation
x
Batch sizeis reached?
K=128 quantized valuesK=128 quantized values
Split one 64B cache-line into 2 parts
A
1
2
3
4
Circuit for Q1 SGD
Processing rate: 6.4 GB/s16
Support for all Qx
K=128, 64, 32 quantized values
64B cache-line
xxK fixed multipliersxx
++
+
+
+
+
+
+
+
+
++
+
K fixedadders
-
x
b fifo
Q (a)x b
γ(Q (a)x-b)
a fifo
xxxK fixed multipliers
---K fixedadders
xloading
Q (a)
+
γ
γ(Q (a)x-b)Q (a)
x - γ(Q (a)x-b)Q (a)
1 fixed adder1 bit-shift
Model update
Dotproduct
Gradientcalculation
x
Batch sizeis reached?
A
1
2
3
Circuit for Q8, Q4 and Q2 SGD
Processing rate: 12.8 GB/s
K=256 quantized values
64B cache-line
xx128 fixed
multipliersxx
++
+
+
+
+
+
+
+
+
++
+
128 fixedadders
-
x
b fifo
Q (a)x b
γ(Q (a)x-b)
a fifo
xxx128 fixed multipliers
---128 fixed
adders
xloading
Q (a)
+
γ
γ(Q (a)x-b)Q (a)
x - γ(Q (a)x-b)Q (a)
1 fixed adder1 bit-shift
Model update
Dotproduct
Gradientcalculation
x
Batch sizeis reached?
K=128 quantized valuesK=128 quantized values
Split one 64B cache-line into 2 parts
A
1
2
3
4
Circuit for Q1 SGD
Processing rate: 6.4 GB/s
Data Type Logic DSP BRAM
float 38% 12% 7%
Q8 35% 25% 6%
Q4 36% 50% 6%
Q2, Q1 43% 1% 6%
17
Data sets for evaluation
Name Size # Features # Classes
MNIST 60,000 780 10
GISETTE 6,000 5,000 2
EPSILON 10,000 2,000 2
SYN100 10,000 100 Regression
SYN1000 10,000 1,000 Regression
18
Following the Intel legal guidelines on publishing performance numbers, we would like to make the reader aware that results in this publication were generated using preproduction hardware and software, and may not reflect the performance of production or future systems.
SGD Performance Improvement
1E-4 0.001 0.01 0.10.000
0.002
0.004
0.006
0.008
0.010
0.012
0.014
Trai
nin
g Lo
ss
Time (s)
float CPU 1-thread
float CPU 10-threads
Q4 FPGA
float FPGA
7.2x0.001 0.01 0.1 1
0.000
0.005
0.010
0.015
0.020
0.025
0.030
Trai
nin
g Lo
ss
Time (s)
float CPU 1-thread
float CPU 10-threads
Q8 FPGA
float FPGA
2x
SGD on Synthetic100,4-bit works
SGD on Synthetic1000,8-bit works 19
1-bit also works on some data sets
SGD on MNIST, Digit 71-bit works
0.001 0.01 0.1 1 10
0.0170
0.0175
0.0180
0.0185
0.0190
0.0195
0.0200
0.0205
0.0210
0.0215
Trai
nin
g Lo
ss
Time (s)
float CPU 1-thread
float CPU 10-threads
Q1 FPGA
float FPGA
10.6x
MNISTmulti-classification accuracy
Precision Accuracy Training Time float CPU-SGD 85.82% 19.04 s 1-bit FPGA-SGD 85.87% 2.41 s
20
0 10 20 30 40 50 60
0.05
0.10
0.15
0.20
0.25
Trai
ning
Los
s
Number of Epochs
float
Q2, F2
Q4, F4
Q8, F8
Naïve Rounding vs. Stochastic Rounding
Bias
Stochastic
quantization
results in
unbiased
convergence.
21
Effect of Step Size
0 10 20 30 40 50 600.001
0.01
0.1
Trai
ning
Los
s
Number of Epochs
float, step size=1/29, Q2, step size=2/29
float, step size=1/212, Q2, step size=2/212
float, step size=1/215, Q2, step size=2/215 Large step size +
Full precision
vs.
Small step size +
Low precision
22
||
Conclusion
• Highly scalable, parametrizable FPGA-based stochastic gradient descent
implementations for doing linear model training.
• Open source: www.systems.ethz.ch/fpga/ZipML_SGD
1. The way to train linear models on FPGA should be through the usage of
stochastically rounded, low-precision data.
2. Multivariate trade-off space: Precision vs. end-to-end runtime, convergence
quality, design complexity, data and system properties.
Key Takeaways:
23
Backup
Why do dense linear models matter?
• Workhorse algorithm for regression and classification.
• Sparse, high dimensional data sets can be converted
into dense ones.Transfer Learning Random Features
Why is training speed important?
• Often a human-in-the-loop process.
• Configuring parameters, selecting features,
data dependency etc.
Train
Evaluate
25