A Case Study on Optimizing Accurate Half Precision Average · - k-means, meanshift, average pooling...
Transcript of A Case Study on Optimizing Accurate Half Precision Average · - k-means, meanshift, average pooling...
A Case Study on Optimizing Accurate Half PrecisionAverage
Kenny Peou1,2,3
Alan Kelly1, Joel Falcou1,2,3
NUMSCALE1, LRI2 and Universite Paris-Saclay3
24/09/2018
Introduction
Floating Point Format
Parallelization
Performance Considerations
Calculating the Average
Conclusion
2 / 32 Introduction A Case Study on Optimizing Accurate Half Precision Average
A Very Average Presentation
� Average is fundamental component of some ML algorithms- k-means, meanshift, average pooling
� FP16 hardware support incoming- Pascal GPU, ARM SVE
� FP16 precision imposes serious limitations
� Performance is important
3 / 32 Introduction A Case Study on Optimizing Accurate Half Precision Average
A Very Average Presentation
� Average is fundamental component of some ML algorithms- k-means, meanshift, average pooling
� FP16 hardware support incoming- Pascal GPU, ARM SVE
� FP16 precision imposes serious limitations
� Performance is important
3 / 32 Introduction A Case Study on Optimizing Accurate Half Precision Average
A Very Average Presentation
� Average is fundamental component of some ML algorithms- k-means, meanshift, average pooling
� FP16 hardware support incoming- Pascal GPU, ARM SVE
� FP16 precision imposes serious limitations
� Performance is important
3 / 32 Introduction A Case Study on Optimizing Accurate Half Precision Average
A Very Average Presentation
� Average is fundamental component of some ML algorithms- k-means, meanshift, average pooling
� FP16 hardware support incoming- Pascal GPU, ARM SVE
� FP16 precision imposes serious limitations
� Performance is important
3 / 32 Introduction A Case Study on Optimizing Accurate Half Precision Average
Floating Point Format - Specifications
Sign Exponent Mantissa
1. . 2-15
Precision Exponent Mantissa Max Min
Half 5 10 65504 6.1 · 10−5
Single 8 23 3.4 · 1038 1.2 · 10−38
Double 11 52 1.8 · 10308 2.2 · 10−308
4 / 32 Floating Point Format A Case Study on Optimizing Accurate Half Precision Average
Limitations
� Overflow - too big65504 + 32 =∞256 · 256 =∞
� Underflow/Subnormals - too small1÷ 66000 = 00.0625/64992 = 9.53674e − 07
� Exponent (mis)alignment - juuuuuuust wrong2048 + 1 = 20482048 + 3.5 = 2050
� Limited hardware support for FP16Emulated via half C++ library
5 / 32 Floating Point Format A Case Study on Optimizing Accurate Half Precision Average
Limitations
� Overflow - too big65504 + 32 =∞256 · 256 =∞
� Underflow/Subnormals - too small1÷ 66000 = 00.0625/64992 = 9.53674e − 07
� Exponent (mis)alignment - juuuuuuust wrong2048 + 1 = 20482048 + 3.5 = 2050
� Limited hardware support for FP16Emulated via half C++ library
5 / 32 Floating Point Format A Case Study on Optimizing Accurate Half Precision Average
Limitations
� Overflow - too big65504 + 32 =∞256 · 256 =∞
� Underflow/Subnormals - too small1÷ 66000 = 00.0625/64992 = 9.53674e − 07
� Exponent (mis)alignment - juuuuuuust wrong2048 + 1 = 20482048 + 3.5 = 2050
� Limited hardware support for FP16Emulated via half C++ library
5 / 32 Floating Point Format A Case Study on Optimizing Accurate Half Precision Average
Limitations
� Overflow - too big65504 + 32 =∞256 · 256 =∞
� Underflow/Subnormals - too small1÷ 66000 = 00.0625/64992 = 9.53674e − 07
� Exponent (mis)alignment - juuuuuuust wrong2048 + 1 = 20482048 + 3.5 = 2050
� Limited hardware support for FP16Emulated via half C++ library
5 / 32 Floating Point Format A Case Study on Optimizing Accurate Half Precision Average
Unit of Least Precision (ULP)
Sign Exponent Mantissa
1.ULP Bit
. 2-15
Base Value Half ULP Float ULP Double ULP
2−10 2−20 2−33 2−63
1 2−10 2−23 2−53
210 1 2−13 2−43
Table: Value of 1 ULP at different base values for half, single, and doubleprecisions.
6 / 32 Floating Point Format A Case Study on Optimizing Accurate Half Precision Average
Parallelization (SIMD)
Single Instruction Single Data
+
7 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Parallelization (SIMD)
Single Instruction Single Data
+ + + + + + + +
7 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Parallelization (SIMD)
Single Instruction Single Data
+ + + + + + +
7 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Parallelization (SIMD)
Single Instruction Single Data
+ + + + + +
7 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Parallelization (SIMD)
Single Instruction Single Data
+ + + + +
7 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Parallelization (SIMD)
Single Instruction Single Data
+ + + +
7 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Parallelization (SIMD)
Single Instruction Single Data
+ + +
7 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Parallelization (SIMD)
Single Instruction Single Data
+ +
7 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Parallelization (SIMD)
Single Instruction Single Data
+
7 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Parallelization (SIMD)
Single Instruction Single Data
7 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Parallelization (SIMD)
Single Instruction Single Data
7 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Parallelization (SIMD)
Single Instruction Multiple Data
+
7 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Parallelization (SIMD)
Single Instruction Multiple Data
+ +
7 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Parallelization (SIMD)
Single Instruction Multiple Data
+
7 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Parallelization (SIMD)
Single Instruction Multiple Data
7 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Parallelization (SIMD)
Single Instruction Multiple Data
SIMD Technologies
� GPU - CUDA, OpenCL
� Multithread - OpenMP, MPI
� CPU Vectorization - AVX, NEON, PPC- No communication/transfer- Fixed register sizes- Not portable (without NSIMD)
7 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Vectorization - NSIMD
// AVXm256 a = mm256 load ps ( &(A [ i ] ) ) ;m256 b = mm256 load ps ( &(B [ i ] ) ) ;m256 c = mm256 add ps ( a , b ) ;
mm256 store ps ( &(C [ i ] ) , c ) ;
8 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Vectorization - NSIMD
// NEONf l o a t 3 2 x 4 t a = v l d 1 q s 3 2 ( &(A [ i ] ) ) ;f l o a t 3 2 x 4 t b = v l d 1 q s 3 2 ( &(B [ i ] ) ) ;f l o a t 3 2 x 4 t c = v a d d q f 3 2 ( a , b ) ;v s t 1 q f 3 2 ( &(C [ i ] ) , c ) ;
8 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Vectorization - NSIMD
// VMXv e c t o r f l o a t a = v e c l d ( 0 , &(A [ i ] ) ) ;v e c t o r f l o a t b = v e c l d ( 0 , &(B [ i ] ) ) ;v e c t o r f l o a t c = v e c a d d ( a , b ) ;
v e c s t ( c , 0 , &(C [ i ] ) ) ;
8 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Vectorization - NSIMD
// NSIMD ( A l l a r c h i t e c t u r e s )nsimd : : pack<f l oa t> b = nsimd : : l o a d ( &(B [ i ] ) ) ;nsimd : : pack<f l oa t> c = nsimd : : l o a d ( &(C [ i ] ) ) ;nsimd : : pack<f l oa t> a = b ∗ c ;nsimd : : s t o r e ( &(A [ i ] ) , a ) ;
8 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average
Performance Considerations
� Number of Operations/Instructions
� Instruction Latency
� Data Types (SIMD consideration)
Operation Integer Float SIMD Float SIMD Double
Load/Store 1 1 1 1Addition 1 3 3 3
Subtraction 1 3 3 3Multiplication 3 5 5 5
Division 22 10 10 10
Table: Minimum cycles required to perform basic arithmetic operations onan Intel Haswell CPU. (Agner Fog)
9 / 32 Performance Considerations A Case Study on Optimizing Accurate Half Precision Average
Calculating the Average
Seems simple, right?
Avg(X ) =1
N
N∑i=1
xi
Not with half precision!
10 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Precision BenchmarksInput N Naive Kahan Iterative Upcast Cascade
Sequential Input[i ] = s · is = 1 100
100010000
s = 0.001 1001000
10000
Fixed Ratio Input[i ] ={ even: i ÷ 2
odd: i ÷ (2 · r)r = N/2 100
100010000
Fixed Diff Input[i ] ={ even: i ÷ 2
odd: (i ÷ 2) + dd = N/2 100
100010000
Random Input[i ] = rand()
[0, 1] 1000[0, 10] 1000
Fixed Input[i ] = c
c = 10 100010000
1000000
Repeating Input[i ] = 10 + (i%3)
10 - 12 3003000
30000Image1 2073600Image2 2073600
11 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Performance Benchmarks
Processor Used Clock Speed RAM
Intel i7-4790S 3.20GHz 16GBPower8 8348-21C 2.061GHz 64GB
ARM Cortex-A57r1 1.91GHz 4GB
Benchmark Compiler Used Compilation Flags
Scalar All of below -O3
SSE GCC 6.3.1 -O3 -msse4.2
AVX GCC 6.3.1 -O3 -mavx2
NEON GCC 5.4.1 -O3 -march=armv8-a+simd
Altivec GCC 6.3.0 -O3 -maltivec
Performance measured using float instead of half
12 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Naive Average
1: function Naive Average(Array)2: sum = 03: for a in Array do4: sum += a5: end for6: avg = sum / length(Array)7: end function
� Rounding errors gets worse and worse
� Susceptible to overflow
� Computationally simple
13 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Naive AverageInput N Naive Kahan Iterative Upcast Cascade
Sequential Input[i ] = s · is = 1 100 9
1000 fail10000 fail
s = 0.001 100 161000 198
10000 fail
Fixed Ratio Input[i ] ={ even: i ÷ 2
odd: i ÷ (2 · r)r = N/2 100 1
1000 fail10000 fail
Fixed Diff Input[i ] ={ even: i ÷ 2
odd: (i ÷ 2) + dd = N/2 100 13
1000 fail10000 fail
Random Input[i ] = rand()
[0, 1] 1000 271[0, 10] 1000 fail
Fixed Input[i ] = c
c = 10 1000 15210000 1070
1000000 1277
Repeating Input[i ] = 10 + (i%3)
10 - 12 300 173000 709
30000 1998Image1 2073600 failImage2 2073600 1761
14 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Naive Average
Naive Kahan Iterative Upcast Cascading
0
2
4
6
1
3.96
7.05
3.96
2.6
Speedup for All Architectures
Scalar SSE AVX Altivec NEON
15 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Average Using Kahan Sum
1: function Kahan Average(Array)2: sum = 03: rem = 04: for a in Array do5: y = a - rem6: t = sum + y7: rem = ( t - sum ) - y8: sum = t9: end for
10: avg = sum / length(Array)11: end function
� Mostly compensates rounding errors
� Still susceptible to overflow and misalignments
16 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Average Using Kahan SumInput N Naive Kahan Iterative Upcast Cascade
Sequential Input[i ] = s · is = 1 100 9 1
1000 fail fail10000 fail fail
s = 0.001 100 16 11000 198 1
10000 fail fail
Fixed Ratio Input[i ] ={ even: i ÷ 2
odd: i ÷ (2 · r)r = N/2 100 1 1
1000 fail fail10000 fail fail
Fixed Diff Input[i ] ={ even: i ÷ 2
odd: (i ÷ 2) + dd = N/2 100 13 1
1000 fail fail10000 fail fail
Random Input[i ] = rand()
[0, 1] 1000 271 0[0, 10] 1000 fail fail
Fixed Input[i ] = c
c = 10 1000 152 010000 1070 fail
1000000 1277 fail
Repeating Input[i ] = 10 + (i%3)
10 - 12 300 17 03000 709 1
30000 1998 failImage1 2073600 fail failImage2 2073600 1761 391
17 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Average Using Kahan Sum
Naive Kahan Iterative Upcast Cascading
0
2
4
6
1
3.96
7.05
3.96
2.6
0.24
1
1.97
1 0.99
Speedup for All Architectures
Scalar SSE AVX Altivec NEON
18 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Iterative Average
1: function Iterative Average(Array)2: avg = 03: for i in 1 → length(Array) do4: avg += (avg - Array[i]) / i5: end for6: end function
� Removes risk of overflow
� Misalignment, underflow, and subnormals get progressively worse
�
limi→∞
(xi − Avgi )
i= 0
19 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Iterative AverageInput N Naive Kahan Iterative Upcast Cascade
Sequential Input[i ] = s · is = 1 100 9 1 0
1000 fail fail 010000 fail fail 993
s = 0.001 100 16 1 201000 198 1 249
10000 fail fail 1486
Fixed Ratio Input[i ] ={ even: i ÷ 2
odd: i ÷ (2 · r)r = N/2 100 1 1 17
1000 fail fail 310000 fail fail 994
Fixed Diff Input[i ] ={ even: i ÷ 2
odd: (i ÷ 2) + dd = N/2 100 13 1 23
1000 fail fail 22710000 fail fail 737
Random Input[i ] = rand()
[0, 1] 1000 271 0 249[0, 10] 1000 fail fail 261
Fixed Input[i ] = c
c = 10 1000 152 0 010000 1070 fail 0
1000000 1277 fail 0
Repeating Input[i ] = 10 + (i%3)
10 - 12 300 17 0 743000 709 1 128
30000 1998 fail 128Image1 2073600 fail fail 1037Image2 2073600 1761 391 1701
20 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Iterative Average
Naive Kahan Iterative Upcast Cascading
0
2
4
6
1
3.96
7.05
3.96
2.6
0.24
1
1.97
1 0.99
0.16
0.63 0.
89
0.58
0.58
Speedup for All Architectures
Scalar SSE AVX Altivec NEON
21 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Increased Precision
1: function Upcast Average(Array)2: sum = (upcast)03: for a in Array do4: sum += (upcast)a5: end for6: avg = sum / length(Array)7: end function
� Same fundamental problems as naive average
� Limitations less problematic
22 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Increased PrecisionInput N Naive Kahan Iterative Upcast Cascade
Sequential Input[i ] = s · is = 1 100 9 1 0 0
1000 fail fail 0 010000 fail fail 993 0
s = 0.001 100 16 1 20 01000 198 1 249 0
10000 fail fail 1486 0
Fixed Ratio Input[i ] ={ even: i ÷ 2
odd: i ÷ (2 · r)r = N/2 100 1 1 17 0
1000 fail fail 3 010000 fail fail 994 0
Fixed Diff Input[i ] ={ even: i ÷ 2
odd: (i ÷ 2) + dd = N/2 100 13 1 23 0
1000 fail fail 227 010000 fail fail 737 0
Random Input[i ] = rand()
[0, 1] 1000 271 0 249 0[0, 10] 1000 fail fail 261 0
Fixed Input[i ] = c
c = 10 1000 152 0 0 010000 1070 fail 0 0
1000000 1277 fail 0 0
Repeating Input[i ] = 10 + (i%3)
10 - 12 300 17 0 74 03000 709 1 128 0
30000 1998 fail 128 0Image1 2073600 fail fail 1037 1Image2 2073600 1761 391 1701 12
23 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Increased Precision
Naive Kahan Iterative Upcast Cascading
0
2
4
6
1
3.96
7.05
3.96
2.6
0.24
1
1.97
1 0.99
0.16
0.63 0.
89
0.58
0.58 1
22.
37
21.
58
Speedup for All Architectures
Scalar SSE AVX Altivec NEON
24 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Cascading Average
0 1 2 3 4 5 6 7
25 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Cascading Average
0 1 2 3 4 5 6 7
0.5 2.5 4.5 6.5
25 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Cascading Average
0 1 2 3 4 5 6 7
0.5 2.5 4.5 6.5
1.5 5.5
25 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Cascading Average
0 1 2 3 4 5 6 7
0.5 2.5 4.5 6.5
1.5 5.5
3.5
25 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Cascading Average
0 1 2 3 4 5 6 7
0.5 2.5 4.5 6.5
1.5 5.5
3.5
� Pairwise operations
� No accumulator = no overflow
� Rounding errors from inputvariation
� ≤ 2N operations
25 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Cascading Average
1: function Cascading Average(Array)2: n = length(Array)3: if n == 1 then4: return Array[0]5: else if n == 2 then6: return ( Array[0] + Array[1] ) / 27: else8: return ( Cascading Average( Array[1:n/2] ) + Cascading
Average( Array[n/2:n] ) ) / 29: end if
10: end function
26 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Cascading AverageInput N Naive Kahan Iterative Upcast Cascade
Sequential Input[i ] = s · is = 1 100 9 1 0 0 0
1000 fail fail 0 0 010000 fail fail 993 0 1
s = 0.001 100 16 1 20 0 21000 198 1 249 0 4
10000 fail fail 1486 0 4
Fixed Ratio Input[i ] ={ even: i ÷ 2
odd: i ÷ (2 · r)r = N/2 100 1 1 17 0 0
1000 fail fail 3 0 010000 fail fail 994 0 1
Fixed Diff Input[i ] ={ even: i ÷ 2
odd: (i ÷ 2) + dd = N/2 100 13 1 23 0 0
1000 fail fail 227 0 010000 fail fail 737 0 0
Random Input[i ] = rand()
[0, 1] 1000 271 0 249 0 3[0, 10] 1000 fail fail 261 0 4
Fixed Input[i ] = c
c = 10 1000 152 0 0 0 010000 1070 fail 0 0 0
1000000 1277 fail 0 0 0
Repeating Input[i ] = 10 + (i%3)
10 - 12 300 17 0 74 0 13000 709 1 128 0 1
30000 1998 fail 128 0 1Image1 2073600 fail fail 1037 1 3Image2 2073600 1761 391 1701 12 7
27 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Cascading Average
Naive Kahan Iterative Upcast Cascading
0
2
4
6
1
3.96
7.05
3.96
2.6
0.24
1
1.97
1 0.99
0.16
0.63 0.
89
0.58
0.58 1
22.
37
21.
58
0.28
1.04
2.21
1.21
0.9
Speedup for All Architectures
Scalar SSE AVX Altivec NEON
28 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Performance - Overall
210 219 228
2−2
2−1
20
Input Size
Sp
eed
up
Scalar Performance
210 219 228
20
21
22
Input Size
SSE Performance
210 219 228
20
21
22
23
Input Size
AVX Performance
210 2202−1
20
21
22
Input Size
Sp
eed
up
NEON Performance
210 219 2282−1
20
21
22
Input Size
Altivec Performance
Naive
Kahan
Iterative
Cascade
Upcast
29 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Analysis
� Sum then divide methods overflow easily
� Iterative average method fails for large N
� Kahan sum (with divide when necessary) is the most precise whenit works
� Naive method is fast enough to increase precision and still befaster (with SIMD)
� Cascading Average is never a bad choice thanks to its robustness
30 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average
Conclusion
� Numerical precision is complicated
� Performance requires effort
� Even simple things can need research
Future Works
� Similar study using 8 bit floats
� Similar study with exotic numerical formats
� Mix algorithms intelligently
� Look into other ”solved” algorithms under half precision
31 / 32 Conclusion A Case Study on Optimizing Accurate Half Precision Average
Full Precision ResultsInput N Naive Kahan Iterative Upcast Cascade
Sequential Input[i ] = s · is = 1 100 9 1 0 0 0
1000 fail fail 0 0 010000 fail fail 993 0 1
s = 0.001 100 16 1 20 0 21000 198 1 249 0 4
10000 fail fail 1486 0 4s = -1 100 9 1 0 0 0
1000 fail fail 0 0 010000 fail fail 993 0 1
Fixed Ratio Input[i ] ={ even: i ÷ 2
odd: i ÷ (2 · r)r = N/2 100 1 1 17 0 0
1000 fail fail 3 0 010000 fail fail 994 0 1
Fixed Diff Input[i ] ={ even: i ÷ 2
odd: (i ÷ 2) + dd = N/2 100 13 1 23 0 0
1000 fail fail 227 0 010000 fail fail 737 0 0
Random Input[i ] = rand()
[0, 1] 1000 271 0 249 0 3[0, 10] 1000 fail fail 261 0 4
[0, 100] 1000 fail fail 246 0 2[0, 1000] 1000 fail fail 253 0 3
Fixed Input[i ] = c
1000 152 0 0 0 0c = 10 10000 1070 fail 0 0 0
100000 1259 fail 0 0 01000000 1277 fail 0 0 0
Repeating Input[i ] = 10 + (i%3)
10 - 12 300 17 0 74 0 13000 709 1 128 0 1
30000 1998 fail 128 0 1300000 1401 fail 128 0 1
Image1 2073600 fail fail 1037 1 3Image2 2073600 1761 391 1701 12 7
1 / 1 A Case Study on Optimizing Accurate Half Precision Average