University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of...
-
date post
15-Jan-2016 -
Category
Documents
-
view
214 -
download
0
Transcript of University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of...
![Page 1: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/1.jpg)
University of MichiganElectrical Engineering and Computer Science
MacroSS: Macro-SIMDization of Streaming Applications
Amir Hormati*, Yoonseo Choi‡, Mark Woh*,
Manjunath Kudlur†, Rodric Rabbah‡, Trevor Mudge*,
Scott Mahlke*
* Advanced Computer Arch. Lab.,
University of Michigan† Nvidia Corp. ‡ IBM T.J. Watson Research
Center
![Page 2: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/2.jpg)
University of MichiganElectrical Engineering and Computer Science
Importance of SIMD
• Energy and area efficient way to exploit data-level parallelism
• Performance in multimedia and communication apps
• Ubiquitous in modern processors– Intel: SSE, Larrabee– IBM: Altivec, Cell SPE – ARM: Neon
Control Unit
Functional Units
Cache
Control Unit
Functional Units
Cache
Control Unit
Functional Units
Cache
![Page 3: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/3.jpg)
University of MichiganElectrical Engineering and Computer Science
Stream Computing
• Prevalent in embedded, desktop and server systems
• Many optimizations for mapping and scheduling applications to parallel architectures
• Retargetability is a big plus in streaming languages
• Task, pipeline, and data-level parallelism is mapped into core-level parallelism
• Data-level parallelism on SIMD engines is not utilized
![Page 4: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/4.jpg)
University of MichiganElectrical Engineering and Computer Science
Traditional Vectorization on Streaming Applications
AudioBeam
BeamForm
erDCT
FFT
FM R
adio
Matr
ix Multip
ly
Matr
ix Multip
ly Block
Bitonic
Sort
FilterB
ank
MP3 D
ecoder
Average
0
0.5
1
1.5
2
2.5
3
3.5ICC + Auto Vectorize
Sp
ee
du
p (
x)
![Page 5: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/5.jpg)
University of MichiganElectrical Engineering and Computer Science
Why SIMD engines are under-utilized?
• Finding data-level parallelism suitable for SIMD engines
• Proper data-alignment
• Complicated compiler optimization and transformations
• Wide variety of SIMD standards
![Page 6: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/6.jpg)
University of MichiganElectrical Engineering and Computer Science
In this work…
• Macro-level SIMDization techniques for streaming languages.
• MacroSS compiler for StreamIt language
• Hardware-based buffer optimizations for packing/unpacking operations
• Evaluation of MacroSS on Intel Core i7
![Page 7: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/7.jpg)
University of MichiganElectrical Engineering and Computer Science
StreamIt
• Main Constructs:– Filter: Encapsulate computation.
• Stateful• Stateless
– Pipeline Expressing pipeline parallelism
– Splitjoin Expressing task/data-level parallelism
• Exposes different types of parallelism
• Scheduling and rate-matching are needed
pipeline
filter
splitjoin
![Page 8: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/8.jpg)
University of MichiganElectrical Engineering and Computer Science
Macro SIMDization
• SIMDization at graph level
• Tunes the graph based on the target system– SIMD standards– Wide/Narrow SIMD
• Actor SIMDization:– Single-Actor– Vertical– Horizontal
![Page 9: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/9.jpg)
University of MichiganElectrical Engineering and Computer Science
EE EE
Single-Actor SIMDization Overview
E
E v
E
E
E
E
E
E
E
EEEE E v
E(8)
E v
E v
Execution ReorderingSerial Execution Ideal VectorizationRealistic Vectorization
![Page 10: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/10.jpg)
University of MichiganElectrical Engineering and Computer Science
0 x0 = pop();1 x1 = pop();2 x2 = pop();3 result[0] = x1 * cos(x0) + x2;4 result[1] = x0 * cos(x1) + x2;5 result[2] = x1 * sin(x0) + x2;5 result[3] = x0 * sin(x1) + x2;6 for (i : 0 to 3) 7 push(result[i]);
E (8)0 x0 = pop();1 x1 = pop();2 x2 = pop();3 result[0] = x1 * cos(x0) + x2;4 result[1] = x0 * cos(x1) + x2;5 result[2] = x1 * sin(x0) + x2;5 result[3] = x0 * sin(x1) + x2;6 for (i : 0 to 3) 7 push(result[i]);
E (8)
Single Actor SIMDization0 x0_v.{3} = peek(9);1 x0_v.{2} = peek(6);2 x0_v.{1} = peek(3);3 x0_v.{0} = pop();
4 x1_v.{3} = peek(9);5 x1_v.{2} = peek(6);6 x1_v.{1} = peek(3);7 x1_v.{0} = pop();
8 x2_v.{3} = peek(9);9 x2_v.{2} = peek(6);10 x2_v.{1} = peek(3);11 x2_v.{0} = pop();
12 result_v[0] = x1_v * cos(x0_v) + x2_v;13 result_v[1] = x0_v * cos(x1_v) + x2_v;14 result_v[2] = x1_v * sin(x0_v) + x2_v;15 result_v[3] = x0_v * sin(x1_v) + x2_v;
16 for (i : 0 to 3) {17 rpush(result_v[i].{3}, 12);18 rpush(result_v[i].{2}, 8);19 rpush(result_v[i].{1}, 4); 20 push(result_v[i].{0});21 }
EV (1)
• Only stateless actors• Scalar buffer accesses • Strided pushes and
pops
0 x0 = pop();1 x1 = pop();2 x2 = pop();3 result[0] = x1 * cos(x0) + x2;4 result[1] = x0 * cos(x1) + x2;5 result[2] = x1 * sin(x0) + x2;5 result[3] = x0 * sin(x1) + x2;6 for (i : 0 to 3) 7 push(result[i]);
E (8)0 x0 = pop();1 x1 = pop();2 x2 = pop();3 result[0] = x1 * cos(x0) + x2;4 result[1] = x0 * cos(x1) + x2;5 result[2] = x1 * sin(x0) + x2;5 result[3] = x0 * sin(x1) + x2;6 for (i : 0 to 3) 7 push(result[i]);
E (4)
![Page 11: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/11.jpg)
University of MichiganElectrical Engineering and Computer Science
Why Scalar Buffers?
Epop=3, push=4
Dpop=2, push=2
8
12
128 bits
60 42
2317 2119
2216 2018
159 1311
148 1210
71 531st Execution
2nd Execution
3nd Execution
?
90 63
2314 2017
2213 1916
2112 1815
112 85
101 74
2nd Execution
1st Execution
20 21 22 23
16 17 18 19
12 13 14 15
8 9 10 11
4 5 6 7
0 1 2 3
![Page 12: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/12.jpg)
University of MichiganElectrical Engineering and Computer Science
Vertical SIMDization
3D 2Epop=6, push=8
4
D0 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11D1
E0 E2 E3 E4 E5 E6 E7E1
1st Execution 2nd Execution 3rd Execution
1st Execution 2nd Execution
Epop=3, push=4
Dpop=2, push=2
8
12
D0 D2D1
E0 E1
D3 D5D4
E2 E3
D6 D8D7
E4 E5
D9 D11D10
E6 E7
![Page 13: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/13.jpg)
University of MichiganElectrical Engineering and Computer Science
Horizontal SIMDization
• Find isomorphic actors in split/join structures
• The isomorphic actors are merge in one vectorized actor
• Actors can be both stateful or stateless.
Source
Splitter
A1
B1
C1
Sink
Joiner
An
Bn
Cn
. . .
. . .
. . .
![Page 14: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/14.jpg)
University of MichiganElectrical Engineering and Computer Science
Epop=3, push=4
Dpop=2, push=2
Fpop=4, push=1
Gpop=2, push=8
Hpop=8, push=n
Apop=n, push=8
Joiner (1, 1, 1, 1)
Splitter (4, 4, 4, 4)
C0pop=1, push=1
C3pop=1, push=1
C2pop=1, push=1
C1pop=1, push=1
B1pop=12, push=3
B2pop=12, push=3
B3pop=12, push=3
B0pop=12, push=3
6
3
3
3333
1 111
4
6
4
2
1
B3
C3
B2
C2
B1
C1
Epop=3, push=4
Dpop=2, push=2
Fpop=4, push=1
Gpop=2, push=8
Hpeek=8, pop=8,
push=n
Apop=n, push=8
Joiner (1, 1, 1, 1)
Splitter (4, 4, 4, 4)
C0pop=1, push=1
B0pop=12, push=3
12
6
6
6
2
8
12
8
4
2
3D 2E
B3B2
B1
3D 2E
C3C2
C1
HJoiner (1)
HSplitter (4)
3D 2E
3D 2Epop=6, push=8
Fpop=4, push=1
Gpeek=4, pop=2,
push=8G
peek=4, pop=2, push=8G
peek=4, pop=2, push=8G
pop=2, push=8
Hpop=8, push=n
C0pop=1, push=1
B0pop=12, push=3
Apop=n, push=8
12
1
6
8
1
2
22
22
6
666
6
Horizontal SIM
Dization
Vertical SIM
Dization
Single-Actor SIM
Dization
?
?
![Page 15: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/15.jpg)
University of MichiganElectrical Engineering and Computer Science
20 21 22 23
16 17 18 19
12 13 14 15
8 9 10 11
4 5 6 7
Streaming Address Generation
0 1 2 3
14 17 20 23
13 16 19 22
12 15 18 21
2 5 8 11
1 4 7 10
0 3 6 9
E pop=2
Dpush=3
12
8
E pop=2
Dpush=3
12
8
Scalar Buffer Vector Buffer
• Area overhead less than 1% on Core i7.
• Critical path two 16-bit adds and one 64-bit add.
![Page 16: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/16.jpg)
University of MichiganElectrical Engineering and Computer Science
Traditional vs. Macro SIMDization
Traditional SIMDization Macro-SIMDization
Applicability Any Streaming
Adjust the schedule xTune streaming graph xIdentify isomorphic actors xEasily retargetable x
Complexity of optis and transformations High Low
![Page 17: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/17.jpg)
University of MichiganElectrical Engineering and Computer Science
Experimental Setup
Backend Compiler
Frontend Compiler
Streaming Program
C Code
Host Compiler
Intel Core i7
• Frontend StreamIt MIT Compiler
• Backend MacroSS
• ICC 11.1 compile C/C++ code
• Core i7 with SSE4
![Page 18: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/18.jpg)
University of MichiganElectrical Engineering and Computer Science
Macro-SIMDization vs. Traditional
AudioBeam
BeamFormer
DCTFFT
FM Radio
Matrix Multip
ly
Matrix Multip
ly Block
Bitonic Sort
FilterBank
MP3 Decoder
Average0
0.51
1.52
2.53
3.5
ICC + Auto Vectorize ICC + Macro SIMDICC + Macro SIMD + Autovectorize
Spee
dup
(x)
![Page 19: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/19.jpg)
University of MichiganElectrical Engineering and Computer Science
Benefits of SAGU
AudioBeam
BeamFormer
DCTFFT
FM Radio
Matrix Multip
ly
Matrix Multip
ly Block
Bitonic Sort
FilterBank
MP3 Decoder
Average0
5
10
15
20
25
% Im
prov
emen
t
![Page 20: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/20.jpg)
University of MichiganElectrical Engineering and Computer Science
Conclusion• Streaming is prevalent in all computing domains.
• Applying traditional SIMDization on streaming applications fails to utilize SIMD engines.
• Macro-SIMDization is done at higher level.
• MacroSS outperforms traditional SIMDization techniques by 54%.
![Page 21: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/21.jpg)
University of MichiganElectrical Engineering and Computer Science
Questions and Comments
![Page 22: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/22.jpg)
University of MichiganElectrical Engineering and Computer Science
Macro-SIMDization vs. Traditional
AudioBeam
BeamForm
erDCT
FFT
FM R
adio
Matr
ix Multip
ly
Matr
ix Multip
ly Block
Bitonic
Sort
FilterB
ank
MP3 D
ecoder
Average
0
0.5
1
1.5
2
2.5
3
3.5
4
GCC + Auto Vectorize GCC + Macro SIMDGCC + Macro SIMD + Autovectorize
Sp
ee
du
p (
x)
![Page 23: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/23.jpg)
University of MichiganElectrical Engineering and Computer Science
SAGU Implementation
• Area overhead less than 1% on Core i7.
• Critical path two 16-bit adds and one 64-bit add.
• Minor ISA modifications are needed.
![Page 24: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/24.jpg)
University of MichiganElectrical Engineering and Computer Science
SIMD + Multi-core Scheduling
• How to schedule for a heterogeneous SIMD system?
• SIMDization reduces memory/bus traffic
• Exploit SIMD parallelism before Core-level parallelism.
• Is this the best we can do?
![Page 25: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,](https://reader034.fdocuments.us/reader034/viewer/2022051621/56649d445503460f94a218de/html5/thumbnails/25.jpg)
University of MichiganElectrical Engineering and Computer Science
Multicore + Macro-SIMDization
AudioBeam
BeamForm
erDCT
FFT
FM R
adio
Matr
ix Multip
ly
Matr
ix Multip
ly Block
Bitonic
Sort
FilterB
ank
MP3 D
ecoder
Average
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
2 Cores 4 Cores 2 Cores + Macro SIMD 4 Cores + Macro SIMD
Sp
ee
du
p (
x)