Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style...
-
Upload
piers-tucker -
Category
Documents
-
view
214 -
download
0
Transcript of Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style...
![Page 1: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/1.jpg)
Bottlenecks of SIMD
Haibin Wang
Wei tong
![Page 2: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/2.jpg)
Paper
Bottlenecks in Multimedia Processing with SIMD Style
Extensions and Architectural Enhancements One IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 8, AUGUST 2003
Deepu Talla, Member, IEEE ,Lizy Kurian John, Senior Member, IEEE, and Doug Burger, Member, IEEE
![Page 3: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/3.jpg)
Outline
Introduction Bottlenecks Analysis MediaBreeze Architecture Summary
![Page 4: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/4.jpg)
Introduction
It is popular to use multimedia SIMD extensions to speed up media processing, but the efficiency is not very high.
75 to 85 percent of the dynamic instructions in the processor instruction stream are supporting instructions.
![Page 5: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/5.jpg)
Introduction
The bottlenecks are caused by the loop structure and the access patterns of the media program.
So instead of exploiting more data-level parallelism, the paper focuses on improving the efficiency of the instructions supporting the core computation.
![Page 6: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/6.jpg)
Introduction
This paper has two major contributions: Firstly, it focuses on the supporting
instructions to enhance the performance of SIMD which is an innovation.
Secondly, it gives a method to reduce and eliminate supporting instructions with the MediaBreeze architecture.
![Page 7: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/7.jpg)
Nested Loop
![Page 8: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/8.jpg)
The analysis of loop architecture
The sub-block is very small which leads to the limited DLP because it needs many supporting instructions.
There are 5 loops for every block which waste so much time on braches.
You need to reorganize the data to use SIMD
![Page 9: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/9.jpg)
Access patterns
![Page 10: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/10.jpg)
Access patterns
The addressing sequences are complex and big part which need lots of supporting instructions to generate them.
Using general-purpose instruction sets to generate multiple addressing sequences is not very efficient.
![Page 11: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/11.jpg)
The overhead instructions
Address generation: address calculation Address transformation: data movement,
data reorganization Loads and Stores: memory Branches : control transfer, for-loop
![Page 12: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/12.jpg)
![Page 13: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/13.jpg)
Architecture
![Page 14: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/14.jpg)
Instruction Structure
![Page 15: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/15.jpg)
Breeze Instruction Mapping of 1D-DCT
![Page 16: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/16.jpg)
Full Map
. five branches, . three loads and one store, . four address value generation (one on each stream with each address generation representing multiple RISC instructions), . one SIMD operation (2-way to 16-way parallelism depending on each data element size), . one accumulation of SIMD result and one SIMD reduction operation, four SIMD data reorganization (pack/unpack, permute, etc.) operations, and . shifting and saturation of SIMD results.
![Page 17: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/17.jpg)
Performance Evaluation
cfa,dct, motest,scale G711, decrypt Aud, jpeg, ijpeg
![Page 18: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/18.jpg)
Any improvement?
Why not higher efficiency in cfa?
Memory latency! Solution?
Prefetch!
![Page 19: Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649f155503460f94c29c47/html5/thumbnails/19.jpg)
Evaluation
Advantage: Eliminating and reducing overhead. Much better than normal SIMD extension. 0.3% processor area, less 1% total power consumption. Drawback: Complicated instruction. Who will design a compiler for this?