The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and...
-
Upload
karin-chandler -
Category
Documents
-
view
217 -
download
0
description
Transcript of The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and...
![Page 1: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/1.jpg)
The Imagine Stream Processor
Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany
Presenter: Lu Hao
![Page 2: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/2.jpg)
Contents
Stream processor Imagine Architecture Example: FFT application Experimental result Conclusion
![Page 3: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/3.jpg)
Motivation of stream processor Media-processing applications, such as 3-D polygon rendering, M
PEG-2 encoding are becoming an increasingly dominant portion of computing workloads today
Properties of media-processing applications Real-time performance constraints High arithmetic intensity require parallel solutions Inherently contain a large amount of data-parallelism
Providing large numbers of ALUs to operate on data in parallel is relatively inexpensive
Current programmable solutions cannot scale to support this many ALUs Both providing instructions and transferring data at the necessary rates a
re problematic. For example, a 48 ALU single-chip processor must issue up to 48 instruc
tions/cycle and provide up to 144 words/cycle of data bandwidth to operate at peak rate.
![Page 4: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/4.jpg)
What is a stream processor
Usually SIMD Allows some applications to more
easily exploit a limited form of parallel processing
Using the stream programming model to expose parallelism as well as producer-consumer locality
can use multiple computational units
![Page 5: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/5.jpg)
The Imagine Processor
Imagine is a programmable stream processor and is a hardware implementation of the stream model.
Imagine is designed to be a stream coprocessor for a general purpose processor that acts as the host.
The programming model organizes the computation in an application into a sequence of arithmetic kernels, and organizes the data-flow into a series of data streams.
On a variety of realistic applications, Imagine can sustain up to 50 instructions per cycle, and up to 15 GOPS of arithmetic bandwidth.
Load-store architecture for streams (SRF)
![Page 6: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/6.jpg)
Contents
Stream processor Imagine Architecture Example: FFT application Experimental result Conclusion
![Page 7: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/7.jpg)
Architecture of Imagine 32 KW stream
register file (SRF) The microcontroller
keeps track of the program counter as it broadcasts each VLIW instruction to all eight clusters in a SIMD manner.
Each ALU cluster: six ALUs and 304 registers in several local register files (LRFs).
![Page 8: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/8.jpg)
Architecture of Imagine
The SRF
![Page 9: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/9.jpg)
The SRF
Clusters <---> SRF: data that needs to be passed from kernel to kernel
SRF <---> DRAM: part of truly global data structures
All stream operands originate in the SRF and stream results are stored back to the SRF.
![Page 10: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/10.jpg)
Irregular stream locality converted to reuse through memory
![Page 11: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/11.jpg)
Irregular producer-consumer locality captured at the SRF
![Page 12: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/12.jpg)
Data distribution
![Page 13: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/13.jpg)
Data distribution result
![Page 14: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/14.jpg)
Architecture of Imagine
The ALU cluster
![Page 15: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/15.jpg)
The ALU cluster256 x 32-bit register file
![Page 16: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/16.jpg)
Contents
Stream processor Imagine Architecture Example: FFT application Experimental result Conclusion
![Page 17: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/17.jpg)
Example: mapping of a 1024-point radix-2 FFT to the stream model
![Page 18: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/18.jpg)
Contents
Stream processor Imagine Architecture Example: FFT application Experimental result Conclusion
![Page 19: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/19.jpg)
Experimental Result
Speedup of 8 clusters over 1 cluster
![Page 20: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/20.jpg)
Contents
Stream processor Imagine Architecture Example: FFT application Experimental result Conclusion
![Page 21: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/21.jpg)
Conclusion
Stream processors are suitable for media-processing applications
Imagine exploits the data-level parallelism (DLP) in streams by executing a kernel on eight successive stream elements in parallel (one on each cluster). SRF ALU clusters
Application example: 1024pt FFT
![Page 22: The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b8f7f8b9ab0599c0acf/html5/thumbnails/22.jpg)
Thanks!
Questions?