Comparison of Multimedia SIMD, GPUs and Vector
-
Upload
harsh-prasad -
Category
Documents
-
view
57 -
download
2
Transcript of Comparison of Multimedia SIMD, GPUs and Vector
Comparison of Multimedia SIMD, GPUs and Vector
Architectures(Data Parallelism – Hennessy Section 4.4)
By-Harsh Prasad
2008CS50210
05-Apr-12CSL718
Introduction● A common way to increase parallelism among
instructions is to exploit data parallelism among independent iterations of a loop
● SIMD architectures can exploit significant data-level parallelism for:○ matrix-oriented scientific computing○ media-oriented image and sound processors
● SIMD is more energy efficient than MIMD● SIMD Parallelism
○ Vector architectures○ SIMD extensions○ Graphics Processor Units (GPUs)
● These architectures are designed to execute Data Level parallel Programs
05-Apr-12CSL718
Vector Architectures
● Read sets of data elements into “vector registers”● Operate on those registers● Disperse the results back into memory● Example: VMIPS
● Improvements○ Multiple Lanes○ Gather-Scatter Memory Addressing
05-Apr-12CSL718
05-Apr-12
Basic Structure of Vector Register Architecture (Vector MIPS)
VLR Vector Length RegisterVM Vector Mask Register
Vector Load-Store Units (LSUs)
Multi-Bankedmemoryfor bandwidth and latency-hiding
PipelinedVector Functional Units
Vector Control Registers
Each Vector Registerhas MVL elements(each 64 bits)
MVL = Maximum Vector Length
1
2
3
4
CSL718
SIMD Extensions● Media applications operate on data types narrower than
the native word size
● Limitations, compared to vector instructions:○ Number of data operands encoded into op code○ No sophisticated addressing modes (stride, scatter-gather)○ No mask registers
05-Apr-12CSL718
Graphics Processing Unit
● Offers higher potential performance than traditional multicore computers.
● Heterogeneous execution model○ CPU is the host, GPU is the device
● Develop a C-like programming language for GPU● Unify all forms of GPU parallelism as CUDA (Compute
Unified Device Architecture) thread● Programming model is “Single Instruction Multiple
Thread”
05-Apr-12CSL718
Comparison: Vector Architectures and GPUs
05-Apr-12CSL718
many lanes in GPU, therefore GPU chimes are smaller
compiler manages mask register explicitly in software
Implicitly using branch synchronization markers and internal stack to save, complement and restore masks.
Vector processor and a multithreaded SIMD Processor of a
GPU
05-Apr-12CSL718
Supplies scalar operands for scalar-vector operations, increments addressing for unit and non-unit stride accesses to memory
one PC per SIMD thread
Ensures High Memory Bandwidth
05-Apr-12CSL718
GPU have hardware supportfor Multithreading
VMIPS register holdsthe entire vector.
Vector is spread across the registers of SIMD lanes.
05-Apr-12CSL718
● Memory Latency is hidden by paying latency once per load/store instructions in Vector Architecture. GPU hides it using Multithreading.
● Conditional Branch Mechanism of GPU handles Strip-Mining problem of Vector Architectures by iterating the loop until all the SIMD lanes reach the loop bound.
Comparison: Multimedia SIMD Computers and
GPUs
05-Apr-12CSL718
Scalar processor and Multimedia instructions are separated by an I/O bus in GPUs with separate main memories.
● Also, Multimedia SIMD instructions do not support scatter-gather memory accesses.
● In short it can be said that GPUs are multithreaded SIMD
processors with more number of lanes, processors and better hardware for multi-threading.
05-Apr-12CSL718
Thank You
05-Apr-12CSL718