Comparison of Multimedia SIMD, GPUs and Vector

Post on 04-Sep-2014

57 views 2 download

Tags:

Transcript of Comparison of Multimedia SIMD, GPUs and Vector

Comparison of Multimedia SIMD, GPUs and Vector

Architectures(Data Parallelism – Hennessy Section 4.4)

By-Harsh Prasad

2008CS50210

05-Apr-12CSL718

Introduction● A common way to increase parallelism among

instructions is to exploit data parallelism among independent iterations of a loop

● SIMD architectures can exploit significant data-level parallelism for:○ matrix-oriented scientific computing○ media-oriented image and sound processors

● SIMD is more energy efficient than MIMD● SIMD Parallelism

○ Vector architectures○ SIMD extensions○ Graphics Processor Units (GPUs)

● These architectures are designed to execute Data Level parallel Programs

05-Apr-12CSL718

Vector Architectures

● Read sets of data elements into “vector registers”● Operate on those registers● Disperse the results back into memory● Example: VMIPS

● Improvements○ Multiple Lanes○ Gather-Scatter Memory Addressing

05-Apr-12CSL718

05-Apr-12

Basic Structure of Vector Register Architecture (Vector MIPS)

VLR Vector Length RegisterVM Vector Mask Register

Vector Load-Store Units (LSUs)

Multi-Bankedmemoryfor bandwidth and latency-hiding

PipelinedVector Functional Units

Vector Control Registers

Each Vector Registerhas MVL elements(each 64 bits)

MVL = Maximum Vector Length

1

2

3

4

CSL718

SIMD Extensions● Media applications operate on data types narrower than

the native word size

● Limitations, compared to vector instructions:○ Number of data operands encoded into op code○ No sophisticated addressing modes (stride, scatter-gather)○ No mask registers

05-Apr-12CSL718

Graphics Processing Unit

● Offers higher potential performance than traditional multicore computers.

● Heterogeneous execution model○ CPU is the host, GPU is the device

● Develop a C-like programming language for GPU● Unify all forms of GPU parallelism as CUDA (Compute

Unified Device Architecture) thread● Programming model is “Single Instruction Multiple

Thread”

05-Apr-12CSL718

Comparison: Vector Architectures and GPUs

05-Apr-12CSL718

many lanes in GPU, therefore GPU chimes are smaller

compiler manages mask register explicitly in software

Implicitly using branch synchronization markers and internal stack to save, complement and restore masks.

Vector processor and a multithreaded SIMD Processor of a

GPU

05-Apr-12CSL718

Supplies scalar operands for scalar-vector operations, increments addressing for unit and non-unit stride accesses to memory

one PC per SIMD thread

Ensures High Memory Bandwidth

05-Apr-12CSL718

GPU have hardware supportfor Multithreading

VMIPS register holdsthe entire vector.

Vector is spread across the registers of SIMD lanes.

05-Apr-12CSL718

● Memory Latency is hidden by paying latency once per load/store instructions in Vector Architecture. GPU hides it using Multithreading.

● Conditional Branch Mechanism of GPU handles Strip-Mining problem of Vector Architectures by iterating the loop until all the SIMD lanes reach the loop bound.

Comparison: Multimedia SIMD Computers and

GPUs

05-Apr-12CSL718

Scalar processor and Multimedia instructions are separated by an I/O bus in GPUs with separate main memories.

● Also, Multimedia SIMD instructions do not support scatter-gather memory accesses.

● In short it can be said that GPUs are multithreaded SIMD

processors with more number of lanes, processors and better hardware for multi-threading.

05-Apr-12CSL718

Thank You

05-Apr-12CSL718