Download - 5 1 GPU Architecture

8/2/2019 5 1 GPU Architecture

1/12

GPU Architecture

2CS 6823-003 Spring12 @ ASU Supported by2

CPU & APUArchitecture


2/12


CPU Computing

CPU performance is the product of many related advances Increased transistor density

Increased transistor performance

Wider data paths

Pipelining

Superscalar execution

Speculative execution

Caching

Chip- and system-level integration


Bandwidth Gravity of Modern Computer Systems

The Bandwidth between key components ultimatelydictates system performance

Especially true for massively parallel systems processing massiveamount of data

Tricks like buffering, reordering, caching can temporarily defy therules in some cases

Ultimately, the performance goes falls back to what the speedsand feeds dictate

Perform well themselves

Cooperate well too


3/12


Superscalar Execution

Superscalar and, by extension, out-of-order execution is onesolution that has been included on CPUs for a long time

Out-of-order scheduling logic requires a substantial area of the CPUdie to maintain dependence information and queues of instructionsto deal with dynamic schedules throughout the hardware

Speculative instruction execution necessary to expand the windowof out-of-order instructions to execute in parallel results in inefficientexecution of throwaway work


Out-of-order Execution


4/12


VLIW

VLIW is a heavily compiler-dependent method for

increasing instruction-level parallelism in a processor VLIW moves the dependence analysis work into the compiler


SIMD and Vector Processing

SIMD and its generalization in vector parallelism approachimproved efficiency by:

The same operation be performed on multiple data elements


5/12


Multithreading


Two Threads Scheduled in Time Slice Fashion


6/12


Multi-Core Architectures

A large number of threads interleave execution to keepthe device busy, whereas each individual thread takeslonger to execute than the theoretical minimum


AMD Bobcat and Bulldozer CPUs

Bobcat (left) follows a traditional approach mappingfunctional units to cores, in a low-power design

Bulldozer (right) combines two cores within a module,offering sharing of some functional units


7/12


The AMD Radeon HD6970 GPU

The device is divided into two halves where instruction

control: scheduling and dispatch is performed by the levelwave scheduler for each half

The 24 16-lane SIMD cores execute four-way VLIW instructions on eachSIMD lane and contain private level 1 caches and local data shares(LDS)


CPU and GPU Architectures


8/12


Niagara 2 CPU from Sun/Oracle

Relative similar to the GPU design


E350 "Zacate" AMD APU

Two "Bobcat" low-power x86 cores

Two 8-wide SIMD cores with five-way VLIW units

Connected via a shared bus and a single interface to DRAM


9/12


Intel Sandy Bridge

Intel combines four Sandy Bridge x86 cores with an

improved version of its embedded graphics processor The concept is the same as the devices under that category from

AMD


Intels Core i7 Processor Four CPU cores with simultaneous multithreading

Made with 45nm process technology

Each chip has 731 million transistors and consumes up to130W of thermal design power


10/12


Intels Nehalem

Four-wide superscalar

Out of order, speculativeexecution

Simultaneousmultithreading

Multiple branchpredictors

On-die power gating

On-die memorycontrollers

Large caches

Multiple interprocessorinterconnects


Todays Intel PC Architecture:Single Core System

FSB connection betweenprocessor and Northbridge(82925X)

Memory Control Hub

Northbridge handles primaryPCIe to video/GPU and DRAM.

PCIe x16 bandwidth at 8 GB/s

(4 GB each direction)

Southbridge (ICH6RW) handlesother peripherals


11/12


Todays Intel PC Architecture:

Dual Core System

Bensley platform Blackford Memory Control Hub (MCH) is

now a PCIe switch that integrates(NB/SB).

FBD (Fully Buffered DIMMs) allowsimultaneous R/W transfers at 10.5 GB/sper DIMM

PCIe links form backbone PCIe device upstream bandwidth now

equal to down stream Workstation version has x16 GPU link v ia

the Greencreek MCH

Two CPU sockets Dual Independent Bus to CPUs, each is

basically a FSB (Front-Side Bus) CPU feeds at 8.510.5 GB/s per socket Compared to current Front -Side Bus CPU

feeds 6.4GB/s PCIe bridges to legacy I/O devices

Source: http://www.2cpu.com/review.php?id=109


Todays AMD PC Architecture

AMD HyperTransportTechnology bus replaces theFront-side Bus architecture

HyperTransport similarities toPCIe:

Packet based, switching network Dedicated links for both directions

Shown in 4 socket configuraton, 8GB/sec per link

Northbridge/HyperTransport ison die

Glueless logic to DDR, DDR2 memory PCI-X/PCIe bridges (usually

implemented in Southbridge)


12/12


Todays AMD PC Architecture

Torrenza technology

Allows licensing of coherentHyperTransport to 3rd partymanufacturers to make socket-compatible accelerators/co-processors

Allows 3rd party PPUs (PhysicsProcessing Unit), GPUs, and co-processors to access main systemmemory directly and coherently

Could make acceleratorprogramming model easier to use

than say, the Cell processor, whereeach SPE cannot directly accessmain memory.


HyperTransport Feeds and Speeds

Primarily a low latency directchip-to-chip interconnect,supports mapping to board-to-board interconnect such as PCIe

HyperTransport 1.0Specification

800 MHz max, 12.8 GB/saggregate bandwidth (6.4 GB/seach way)

HyperTransport 2.0Specification Added PCIe mapping

1.0 - 1.4 GHz Clock, 22.4 GB/saggregate bandwidth (11.2 GB/seach way)

HyperTransport 3.0 Specification 1.8 - 2.6 GHz Clock, 41.6 GB/s

aggregate bandwidth (20.8 GB/seach way)

Added AC coupling to extendHyperTransport to long distanceto system-to-system interconnect

Courtesy HyperTransport ConsortiumSource: White Paper: AMD HyperTransport

Technology-Based System Architecture