8/2/2019 5 1 GPU Architecture
1/12
GPU Architecture
2CS 6823-003 Spring12 @ ASU Supported by2
CPU & APUArchitecture
8/2/2019 5 1 GPU Architecture
2/12
3CS 6823-003 Spring12 @ ASU Supported by3
CPU Computing
CPU performance is the product of many related advances Increased transistor density
Increased transistor performance
Wider data paths
Pipelining
Superscalar execution
Speculative execution
Caching
Chip- and system-level integration
4CS 6823-003 Spring12 @ ASU Supported by4
Bandwidth Gravity of Modern Computer Systems
The Bandwidth between key components ultimatelydictates system performance
Especially true for massively parallel systems processing massiveamount of data
Tricks like buffering, reordering, caching can temporarily defy therules in some cases
Ultimately, the performance goes falls back to what the speedsand feeds dictate
Perform well themselves
Cooperate well too
8/2/2019 5 1 GPU Architecture
3/12
5CS 6823-003 Spring12 @ ASU Supported by5
Superscalar Execution
Superscalar and, by extension, out-of-order execution is onesolution that has been included on CPUs for a long time
Out-of-order scheduling logic requires a substantial area of the CPUdie to maintain dependence information and queues of instructionsto deal with dynamic schedules throughout the hardware
Speculative instruction execution necessary to expand the windowof out-of-order instructions to execute in parallel results in inefficientexecution of throwaway work
6CS 6823-003 Spring12 @ ASU Supported by6
Out-of-order Execution
8/2/2019 5 1 GPU Architecture
4/12
7CS 6823-003 Spring12 @ ASU Supported by7
VLIW
VLIW is a heavily compiler-dependent method for
increasing instruction-level parallelism in a processor VLIW moves the dependence analysis work into the compiler
8CS 6823-003 Spring12 @ ASU Supported by8
SIMD and Vector Processing
SIMD and its generalization in vector parallelism approachimproved efficiency by:
The same operation be performed on multiple data elements
8/2/2019 5 1 GPU Architecture
5/12
9CS 6823-003 Spring12 @ ASU Supported by9
Multithreading
10CS 6823-003 Spring12 @ ASU Supported by10
Two Threads Scheduled in Time Slice Fashion
8/2/2019 5 1 GPU Architecture
6/12
11CS 6823-003 Spring12 @ ASU Supported by11
Multi-Core Architectures
A large number of threads interleave execution to keepthe device busy, whereas each individual thread takeslonger to execute than the theoretical minimum
12CS 6823-003 Spring12 @ ASU Supported by12
AMD Bobcat and Bulldozer CPUs
Bobcat (left) follows a traditional approach mappingfunctional units to cores, in a low-power design
Bulldozer (right) combines two cores within a module,offering sharing of some functional units
8/2/2019 5 1 GPU Architecture
7/12
13CS 6823-003 Spring12 @ ASU Supported by13
The AMD Radeon HD6970 GPU
The device is divided into two halves where instruction
control: scheduling and dispatch is performed by the levelwave scheduler for each half
The 24 16-lane SIMD cores execute four-way VLIW instructions on eachSIMD lane and contain private level 1 caches and local data shares(LDS)
14CS 6823-003 Spring12 @ ASU Supported by14
CPU and GPU Architectures
8/2/2019 5 1 GPU Architecture
8/12
15CS 6823-003 Spring12 @ ASU Supported by15
Niagara 2 CPU from Sun/Oracle
Relative similar to the GPU design
16CS 6823-003 Spring12 @ ASU Supported by16
E350 "Zacate" AMD APU
Two "Bobcat" low-power x86 cores
Two 8-wide SIMD cores with five-way VLIW units
Connected via a shared bus and a single interface to DRAM
8/2/2019 5 1 GPU Architecture
9/12
17CS 6823-003 Spring12 @ ASU Supported by17
Intel Sandy Bridge
Intel combines four Sandy Bridge x86 cores with an
improved version of its embedded graphics processor The concept is the same as the devices under that category from
AMD
18CS 6823-003 Spring12 @ ASU Supported by18
Intels Core i7 Processor Four CPU cores with simultaneous multithreading
Made with 45nm process technology
Each chip has 731 million transistors and consumes up to130W of thermal design power
8/2/2019 5 1 GPU Architecture
10/12
19CS 6823-003 Spring12 @ ASU Supported by19
Intels Nehalem
Four-wide superscalar
Out of order, speculativeexecution
Simultaneousmultithreading
Multiple branchpredictors
On-die power gating
On-die memorycontrollers
Large caches
Multiple interprocessorinterconnects
20CS 6823-003 Spring12 @ ASU Supported by20
Todays Intel PC Architecture:Single Core System
FSB connection betweenprocessor and Northbridge(82925X)
Memory Control Hub
Northbridge handles primaryPCIe to video/GPU and DRAM.
PCIe x16 bandwidth at 8 GB/s
(4 GB each direction)
Southbridge (ICH6RW) handlesother peripherals
8/2/2019 5 1 GPU Architecture
11/12
21CS 6823-003 Spring12 @ ASU Supported by21
Todays Intel PC Architecture:
Dual Core System
Bensley platform Blackford Memory Control Hub (MCH) is
now a PCIe switch that integrates(NB/SB).
FBD (Fully Buffered DIMMs) allowsimultaneous R/W transfers at 10.5 GB/sper DIMM
PCIe links form backbone PCIe device upstream bandwidth now
equal to down stream Workstation version has x16 GPU link v ia
the Greencreek MCH
Two CPU sockets Dual Independent Bus to CPUs, each is
basically a FSB (Front-Side Bus) CPU feeds at 8.510.5 GB/s per socket Compared to current Front -Side Bus CPU
feeds 6.4GB/s PCIe bridges to legacy I/O devices
Source: http://www.2cpu.com/review.php?id=109
22CS 6823-003 Spring12 @ ASU Supported by22
Todays AMD PC Architecture
AMD HyperTransportTechnology bus replaces theFront-side Bus architecture
HyperTransport similarities toPCIe:
Packet based, switching network Dedicated links for both directions
Shown in 4 socket configuraton, 8GB/sec per link
Northbridge/HyperTransport ison die
Glueless logic to DDR, DDR2 memory PCI-X/PCIe bridges (usually
implemented in Southbridge)
8/2/2019 5 1 GPU Architecture
12/12
23CS 6823-003 Spring12 @ ASU Supported by23
Todays AMD PC Architecture
Torrenza technology
Allows licensing of coherentHyperTransport to 3rd partymanufacturers to make socket-compatible accelerators/co-processors
Allows 3rd party PPUs (PhysicsProcessing Unit), GPUs, and co-processors to access main systemmemory directly and coherently
Could make acceleratorprogramming model easier to use
than say, the Cell processor, whereeach SPE cannot directly accessmain memory.
24CS 6823-003 Spring12 @ ASU Supported by24
HyperTransport Feeds and Speeds
Primarily a low latency directchip-to-chip interconnect,supports mapping to board-to-board interconnect such as PCIe
HyperTransport 1.0Specification
800 MHz max, 12.8 GB/saggregate bandwidth (6.4 GB/seach way)
HyperTransport 2.0Specification Added PCIe mapping
1.0 - 1.4 GHz Clock, 22.4 GB/saggregate bandwidth (11.2 GB/seach way)
HyperTransport 3.0 Specification 1.8 - 2.6 GHz Clock, 41.6 GB/s
aggregate bandwidth (20.8 GB/seach way)
Added AC coupling to extendHyperTransport to long distanceto system-to-system interconnect
Courtesy HyperTransport ConsortiumSource: White Paper: AMD HyperTransport
Technology-Based System Architecture
Top Related