A Closer Look at GPUs

8/7/2019 A Closer Look at GPUs

1/8

50 communicat ions o the acm | october 2008 | vol . 51 | no. 10

practice

ILLU

STRATION BY DREW FLAHERTY

T2, and increasingly multicore x86 sys-tems rom Intel and AMD, di erentiatethemselves rom traditional CPU de-signs by prioritizing high-throughputprocessing o many parallel operationsover the low-latency execution o a sin-gle task.

GPUs assemble a large collection o fxed- unction and so tware-program-mable processing resources. Impressivestatistics, such as ALU (arithmetic logicunit) counts and peak oating-pointrates o ten emerge during discussionso GPU design. Despite the inherently

parallel nature o graphics, however, e -fciently mapping common rendering algorithms onto GPU resources is ex-tremely challenging.

The key to high per ormance lies instrategies that hardware componentsand their corresponding so tware in-ter aces use to keep GPU processing resources busy. GPU designs go to greatlengths to obtain high e fciency andconveniently reduce the di fculty pro-grammers ace when programming graphics applications. As a result, GPUs

deliver high per ormance and exposean expressive but simple programming inter ace. This inter ace remains largely devoid o explicit parallelism or asyn-chronous execution and has proven tobe portable across vendor implementa-tions and generations o GPU designs.

At a time when the shi t towardthroughput-oriented CPU plat orms isprompting alarm about the complexity o parallel programming, understand-ing key ideas behind the success o GPU computing is valuable not only ordevelopers targeting so tware or GPUexecution, but also or in orming thedesign o new architectures and pro-gramming systems or other domains.In this article, we dive under the hood o a modern GPU to look at why interactiverendering is challenging and to explorethe solutions GPU architects have de-vised to meet these challenges.

t Gr p P p lA graphics system generates imagesthat represent views o a virtual scene.This scene is defned by the geometry,

a cl rL

GPu

Doi:10.1145/1400181.1400197

As the line between GPUs and CPUsbegins to blur, its important to understandwhat makes GPUs tick.

By kayVon atahaLian anD mike houston

A GAMER WANDERS through a virtual world renderedin near-cinematic detail. Seconds later, the screenflls with a 3D explosion, the result o unseen enemieshiding in physically accurate shadows. Disappointed,the user exits the game and returns to a computerdesktop that exhibits the stylish 3D look-and- eelo a modern window manager. Both o these visualexperiences require hundreds o giga ops o computing per ormance, a demand met by the GPU (graphicsprocessing unit) present in every consumer PC.

The modern GPU is a versatile processor that constitutes an extreme but compelling point inthe growing space o multicore parallel computing architectures. These plat orms, which include GPUs,the STI Cell Broadband Engine, the Sun UltraSPARC


2/8


3/8

52 communicat ions o the acm | october 2008 | vol . 51 | no. 10

ractice

an application-programmable stage.PO (pixel operations). PO uses each

ragments screen position to calculateand apply the ragments contributionto output image pixel values. PO ac-counts or a samples distance rom thevirtual camera and discards ragmentsthat are blocked rom view by sur acescloser to the camera. When ragments

rom multiple primitives contribute tothe value o a single pixel, as is o ten thecase when semi-transparent sur acesoverlap, many rendering techniquesrely on PO to per orm pixel updatesin the order defned by the primitivespositions in the PP output stream. Allgraphics APIs guarantee this behavior,and PO is the only stage where the ordero entity processing is specifed by thepipelines defnition.

s d r Pr gr gThe behavior o application-program-mable pipeline stages (VP, PP, FP) isdefned by shader unctions (or shaders).Graphics programmers express vertex,primitive, and ragment shader unc-tions in high-level shading languagessuch as NVIDIAs Cg, OpenGLs GLSL,or Microso ts HLSL. Shader source iscompiled into bytecode o ine, thentrans ormed into a GPU-specifc binary by the graphics driver at runtime.

Shading languages support complexdata types and a rich set o control- ow constructs, but they do not containprimitives related to explicit parallelexecution. Thus, a shader defnition isa C-like unction that serially computesoutput-entity data records rom a singleinput entity. Each unction invocation isabstracted as an independent sequenceo control that executes in completeisolation rom the processing o otherstream entities.

As a convenience, in addition to datarecords rom stage input and outputstreams, shader unctions may access(but not modi y) large, globally shareddata bu ers. Prior to pipeline execu-tion, these bu ers are initialized to con-tain shader-specifc parameters and tex-tures by the application.

c r r d c ll gGraphics pipeline execution is charac-terized by the ollowing key properties.

Opportunities or parallel processing.Graphics presents opportunities orboth task- (across pipeline stages) and

orientation, and material properties o object sur aces and the position andcharacteristics o light sources. A sceneview is described by the location o a vir-tual camera. Graphics systems seek tofnd the appropriate balance betweencon icting goals o enabling maximumper ormance and maintaining an ex-pressive but simple inter ace or de-scribing graphics computations.

Real-time graphics APIs such as Di-rect3D and OpenGL strike this balanceby representing the rendering compu-tation as a graphics processing pipelinethat per orms operations on our un-damental entities: vertices, primitives,

ragments, and pixels. Figure 1 providesa block diagram o a simplifed seven-stage graphics pipeline. Data ows be-tween stages in streams o entities. Thispipeline contains fxed- unction stages(green) implementing API-specifedoperations and three programmablestages (red) whose behavior is defned

by application code. Figure 2 illustratesthe operation o key pipeline stages.

VG (vertex generation). Real-timegraphics APIs represent sur aces ascollections o simple geometric primi-tives (points, lines, or triangles). Eachprimitive is defned by a set o vertices.To initiate rendering, the applicationprovides the pipelines VG stage with alist o vertex descriptors. From this list,VG pre etches vertex data rom memory and constructs a stream o vertex datarecords or subsequent processing. Inpractice, each record contains the 3D(x,y,z) scene position o the vertex plusadditional application-defned param-eters such as sur ace color and normalvector orientation.

VP (vertex processing). The behavioro VP is application programmable. VP

operates on each vertex independently and produces exactly one output vertexrecord rom each input record. One o the most important operations o VP ex-ecution is computing the 2D output im-age (screen) projection o the 3D vertexposition.

PG (primitive generation). PG usesvertex topology data provided by the ap-plication to group vertices rom VP intoan ordered stream o primitives (eachprimitive record is the concatenation o several VP output vertex records). Vertex

topology also defnes the order o primi-tives in the output stream.PP (primitive processing). PP operates

independently on each input primitiveto produce zero or more output primi-tives. Thus, the output o PP is a new (potentially longer or shorter) orderedstream o primitives. Like VP, PP opera-tion is application programmable.

FG ( ragment generation). FG sampleseach primitive densely in screen space(this process is called rasterization ).Each sample is mani est as a ragmentrecord in the FG output stream. Frag-ment records contain the output imageposition o the sur ace sample, its dis-tance rom the virtual camera, as well asvalues computed via interpolation o thesource primitives vertex parameters.

FP ( ragment processing). FP simu-lates the interaction o light with scenesur aces to determine sur ace color andopacity at each ragments sample point.To give sur aces realistic appearances,FP computations make heavy use o fl-tered lookups into large, parameterized1D, 2D, or 3D arrays called textures . FP is

g r 1: a pl f d gr p p p l .

xg a(vG)

xp g(vP)

p mg a(PG)

p mp g(PP)

agmp g(FP)

p xp a

(Po)

agmg a(FG)

x px a a

m r B r

g a x

x p gy

g a x

g a x

p mag

fx - aga - f ag


4/8

practice

october 2008 | vol. 51 | no. 10 | communicat ions o the acm 53

data- (stages operate independently onstream entities) parallelism, making parallel processing a viable strategy orincreasing throughput. Despite abun-dant potential parallelism, however, theunpredictable cost o shader executionand constraints on the order o PO stageprocessing introduce dynamic, f ne-grained dependencies that complicateparallel implementation throughoutthe pipeline. Although output imagecontributions rom most ragments canbe applied in parallel, those that con-tribute to the same pixel cannot.

Extreme variations in pipeline load.Although the number o stages and data

ows o the graphics pipeline is f xed,the computational and bandwidth re-quirements o all stages vary signif cant-ly depending on the behavior o shader

unctions and properties o scene. Forexample, primitives that cover large re-gions o the screen generate many more

ragments than vertices. In contrast,many small primitives result in high ver-tex-processing demands. Applications

requently reconf gure the pipeline touse di erent shader unctions that vary

rom tens o instructions to a ew hun-dred. For these reasons, over the dura-tion o processing or a single rame,di erent stages will dominate overallexecution, o ten resulting in bandwidth

and compute-intensive phases o execu-tion. Dynamic load balancing is requiredto maintain an e f cient mapping o thegraphics pipeline to a GPUs resourcesin the ace o this variability and GPUsemploy sophisticated heuristics or re-allocating execution and on-chip stor-age resources amongst pipeline stagesdepending on load.

Fixed- unction stages encapsulate di -f cult-to-parallelize work. Programma-ble stages are trivially parallelizable by executing shader unction logic simul-taneously on multiple stream entities.In contrast, the pipelines nonprogram-mable stages involve multiple entity interactions (such as ordering depen-dencies in PO or vertex grouping in PG)and state ul processing. Isolating thisnon-data-parallel work into f xed stagesallows the GPUs programmable pro-cessing components to be highly spe-cialized or data-parallel execution andkeeps the shader programming modelsimple. In addition, the separation en-ables di f cult aspects o the graphicscomputation to be encapsulated in op-

timized, f xed- unction hardware com-ponents.

Mixture o predictable and unpredict-able data access. The graphics pipelinerigidly def nes inter-stage data owsusing streams o entities. This pre-dictability presents opportunities oraggregate pre etching o stream datarecords and highly specialized hard-ware management o on-chip storageresources. In contrast, bu er and tex-ture accesses per ormed by shadersare f ne-grained memory operations ondynamically computed addresses, mak-ing pre etch di f cult. As both orms o data access are critical to maintaining high throughput, shader programming models explicitly di erentiate stream

rom bu er/texture memory accesses,permitting specialized hardware solu-

tions or both types o accesses.Opportunities or instruction streamsharing. While the shader programming model permits each shader invocationto ollow a unique stream o control, inpractice, shader execution on nearby stream elements o ten results in thesame dynamic control- ow decisions.As a result, multiple shader invocationscan likely share an instruction stream.Although GPUs must accommodatesituations where this is not the case, theuse o SIMD-style execution to exploit

shared control- ow across multipleshader invocations is a key optimizationin the design o GPU processing coresand is accounted or in algorithms orpipeline scheduling.

Pr gr blPr g R rA large raction o a GPUs resourcesexist within programmable processing cores responsible or executing shader

unctions. While substantial imple-mentation di erences exist acrossvendors and product lines, all modernGPUs maintain high e f ciency throughthe use o multicore designs that em-ploy both hardware multithreading andSIMD (single instruction, multiple data)processing. As shown in the table here,these throughput-computing tech-niques are not unique to GPUs (top tworows). In comparison with CPUs, how-ever, GPU designs push these ideas toextreme scales.

Multicore + SIMD Processing = Lotso ALUs.A logical thread o control isrealized by a stream o processor in-

g r 2: Gr p p p l p r .

v1

v2v4

v3

v5v0

v1

v2v2v4

v3

v5v0 p0

p1

p0

p1

p0

p1

(a) s x m vG p amf p a a

w a g . ( ) F w g vP a PG, a a m

- pa p a g p w a g p m ,p0 a p1 . ( ) FGamp w p m , p g a

agm p g p0 a p1 . ( )FP mp app a a aa a amp a . ( ) Po p a

p mag w m agm , a g a y.

i xamp ,p1 a am aa p0 . A a p0 yp1 .

(a)

( )

( )

( )

( )


5/8

54 communicat ions o the acm | october 2008 | vol. 51 | no. 10

ractice

to take advantage o SIMD process-ing. Dynamic per-entity control ow isimplemented by executing all controlpaths taken by the shader invocationsin the group. SIMD operations that donot apply to all invocations, such asthose within shader code conditional orloop blocks, are partially nullifed using write-masks. In this implementation,when shader control ow diverges, ew-er SIMD ALUs do use ul work. Thus, ona chip with width-S SIMD processing,worst-case behavior yields per ormanceequaling 1/S the chips peak rate. For-tunately, shader workloads exhibitsu fcient levels o instruction streamsharing to justi y wide SIMD implemen-tations. Additionally, GPU ISAs containspecial instructions that make it pos-sible or shader compilers to trans orm

per-entity control ow into e fcientsequences o explicit or implicit SIMDoperations.

Hardware Multithreading = High ALU Utilization. Thread stalls pose an addi-tional challenge to high-per ormanceshader execution. Threads stall (orblock) when the processor cannot dis-patch the next instruction in an instruc-tion stream due to a dependency on anoutstanding instruction. High-latency o -chip memory accesses, most nota-bly those generated by texture access

operations, cause thread stalls lasting hundreds o cycles (recall that whileshader input and output records lendthemselves to streaming pre etch, tex-ture accesses do not).

Allowing ALUs to remain idle dur-ing the period while a thread is stalledis ine fcient. Instead, GPUs maintainmore execution contexts on chip thanthey can simultaneously execute, andthey per orm instructions rom run-nable threads when others are stalled.Hardware scheduling logic determineswhich context(s) to execute in each pro-cessor cycle. This technique o overpro-visioning cores with thread contexts tohide the latency o thread stalls is calledhardware multithreading . GPUs usemultithreading as the primary mecha-nism to hide both memory access andinstruction pipeline latencies.

The amount o stall latency a GPUcan tolerate via multithreading is de-pendent on the ratio o hardware threadcontexts to the number o threads thatare simultaneously executed in a clock(we re er to this ratio as T). Support or

tivec ISA extensions. These extensionsprovide instructions that control theoperation o our ALUs (SIMD widtho 4). Alternatively, most GPUs realizethe benefts o SIMD execution by im-plicitly sharing an instruction streamacross threads with identical PCs. Inthis implementation, the SIMD widtho the machine is not explicitly madevisible to the programmer. CPU design-ers have chosen a SIMD width o our asa balance between providing increasedthroughput and retaining high single-threaded per ormance. Characteristicso the shading workload make it ben-efcial or GPUs to employ signifcantly wider SIMD processing (widths ranging

rom 32 to 64) and to support a rich seto operations. It is common or GPUsto support SIMD implementations o

reciprocal square root, trigonometricunctions, and memory gather/scatteroperations.

The e fciency o wide SIMD pro-cessing allows GPUs to pack many coresdensely with ALUs. For example, the NVID-IA GeForce GTX 280GPU contains 480ALUs operating at 1.3GHz. These ALUsare organized into 30 processing coresand yield a peak rate o 933GFLOPS. Incomparison, a high-end 3GHz Intel Core2 Quad CPU contains our cores, eachwith eight SIMD oating-point ALUs (two

4-width vector instructions per clock) andis capable o , at most, 96GFLOPS o peakper ormance.

Recall that a shader unction defnesprocessing on a single pipeline entity.GPUs execute multiple invocations o the same shader unction in parallel

structions that execute within a pro-cessor-managed environment, calledan execution (or thread) context. Thiscontext consists o state such as a pro-gram counter, a stack pointer, general-purpose registers, and virtual memory mappings. A single core processor man-aging a single execution context canrun one thread o control at a time. Amulticore processor replicates process-ing resources (ALUs, control logic, andexecution contexts) and organizes theminto independent cores. When an ap-plication eatures multiple threads o control, multicore architectures pro-vide increased throughput by executing these instruction streams on each corein parallel. For example, an Intel Core2 Quad contains our cores and can ex-ecute our instruction streams simulta-

neously. As signifcant parallelism existsacross shader invocations in a graphicspipeline, GPU designs easily push corecounts higher.

Even higher per ormance is possibleby populating each core with multiple

oating-point ALUs. This is done e f-ciently through SIMD (single instruc-tion, multiple data) processing, whereseveral ALUs per orm the same opera-tion on a di erent piece o data. SIMDprocessing amortizes the complexity o decoding an instruction stream and the

cost o ALU control structures acrossmultiple ALUs, resulting in both power-and area-e fcient chip execution.

The most common implementationo SIMD processing is via explicit short-vector instructions, similar to thoseprovided by the x86 SSE or PowerPC Al-

t bl 1. t l p : t r g p r r .

t p Pr r c r /c p aLu /c r 3 simD w d m x t 4

GPu AMd ra hd4870 10 80 64 25

nvidiA G FGtX 280

30 8 32 128

cPu i c 2 Q a 1 4 8 4 1

sti c be 2 8 4 4 1

s u asPArc t2 8 1 1 4

1 SSE processing only, does not account or traditional FPU2 Stream processing (SPE) cores only, does not account or PPU cores.3 32-bit foating point operations4 Max T is de ned as the maximum ratio o hardware-managed thread execution contexts to simultaneously

executable threads (not an absolute count o hardware-managed execution contexts). This ratio is a measureo a processors ability to automatically hide thread stalls using hardware multithreading.


6/8

practice

october 2008 | vol. 51 | no. 10 | communicat ions o the acm 55

more thread contexts allows the GPU tohide longer or more requent stalls. Allmodern GPUs maintain large numberso execution contexts on chip to providemaximal memory latency-hiding ability (T reaches 128 in modern GPUsseethe table). This represents a signifcantdeparture rom CPU designs, which at-tempt to avoid or minimize stalls pri-marily using large, low-latency datacaches and complicated out-o -orderexecution logic. Current Intel Core 2and AMD Phenom processors maintainone thread per core, and even high-endmodels o Suns multithreaded Ultra-SPARC T2 processor manage only ourtimes the number o threads they cansimultaneously execute.

Note that in the absence o stalls, thethroughput o single- and multithreaded

processors is equivalent. Multithread-ing does not increase the number o processing resources on a chip. Rather,it is a strategy that interleaves executiono multiple threads in order to use exist-ing resources more e fciently (improvethroughput). On average, a multithread-ed core operating at its peak rate runseach thread 1/T o the time.

To achieve large-scale multithread-ing, execution contexts must be com-pact. The number o thread contextssupported by a GPU core is limited by

the size o on-chip execution contextstorage. GPUs require compiled shaderbinaries to statically declare input andoutput entity sizes, as well as bounds ontemporary storage and scratch registersrequired or their execution. At runtime,GPUs use these bounds to dynamically partition on-chip storage (including data registers) to support the maximumpossible number o threads. As a re-sult, the latency hiding ability o a GPUis shader dependent. GPUs can man-age many thread contexts (and providemaximal latency-hiding ability) whenshaders use ewer resources. Whenshaders require large amounts o stor-age, the number o execution contexts(and latency-hiding ability) provided by a GPU drops.

x d-Pr g R rA GPUs programmable cores interoper-ate with a collection o specialized fxed-

unction processing units that providehigh-per ormance, power-e fcientimplementations o nonshader stages.

Running a Fragment Shader on a GPU CoreShader compilation to SIMD (single instruction, multiple data) instructionsequences coupled with dynamic hardware thread scheduling lead to e cient execution o a ragment shader on the simpli ed single-core GPU shown in Figure A.

The core executes an instruction rom at most one thread eachprocessor clock, but maintains state

or our threads on-chip simultane-ously (T=4).

Core threads issue explicit width-32 SIMD vector instructions;32 ALUs simultaneously execute avector instruction in a single clock.

The core contains a pool o 16general-purpose vector registers(each containing a vector o 32single-precision foats) partitionedamong thread contexts.

The only source o thread stalls istexture access; they have a maximumlatency o 50 cycles.

Shader compilation by the graphics driver produces a GPU binary rom high-levelragment shader source. The resulting vector instruction sequence per orms

32 invocations o the ragment shader simultaneously by carrying out eachinvocation in a single lane o the width-32 vectors. The compiled binary requires

our vector registers or temporary results and contains 20 arithmetic instructionsbetween each texture access operation.

At runtime, the GPU executes a copy o the shader binary on each o its our threadcontexts, as illustrated in Figure B. The core executes T0 (thread 0) until it detectsa stall resulting rom texture access in cycle 20. While T0 waits or the result o thetexturing operation, the core continues to execute its remaining three threads.The result o T0s texture access becomes available in cycle 70. Upon T3s stall incycle 80, the core immediately resumes T0. Thus, at no point during execution areALUs le t idle.

When executing the shader program or this example, a minimum o our threadsis needed to keep core ALUs busy. Each thread operates simultaneously on 32

ragments; thus, 4*32=128 ragments are required or the chip to achieve peak per ormance. As memory latencies on real GPUs involve hundreds o cycles,modern GPUs must contain support or signi cantly more threads to sustainhigh utilization. I we extend our simple GPU to a more realistic size o 16processing cores and provision each core with storage or 16 execution contexts,then simultaneous processing o 8,192 ragments is needed to approach peak processing rates. Clearly, GPU per ormance relies heavily on the abundance o parallel shading work.

g r a: ex pl GPu r

Alu (siMd p a )

1

31

g a g f(pa am g a )

r0

r15

x ( a ) x

t0 t1 t2 t3

g r B: t r d ex x pl GPu r

0

20

40

60

80

executing ready (not executing) stalled

t0

cy

le

t1 t2 t3

a

a

a

aa y


7/8

56 communications o the acm | october 2008 | vol . 51 | no. 10

ractice

These components do not simply aug-ment programmable processing; they per orm sophisticated operations andconstitute an additional hundreds o gi-ga ops o processing power. Two o themost important operations per ormedvia fxed- unction hardware are texturefltering and rasterization ( ragmentgeneration).

Texturing is handled almost entirely by fxed- unction logic. A texturing op-eration samples a contiguous 1D, 2D,or 3D signal (a texture) that is discretely represented by a multidimensional ar-ray o color values (2D texture data issimply an image). A GPU texture-flter-ing unit accepts a point within the tex-tures parameterization (represented by a oating-point tuple, such as {.5,.75})and loads array values surrounding the

coordinate rom memory. The valuesare then fltered to yield a single resultthat represents the textures value atthe specifed coordinate. This valueis returned to the calling shader unc-tion. Sophisticated texture fltering isrequired or generating high-quality im-ages. As graphics APIs provide a fniteset o fltering kernels, and because fl-tering kernels are computationally ex-pensive, texture fltering is well suited

or fxed- unction processing.Primitive rasterization in the FG

stage is another key pipeline opera-tion currently implemented by fxed-unction components. Rasterization

involves densely sampling a primitive(at least once per output image pixel)to determine which pixels the primitiveoverlaps. This process involves comput-ing the location o the sur ace at eachsample point and then generating rag-ments or all sample points covered by the primitive. Bounding-box compu-tations and hierarchical techniquesoptimize the rasterization process.Nonetheless, rasterization involves sig-nifcant computation.

In addition to the components ortexturing and rasterization, GPUs con-tain dedicated hardware components oroperations such as sur ace visibility deter-mination, output pixel compositing, anddata compression/decompression.

t m r sParallel-processing resources place ex-treme load on a GPUs memory system,which services memory requests romboth fxed- unction and programmable

components. These requests includea mixture o fne-granularity and bulkpre etch operations and may even re-quire real-time guarantees (such as dis-play scan out).

Recall that a GPUs programmablecores tolerate large memory latenciesvia hardware multithreading and thatinterstage stream data accesses can bepre etched. As a result, GPU memory systems are architected to deliver high-bandwidth, rather than low-latency,data access. High throughput is ob-tained through the use o wide memory buses and specialized GDDR (graphicsdouble data rate) memories that oper-ate most e fciently when memory ac-cess granularities are large. Thus, GPUmemory controllers must bu er, reor-der, and then coalesce large numbers

o memory requests to synthesize largeoperations that make e fcient use o the memory system. As an example, theATI Radeon HD 4870 memory controllermanipulates thousands o outstanding requests to deliver 115GB per second o bandwidth rom GDDR5 memories at-tached to a 256-bit bus.

GPU data caches meet di erentneeds rom CPU caches. GPUs employ relatively small, read-only caches (nocache coherence) that serve to flter re-quests destined or the memory control-

ler and to reduce bandwidth require-ments placed on main memory. Thus,GPU caches typically serve to ampli y total bandwidth to processing unitsrather than decrease latency o memory accesses. Interleaved execution o many threads renders large read-write cach-es ine fcient because o severe cachethrashing. Instead, GPUs beneft romsmall caches that capture spatial locality across simultaneously executed shaderinvocations. This situation is common,as texture accesses per ormed whileprocessing ragments in close screenproximity are likely to have overlapping texture-flter support regions.

Although most GPU caches are small,this does not imply that GPUs con-tain little on-chip storage. Signifcantamounts o on-chip storage are used tohold entity streams, execution contexts,and thread scratch data.

P p l s d l g d c r lMapping the entire graphics pipelinee fciently onto GPU resources is a chal-lenging problem that requires dynamic

u d r d g d b d

GPu p g v l bl l

r d v l p rrg g w rr GPu x ,

b l rr g

d g wr r dpr gr g

r rd .


8/8

practice

october 2008 | vol . 51 | no. 10 | communicat ions o the acm 57

and adaptive techniques. A unique as-pect o GPU computing is that hardwarelogic assumes a major role in mapping and scheduling computation onto chipresources. GPU hardware schedulinglogic extends beyond the thread-sched-uling responsibilities discussed in pre-vious sections. GPUs automatically as-sign computations to threads, clean upa ter threads complete, size and man-age bu ers that hold stream data, guar-antee ordered processing when needed,and identi y and discard unnecessary pipeline work. This logic relies heavily on specifc up ront knowledge o graph-ics workload characteristics.

Conventional thread programming uses operating-system or threading APImechanisms or thread creation, com-pletion, and synchronization on shared

structures. Large-scale multithreading coupled with the brevity o shader unc-tion execution (at most a ew hundredinstructions), however, means GPUthread management must be per ormedentirely by hardware logic.

GPUs minimize thread launch costsby preconfguring execution contexts torun one o the pipelines three types o shader unctions and reusing the con-fguration multiple times or shaderso the same type. GPUs pre etch shaderinput records and launch threads when

a shader stages input stream containsa su fcient number o entities. Simi-lar hardware logic commits records tothe output stream bu er upon threadcompletion. The distribution o execu-tion contexts to shader stages is repro-visioned periodically as pipeline needschange and stream bu ers drain or ap-proach capacity.

GPUs leverage up ront knowledge o pipeline entities to identi y and skip un-necessary computation. For example,vertices shared by multiple primitivesare identifed and VP results cached toavoid duplicate vertex processing. GPUsalso discard ragments prior to FP whenthe ragment will not alter the value o any image pixel. Early ragment discardis triggered when a ragments samplepoint is occluded by a previously pro-cessed sur ace located closer to thecamera.

Another class o hardware optimiza-tions reorganizes fne-grained opera-tions or more e fcient processing. Forexample, rasterization orders ragmentgeneration to maximize screen proxim-

ity o samples. This ordering improvestexture cache hit rates, as well as in-struction stream sharing across shaderinvocations. The GPU memory control-ler also per orms automatic reorganiza-tion when it reorders memory requeststo optimize memory bus and DRAM uti-lization.

GPUs ensure inter- ragment PO or-dering dependencies using hardwarelogic. Implementations use structuressuch as post-FP reorder bu ers orscoreboards that delay ragment threadlaunch until the processing o overlap-ping ragments is complete.

GPU hardware can take responsibil-ity or sophisticated scheduling deci-sions because semantics and invariantso the graphics pipeline are known a pri-ori. Hardware implementation enables

fne-granularity logic that is in ormedby precise knowledge o both the graph-ics pipeline and the underlying GPUimplementation. As a result, GPUs arehighly e fcient at using all available re-sources. The drawback o this approachis that GPUs execute only those compu-tations or which these invariants andstructures are known.

Graphics programming is becom-ing increasingly versatile. Developersconstantly seek to incorporate moresophisticated algorithms and leverage

more confgurable graphics pipelines.Simultaneously, the growing popular-ity o GPU-based computing or non-graphics applications has led to new inter aces or accessing GPU resources.Given both o these trends, the extentto which GPU designers can embed apriori knowledge o computations intohardware scheduling logic will inevita-bly decrease over time.

A major challenge in the evolutiono GPU programming involves preserv-ing GPU per ormance levels and easeo use while increasing the generality and expressiveness o application inter-

aces. The designs o GPU-computeinter aces, such as NVIDIAs CUDA andAMDs CAL, are evidence o how di fcultthis challenge is. These rameworks ab-stract computation as large batch oper-ations that involve many invocations o a kernel unction operating in parallel.The resulting computations execute onGPUs e fciently only under conditionso massive data parallelism. Programsthat attempt to implement non data-parallel algorithms per orm poorly.

GPU-compute programming modelsare simple to use and permit well-writ-ten programs to make good use o bothGPU programmable cores and (i need-ed) texturing resources. Programs using these inter aces, however, cannot usepower ul fxed- unction components o the chip, such as those related to com-pression, image compositing, or raster-ization. Also, when these inter aces areenabled, much o the logic specifc tographics-pipeline scheduling is simply turned o . Thus, current GPU-computeprogramming rameworks signifcant-ly restrict computations so that theirstructure, as well as their use o chip re-sources, remains su fciently simple orGPUs to run these programs in parallel.

GPu d cPu c v rg

The modern graphics processor is a pow-er ul computing plat orm that residesat the extreme end o the design spaceo throughput-oriented architectures.A GPUs processing resources and ac-companying memory system are heavily optimized to execute large numbers o operations in parallel. In addition, spe-cialization to the graphics domain hasenabled the use o fxed- unction pro-cessing and allowed hardware schedul-ing o a parallel computation to be prac-tical. With this design, GPUs deliver

unsurpassed levels o per ormance tochallenging workloads while maintain-ing a simple and convenient program-ming inter ace or developers.

Today, commodity CPU designs areadopting eatures common in GPUcomputing, such as increased corecounts and hardware multithreading.At the same time, each generation o GPU evolution adds exibility to previ-ous high-throughput GPU designs. Giv-en these trends, so tware developers inmany felds are likely to take interest inthe extent to which CPU and GPU archi-tectures and, correspondingly, CPU andGPU programming systems, ultimately converge.

Kayvon Fata alian (kayvon @gmail.com) and Mikehouston are Ph.D. candidates in computer science in theComputer Graphics Laboratory at Stan ord University.

A previous version o this article was published in theMarch 2008 issue o ACM Queue .

2008 ACM 0001-0782/08/1000 $5.00

A Closer Look at GPUs

Documents

Transcript of A Closer Look at GPUs