CS 354 Performance Analysis

CS 354Performance Analysis

Mark KilgardUniversity of TexasApril 26, 2012

CS 354 2

Today’s material

In-class quiz On acceleration structures lecture

Lecture topic Graphic Performance Analysis

CS 354 3

My Office Hours Tuesday, before class

Painter (PAI) 5.35 8:45 a.m. to 9:15

Thursday, after class ACE 6.302 11:00 a.m. to 12

Randy’s office hours Monday & Wednesday 11 a.m. to 12:00 Painter (PAI) 5.33

CS 354 4

Last time, this time

Last lecture, we discussed Acceleration structures

This lecture Graphics Performance Analysis

Projects Project 4 on ray tracing on Piazza

Due May 2, 2012 Get started!

CS 354 5

Daily Quiz1. Multiple choice: Which is

NOT a bounding volume representation

a) sphere

b) axis-aligned bounding box

c) object aligned bounding box

d) bounding graph point

e) convex polyhedron

2. True or False: Place objects within a uniform grid is easier than placing objects within a KD tree.

3. True of False: Volume rendering can be accelerated by the GPU by drawing blended slices of the volume.

On a sheet of paper• Write your EID, name, and date• Write #1, #2, #3 followed by its answer

CS 354 6

Graphics Performance Analysis

Generating synthetic images by computer is computationally—and bandwidth—intensive Achieving interactive rates is key

60 frames/second ≈ real-time interactivity Worth optimizing

Entertainment and intuition tied to interactivity How do we think about graphics

performance analysis?

CS 354 7

Framing Amdahl’s Law

Assume a workload with two parts First part in A% Second part is B% Such that A% + B% = 100%

If we have a technique to speedup the second part by N times But have no speedup for the first part What overall speed up can we expect?

CS 354 8

Amdahl’s Equation

Assume A% + B% = 100% If the un-optimized effort is 100%, the optimized

effort should be smaller

Speedup is ratio of UnoptimizedEffort to OptimizedEffort

NBB

NBA

Speedup

)1(

1%%

%100

NBAffortOptimizedE %%

CS 354 9

Who was Amdahl?

Gene Amdahl CPU architect for IBM in 1960s

Helped design IBM’s System/360 mainframe architecture

Left IBM to found Amdahl computer Building IBM compatible mainframes

Why? Evaluating whether to invest in parallel

processing or not

CS 354 10

Parallelization

Broadly speaking, computer tasks can be broken into two portions Sequential sub-tasks

Naturally requires steps to be done in a particular order Examples: text layout, entropy decoding

Parallel sub-tasks Problem splits into lots of independent chunks of work Chunks of work can be done by separate processing units

simultaneously: parallelization Examples: tracing rays, shading pixels, transforming

vertices

CS 354 11

Serial Work SandwichingParallel Work

CS 354 12

Example of Amdahl’s Law

Say a task is 50% serial and 50% parallel Consider using 4 parallel processors on the

parallel portion Speedup: 1.6x

Consider using 40 parallel processor on parallel portion Speedup: 1.951x

Consider limit:25.5.

1lim

nn

CS 354 13

Graph of Amdahl’s Law

CS 354 14

Pessimism about Parallelism?

Amdahl’s Law can instill pessimism about parallel processing

If the serial work percent is high, adding parallel units has low benefit Assumes fixed “problem” size So workload stays same size even as parallel

execution resources are added So why do GPUs offer 100’s of cores

then?

CS 354 15

Gustafson's Law

Observation by John Gustafson With N parallel unit, bigger problems can be attacked

Great example Increasing GPU resolution Was 640x480 pixels, now 1920x1200 More parallel units means more pixels can be

processed simultaneously Supporting rendering resolutions previously unattainable

Problem size improvement)1( NANleproblemSca

CS 354 16

Example

Say a task is 50% serial and 50% parallel Consider using 4 parallel processors on the

parallel portion Problem scales up: 2.5x

Consider 100 parallel processors Problem scales up: 50.5x

Also consider heterogeneous nature of graphics processing units

CS 354 17

Coherent Work vs.Incoherent Work

Not all parallel work is created equal Coherent work = “adjacent” chunks of work

performing similar operations and memory accesses Example: camera rays, pixel shading Allows sharing control of instruction execution Good for caches

Incoherent work = “adjacent” chunks of work performing dissimilar operations and memory accesses Examples: reflection, shadow, and refraction rays Bad for caches

CS 354 18

Coherent vs. Incoherent Rays

coherent = camera rays coherent = light rays

incoherent = reflected rays

CS 354 19

Keeping Work Coherent?

How do we keep work concurrent? Pipelines

Careful because they can introduce latency Data structures SPMD (or SIMD) execution

Single Program, Multiple Data To exploit Single Instruction, Multiple Data (SIMD)

units Bundling “adjacent” work elements helps cache and

memory access efficiency

CS 354 20

Pipeline Processing

Parallel and naturally coherent

CS 354 21

A Simplified Graphics PipelineApplication

Vertex batching & assembly

Triangle assembly

Triangle clipping

Triangle rasterization

Fragment shading

Depth testing

Color update

Application-OpenGL API boundary

Framebuffer

NDC to window space

Depth buffer

CS 354 22

Another View of the Graphics Pipeline

GeometryProgram

3D Applicationor Game

OpenGL API

GPUFront End

VertexAssembly

VertexShader

Clipping, Setup,and Rasterization

FragmentShader

Texture Fetch

RasterOperations

Framebuffer Access

Memory Interface

CPU – GPU Boundary

OpenGL 3.3

Attribute Fetch

PrimitiveAssembly

Parameter Buffer Readprogrammable

fixed-function

Legend

CS 354 23

Modeling Pipeline Efficiency

Rate of processing for sequential tasks Assume three tasks Run time is sum of each operation’s time

A+B+C Rate of processing in a pipeline

Assume three tasks, treated as stages Performance gated by slowest operation

Three operations in pipeline: A, B, C Run time = max(A,B,C)

CS 354 24

Hardware Clocks

Heart beat of hardware Measured in frequency

Hertz (Hz) = cycles per second Megahertz, gigahertz = million, billion Hz

Faster clocks = faster computation and data transfer

So why not simply raise clocks? High clocks consume more power Circuits are only rated to a maximum clock

speed before becoming unreliable

CS 354 25

Clock Domains Given chip may have multiple clocks running Three key domains (GPU-centric)

Graphics clock—for fixed-function units Example uses: rasterization, texture filtering, blending Optimize for throughput, not latency

Can often instance more units instead of raising clocks Processor clock—for programmable shader units

Example: shader instruction execution Generally higher than graphics clock

Because optimized for latency rather than throughput Memory clock—for talking to external memory

Depends on speed rating of external memory Other domains too

Display clock, PCI-Express bus clock Generally not crucial to rendering performance

CS 354 26

PrimitiveProgram

3D Pipeline Programmable Domains run on Unified Hardware

Unified Streaming Processor Array (SPA) architecture means same capabilities for all domains Plus tessellation + compute (not shown below)

GPUFront End

VertexAssembly

VertexProgram

,Clipping, Setup,

and Rasterization

FragmentProgram

Texture Fetch

RasterOperations

Framebuffer Access

Memory Interface

Attribute Fetch

PrimitiveAssembly

Parameter Buffer Read

Can beunifiedhardware!

CS 354 27

Memory Bandwidth

Raw memory bandwidth Physical clock rate

Examples: 3 Ghz Memory bus width

64-bit, 128-bit, 192-bit, 256-bit, 384-bit Wider buses are faster but more expensive to route all those wires

Signaling rate Double data rate (DDR) means signals are sent on the rising and

falling clock edges Often logical memory clock rate includes signaling rate

Computing raw memory bandwidth

busWidthlocksignalPerCockphysicalClbandwidth

CS 354 28

Latency vs. Throughput

Raw bandwidth is reduced by memory utilization bandwidth Unrealistic to expect 100% utilization GPUs are much better than CPUs generally

Trade-off Maximizing throughput (utilization) increases

latency Minimizing latency reduces utilization

CS 354 29

Computing Bandwidth

Example: GeForce GTX 680 Latest NVIDIA generation 3.54 billion transistors in 28 nm process

Memory characteristics 6 GHz memory clock (includes signaling rate) 256-bit memory interface = 192 gigabytes/second

6 billion × 256 bits/clock × 1byte/8bits

[GK104 die]

[GeForce GTX 680board]

CS 354 30

0

20

40

60

80

100

120

140

160

180

200

GeForce2GTS

GeForce3 GeForce4 Ti4600

GeForce FX GeForce6800 Ultra

GeForce7800 GTX

Rawbandwidth

Effective rawbandwidthwithcompression

Expon.(Effective rawbandwidthwithcompression)Expon. (Rawbandwidth)

GeForce PeakMemory Bandwidth Trends

128-bit interface 256-bit interface

Gig

abyt

es p

er s

econ

d

CS 354 31

Effective GPUMemory Bandwidth

Compression schemes Lossless depth and color (when multisampling)

compression Lossy texture compression (S3TC / DXTC) Typically assumes 4:1 compression

Avoidance useless work Early killing of fragments (Z cull) Avoiding useless blending and texture fetches

Very clever memory controller designs Combining memory accesses for improved coherency Caches for texture fetches

CS 354 32

Other Metrics

Host bandwidth Vertex pulling Vertex transformation Triangle rasterization and setup Fragment shading rate Shader instruction rate Raster (blending) operation rate Early Z reject rate

CS 354 33

Kepler GeForce GTX 680High-level Block Diagram

8 Streaming Multiprocessors (SMX)

1536 CUDA Cores 8 Geometry Units 4 Raster Units 128 Texture units 32 Raster operations 256-bit GDDR5

memory

CS 354 34

Kepler Streaming Multiprocessor

8 more copies of this

CS 354 35

Prior Generation StreamingMultiprocessor (SM)

Multi-processor execution unit (Fermi) 32 scalar processor

cores Warp is a unit of

thread execution of up to 32 threads

Two workloads Graphics

Vertex shader Tessellation Geometry shader Fragment shader

Compute

CS 354 36

Power Gating

Computer architecture has hit the “power wall” Low-power operation is at a premium

Battery-powered devices Thermal constraints Economic constraints

Power Management (PM) works to reduce power by Lower clocks when performance isn’t required Disabling hardware units

Avoids leakage

CS 354 37

Scene Graph Labor High-level division of scene graph labor Four pipeline stages

App (application) Code that manipulates/modifies the scene graph in response to

user input or other events Isect (intersection)

Geometric queries such as collision detection or picking Cull

Traverse the scene graph to find the nodes to be rendered Best example: eliminate objects out of view

Optimize the ordering of nodes Sort objects to minimize graphics hardware state changes

Draw Communicating drawing commands to the hardware Generally through graphics API (OpenGL or Direct3D)

Can map well to multi-processor CPU systems

CS 354 38

App-cull-draw Threading

App-cull-draw processing on one CPU core

App-cull-draw processing on multiple CPUs

CS 354 39

Scene Graph Profiling

Scene graph should help provide insight into performance

Process statistics What’s going on? Time stamps

Database statistics How complex is the scene in any frame?

CS 354 40

Example:Depth Complexity Visualization

How many pixels are being rendered? Pixels can be rasterized by multiple objects Depth complexity is the average number of times a

pixel or color sample is updated per frame

yellow and black indicate higher depth complexity

CS 354 41

Example:Heads-up Display of Statistics

Process statistics How long is

everything taking? Database statistic

What is being rendered?

Overlaying on active scene often value Dynamic update

CS 354 42

Benchmarking

Synthetic benchmarks focus on rendering particular operations in isolation What is the blended pixel performance

Application benchmarks Try to reflect what a real application would do

CS 354 43

Tips for InteractivePerformance Analysis

Vary things you can control Change window resolution

Making it smaller and seeing better performance Null driver analysis

Skip the actual rendering calls What if the driver was *infinitely” fast

Use occlusion queries to monitor how many samples (pixels) are actually got to need

Keep data on the GPU Let GPU do Direct Memory Access (DMA) Keep from swapping textures and buffers

Easy when multi-gigabyte graphics cards available

CS 354 44

Next Class

Next lecture Surfaces Programmable tessellation

Reading None

Project 4 Project 4 is a simple ray tracer Due Wednesday, May 2, 2012

CS 354 Performance Analysis

Technology

Transcript of CS 354 Performance Analysis