Download - Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Page 1: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

Real-time Ray Tracing on GPU with BVH-based Packet Traversal

Stefan Popov, Johannes Günther, Hans-Peter Seidel, Philipp Slusallek

Page 2: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

Background GPUs attractive for ray tracing

High computational power Shading oriented architecture

GPU ray tracers Carr – the ray engine Purcell – Full ray tracing on the GPU, based on grids Ernst – KD trees with parallel stack Carr, Thrane & Simonsen – BVH Foley, Horn, Popov – KD trees - stackless traversal

Page 3: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

Motivation So far

Interactive RT on GPU, but Limited model size No dynamic scene support

The G80 – new approach to the GPU High performance general purpose processor with

graphics extensions PRAM architecture

BVH allow for Dynamic/deformable scenes Small memory footprint Goal: Recursive ordered traversal of BVH on the G80

Page 4: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

GPU Architecture (G80) Multi-threaded scalar architecture

12K HW threads Threads cover latencies

Off-chip memory ops Instruction dependencies

4 or 16 cycles to issue instr. 16 (multi-)cores

8-wide SIMD 128 scalar cores in total Cores process threads in 32 wide

SIMD chunks


Multi-Core 1


Chunk Pool…


Multi-Core 16


Chunk Pool…


Page 5: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

GPU Architecture (G80) Scalar register file (8K)

Partitioned among running threads

Shared memory (16KB) On-chip, 0 cycle latency

On-board memory (768MB) Large latency (~ 200 cycles) R/W from within thread Un-cached

Read-only L2 cache (128KB) On chip, shared among all


On-board memory

Multi-Core 1






Multi-Core 16

L2 Cache (128KB)

Page 6: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

Programming the G80 CUDA

C based language with parallel extensions GPU utilization at 100% only if

Enough threads are present (>> 12K) Every thread uses less than 10 registers and 5

words (32 bit) of shared memory Enough computations per transferred word of data

Bandwidth << computational power Adequate memory access pattern to allow read


Page 7: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

Performance Bottlenecks Efficient per-thread stack implementation

Shared memory too small – will limit parallelism On-board memory – uncached

Need enough computations between stack ops Efficient memory access pattern

Use texture caches However, only few words of cache / thread

Read successive memory locations in successive threads of a chunk Single roundtrip to memory (read combining)

Cover latency with enough computations

Page 8: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

Ray Tracing on the G80 Map each ray to one thread

Enough threads to keep the GPU busy Recursive ray tracing

Use per-thread stack stored on on-board memory Efficient, since enough computations are present

But how to do the traversal ? Skip pointers (Thrane) – no ordered traversal Geometric images (Carr) – single mesh only Shared stack traversal

Page 9: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

SIMD Packet Traversal of BVH Traverse a node with the whole packet At an internal node:

Intersect all rays with both children and determine traversal order

Push far child (if any) on a stack and descend to the near one with the packet

At a leaf: Intersect all rays with contained geometry Pop next node to visit from the stack

Page 10: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

PRAM Basics The PRAM model

Implicitly synchronized processors (threads)

Shared memory between all processors

Basic PRAM operations Parallel OR in O(1) Parallel reduction in

O(log N)

false truefalse true



11 912 32

44 20

+ +

+64 20 11 9

11 9

Page 11: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

PRAM Packet Traversal of BVH The G80 – PRAM machine on chunk level

Map packet chunk, ray thread Threads behave as in the single ray traversal

At leaf: Intersect with geometry. Pop next node from stack

At node: Decide which children to visit and in what order. Push far child

Difference: How rays choose which node to visit first

Might not be the one they want to

Page 12: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

PRAM Packet Traversal of BVH Choose child traversal order

PRAM OR to determine if all rays agree on visiting the same node first The result is stored in shared memory

In case of divergence: choose child with more ray candidates Use PRAM SUM on +/- 1 for each thread, -1 left node Look at result’s sign

Guarantees synchronous traversal of BVH

Page 13: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

PRAM Packet Traversal of BVH Stack:

Near & far child – the same for all threads => store once

Keep stack in shared memory. Only few bits per thread!

Only Thread 0 does all stack ops. Reading data:

All threads work with the same node / triangle Sequential threads bring in sequential words Single load operation. Single round trip to memory

Implementable in CUDA

Page 14: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing


Page 15: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

Analysis Coherent branch decisions / memory access Small footprint of the data structure

Can trace up to 12 million triangle models Program becomes compute bound

Determined by over/under-clocking the core/memory No frustums required

Good for secondary rays, bad for primary Can use rasterization for primary rays

Implicit SIMD – easy shader programming Running on a GPU – shading “for free”

Page 16: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

Dynamic Scenes Update parts / whole BVH and geometry on

GPU Use GPU for RT and CPU for BVH construction /

refitting Construct BVH using binning

Similar to Wald RT07 / Popov RT06 Bin all 3 dimensions using SIMD

Results in > 10% better trees Measured as SAH quality, not FPS Speed loss is almost negligible

Page 17: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing


Page 18: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

Conclusions New recursive PRAM BVH traversal algorithm

Very well suited for the new generation of GPUs No additional pre-computed data required

First GPU ray tracer to handle large models Previous implementations were limited to < 300K

Can handle dynamic scenes By using the CPU to update the geometry / BVH

Page 19: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

Future Work More features

Shaders, adaptive anti-aliasing, … Global illumination

Code optimizations Current implementation uses too many registers

Page 20: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

Thank you!

Page 21: Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

CUDA Hello World__global__ void addArrays(int *arr1, int *arr2){

unsigned t = threadIdx.x + blockIdx.x * blockDim.x;arr1[t] += arr2[t];


int main(){

int *inArr1 = malloc(4194304), *inArr2 = malloc(4194304);int *ta1, *ta2;cudaMalloc((void**)&ta1, 4194304); cudaMalloc((void**)&ta2, 4194304);

for(int i = 0; i < 4194304; i++){ inArr1[i] = rand(); inArr2[i] = rand(); }

cudaMemcpy(ta1, inArr1, 4194304, cudaMemcpyHostToDevice);cudaMemcpy(ta2, inArr2, 4194304, cudaMemcpyHostToDevice);

addArrays<<<dim3(4194304 / 512, 1, 1), dim3(512, 1, 1)>>>(ta1, ta2);

cudaMemcpy(inArr1, ta1, 4194304, cudaMemcpyDeviceToHost);for(int i = 0; i < 4194304; i++) printf("%d ", inArr1[i]);

return 0;}