Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner...

17
Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough

Transcript of Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner...

Interactive Rendering With Coherent Ray Tracing

Eurogaphics 2001

Wald, Slusallek, Benthin, Wagner

Comp 238, UNC-CH, September 10, 2001

Joshua Stough

The Gist

• The authors present “ a highly optimized implementation of a ray tracer that improves performance by more than an order of magnitude compared to currently available ray tracers…makes better use of computational resources…and better exploits image and object space coherence.”

Organization

• Why Ray Tracing over Rasterization?• An Optimized Ray Tracing Implementation

– Code structure, Caching, Coherence

– Intersections

– Volume Traversal (Memory Layout, Overhead)

• Performance of the Ray Tracing Engine

Why Ray Tracing Over Raster?

• Automatic Occlusion Culling

• Logarithmic Complexity in number of scene primitives• Flexible sampling – allows for more effective use of time

• Efficient Shading – “avoids computation for invisible geometry”

• Shader Programming – direct use verses pipeline model

• More Correct Physically – and can use the same approximations

• “Trivially Parallel” – though initial resources required are higher

• Coherence

“Coherence is the key to efficiency.”• Basic (Recursive Tree) Ray Tracer lacks concern for:

– Modern CPU design – pipeline execution– Caching to hide low bandwidth and high latency on main memory

• Instead, “pay particular attention to:”– Caching – efficient/aligned data structures, traversing mechanisms– Pipelining– Parallel execution possibilities

• “We show that even today the performance of a software ray tracer on a single PC can challenge dedicated rasterization hardware for complex environments.”

An Optimized Ray Tracing Implementation

• Reducing Code Complexity

• Optimizing cache usage

• Reducing memory bandwidth

• Prefetching Data

• And with SIMD/SSE:– Ray intersections

– Scene traversal

– Shading

Code Complexity

• Few conditionals, Tight Inner loops

• Axis aligned BSP Tree – iterative algorithm possible

• Triangles only – reduces branches

• Shading less important – once verses 40-50 traversals 5-10 intersections

Caching• Performance bound by bandwidth, not CPU speed

– BSP traversal, low computation to bandwidth ratio

• Fetching on entire cache line

• Carefully lay out data– Data together only if used together (geometry vs. shading)

– Separate read-only (preprocessing) data from read-write (mailboxes)

• Hide latency with prefetching

Ray-Triangle Intersection

Compute distance to plane (defined by triangle) along ray

If distance is within current interval for testing (via BSP)

Compute hit point

Project into an axis-aligned plane (largest angle to normal)

Barycentric coordinates of the hit point in 2d

Data alignment – 2 2D edge equations, plane equation for distance, tag for projection axis = 9 floats + tag. Padded to 48 bytes (memory tradeoff).

CPU Cost of Ray-Triangle Test

Bary. Pleucker BarySpeed-

C Code SSE SSE Up

Min 78 77 22 3.5

Max 148 123 41 3.7

**

-41 cycles ~ 20M ray-triangle intersections/sec

-SSE requires bundling four rays at a time.

The Bundling of Four Rays at Once

• Better than four Triangles/One Ray

• Requires new Traversal algorithm

• Potential Overhead

• Primary rays verses shadow rays

BSP Traversal

• Before, 2x-3x more time spent than on intersections

• Axis Aligned BSP Tree– Only 2 binary decisions – efficient in parallel

– Any ray traverses a child node => All four traverse in parallel

• Algorithm – Maintain current ray segment [near, far]

– Calculate distance to splitting plane

– Three cases

– Update segments and traverse children if necessary

BSP Tree Memory Layout

• Caching and Prefetching in mind

• 1 children node pointer, node type flag, split coordinate – = 8 bytes/node = 4 nodes/cache line.

– Aligned children

– Memory bandwidth reduced by 4x.

• Possible Overhead– Incoherent rays = high overhead

– Worst case = no worse than normal

Performance of the Ray Tracer

Considerations

• 11-15x Performance Increase!

• RTRT on 256MB RAM, others on 1GB!

BUT

• Difference in features

• Others not limited to triangles

• Others did not target performance

Comparison With Raster Hardware