Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests...

Stream

Compaction

Deferred

Shading

Jared Hoberock*

Victor Lu

Yuntao Jia

John C. Hart

University of Illinois

(*now at NVIDIA)

Our Contribution

• Shading requests from rasterization are spatially coherent

• Less so when shading is deferred until after rasterization

• Shading requests from ray tracers are spatially incoherent

• Neighboring processes need to run completely different shaders

• Shading requests can be deferred and batch processed

• SIMD processing of incoherent shading batches suffers

from control flow divergence

• Is it worth clustering shading requests into coherent

batches to avoid SIMD divergence?

Previous Work

• Memory Coherence for Out-of-Core Processing

[Pharr et al. 1997]

• Encouraged memory coherence within intersection jobs whereas

we encourage instruction coherence within shading jobs

• Ray-Hierarchy Traversal

• Mannson et al. [2007] measured divergence

• Wald et al. [2007] simulated compaction to avoid divergence

• Dynamic Warp Formation [Fung et al. 2007]

• Local re-ordering hardware v. global re-ordering software

• Load Balancing [Aila & Laine 2009]

• Ray tracing is a scheduling problem

Data Parallel Architectures

• MIMD “cores”

• Each core has its own

instruction counter

• Cell:8, GT200:30, LRB:32

• SIMD vector processors

• Lanes share same

instruction counter

• Cell:4, GT200:8, LRB:16

• Programmer may see even

wider degree of SIMD parallelism

• NVIDIA’s 32-wide “warps”

SIMD Divergence: Conceptual

TFFT TFTF

ABBA ABAB

Test X

Execute A or B

SIMD Divergence: Actual

TFFT TFTF

Mask on X

AAAA AAAA

Execute Both A and B

BBBB BBBB

ABBA ABAB

SIMD Divergence: Measurement

TFFT TFTF

Mask on X

ABBA ABAB

Efficiency =

#A |A| + #B |B|

(#A + #B)(|A|+|B|)

Useful Work

Total EffortAAAA AAAA

Execute Both A and B

BBBB BBBB=

min( ,1)

Shading Efficiency in a Path Tracer

1st Hit: 98% Efficient 2nd Hit: 56% Efficient

3rd Hit: 52% Efficient 4th Hit: 54% Efficient

Recovering Coherence

TFFT TFTF

TTTT FFFF

AAAA BBBB

ABBA ABAB

Test X

Compact

Evaluate A or B

Scatter result

Recovering Coherence

TFFT TFTF

TTTT FFFF

AAAA BBBB

ABBA ABAB

Efficiency =

(A | B)

(A | B) + compact

Stream Compaction

• Compacting disorganized input

Stream Compaction

1. Select orange token

0001 1000 1000 0010

Stream Compaction

2. Prefix Sum

0001 1000 1000 0010

1110 1111 2222 4433

Stream Compaction

2. Prefix Sum

3. Scatter

0001 1000 1000 0010

1110 1111 2222 4433

Stream Compaction

2. Prefix Sum

3. Scatter

0001 1000 1000 0010

1110 1111 2222 4433

Scan: M*O(N)

[Sengupta et al. 2007]

Radix Sort: log(M)*O(N)

[Satish et al. 2009]

Shader Scheduling

• Implicit Serialization

• (Big Switch)

• Let hardware schedule

• Explicit Serialization

• Run only jobs w/same shader at a time

• Compact + Imp/Exp Serialization

• Radix Sort + Imp/Exp Serialization

• Local Bitonic Sort + Imp Ser.

• Local to a CUDA thread block

• Global loads coalesce

forall j in jobs in SIMD do

switch j do

case s1:

execute(s1)

case sM:

execute(sM)

forall s in shaders do

mask = select(s,jobs)

forall (m,j) in (mask,jobs)

in SIMD do

if m then execute(s)

Implemented in CUDA

on G80-class hardware

Results

3 simple shaders

19% slower on GX2

5% slower on GTX+

unscheduled scheduled

Results

6 simple shaders

14% faster on GX2

38% faster on GTX+

Results

2 simple & 2 proc. shaders

2.4x faster on GX2

2.7x faster on GTX+

Results

3 simple & 2 proc. shaders

3.2x faster on GX2

3.5x faster on GTX+

Results

11 moderate shaders

23% faster on GX2

51% faster on GTX+

Scaling

9800GX2 9800GTX+

“Cores” 2x16 (we used 16) 16

Processor Clock 1.50 GHz 1.84 GHz (23%)

Memory Clock 1 GHz 1.1 GHz (10%)

Bandwidth 64 GB/s 70.4 GB/s (10%)

Bus Width 2x256 bit (we used 256) 256 bit

• Difference between processor clock scaling and memory

bandwidth scaling enhances benefits of shader compaction

• Compaction further leverages increased processor speed

Analysis

• Coherent shading time is always smaller

• But cost of overhead not always worth it for simple cases

• Shader Complexity

• Simple – No improvement, but little penalty

• Procedural – Large improvements

• Implicit versus Explicit Serialization

• With compaction, explicit almost always wins

• Large penalties for explicit with unordered input

• Local compaction was never successful

• Too much local data movement

• Limited working set size

Conclusions

• Global stream compaction is almost always a win

• Surprising positive results for our toy scenes

• Production renderers will require stream compaction to

be tractable in large scenes of arbitrary shading

complexity

Future work

• Data sensitive scheduling to avoid memory divergence

• Hybrid shader batch approaches

• Scheduling in both space and time

Acknowledgments

• Thanks to Shubho Sengupta & Mark Harris for making

their fast CUDA compaction primitives available in

CUDPP, and Nathan Bell for GPU radix sort

• This work was funded by the Intel & Microsoft as part of

the Illinois Universal Parallel Computing Research

Center

Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests...

Documents

Transcript of Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests...

Deferred Shading

SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3

Lecture 12: Deferred Shading - SCHOOL OF COMPUTER SCIENCE

Deferred Shading (GDC 2004) - Shawn Hargreavesshawnhargreaves.com/DeferredShading.pdf · 2011. 11. 2. · Deferred Shading Advantages • Excellent batching • Render each triangle

Render to texture Deferred shading Post-processing blur Szécsi László.

Raw point cloud deferred shading through screen space ...

Deferred Par ticle Shading

Deferred Shading In Callisto

OpenGL 4.3 (Core Profile) - February 14, 2013 · 4 .3 Fragment Shading e Image Tessellation Vertex Shading Geometry Shading orm eedback Rasterization Vertex Speci cation s VBO) eedback

Transparency with Deferred Shading...Deferred shading has become a popular technique often used for development of main stream video-game applications. The deferred shading technique

An Evaluation of Deferred Shading Under Changing Conditions · Deferred shading was ﬁrst introduced in 1988 by Deering et al. [3]. Its basic principle lies in postponing the shading

An Evaluation of Deferred Shading Under Changing Conditionstobias.isenberg.cc/personal/papers_students/Postma_2009_EDS.pdf · Deferred shading is a relatively new shading technique

Deferred Shading (GDC 2004) - College of Engineeringweb.engr.oregonstate.edu/.../DeferredShading.pdf · Title: Deferred Shading (GDC 2004) Author: Shawn Hargreaves Created Date: 3/26/2004

Deferred Shading & Screen Space Effects - Entropia...Deferred Shading – Motivation Rendering a Complex Scene Render time depends on: Number of objects N (scene complexity) Number

[inria-00480869, v2] A Deferred Shading Algorithm for Real ... · A Fast Deferred Shading Pipeline for Real Time Approximate I ndirect Illumination 5 2.3 Approximate indirect illumination

A Fast Deferred Shading Algorithm for Approximate Indirect Illumination

OpenGL 3 Deferred Shading with point in polyhedron test

Deferred Shading (GDC 2004) - The College of Engineering at the …cs5600/slides/Wk 9 D3DTutorial... · 2013. 3. 5. · Deferred Shading (GDC 2004) Author: Shawn Hargreaves Created

Clustered Deferred and Forward Shading - Chalmers

4 . 2. Deferred Shading