Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction...

28
Introduction Method Data management Occlusion computation Results Height field ambient occlusion using CUDA 3.6.2009

Transcript of Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction...

Page 1: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

Height field ambient occlusion using CUDA

3.6.2009

Page 2: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

Outline

1 Introduction

2 MethodIdeaImplementation

3 Data managementNaive solutionsPerformance improvementsPreblend kernel

4 Occlusion computationTheoryKernel

5 Results

Page 3: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

Height fields

Page 4: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

Self occlusion

Page 5: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

Current methods

Marching severaldirections from eachfragment

Sampling several timesalong a direction

Optimizations based onthis approach

Page 6: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

IdeaImplementation

Outline

1 Introduction

2 MethodIdeaImplementation

3 Data managementNaive solutionsPerformance improvementsPreblend kernel

4 Occlusion computationTheoryKernel

5 Results

Page 7: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

IdeaImplementation

Sweeps

We process several (e.g.64) directions (sweeps)

For a height field of n2

(e.g. 1024x1024) a sweepconsists of n · · ·

√2n lines

One direction = infinitelyhigh but thin light source

Page 8: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

IdeaImplementation

Spatial coherence

Each line steps one texelat a time

Keeps a record of previousoccluders

Line-wise coherence

Convex hull subset ofoccluders is enough

3% of occluders requiredin practice

Output: occlusionvector/value on each step

Results for each directionblended together

Page 9: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

IdeaImplementation

Implications

Previously bandwidth limited

New method samples significantly less

Bound to discrete directions, but ideally takes each pixel onthem into account

Does not map to shaders well ⇒ GPGPU

Page 10: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

IdeaImplementation

Outline

1 Introduction

2 MethodIdeaImplementation

3 Data managementNaive solutionsPerformance improvementsPreblend kernel

4 Occlusion computationTheoryKernel

5 Results

Page 11: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

IdeaImplementation

Environment

OpenGL 3.0 (Aug 2008)

CUDA 2.2 (May 2009)

Linux (primary), Windows

Software (CPU) and hardware (GPU) implementations

Page 12: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

IdeaImplementation

Tesla architecture

Global (device) memoryon-board, various accessmodes

Shared memory on-chip,for each MP

64kB, 16 banks,interleaved 32b

30 MPs on GTX280, 8ALUs on a MP

Page 13: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

Naive solutionsPerformance improvementsPreblend kernel

Outline

1 Introduction

2 MethodIdeaImplementation

3 Data managementNaive solutionsPerformance improvementsPreblend kernel

4 Occlusion computationTheoryKernel

5 Results

Page 14: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

Naive solutionsPerformance improvementsPreblend kernel

OpenGL interoperability

Page 15: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

Naive solutionsPerformance improvementsPreblend kernel

OpenGL interoperability

Too much data copying

Not enough threads

We can transfer more sweeps at once to remedy the latter

Page 16: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

Naive solutionsPerformance improvementsPreblend kernel

Blending in CUDA

Write into multiple dests as before

Blend using CUDA

⇒ still memcpies, sampling slower in CUDA than inOpenGL

Page 17: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

Naive solutionsPerformance improvementsPreblend kernel

Outline

1 Introduction

2 MethodIdeaImplementation

3 Data managementNaive solutionsPerformance improvementsPreblend kernel

4 Occlusion computationTheoryKernel

5 Results

Page 18: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

Naive solutionsPerformance improvementsPreblend kernel

Sample src PBO with CUDA

Reduces data passed between OpenGL and CUDA

Need to copy src PBO to CudaArray

CUDA provides 2D bilinear filtering (read-only)

Page 19: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

Naive solutionsPerformance improvementsPreblend kernel

Preblending in CUDA

Cannot write directly,atomic adds are slow

180◦ rotation is trivial

90◦/270◦ needs sharedmem tricks for memcoalescing

Page 20: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

Naive solutionsPerformance improvementsPreblend kernel

Packing texels

CUDA memory coalescingonly works forthread-consecutive 32belements

But we have only singleluminance value for a texel

This calls for bit operations

Heaviest on preblendkernels, which luckily havelow arithmetic density

Page 21: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

Naive solutionsPerformance improvementsPreblend kernel

Outline

1 Introduction

2 MethodIdeaImplementation

3 Data managementNaive solutionsPerformance improvementsPreblend kernel

4 Occlusion computationTheoryKernel

5 Results

Page 22: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

TheoryKernel

Outline

1 Introduction

2 MethodIdeaImplementation

3 Data managementNaive solutionsPerformance improvementsPreblend kernel

4 Occlusion computationTheoryKernel

5 Results

Page 23: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

TheoryKernel

Function of the kernel

Processes one line, reads height values as input

Writes aligned packed values

Keeps a representation of the convex hull

Outputs an occlusion vector

Coalescing vs. length uniformity

Thread block size and shared memory resources

Page 24: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

TheoryKernel

Occluder processing

Keep a vector of occludersin shared mem

Search it backwards(linear/binary)

Use heuristic comparisonoperator

Insert new occluder

Remove remaining(change length indicator)

Return last − current

Page 25: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

TheoryKernel

Outline

1 Introduction

2 MethodIdeaImplementation

3 Data managementNaive solutionsPerformance improvementsPreblend kernel

4 Occlusion computationTheoryKernel

5 Results

Page 26: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

Performance figures

Profilers are immature

But we know that we get 50-80 GB/s device memory rates

And 200-400 Gops/s

Page 27: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

Comparison

M$ published previous state-of-the-art method inEurographics 2008

It scales badly, 1024x1024 @ 2.5fps, 32 directions

Our method achieves 36-42fps, 64 directions

Is texel-precise on sharp edges

Page 28: Height field ambient occlusion using CUDAwili.cc/research/presentations/utu2009.pdf · Introduction Method Data management Occlusion computation Results Theory Kernel Function of

IntroductionMethod

Data managementOcclusion computation

Results

Screen captures