Implementing the Render Cache and the Edge-and-Point Image on Graphics Hardware
Enhancing and Optimizing the Render Cache
-
Upload
branden-long -
Category
Documents
-
view
26 -
download
0
description
Transcript of Enhancing and Optimizing the Render Cache
Enhancing and Optimizing the Render Cache
Bruce Walter
Cornell Program of Computer Graphics
George DrettakisREVES/INRIA Sophia-Antipolis
Donald P. GreenbergCornell Program of Computer Graphics
Background
Render Cache• “Interactive Rendering using the Render
Cache”, Rendering Workshop 1999• Goal
- Interactive Rendering
- Exploit frame-to-frame coherence
- Decouple renderer from display framerate
- Reuse “expensive” rendering results
Background
Goal: Interactive rendering
Ray tracing Path tracing
Background
Modified Visual
Feedback Loop
display
application
image
userrenderer
Asynchronousinterface
Background
Reproject rendered points
Original view New view
Background
renderer
renderer
imageInterpolate
Sampling
Depth Cull
Project/Z-Buffer
Display process
Update Points
Background
Results after each stage
Projection Depth cull Interpolation
Background
Displayed image Priority image Requested pixels
Sampling
Related Work
Faster ray engines• Optimize and parallelize
- E.g., Wald et al
Hardware-based display• Mesh-based
- E.g., Tapestry, Holodeck, Tole et al
• Texture-based- E.g., Corrective textures
Motivation
Render Cache works well• Can enable interactive use of higher quality
ray-based renderers.
… but needs improvement• Images too small (256x256)• Gaps often visible during camera motion• Not fast enough in tracking shading
changes
Enhancements
Tiled Z-Buffer• Better scalability and memory coherence
Larger Interpolation Prefilter• Can fill larger gaps between points
Predictive Sampling• Improved quality during camera motion
Point Eviction• Faster update of shading changes
Enhancements
Code Optimization• Use of SIMD (MMX/SSE/SSE2)• Data layout, branch conversions, etc.
Publicly Available• For evaluation, comparison, or use
- Non-commercial binary release
- URL is in the paper
Memory Coherence
Change from R10K to Pentium 4• Cache reduced from 4MB to 256K• Clock increased from 195MHz to 1.7GHz
- Cache misses much more expensive
Change from 256x256 to 512x512• Point data ~ 5MB, Image data ~ 3MB
- Much bigger than cache
Projection and Z-Buffer problematic
Projection and Z-Buffer
Point Cloud 5MB
Image - 3MB
Random order memory access- Read/modify/write operation is memory latency
limited
Tiled Projection and Z-Buffer
Point Cloud 5MB
Image - 3MB
Divide image into tiles- Tiles sized to fit in cache
Tile Buckets - 4MB
Tiled Projection and Z-Buffer
Point Cloud 5MB
Image - 3MB
Project and bucket sort by tile
Tile Buckets - 4MB
Tiled Projection and Z-Buffer
Point Cloud 5MB
Image - 3MB
Z-Buffer each tile separately
Tile Buckets - 4MB
Tiled Projection and Z-Buffer
Point Cloud 5MB
Image - 3MB
Uses more memory and instructions- But it is faster (25ms instead of 42ms)
Tile Buckets - 4MB
Interpolation Filters
Larger filters• Fill larger gaps in point data• Generally more expensive• Result in more blurring of the image
The previous Render Cache• Used a 3x3 weighted filter
- Can only fill very small gaps
- Introduces only a small amount of blurring
Prefilter
Add a larger “backup” filter• Results used only when 3x3 filter fails• Uses a uniform 7x7 filter
- Can be computed cheaply
• Can fill in much larger gaps• Does not affect sampling priorities• Actually executed first then overwritten
- Hence the name “prefilter”
Prefilter
3x3 filter only 7x7 prefilter only Both filters
Predictive Sampling
Sampling is purely reactive• Helps to guide sparse sampling• Samples returned in later frame
- Problem when large new regions become visible
Predict large gaps ahead of time• Project using a predicted camera• Request samples before they are needed
Predictive Sampling
Projection is expensive• 47% of original render cache cost
Use simplified projection• No Z-Buffer
- Only need to find regions with no points
• Reduced resolution- 1/4 width and height (1/16 # of pixels)
• Store only 1 byte per pixel- Occupancy image fits easily in cache
Predictive Sampling
No Prediction With Prediction
Example during rapid camera rotation
Algorithm Overview
renderer
renderer
image
Interpolate
Sampling
Depth Cull
Z-Buffer
Update Points
Prediction
Project/Sort
Prefilter
Point Eviction
Stale data can be worse than no data• Points may live a long time at high ratios
- Not enough new samples to overwrite old
• Color change detection already exists- Enhances sampling in regions of change
- Works by aging nearby points
Evict points beyond an age limit• Speeds image convergence
SIMD Optimizations
Utilize MMX/SSE/SSE2 instructions• Project four points at once• Process R,G,B channel simultaneously• Add memory prefetches
- Automatic prefetch works well for linear access
• Convert branches to data dependencies- Compares set masks of zeroes or ones
- Use boolean operations instead of branches
• Roughly a factor of two total speedup
Results
Ray trace only (1.8 fps) Render Cache (9 fps)
Single 1.7GHz processor - rotating camera
Results
Timing: 62.1 ms (up to 16 fps)• 512x512 image, render cache only• 1.7GHz Pentium 4 processor
Update Points
Prediction
ProjectZ-Buffer
Depth Cull
Prefilter
Filter / Smooth
Sampling
Scalability with Image Size
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
0 50 100 150 200 250 300 350
Fra
me
Siz
e (
Pix
els
)
Frame Time (ms)
512x512
1200x1200
Results
Try it for yourself• Download publicly available binary
- Includes Render Cache and simple Ray Tracer
- Requires a Pentium 4 and Java Web Start
- Free for evaluation and internal use
- Http://www.graphics.cornell.edu/research/interactive/rendercache
Demo
The End