GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
-
Upload
umbra-software -
Category
Technology
-
view
337 -
download
0
Transcript of GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
![Page 1: GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra](https://reader036.fdocuments.us/reader036/viewer/2022070516/5873c66c1a28abbc788b7b47/html5/thumbnails/1.jpg)
Antwan HätäläUmbra 3 Lead programmer
Boosting your ARMmobile 3D rendering
performance with Umbra 3
![Page 2: GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra](https://reader036.fdocuments.us/reader036/viewer/2022070516/5873c66c1a28abbc788b7b47/html5/thumbnails/2.jpg)
INDEX• Who are we?• Games• What is Umbra 3 and occlusion culling• bringing our system to the PlayStation 4• experiences and benefits• lessons learned
![Page 3: GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra](https://reader036.fdocuments.us/reader036/viewer/2022070516/5873c66c1a28abbc788b7b47/html5/thumbnails/3.jpg)
UMBRASOFTWAREOcclusion culling middlewarefor 3D games
Founded in 2007
14 employees
Based in Helsinki, Finland
Support office in Seattle, WA
Same problem – Different solutions
Mo Money – Mo Problems
“Level artists are there to fill theworld with content. Integrating Umbra
saved us not only artist time but the time to create and maintain an efficient
visibility culling solution. Umbra’s support provides us with the solutions and
features that we need.”
“Umbra’s technology is playing an important rolein the creation of our next universe, by freeing our
artists from the burden of manual markups typically associated
with polygon soup.”
![Page 4: GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra](https://reader036.fdocuments.us/reader036/viewer/2022070516/5873c66c1a28abbc788b7b47/html5/thumbnails/4.jpg)
Occlusionculling basics
![Page 5: GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra](https://reader036.fdocuments.us/reader036/viewer/2022070516/5873c66c1a28abbc788b7b47/html5/thumbnails/5.jpg)
Occlusion Culling: Why bother?
• Process and render only whats visible• improved frame rate and rendering performance• allows you to put more detail into levels and create larger
levels
![Page 6: GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra](https://reader036.fdocuments.us/reader036/viewer/2022070516/5873c66c1a28abbc788b7b47/html5/thumbnails/6.jpg)
6
What is Umbra ?
![Page 7: GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra](https://reader036.fdocuments.us/reader036/viewer/2022070516/5873c66c1a28abbc788b7b47/html5/thumbnails/7.jpg)
7
Determines visible objects fast to save further work both on CPU and GPU
Rasterizes automatically generated proprietary occluder models on CPU
Operates in low resolution, generates conservative (dilated) results Rasterization is embarassingly parallel in nature
Parallellize across CPU cores Process multiple pixels/elements in SIMD
Optimized for SSE, Altivec, Cell and ARM NEON
Umbra 3 occluder rasterizer
![Page 8: GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra](https://reader036.fdocuments.us/reader036/viewer/2022070516/5873c66c1a28abbc788b7b47/html5/thumbnails/8.jpg)
8
Processing of multiple data elements (2 to 16) in single instruction Separate execution pipeline: can execute in parallel with ARM Separate register file: 16 128-bit regs (or 32 64-bit), SP floats or 8-64
bit integers Mandatory in Cortex-A8/A12/A15, optional in Cortex-A9
For mobile 3D title purposes, it will be there Actual cycle counts will vary: 64-bit vs 128-bit, single vs dual issue,
latencies For multi-platform, target A9 and enjoy free benefits on more advanced
platforms Used in one of three ways
Inline assembly Compiler intrinsics Compiler auto-vectorization
Similar to SSE, Altivec but for best performance you need to know your platform
NEON overview
![Page 9: GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra](https://reader036.fdocuments.us/reader036/viewer/2022070516/5873c66c1a28abbc788b7b47/html5/thumbnails/9.jpg)
9
Collaborate with the compiler, but keep an eye on the output Align your data when possible Inline functions that operate on SIMD values Use __restrict to let compiler reorder Watch for register spilling
Schedule enough NEON work, even when it might be redundant Loading data from ARM registers is relatively cheap, storing back is expensive Hide load/store latencies by interleaving with computation (unroll your loops)
Never interleave VFP instructions with NEON Means pipeline flush, tens of cycles of penalty Watch for ”s” register use is compiler output
NEON common best practices
![Page 10: GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra](https://reader036.fdocuments.us/reader036/viewer/2022070516/5873c66c1a28abbc788b7b47/html5/thumbnails/10.jpg)
10
No penalty from interleaving 2-wide ops with 4-wide ops Cortex-A8/A9 does 64-bit float operations per cycle vget_high_xxx, vget_low_xxx to address quadword halves
Narrow to 64 bits early 16x4 and 8x8 are also 64 bits, for many operations 32 bits per channel not needed Even if CPU can churn out 128 bits per cycle, savings to be had in result latency etc. Use VMOVN or coupled operation and narrow
Careful with your constants VMOV and VMVN can encode lots of useful constants Compilers do a good job of constant encoding, but can’t choose the constants for you
Killer instructions Shift-and-insert: VSRI, VSLI Byte permute by table lookup: VTBL, VTBX Gather load and scatter store: VLD2-4, VST2-4
NEON optimization tricks
![Page 11: GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra](https://reader036.fdocuments.us/reader036/viewer/2022070516/5873c66c1a28abbc788b7b47/html5/thumbnails/11.jpg)
11
Example routine: gather sign bits of large array of float values
NEON optimization example
function gather_signbits(flt_array):let output_bitmap = bitmap of size len(flt_array)foreach elem in flt_array at index idx:if (elem < 0)set_bit(output_bitmap, idx)elseclear_bit(output_bitmap,idx)
![Page 12: GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra](https://reader036.fdocuments.us/reader036/viewer/2022070516/5873c66c1a28abbc788b7b47/html5/thumbnails/12.jpg)
12
Sufficient unrolling: handle 16 elements in one iteration
compare 4 values per instruction bitwise and for correct bit offsets collapse with vertical or (pairwise
add)
Neon optimization example: first attempt20: add.w r2, r0, #3224: vld1.64 {d28-d29}, [r0 :128]28: vld1.64 {d24-d25}, [r2 :128]2c: add.w r2, r0, #1630: vclt.f32 q14, q14, #034: vld1.64 {d26-d27}, [r2 :128]38: add.w r2, r0, #48
; 0x303c: vclt.f32 q12, q12, #040: vand q14, q8, q1444: vld1.64 {d30-d31}, [r2 :128]48: vclt.f32 q13, q13, #04c: vand q13, q11, q1350: vclt.f32 q15, q15, #054: vand q12, q10, q1258: vand q15, q9, q155c: vorr q13, q14, q1360: vorr q12, q12, q1564: vorr q12, q13, q1268: vpadd.i32 d24, d24, d256c: vpadd.i32 d24, d24, d2470: vst1.32 {d24[0]}, [r0 :32], r1
![Page 13: GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra](https://reader036.fdocuments.us/reader036/viewer/2022070516/5873c66c1a28abbc788b7b47/html5/thumbnails/13.jpg)
13
Compare with zero = shift sign bit Can shift and combine
simultaneously with VSRI instruction
Narrow to 16 bits (VMOVN) before proceeding further
half the amount of constants
Neon optimization example: shift-and-insert, narrow early
18: vld1.64 {d18-d19}, [r0 :128]1c: add.w r3, r0, #1620: adds r1, #422: vshr.u32 q9, q9, #1926: vld1.64 {d20-d21}, [r3 :128]2a: add.w r3, r0, #322e: vsri.32 q9, q10, #2332: vld1.64 {d20-d21}, [r3 :128]36: add.w r3, r0, #48
; 0x303a: vsri.32 q9, q10, #273e: vld1.64 {d20-d21}, [r3 :128]42: vsri.32 q9, q10, #3146: vmovn.i32 d18, q94a: vand d18, d18, d164e: vshl.u16 d18, d18, d1752: vpaddl.u16 d18, d1856: vpadd.i32 d18, d18, d185a: vst1.32 {d18[0]}, [r0 :32], r2
![Page 14: GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra](https://reader036.fdocuments.us/reader036/viewer/2022070516/5873c66c1a28abbc788b7b47/html5/thumbnails/14.jpg)
Thank you.For more on Umbra 3, go to:
umbra3.com [email protected]
Follow us on Twitter @umbrasoftware