High-Performance Physics Solver Design for Next...

69
High-Performance Physics Solver Design for Next Generation Consoles Vangelis Kokkevis Steven Osman Eric Larsen Simulation Technology Group Sony Computer Entertainment America US R&D

Transcript of High-Performance Physics Solver Design for Next...

Page 1: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

High-Performance Physics Solver Design for Next Generation Consoles

Vangelis KokkevisSteven Osman

Eric Larsen

Simulation Technology Group

Sony Computer Entertainment AmericaUS R&D

Page 2: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

This Talk

Optimizing physics simulation on a multi-core architecture.

Focus on CELL architectureVariety of simulation domains

Cloth, Rigid Bodies, Fluids, ParticlesPractical advice based on real case-studiesDemos!

Page 3: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Basic IssuesLooking for opportunities to parallelize processing

High Level – Many independent solvers on multiple coresLow Level – One solver, one/multiple cores

Coding with small memory in mindStreamingBatching up workSoftware Caching

Speeding up processing within each unitSIMD processing, instruction schedulingDouble-buffering

Parallelizing/optimizing existing code

Page 4: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

What is not in this talk?

Details on specific physics algorithmsToo much material for a 1-hour talkWill provide references to techniques

Much insight on non-CELL platformsConcentrate on actual resultsConcepts should be applicable beyond CELL

Page 5: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

The Cell Processor Model

PPU

SPE0

DMA

256K LS

SPU0

SPE0

DMA

256K LS

SPU1

SPE0

DMA

256K LS

SPU2

SPE0

DMA

256K LS

SPU3

DMA

256K LS

SPU4

DMA

256K LS

SPU5

DMA

256K LS

SPU6

DMA

256K LS

SPU7

Main Memory

L1/L2

Page 6: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Physics on CELL

Physics should happen mostly on SPUsThere’s more of them!SPUs have greater bandwidth & performancePPU is busy doing other stuff

PPU

SPE0DMA

256K LS

SPU0

DMA

SPU1

DMA

SPU2

DMA

SPU3

DMA

SPU4

DMA

SPU5

DMA

SPU6

DMA

SPU7

Main Memory

L1/L2

256K LS 256K LS 256K LS

256K LS 256K LS 256K LS 256K LS

Page 7: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

SPU Performance Recipe

Large bandwidth to and from main memoryQuick (1-cycle) LS memory accessSIMD instruction setConcurrent DMA and processingChallenges:

Limited LS size, shared between code and dataRandom accesses of main memory are slow

Page 8: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Cloth Simulation

Page 9: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Cloth Simulation

Cloth mesh simulated as point masses (vertices) connected via distance constraints (edges).

m1m1

m2m2 m3m3

d1d1

d2d2

d3d3Mesh TriangleMesh Triangle

References:T.Jacobsen,Advanced Character Physics, GDC 2001A.Meggs,Taking Real-Time Cloth Beyond Curtains,GDC 2005

Page 10: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Simulation Step

1. Compute external forces, fE,per vertex2. Compute new vertex positions [ Integration ]:

pt+1 = (2pt− pt−1)+ 1

3. Fix edge lengths Adjust vertex positions

4. Correct penetrations with collision geometryAdjust vertex positions

2fE ∗ 1m ∗Δt2

Page 11: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

How many vertices?

How many vertices fit in 256K (less actually)? A lot, surprisingly…

Tips:Look for opportunities to stream dataKeep in LS only data required for each step

Page 12: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Integration Step

Less than 4000 verts in 200K of memoryWe don’t need to keep them all in LSKeep vertex data in main memory and bring it in in blocks

16 + 16 + 16 + 4 = 52 bytes / vertex

pt+1 = (2pt− pt−1)+ 12fE ∗1m ∗Δt2

Page 13: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Streaming Integration

B0 B1 B2 B3

B0 B1 B2 B3

B0 B1 B2 B3

pt

pt−1

fE

B0 B1 B2 B31m

Main Memory

Local Store

Page 14: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Streaming Integration

B0 B1 B2 B3

B0 B1 B2 B3

B0 B1 B2 B3

pt

pt−1

fE

B0 B1 B2 B31m

Local Store

Main Memory

DMA_IN B0

B0 B0 B0 B0

Page 15: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Streaming Integration

B0 B1 B2 B3

B0 B1 B2 B3

B0 B1 B2 B3

pt

pt−1

fE

B0 B1 B2 B31m

Main Memory

Local Store

DMA_IN B0

B0 B0 B0 B0

Process B0

Page 16: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Streaming Integration

B0 B1 B2 B3

B0 B1 B2 B3

B0 B1 B2 B3

pt

pt−1

fE

Process B0 DMA_OUT B0

B0 B1 B2 B31m

Main Memory

Local Store

DMA_IN B0

B0 B0 B0 B0

Page 17: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Streaming Integration

B0 B1 B2 B3

B0 B1 B2 B3

B0 B1 B2 B3

pt

pt−1

fE

Process B0 DMA_OUT B0

B0 B1 B2 B31m

DMA_IN B1

Main Memory

Local Store

DMA_IN B0

B1 B1 B1 B1

Page 18: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Streaming Integration

B0 B1 B2 B3

B0 B1 B2 B3

B0 B1 B2 B3

pt

pt−1

fE

Process B0 DMA_OUT B0

B0 B1 B2 B31m

DMA_IN B1

Main Memory

Local Store

DMA_IN B0

B1 B1 B1 B1

Process B1

Page 19: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Streaming Integration

B0 B1 B2 B3

B0 B1 B2 B3

B0 B1 B2 B3

pt

pt−1

fE

Process B0 DMA_OUT B0

B0 B1 B2 B31m

DMA_IN B1

Main Memory

Local Store

DMA_IN B0

B1 B1 B1 B1

Process B1 DMA_OUT B1

Page 20: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Streaming Integration

B0 B1 B2 B3

B0 B1 B2 B3

B0 B1 B2 B3

pt

pt−1

fE

Process B0 DMA_OUT B0

B0 B1 B2 B31m

DMA_IN B1

Main Memory

Local Store

DMA_IN B0

B1 B1 B1 B1

Process B1 DMA_OUT B1

Page 21: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Double-buffering

Take advantage of concurrent DMA and processing to hide transfer times

Without double-buffering:

Process B0 DMA_OUT B0

DMA_IN B1

DMA_IN B0 Process B1 DMA_OUT

B1…

With double-buffering:Process B0

DMA_IN B1

DMA_IN B0

…DMA_OUT

B0

Process B1

DMA_IN B2

Process B2

DMA_OUT B1

DMA_IN B3

Page 22: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Streaming Data

Streaming is possible when the data access pattern is simple and predictable (e.g. linear)

Number of verts processed per frame depends on processing speed and bandwidth but not LS size

Unfortunately, not every step in the cloth solver can be fully streamed

Fixing edge lengths requires random memory access…

Page 23: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Fixing Edge Lengths

Points coming out of the integration step don’t necessarily satisfy edge distance constraints

struct Edge{

int v1;int v2;float restLen;

}p[v1] p[v2]

p[v1] p[v2]

Vector3 d = p[v2] – p[v1];float len = sqrt(dot(d,d));diff = (len-restLen)/len;p[v1] -= d * 0.5 * diff;p[v2] += d * 0.5 * diff;

Page 24: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Fixing Edge Lengths

An iterative process: Fix one edge at a time by adjusting 2 vertex positionsRequires random access to particle positions arraySolution:

Keep all particle positions in LSStream in edge dataIn 200K we can fit 200KB / 16B > 12K vertices

Page 25: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Rigid Bodies

Our group is currently porting the AGEIATM

PhysXTM SDK to CELLLarge codebase written with a PC architecture in mind

Assumes easy random access to memoryProcesses tasks sequentially (no parallelism)

Interesting example on how to port existing code to a multi-core architecture

Page 26: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Starting the Port

Determine all the stages of the rigid body pipelineLook for stages that are good candidates for parallelizing/optimizingProfile code to make sure we are focusing on the right parts

Page 27: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Constraint Solve

Constraint Solve

BroadphaseCollision Detection

Rigid Body Pipeline

BroadphaseCollision Detection

NarrowphaseCollision Detection

NarrowphaseCollision Detection

Constraint PrepConstraint Prep

IntegrationIntegration

Constraint PrepConstraint Prep

Points of contactbetween bodies

Potentially collidingbody pairs

New body positions

Updated bodyvelocities

Current body positions

Constraint Equations

Page 28: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Rigid Body Pipeline

BroadphaseCollision Detection

BroadphaseCollision Detection

Points of contactbetween bodies

Potentially collidingbody pairs

New body positions

Updated bodyvelocities

Current body positions

Constraint Equations

NP NP NP

CP CP CP

CS CS CS

I I I

Page 29: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Profiling Scenario

Page 30: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Profiling ResultsCumulative Frame Time

0

10000

20000

30000

40000

50000

60000

70000

1 57 113

169

225

281

337

393

449

505

561

617

673

729

785

841

897

953

1009

1065

1121

1177

1233

1289

1345

1401

1457

1513

1569

1625

1681

1737

1793

1849

1905

1961

OtherINTEGRATIONSOLVERCONSTRAINT_PREPNARROWPHASEBROADPHASE

Page 31: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Running on the SPUs

Three steps:1. (PPU) Pre-process

“Gather” operation (extract data from PhysX data structures and pack it in MM)

2. (SPU) ExecuteDMA packed data from MM to LSProcess data and store output in LSDMA output to MM

3. (PPU) Post-process“Scatter” operation (unpack output data and put back inPhysX data structures)

Page 32: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Why Involve the PPU?Required PhysX data is not conveniently packed Data is often not alignedWe need to use PhysX data structures to avoid breaking features we haven’t ported

Solutions:Use list DMAs to bring in dataModify existing code to force alignmentChange PhysX code to work with new data structures

Page 33: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Batching Up Work

Task Description

Task Description

batch inputs/outputs

batch inputs/outputs

Work batchbuffers in MMWork batch

buffers in MM

PhysXdata-structures

PhysXdata-structures

PPUPre-Process

PPUPre-Process

Task Description

Task Description

batch inputs/outputs

batch inputs/outputs

……

SPUExecute

SPUExecute

PPUPost-Process

PPUPost-Process

PhysXdata-structures

PhysXdata-structures

Create work batches for each task

Page 34: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

A

Narrow-phase Collision Detection

Problem:A list of object pairs that may be collidingWant to do contact processing on SPUsPairs list has references to geometry

C

B

(A,C)

(A,B)

(B,C)

Page 35: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Narrow-phase Collision Detection

Data localitySame bodies may be in several pairsGeometry may be instanced for different bodies

SPU memory accessCan only access main memory with DMANo hardware cacheData reuse must be explicit

Page 36: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Software Cache

Idea: make a (read-only) software cacheCache entry is one geometric objectEntries have variable size

Basic operationSPU checks cache for objectIf not in cache, object fetched with DMACache returns a local address for object

Page 37: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Software Cache

Data StructuresTwo entry buffersNew entries appended to “current” buffer

Hash-table used to record and find loaded entries

A

BC

Buffer 0

Next DMA

Buffer 1

Page 38: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Software Cache

Data ReplacementWhen space runs out in a buffer

Overwrite data in second buffer

ConsiderationsDoes not fragment memoryNo searches for free spaceBut does not prefer frequently used data

Page 39: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Software Cache

Hiding the DMA latencyDouble-buffering

Start DMA for un-cached entriesProcess previously DMA’d entries

Process/pre-fetch batchesFetch and compute times vary

Batching may improve balanceDMA-lists useful

One DMA commandMultiple chunks of data gathered

Current Buffer

DF

E

Process

DMA

A

BC

Page 40: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Software Caching

ConclusionsSimple cache is practical

Used for small convex objects in PhysX

Design considerationsTradeoff of cache-logic cycles vs. bandwidth savedPre-fetching important to include

Page 41: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Single SPU Performance

PPU only:PPU only:

Exec Exec

PPU + SPU:PPU + SPU:

PPUPPU

PPUPPU

SPUSPU Exec Exec

FreeFree

SPU Exec < PPU Exec: SIMD + fast mem accessSPU Exec < PPU Exec: SIMD + fast mem access

PreP PostP

Page 42: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Multiple SPU Performance

Pre- and Post- processing times determine how many SPUs can be used effectively

Page 43: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Multiple SPU Performance

11 22 33 44PPUPPU

1 SPU1 SPU

11

22

33

44SS

3 SPUs3 SPUs

11 22 4433

11

22

33

442 SPUs2 SPUs

Page 44: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

PPU vs SPU comparisonsConvex Stack (500 boxes)

0

10000

20000

30000

40000

50000

60000

70000

80000

1 35 69 103

137

171

205

239

273

307

341

375

409

443

477

511

545

579

613

647

681

715

749

783

817

851

885

919

953

987

1021

1055

1089

1123

1157

1191

1225

1259

1293

frame

mic

rose

cond

s PPU-only1-SPU2-SPUs3-SPUs4-SPUs

Page 45: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Duck Demo

One of our first CELL demos (spring 2005)Several interacting physics systems:

Rigid bodies (ducks & boats)Height-field water surfaceCloth with ripping (sails)Particle based fluids (splashes + cups)

Page 46: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Duck Demo (Lots of Ducks)

Page 47: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Duck Demo

Ambitious project with short deadlineEarly PC prototypes of some piecesMost straightforward way to parallelize:

Dedicate one SPU for each subsystemEach piece could be developed and tested individually

Page 48: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Duck Demo Resource AllocationPU – main loop

SPU thread synchronization, draw calls

SPU0 – height field water (<50%)

SPU1 – splashes iso-surface (<50%)

SPU2 – cloth sails for boat 1 (<50%)

SPU3 – cloth sails for boat 2 (<50%)

SPU4 – rigid body collision/response (95%)

HF water

Iso-Surface

Cloth

Cloth

Rigid Bodies

1 frame

Page 49: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Parallelization Recipe

One three-step approach to code parallelization:

1. Find independent components2. Run them side-by-side3. Recursively apply recipe to components

Page 50: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Challenges

Step 1: Find independent componentsWhere do you look?Maybe you need to break apart and overlap your data?

e.g. Broad phase collision detectionMaybe you need to break apart your loop into individual iterations?

e.g. Solving cloth constraints

Page 51: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Broad Phase Collision Detection

600 Objects

200 Objects A 200 Objects B 200 Objects C

200 Objects A 200 Objects Bvs

200 Objects A 200 Objects Cvs

200 Objects C200 Objects B vs

We can execute all three of

these simultaneously

Need to test 600 rigid bodies against each other.

Page 52: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Cloth Solvingfor (i=1 to 5) {

cloth=solve(cloth)}

A B

C

for (i=1 to 5) {solve_on_proc1(a);solve_on_proc2(b);wait_for_all()solve_on_proc1(c);wait_for_all();

}

Page 53: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

…challenges

Step 2: Run them side-by-sideBandwidth and cache issues

Need good data layout to avoid thrashing cache or bus

Processor issuesNeed efficient processor management scheme

What if the job sizes are very different?e.g. a suit of cloth and a separate neck tie

Need further refinement of large jobs, or you only save on the small neck tie time

Page 54: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

…challenges

Step 3: RecurseWhen do you stop?

Overhead of launching smaller jobsSynchronization when a stage is done

e.g. Gather results from all collision detection before solving

But this can go down to the instruction levele.g. Using Structure-of-Arrays, transform four

independent vectors at once

Page 55: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

High Level Parallelization:Duck Demo

Fluid Simulation Fluid Surface Rigid Bodies Cloth Sails

Dependency exists

Fluid Simulation Fluid Surface

Rigid Bodies

Cloth Sails

Cloth Boat 1

Cloth Boat 2

Note that the parts didn’t take an equal amount of time to run. We

could have done better given time!

But cloth was for multiple boats

Page 56: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Broad PhaseCollision Detection

Narrow PhaseCollision Detection

Constraint Solving

Lower Level ParallelizationRigid Body Simulation

600 bodies example

Objects A Objects B Objects C

Broad PhaseCollision Detection

Narrow PhaseCollision Detection

Constraint Solving

Objects A Objects B

Objects CObjects B

Objects A

Objects C

Proc 3Proc 2Proc 1 Proc 3Proc 2Proc 1 Proc 3Proc 2Proc 1

Page 57: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Structure of Arrays

X0 Y0 Z0 W0

X1 Y1 Z1 W1

X2 Y2 Z2 W2

X3 Y3 Z3 W3

X4 Y4 Z4 W4

X5 Y5 Z5 W5

X6 Y6 Z6 W6

X7 Y7 Z7 W7

Data[0]

Data[1]

Data[2]

Data[3]

Data[4]

Data[5]

Data[6]

Data[7]

X0

Y0

Z0

W0

X1

Y1

Z1

W1

X2

Y2

Z2

W2

X3

Y3

Z3

W3

X4

Y4

Z4

W4

X5

Y5

Z5

W5

X6

Y6

Z6

W6

X7

Y7

Z7

W7

Data[0]

Data[1]

Data[2]

Data[3]

Data[4]

Data[5]

Data[6]

Data[7]

Array of Structuresor “AoS”

Structure of Arraysor “SoA”

1 AoS Vector

1 SoA Vector

Bonus!Since W is almost always 0 or 1, we can eliminate it with a clever math library and save 25% memory and bandwidth!

Page 58: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Lowest Level Parallelization:Structure-of-Array processing of Particles

Given:pn(t)=position of particle n at time tvn(t)=velocity of particle n at time t

p1(ti)=p1(ti-1) + v1(ti-1) * dt + 0.5 * G * dt2

p2(ti)=p2(ti-1) + v2(ti-1) * dt + 0.5 * G * dt2

Note they are independent of each otherSo we can run four together using SoA

p{1-4}(ti)=p{1-4}(ti-1) + v{1-4}(ti-1) * dt + 0.5 * G * dt2

Page 59: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Failure CaseGauss Seidel Solver

Consider a simple position-based solver that uses distance constraints. Given:p=current positions of all objectssolve(cn, p) takes p and constraint cn and computes a new p

that satisfies cn

p=solve(c0, p)p=solve(c1, p)

Note that to solve c1, we need the result of c0. Can’t solve c0 and c1 concurrently!

Page 60: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Failure CasePossible Solutions

Generally, it’s you’re out of luck, but…Some cases have very limited dependenciese.g. particle-based cloth solving

Solution: Arrange constraints such that no four adjacent constraints share cloth particles

Consider a different solvere.g. Jacobi solvers don’t use updated values until all constraints have been processed once

But they need more memory (pnew and pcurrent)And may need more iterations to converge

Page 61: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Duck Demo (EyeToy + SPH)

Page 62: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Smoothed Particle Hydrodynamics(SPH) Fluid Simulation

Smoothed-particlesMass distributed around a pointDensity falls to 0 at a radius h

Forces between particles closer than 2h

h

Page 63: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

SPH Fluid SimulationHigh-level parallelism

Put particles in grid cellsProcess on different SPUs(Not used in duck demo)

Low-level parallelismSIMD and dual-issue on SPULarge n per cell may be better

Less grid overheadLoops fast on SPU

Page 64: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

SPH LoopConsider two sets of particles P and Q

E.g., taken from neighbor grid cellsO(n2) problem

Can unroll (e.g., by 4)for (i = 0; i < numP; i++)

for (j = 0; j < numQ; j+=4)Compute force (pi, qj)Compute force (pi, qj+1)Compute force (pi, qj+2)Compute force (pi, qj+3)

Page 65: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

SPH Loop, SoAIdea:

Increase SIMD throughput with structure-of-arraysTranspose and produce combinations

pi

qj

qj+1

x y z

x y z

x y z

xyz

xyz

xyz

xyz

SoA p

SoA q xyz

xyz

xyz

xyzx y z

x y zqj+2

qj+3

Page 66: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

SPH Loop, Software PipelinedAdd software pipelining

Conversion instructions can dual-issue with math

Load[i]

To SoA[i]

Compute[i]

From SoA[i]Store[i]

Compute[i]Load[i+1]

Store [i-1]

To SoA[i+1]

From SoA[i-1]

Pipe 0 Pipe 1

Page 67: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Recap

Finding independence is hard!Across subsystems or within subsystems?Across iterations or within iterations?Data level independence?Instruction level independence?How about “bandwidth level” independence?

Parallelization overheadSometimes running serially wins over overhead of parallelization

Page 68: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Particle Simulation Demo

Page 69: High-Performance Physics Solver Design for Next …twvideo01.ubm-us.net/o1/vault/gdc06/slides/S2714i1.pdf · High-Performance Physics Solver Design for Next Generation Consoles Vangelis

Questions?

http://www.research.scea.com/

Contacts:Vangelis Kokkevis: [email protected] Larsen: [email protected] Osman: [email protected]