HPCCD: Hybrid Parallel Continuous Collision Detection using CPUs and GPUs

HPCCD: Hybrid ParallelContinuous Collision Detection

using CPUs and GPUs

Duksu Kim Jae-Pil HeoJaehyuk Huh John Kim Sung-eui Yoon

http://sglab.kaist.ac.kr/HPCCD

Collision Detection (CD)

● Collision detection is widely used in various applications● Games● Physically-based simulations● Robotics

from AION from HUBO Lab. In KAISTfrom “Need for speed”

Motivations

● Increasing demands for accurate and fast collision detection for deformable models

● Model complexity continues to grow

Courtesy from Prof.Kofrom Creative Assembly’s “Rome: Total war”

Goals

● Achieve interactive performance for exact collision detection between large-scale deformable models ● E.g., deforming models consisting of tens

or hundreds of thousand triangles

<Cloth-ball, 94K triangles> <Breaking dragon, 252K triangles>

Discrete vs. Continuous

● Discrete collision detection (DCD)● Detect collisions at each frame● Fast, but can miss collisions

Frame1 Frame2

Miss collisions

Discrete vs. Continuous

● Discrete collision detection (DCD)● Continuous collision detection (CCD)

● Identify the first time-of-contact (ToC)● Accurate, but requires a long computation

time● Not widely used in interactive

applications

Frame1 Frame2

The first time-of-contact (ToC)

Inter- and Self-Collisions

● Inter-collisions● Collisions between

two objects

● Self-collisions● Collisions between

two regions of a deformable object

● Takes a long computation time to detect From Govindaraju’s paper

Parallel Computing Trends

● Many core architectures● Multi-core CPU architectures● GPU architectures

● Heterogeneous architectures● Intel’s Larabee and AMD’s Fusion

● Designing parallel algorithms is important to utilize these parallel architectures

Main Contributions

● A novel, hybrid parallel CCD method● Utilize both multi-core CPUs and GPUs● No locking in the main loop of CD● GPU-based exact CD between two

triangles● High scalability● Interactive performance

Cloth Benchmark (94K triangles)

● Test machine● A quad-core

CPU● Two GPUs

● Results● 10.4 X speed-

up over using a CPU core

● 23ms per frame

(43 FPS)

Related Work

● Algorithms specialized on certain types● Rigid objects [Reden et al. 2002]● Articulated bodies [Zhang et al. 2007]● Meshes with fixed topology [Govindaraju

et al. 2005, Wong et al. 2005]● Efficient culling methods

● Eliminates redundant elementary tests [Curtis et al. 2008]

● Connectivity-based culling [Tang et al. 2008]

Related Work

● GPU-based approaches● Visibility queries [Govindaraju et al. 2005]● Unified GPU-framework for proximity

queries [Sud et al. 2006]● Multi-core CPU-based approaches

● Voxel-based collision detection method [Lawlor and Laxmikant 2002]

● Front based task decomposition method [Tang et al. 2009]

Use only multi-core CPUs or GPUs and do not provide interactive performance for large-scale models

Outline

● Background on CCD● Inter-CD based parallel CCD● GPU-based elementary tests● System summary● Results

Bounding Volume Hierarchies (BVHs)

● Organize bounding volumes as a tree● Leaf nodes have triangles

BVH-based Collision Detection

A

B C

X

Y Z

Collision test pair queue

(A,X)

● BVH traversal

A X

Dequeue

BV overlap test


A

B C

X

Y Z


(B,Y)

BV overlap test

Dequeue Refine

Self-CD

● BVH traversal

(B,Z) (C,Y) (C,Y) (B,C) (Y,Z)


● BVH traversal● Elementary tests

● At leaf nodes, exact collision tests between triangles are done by solving cubic equations [Provot 1997]

Lazy BVH Update

● Update BVs of visited nodes during BVH traversal

● Improve the performance of CD for large-scale models

A

B C

X

Y Z

No overlap

B C Y Z

BV updates are not required

Outline

● Background on CCD● Inter-CD based parallel CCD● GPU-based elementary tests● System overview● Results

Naïve Parallel MethodA

B C

X

Y Z

Collision test pair queue(B,Y), (B,Z), (C,Y), (C,Z) (B,C), (Y,Z)

Thread 1 Thread 2

Low Scalability of Naïve Method

1 2 3 41

3

5

7

Number of Threads

Imp

rove

men

t (t

imes

)

IdealNaïve

● Two issues● Locking● Load-balancing

3.5

Two Issues

A

B C

X

Y Z

Collision test pair queue(B,Y), (B,Z), (C,Y), (C,Z) (B,C), (Y,Z)

Thread 1 Thread 2

● Locking

● To apply the lazy BVH update method, the locking is required to write the BV updates for nodes

● Locking serializes multiple threads and lowers the performance of parallel algorithms

Y

Two Issues

Collision test pair queues for each thread

Thread 1 Thread 2

● Locking● Load-balancing among threads

● High variance of the pair-based workload

Our CPU-based Parallel CCD

● Inter-CD based task decomposition● Leads lock-free parallel algorithm in the

main loop of parallel algorithm● Enables efficient dynamic load balancing

among threads

Inter-CD based Task Decomposition

● Combine all BVHs of objects into one BVH

● Terminology● Inter-CD(N): Inter-

collision detection between two sub-BVHs whose roots are two child nodes of the node N

N

● We represent all collision detection tasks as a set of Inter-CDs

N

D R

E F

RD

Inter-CD Based Task Decomposition

Inter-CD(N)

Self-CDs = Inter-CD(D) + Inter-CD(E) + Inter-CD(F) +

etc.

N

D

Disjoint Property

A B

Accessed nodes are disjoint

Inter-CD(A) Inter-CD(B)

Inter-CD based Serial CCD

A

C

D EThread

Task unit queue A

B CB

D E

Pop a node nPush

children of the node n

Processinter-CD(n)Pop a node n

Push children of the node n

ProcessInter-CD(n)

Initialize


A

C

D EThread

Task unit queue A

B C

B

D E

Pop a node nPush




ProcessInter-CD(n)

Always satisfy the disjoint propertyA

C

Inter-CD based Parallel CCD

n1

n2 n3

n4 n5 n6 n7

T1 T2 T3 T4

n4 n5 n6 n7Satisfy the disjoint property

● Initial task assignment

Front

Each thread performs CCD without any locking while lazily update BVHs

Inter-CD based Dynamic Task Assignment

Thread 1 Thread 2

Task unit queue Task unit queue

Request

Scheduling queue Scheduling queueT1

The front node

● Require locking for scheduling queues, but no locking in the main loop of CD algorithm

● Low locking overhead for task assignment

likely to cause the highest computational workload

Results of Our CPU-based Parallel CCD

1 2 3 41

3

5

7

Number of Threads

Imp

rove

men

t (t

imes

)

1 2 3 41

3

5

7

Number of Threads

Imp

rove

men

t (t

imes

)

7.1IdealOur methodNaïve

3.5

● Remove locking in the main loop of CD● Employ efficient dynamic load-

balancing based on inter-CD task units

Results of Our CPU-based Parallel CCD

<Cloth-ball, 94K triangles>

<Breaking dragon, 252K triangles>

2.8 FPS

1.2 FPS

We will further improve the performance of CCD by using GPUs

17.9 FPS

7.7 FPS

A core 8-cores

Outline


Task Distribution

CCD

BVH update BVH traversal Elementary tests

BVH update and traversal

Elementary tests

Multi-core CPUs GPUs

-Random accesses

- Branch prediction- Caching

- Solving cubic equations

Streaming processors optimized with floating point operations

Simple Interface between CPUs and GPUs

● Each thread sends information of two triangles to GPUs

● Two issues● Data transfer

overhead● Each thread

maintains its own GPU context

T1

T2

T3

T4

GPUs

A pair of two triangles(24 byte)

Asynchronous Data Transfer

T1

T2

T3

T4

GPUsGeometry

● Asynchronously send geometry during BVH update and traversal● Data transfer time is hidden

● At leaf nodes, only send two indices of triangles

Reduce the Data Size

T1

T2

T3

T4

GPUs

A pair of two triangles(24 byte)Two indices of triangles(8 byte)

Reduce the DeviceContext Maintain Overhead

● Use the master-slave model

S1

S2

S3

S4

Master

GPUsTriangle index queue

(TIQ)

Results

16 Kbytes~256 Kbytes

Chunks

Index pairs

Index pairs

Index pairs

Index pairs

Segmented queue

Outline


Summary of HPCCD

Elementary tests

Multi-core CPUs GPUs

Slaves MasterInter-CD based parallel BVH update and traversal Geometry

Index pair queue

Index pairs

Index pairs

Index pairs

Index pairsGeometry

Index pairs

Results

Outline


Testing Environment

● Machine● One quad-core CPU (Intel i7 CPU, 3.2

GHz )● Two GPUs (Nvidia Geforce GTX285)

● Run eight CPU threads by using Intel’s hyper threading technology

Breaking Dragon Benchmark

● 252K triangles

● Results● 12.5 X speed-up

over using a CPU core

● 54ms per frame (19 FPS)

Low-Resolution N-body Bench.

● 34K triangles



● 6.8ms per frame

(148 FPS)

High-Resolution N-body Bench.

● 146K triangles



● 54ms per frame (19 FPS)

Results of HPCCD

● As the number of GPUs is increased, we get higher performances

Limitation

● Low scalability for small rigid models

Summary

● A novel, hybrid parallel algorithm● Utilize both multi-core CPUs and GPUs

● High scalability● About 13 times performance improvement

for CCD by using a quad-core CPU and two GPUs compared with using a single CPU core

● Interactive performance● Show 19-140 FPS for various deformable

models consisting of tens or hundreds of thousand triangles

The implementation code will be available as OpenCCD library (http://sglab.kaist.ac.kr/OpenCCD)

Future Work

● Design efficient GPU-based BVH update and traversal

● Extend our method to heterogeneous architectures● Intel’s Larabee● AMD’s Fusion

● Apply our method to other proximity queries● Distance queries● Penetration depth

Acknowledgments

● Members of SGLab. in KAIST● Anonymous reviewers● Model contributor

● UNC GAMMA Research Group● Funding agencies

● KAIST seed grant● Ministry of Knowledge Economy● Samsung● Microsoft Research Asia● Korea Research Foundation

Any Question?

Q & AThanks

http://sglab.kaist.ac.kr/HPCCD

Window Master

Segmented TIQ

● Assign two segments to each thread● Removing lock when slaves access TIQ

S1 S2

E PF E EE PF E PF E P E E

Request a empty

segment

ToGPU

P

ResultsfromGPU

GPU task unit

E: EmptyP: PartialF: Full

Appendix

Segment Size in the TIQ

● A small segment size● Lead to a high communication overhead

between master and slaves● A large segment size

● GPUs to be idle at the beginning of the BVH traversal

● In our experiments● 2K entries for a segment show the best

performance● But, bigger entries shows only minor

performance degradation

Appendix

Load Balancingbetween CPUs and GPUs

● High-end CPUs and low-end GPUs● GPUs do not cover all elementary tests

● The # of GPU task units > 2(the # of GPUs)● Assume, GPUs are busy● Slave threads processes half of

elementary test of a full segment, and reuse the segment

Appendix

Process High-level Nodes ● Processing one by one● Observation

● Most nodes were updated during processing low-level nodes

● Parallelizing an ICTPS ● Arbitrarily partition the

pairs of BV overlap tests into available threads

● Use lock when lazy BV update is needed

n1

n2 n3

n4 n5 n6 n7

Front

High-levelnodes

Low-levelnodes

Appendix

Analysis of Each Component

6.5x performance improvementby using 8 CPU cores

2.8x, 4.6x performance improvement by using one GPU and tow GPUs respectively, compared with using a CPU core

7.5x performance improvementby using 8 CPU cores

Appendix

Parallel BVH Update

● Collect 2k nodes by using BFT, where k is number of threads

● Assign the nodes to threads with round robin manner

● Using similar dynamic assignment method used in traversal

Appendix

1 2 3 41

3

5

7

Number of Threads

Imp

rove

men

t (t

imes

)

Without using GPUs

● High scalability

1 2 3 41

3

5

7

Number of Threads

Imp

rove

men

t (t

imes

)

1 2 3 41

3

5

7

Number of Threads

Imp

rove

men

t (t

imes

)

1 2 3 41

3

5

7

Number of Threads

Imp

rove

men

t (t

imes

)

1 2 3 41

3

5

7

Number of Threads

Imp

rove

men

t (t

imes

)Appendix


A

C

D EThread

Task unit queue A

B CB

D E

Pop a node nPush




ProcessInter-CD(n)

Initialize

Always satisfy the disjoint property

Appendix

Processing an Inter-CD task unit

A

B C

X

Y Z


(B,Y)

BV overlap test

Dequeue Refine

(B,Z) (C,Y) (C,Y)

HPCCD: Hybrid Parallel Continuous Collision Detection using CPUs and GPUs

Documents

Transcript of HPCCD: Hybrid Parallel Continuous Collision Detection using CPUs and GPUs