HPCCD: Hybrid Parallel Continuous Collision Detection using CPUs and GPUs
description
Transcript of HPCCD: Hybrid Parallel Continuous Collision Detection using CPUs and GPUs
HPCCD: Hybrid ParallelContinuous Collision Detection
using CPUs and GPUs
Duksu Kim Jae-Pil HeoJaehyuk Huh John Kim Sung-eui Yoon
http://sglab.kaist.ac.kr/HPCCD
Collision Detection (CD)
● Collision detection is widely used in various applications● Games● Physically-based simulations● Robotics
from AION from HUBO Lab. In KAISTfrom “Need for speed”
Motivations
● Increasing demands for accurate and fast collision detection for deformable models
● Model complexity continues to grow
Courtesy from Prof.Kofrom Creative Assembly’s “Rome: Total war”
Goals
● Achieve interactive performance for exact collision detection between large-scale deformable models ● E.g., deforming models consisting of tens
or hundreds of thousand triangles
<Cloth-ball, 94K triangles> <Breaking dragon, 252K triangles>
Discrete vs. Continuous
● Discrete collision detection (DCD)● Detect collisions at each frame● Fast, but can miss collisions
Frame1 Frame2
Miss collisions
Discrete vs. Continuous
● Discrete collision detection (DCD)● Continuous collision detection (CCD)
● Identify the first time-of-contact (ToC)● Accurate, but requires a long computation
time● Not widely used in interactive
applications
Frame1 Frame2
The first time-of-contact (ToC)
Inter- and Self-Collisions
● Inter-collisions● Collisions between
two objects
● Self-collisions● Collisions between
two regions of a deformable object
● Takes a long computation time to detect From Govindaraju’s paper
Parallel Computing Trends
● Many core architectures● Multi-core CPU architectures● GPU architectures
● Heterogeneous architectures● Intel’s Larabee and AMD’s Fusion
● Designing parallel algorithms is important to utilize these parallel architectures
Main Contributions
● A novel, hybrid parallel CCD method● Utilize both multi-core CPUs and GPUs● No locking in the main loop of CD● GPU-based exact CD between two
triangles● High scalability● Interactive performance
Cloth Benchmark (94K triangles)
● Test machine● A quad-core
CPU● Two GPUs
● Results● 10.4 X speed-
up over using a CPU core
● 23ms per frame
(43 FPS)
Related Work
● Algorithms specialized on certain types● Rigid objects [Reden et al. 2002]● Articulated bodies [Zhang et al. 2007]● Meshes with fixed topology [Govindaraju
et al. 2005, Wong et al. 2005]● Efficient culling methods
● Eliminates redundant elementary tests [Curtis et al. 2008]
● Connectivity-based culling [Tang et al. 2008]
Related Work
● GPU-based approaches● Visibility queries [Govindaraju et al. 2005]● Unified GPU-framework for proximity
queries [Sud et al. 2006]● Multi-core CPU-based approaches
● Voxel-based collision detection method [Lawlor and Laxmikant 2002]
● Front based task decomposition method [Tang et al. 2009]
Use only multi-core CPUs or GPUs and do not provide interactive performance for large-scale models
Outline
● Background on CCD● Inter-CD based parallel CCD● GPU-based elementary tests● System summary● Results
Outline
● Background on CCD● Inter-CD based parallel CCD● GPU-based elementary tests● System summary● Results
Bounding Volume Hierarchies (BVHs)
● Organize bounding volumes as a tree● Leaf nodes have triangles
BVH-based Collision Detection
A
B C
X
Y Z
Collision test pair queue
(A,X)
● BVH traversal
A X
Dequeue
BV overlap test
BVH-based Collision Detection
A
B C
X
Y Z
Collision test pair queue
(B,Y)
BV overlap test
Dequeue Refine
Self-CD
● BVH traversal
(B,Z) (C,Y) (C,Y) (B,C) (Y,Z)
BVH-based Collision Detection
● BVH traversal● Elementary tests
● At leaf nodes, exact collision tests between triangles are done by solving cubic equations [Provot 1997]
Lazy BVH Update
● Update BVs of visited nodes during BVH traversal
● Improve the performance of CD for large-scale models
A
B C
X
Y Z
No overlap
B C Y Z
BV updates are not required
Outline
● Background on CCD● Inter-CD based parallel CCD● GPU-based elementary tests● System overview● Results
Naïve Parallel MethodA
B C
X
Y Z
Collision test pair queue(B,Y), (B,Z), (C,Y), (C,Z) (B,C), (Y,Z)
Thread 1 Thread 2
Low Scalability of Naïve Method
1 2 3 41
3
5
7
Number of Threads
Imp
rove
men
t (t
imes
)
IdealNaïve
● Two issues● Locking● Load-balancing
3.5
Two Issues
A
B C
X
Y Z
Collision test pair queue(B,Y), (B,Z), (C,Y), (C,Z) (B,C), (Y,Z)
Thread 1 Thread 2
● Locking
● To apply the lazy BVH update method, the locking is required to write the BV updates for nodes
● Locking serializes multiple threads and lowers the performance of parallel algorithms
Y
Two Issues
Collision test pair queues for each thread
Thread 1 Thread 2
● Locking● Load-balancing among threads
● High variance of the pair-based workload
Our CPU-based Parallel CCD
● Inter-CD based task decomposition● Leads lock-free parallel algorithm in the
main loop of parallel algorithm● Enables efficient dynamic load balancing
among threads
Inter-CD based Task Decomposition
● Combine all BVHs of objects into one BVH
● Terminology● Inter-CD(N): Inter-
collision detection between two sub-BVHs whose roots are two child nodes of the node N
N
● We represent all collision detection tasks as a set of Inter-CDs
N
D R
E F
RD
Inter-CD Based Task Decomposition
Inter-CD(N)
Self-CDs = Inter-CD(D) + Inter-CD(E) + Inter-CD(F) +
etc.
N
D
Disjoint Property
A B
Accessed nodes are disjoint
Inter-CD(A) Inter-CD(B)
Inter-CD based Serial CCD
A
C
D EThread
Task unit queue A
B CB
D E
Pop a node nPush
children of the node n
Processinter-CD(n)Pop a node n
Push children of the node n
ProcessInter-CD(n)
Initialize
Inter-CD based Serial CCD
A
C
D EThread
Task unit queue A
B C
B
D E
Pop a node nPush
children of the node n
Processinter-CD(n)Pop a node n
Push children of the node n
ProcessInter-CD(n)
Always satisfy the disjoint propertyA
C
Inter-CD based Parallel CCD
n1
n2 n3
n4 n5 n6 n7
T1 T2 T3 T4
n4 n5 n6 n7Satisfy the disjoint property
● Initial task assignment
Front
Each thread performs CCD without any locking while lazily update BVHs
Inter-CD based Dynamic Task Assignment
Thread 1 Thread 2
Task unit queue Task unit queue
Request
Scheduling queue Scheduling queueT1
The front node
● Require locking for scheduling queues, but no locking in the main loop of CD algorithm
● Low locking overhead for task assignment
likely to cause the highest computational workload
Results of Our CPU-based Parallel CCD
1 2 3 41
3
5
7
Number of Threads
Imp
rove
men
t (t
imes
)
1 2 3 41
3
5
7
Number of Threads
Imp
rove
men
t (t
imes
)
7.1IdealOur methodNaïve
3.5
● Remove locking in the main loop of CD● Employ efficient dynamic load-
balancing based on inter-CD task units
Results of Our CPU-based Parallel CCD
<Cloth-ball, 94K triangles>
<Breaking dragon, 252K triangles>
2.8 FPS
1.2 FPS
We will further improve the performance of CCD by using GPUs
17.9 FPS
7.7 FPS
A core 8-cores
Outline
● Background on CCD● Inter-CD based parallel CCD● GPU-based elementary tests● System summary● Results
Task Distribution
CCD
BVH update BVH traversal Elementary tests
BVH update and traversal
Elementary tests
Multi-core CPUs GPUs
-Random accesses
- Branch prediction- Caching
- Solving cubic equations
Streaming processors optimized with floating point operations
Simple Interface between CPUs and GPUs
● Each thread sends information of two triangles to GPUs
● Two issues● Data transfer
overhead● Each thread
maintains its own GPU context
T1
T2
T3
T4
GPUs
A pair of two triangles(24 byte)
Asynchronous Data Transfer
T1
T2
T3
T4
GPUsGeometry
● Asynchronously send geometry during BVH update and traversal● Data transfer time is hidden
● At leaf nodes, only send two indices of triangles
Reduce the Data Size
T1
T2
T3
T4
GPUs
A pair of two triangles(24 byte)Two indices of triangles(8 byte)
Reduce the DeviceContext Maintain Overhead
● Use the master-slave model
S1
S2
S3
S4
Master
GPUsTriangle index queue
(TIQ)
Results
16 Kbytes~256 Kbytes
Chunks
Index pairs
Index pairs
Index pairs
Index pairs
Segmented queue
Outline
● Background on CCD● Inter-CD based parallel CCD● GPU-based elementary tests● System summary● Results
Summary of HPCCD
Elementary tests
Multi-core CPUs GPUs
Slaves MasterInter-CD based parallel BVH update and traversal Geometry
Index pair queue
Index pairs
Index pairs
Index pairs
Index pairsGeometry
Index pairs
Results
Outline
● Background on CCD● Inter-CD based parallel CCD● GPU-based elementary tests● System summary● Results
Testing Environment
● Machine● One quad-core CPU (Intel i7 CPU, 3.2
GHz )● Two GPUs (Nvidia Geforce GTX285)
● Run eight CPU threads by using Intel’s hyper threading technology
Breaking Dragon Benchmark
● 252K triangles
● Results● 12.5 X speed-up
over using a CPU core
● 54ms per frame (19 FPS)
Low-Resolution N-body Bench.
● 34K triangles
● Results● 11.4 X speed-up
over using a CPU core
● 6.8ms per frame
(148 FPS)
High-Resolution N-body Bench.
● 146K triangles
● Results● 13.6 X speed-up
over using a CPU core
● 54ms per frame (19 FPS)
Results of HPCCD
● As the number of GPUs is increased, we get higher performances
Limitation
● Low scalability for small rigid models
Summary
● A novel, hybrid parallel algorithm● Utilize both multi-core CPUs and GPUs
● High scalability● About 13 times performance improvement
for CCD by using a quad-core CPU and two GPUs compared with using a single CPU core
● Interactive performance● Show 19-140 FPS for various deformable
models consisting of tens or hundreds of thousand triangles
The implementation code will be available as OpenCCD library (http://sglab.kaist.ac.kr/OpenCCD)
Future Work
● Design efficient GPU-based BVH update and traversal
● Extend our method to heterogeneous architectures● Intel’s Larabee● AMD’s Fusion
● Apply our method to other proximity queries● Distance queries● Penetration depth
Acknowledgments
● Members of SGLab. in KAIST● Anonymous reviewers● Model contributor
● UNC GAMMA Research Group● Funding agencies
● KAIST seed grant● Ministry of Knowledge Economy● Samsung● Microsoft Research Asia● Korea Research Foundation
Any Question?
Q & AThanks
http://sglab.kaist.ac.kr/HPCCD
Window Master
Segmented TIQ
● Assign two segments to each thread● Removing lock when slaves access TIQ
S1 S2
E PF E EE PF E PF E P E E
Request a empty
segment
ToGPU
P
ResultsfromGPU
GPU task unit
E: EmptyP: PartialF: Full
Appendix
Segment Size in the TIQ
● A small segment size● Lead to a high communication overhead
between master and slaves● A large segment size
● GPUs to be idle at the beginning of the BVH traversal
● In our experiments● 2K entries for a segment show the best
performance● But, bigger entries shows only minor
performance degradation
Appendix
Load Balancingbetween CPUs and GPUs
● High-end CPUs and low-end GPUs● GPUs do not cover all elementary tests
● The # of GPU task units > 2(the # of GPUs)● Assume, GPUs are busy● Slave threads processes half of
elementary test of a full segment, and reuse the segment
Appendix
Process High-level Nodes ● Processing one by one● Observation
● Most nodes were updated during processing low-level nodes
● Parallelizing an ICTPS ● Arbitrarily partition the
pairs of BV overlap tests into available threads
● Use lock when lazy BV update is needed
n1
n2 n3
n4 n5 n6 n7
Front
High-levelnodes
Low-levelnodes
Appendix
Analysis of Each Component
6.5x performance improvementby using 8 CPU cores
2.8x, 4.6x performance improvement by using one GPU and tow GPUs respectively, compared with using a CPU core
7.5x performance improvementby using 8 CPU cores
Appendix
Parallel BVH Update
● Collect 2k nodes by using BFT, where k is number of threads
● Assign the nodes to threads with round robin manner
● Using similar dynamic assignment method used in traversal
Appendix
1 2 3 41
3
5
7
Number of Threads
Imp
rove
men
t (t
imes
)
Without using GPUs
● High scalability
1 2 3 41
3
5
7
Number of Threads
Imp
rove
men
t (t
imes
)
1 2 3 41
3
5
7
Number of Threads
Imp
rove
men
t (t
imes
)
1 2 3 41
3
5
7
Number of Threads
Imp
rove
men
t (t
imes
)
1 2 3 41
3
5
7
Number of Threads
Imp
rove
men
t (t
imes
)Appendix
Inter-CD based Serial CCD
A
C
D EThread
Task unit queue A
B CB
D E
Pop a node nPush
children of the node n
Processinter-CD(n)Pop a node n
Push children of the node n
ProcessInter-CD(n)
Initialize
Always satisfy the disjoint property
Appendix
Processing an Inter-CD task unit
A
B C
X
Y Z
Collision test pair queue
(B,Y)
BV overlap test
Dequeue Refine
(B,Z) (C,Y) (C,Y)