On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers...
-
Upload
miles-raybuck -
Category
Documents
-
view
214 -
download
1
Transcript of On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers...
![Page 1: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/1.jpg)
On Dynamic Load Balancing on Graphics Processors
Daniel Cederman and Philippas TsigasChalmers University of Technology
![Page 2: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/2.jpg)
Overview
• Motivation
• Methods
• Experimental evaluation
• Conclusion
![Page 3: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/3.jpg)
The problem setting
Work
Task Task Task
Task Task Task Task
Task Task Task Task
Offline
Online
![Page 4: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/4.jpg)
Static Load Balancing
Processor Processor Processor Processor
![Page 5: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/5.jpg)
Static Load Balancing
Processor Processor Processor Processor
Task Task Task Task
![Page 6: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/6.jpg)
Static Load Balancing
Processor Processor Processor Processor
Task
Task
Task
Task
![Page 7: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/7.jpg)
Static Load Balancing
Processor Processor Processor Processor
Task
Task
Task
Task
Subtask Subtask Subtask Subtask
![Page 8: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/8.jpg)
Static Load Balancing
Processor Processor Processor Processor
Task
Task
Task
Task
SubtaskSubtask
Subtask
Subtask
![Page 9: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/9.jpg)
Dynamic Load Balancing
Processor Processor Processor Processor
Task
Task
Task
Task
Subtask
SubtaskSubtask
Subtask
![Page 10: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/10.jpg)
Task sharing
Work done?
Try to get task
New tasks
?
Perform task
Got task?
Add task
Task Set
No, retry
Check condition
Acquire Task
Add Task
No, continue
Task
Task
Task
Task
Task
Done
![Page 11: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/11.jpg)
System Model
• CUDA
• Global Memory
• Gather and scatter
• Compare-And-Swap
• Fetch-And-Inc
• Multiprocessors
• Maximum number ofconcurrent thread blocks
Multi-processor
Thread Block
Thread Block
Thread Block
Multi-processor
Thread Block
Thread Block
Thread Block
Multi-processor
Thread Block
Thread Block
Thread Block
Global Memory
![Page 12: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/12.jpg)
Synchronization
• Blocking
• Uses mutual exclusion to only allow one process at a time to access the object.
• Lockfree
• Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps.
• Waitfree
• Multiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps.
![Page 13: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/13.jpg)
Load Balancing Methods
• Blocking Task Queue
• Non-blocking Task Queue
• Task Stealing
• Static Task List
![Page 14: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/14.jpg)
Blocking queue
TB 1
TB 2
TB n
Free
Head
Tail
![Page 15: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/15.jpg)
Blocking queue
TB 1
TB 2
TB n
Free
Head
Tail
![Page 16: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/16.jpg)
Blocking queue
T1
TB 1
TB 2
TB n
Free
Head
Tail
![Page 17: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/17.jpg)
Blocking queue
T1
TB 1
TB 2
TB n
Free
Head
Tail
![Page 18: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/18.jpg)
Blocking queue
T1
TB 1
TB 2
TB n
Free
Head
Tail
![Page 19: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/19.jpg)
Non-blocking Queue
T1 T2 T3 T4
TB 1
TB 2
TB 1
TB 2
TB n
Head
Tail
ReferenceP. Tsigas and Y. Zhang, A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems [SPAA01]
![Page 20: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/20.jpg)
Non-blocking Queue
T1 T2 T3 T4
TB 1
TB 2
TB 1
TB 2
TB n
Head
Tail
![Page 21: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/21.jpg)
Non-blocking Queue
T1 T2 T3 T4
TB 1
TB 2
TB 1
TB 2
TB n
Head
Tail
![Page 22: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/22.jpg)
Non-blocking Queue
T1 T2 T3 T4
TB 1
TB 2
TB 1
TB 2
TB n
Head
Tail
![Page 23: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/23.jpg)
Non-blocking Queue
T1 T2 T3 T4 T5
TB 1
TB 2
TB 1
TB 2
TB n
Head
Tail
![Page 24: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/24.jpg)
Non-blocking Queue
T1 T2 T3 T4 T5
TB 1
TB 2
TB 1
TB 2
TB n
Head
Tail
![Page 25: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/25.jpg)
Task stealing
T1
T3 T2
TB 1
TB 2
TB n
ReferenceArora N. S., Blumofe R. D., Plaxton C. G. , Thread Scheduling for Multiprogrammed Multiprocessors [SPAA 98]
![Page 26: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/26.jpg)
Task stealing
T1 T4
T3 T2
TB 1
TB 2
TB n
![Page 27: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/27.jpg)
Task stealing
T1 T4 T5
T3 T2
TB 1
TB 2
TB n
![Page 28: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/28.jpg)
Task stealing
T1 T4
T3 T2
TB 1
TB 2
TB n
![Page 29: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/29.jpg)
Task stealing
T1
T3 T2
TB 1
TB 2
TB n
![Page 30: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/30.jpg)
Task stealing
T3 T2
TB 1
TB 2
TB n
![Page 31: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/31.jpg)
Task stealing
T2
TB 1
TB 2
TB n
![Page 32: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/32.jpg)
Static Task List
T1
T2
T3
T4
In
![Page 33: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/33.jpg)
Static Task List
T1
T2
T3
T4
In
TB 1
TB 2
TB 3
TB 4
![Page 34: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/34.jpg)
Static Task List
T1
T2
T3
T4
InOut
TB 1
TB 2
TB 3
TB 4
![Page 35: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/35.jpg)
Static Task List
T1
T2
T3
T4
T5
InOut
TB 1
TB 2
TB 3
TB 4
![Page 36: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/36.jpg)
Static Task List
T1
T2
T3
T4
T5
T6
InOut
TB 1
TB 2
TB 3
TB 4
![Page 37: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/37.jpg)
Static Task List
T1
T2
T3
T4
T5
T6
T7
InOut
TB 1
TB 2
TB 3
TB 4
![Page 38: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/38.jpg)
Octree Partitioning
• Bandwidth bound
![Page 39: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/39.jpg)
Octree Partitioning
• Bandwidth bound
![Page 40: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/40.jpg)
Octree Partitioning
• Bandwidth bound
![Page 41: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/41.jpg)
Octree Partitioning
• Bandwidth bound
![Page 42: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/42.jpg)
Four-in-a-row
• Computation intensive
![Page 43: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/43.jpg)
Graphics Processors
8800GT• 14 Multiprocessors
• 57 GB/sec bandwidth
9600GT• 8 Multiprocessors
• 57 GB/sec bandwidth
![Page 44: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/44.jpg)
Blocking Queue – Octree/9600GT
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0
200
400
600
Time (ms)
ThreadsBlocks
Time (ms)
200
300
400
500
![Page 45: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/45.jpg)
Blocking Queue – Octree/8800GT
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0
200
400
600
800
Time (ms)
ThreadsBlocks
Time (ms)
200
400
600
800
![Page 46: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/46.jpg)
Blocking Queue – Four-in-a-row
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0 500
1000 1500 2000 2500
Time (ms)
ThreadsBlocks
Time (ms)
500 1000 1500 2000 2500
![Page 47: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/47.jpg)
Non-blocking Queue – Octree/9600GT
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0 50
100 150 200 250
Time (ms)
ThreadsBlocks
Time (ms)
100
150
200
![Page 48: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/48.jpg)
Non-blocking Queue – Octree/8800GT
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0 50
100 150 200 250
Time (ms)
ThreadsBlocks
Time (ms)
100
150
200
![Page 49: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/49.jpg)
Non-blocking Queue - Four-in-a-row
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0
50
100
150
200
Time (ms)
ThreadsBlocks
Time (ms)
100
150
200
![Page 50: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/50.jpg)
Task stealing – Octree/9600GT
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0 50
100 150 200 250
Time (ms)
ThreadsBlocks
Time (ms)
0
50
100
150
200
![Page 51: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/51.jpg)
Task stealing – Octree/8800GT
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0 50
100 150 200 250
Time (ms)
ThreadsBlocks
Time (ms)
50
100
150
200
![Page 52: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/52.jpg)
Task stealing – Four-in-a-row
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0
50
100
150
Time (ms)
ThreadsBlocks
Time (ms)
50
100
150
![Page 53: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/53.jpg)
Static List
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 1280
20
40
60
80
100
120
140
Octree 9600GT Octree 8800GTS Four-in-a-row
Threads/Block
Tim
e (m
s)
![Page 54: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/54.jpg)
Octree Comparison
100 150 200 250 300 350 400 450 50010
100
Blocking Queue Non-Blocking Queue Static ListWork Stealing
Particles (thousands)
Tim
e (m
s)
![Page 55: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/55.jpg)
Previous work
• Korch M., Raubert T., A comparison of task pools for dynamic load balancing of irregular algorithms, Concurrency and Computation: Practice & Experience, 16, 2003
• Heirich A., Arvo J., A competetive analysis of load balancing strategies for parallel ray tracing, Journal of Supercomputing, 12, 1998
• Foley T., Sugerman J., KD-tree acceleration structures for a GPU raytracer, Graphics Hardware 2005
![Page 56: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/56.jpg)
Conclusion
• Synchronization plays a significant role in dynamic load-balancing
• Lock-free data structures/synchronization scales well and looks promising also in the GPU general purpose programming
• Locks perform poorly
• It is good that operations such as CAS and FAA have been introduced in the new GPUs
• Work stealing could outperform static load balancing
![Page 57: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.](https://reader035.fdocuments.us/reader035/viewer/2022070308/551be8be550346b4588b61e0/html5/thumbnails/57.jpg)
Thank you!
http://www.cs.chalmers.se/~dcs