Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan...

22
Parallel Rendering on Hybrid Multi-GPU Clusters Stefan Eilemann Blue Brain Project

Transcript of Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan...

Page 1: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

Parallel Rendering onHybrid Multi-GPU Clusters

Stefan EilemannBlue Brain Project

Page 2: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

• Based on Equalizer and Collage• Standard framework for parallel rendering• Per process threads: main, receive, command,

image transmit• Per GPU threads: render, transfer/async

readback

May 2012 Blue Brain Project - Stefan Eilemann

Parallel Rendering

2

Equalizer

Collage receivecommand

transmit

render

transfer

render

transfer

...

Page 3: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

• Mixed shared/distributed memory• NUMA topology within node• Cost-effective• We’re back in ’95, and got a cluster

RAM

GPU 1Processor 1Core 1Core 2Core 3

Core 4GPU 2

GPU 3Network 1

Processor 2Core 1Core 2Core 3

Core 4Network 1 ...

RAM

RAM

RAM

RAM

Node

Core 5Core 6

Core 5Core 6 Network 2 Network 2

May 2012 Blue Brain Project - Stefan Eilemann

Hybrid Multi-GPU Clusters

3

Page 4: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

• 13 nodes,2x Xeon X5690,6 cores, 3.47GHz

• 2x 12GB RAM• 3x GTX580,

3GB RAM• 10 GBit ethernet• Used 11 nodes

May 2012 Blue Brain Project - Stefan Eilemann

Benchmark Hardware

4

Page 5: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

• Synthetical: eqPly– PLY renderer using kd-tree– 4x David 1mm, >200MTris– Realistic camera path

• Real-world: RTNeuron– Visualizes neocortical column simulations– Almost worst-case data structure– Transparency, LOD, CUDA-based culling

May 2012 Blue Brain Project - Stefan Eilemann

Benchmark Software

5

Page 6: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

May 2012 Blue Brain Project - Stefan Eilemann

RTNeuron Sort-Last

• Round-robin decomposition– Better load balance– RGB+Depth compositing– No transparency

• Spatial decomposition– kd-tree with #GPU leaves, clip planes– Compact regions– RGBA compositing

6

Page 7: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

0

1

2

3

4

5

6

7

3 9 15 21 27 330%

10%

20%

30%

40%

50%

60%

70%DB, Full Cortical Column

Fram

es p

er S

econ

d

Number of GPUs

spee

dup

due

to o

ptim

izatio

n

spatialround-robinspeedup

May 2012 Blue Brain Project - Stefan Eilemann

RTNeuron Sort-Last

7

Page 8: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

• Readback penalty– GPU 1 -> Processor 1: ~250MPx/s– GPU 1 -> Processor 2: ~120MPx/s

RAM

GPU 1Processor 1Core 1Core 2Core 3

Core 4GPU 2

GPU 3Network 1

Processor 2Core 1Core 2Core 3

Core 4Network 1 ...

RAM

RAM

RAM

RAM

Node

Core 5Core 6

Core 5Core 6 Network 2 Network 2

May 2012 Blue Brain Project - Stefan Eilemann

Thread Placement

8

Page 9: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

RAM

GPU 1Processor 1Core 1Core 2Core 3

Core 4GPU 2

GPU 3Network 1

Processor 2Core 1Core 2Core 3

Core 4Network 1 ...

RAM

RAM

RAM

RAM

Node

Core 5Core 6

Core 5Core 6 Network 2 Network 2

recvmain

cmdreaddraw

drawdraw

readread

xmit

• Automatic thread affinity– Render and readback threads to GPU ‘processor’– IO threads to primary network interface ‘processor’

• Based on hwloc library and X11 extension NV_Control

May 2012 Blue Brain Project - Stefan Eilemann

Thread Placement

9

Page 10: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

linear ROI off ROI on correct incorrect linear ROI off ROI on DB AFF DB bad AFF speedup speedup speedup improvement3915212733

10.1655 10.1655 10.4781 10.4899 10.4652 6.6491 6.6491 6.64151 4.99638 4.96776 0% 3% -1.14 0.5830.4965 24.9284 26.2072 26.5807 26.0064 19.9473 14.783 14.9979 6.07068 6.0736 2% 5% 14.54 -0.0550.8275 37.3315 38.4213 38.6116 37.8874 33.2455 19.8859 20.1192 6.501 6.52079 2% 3% 11.73 -0.3071.1585 45.4641 47.3652 47.2156 45.3518 46.5437 12.6358 11.2892 7.43733 7.5014 4% 4% -106.57 -0.8591.4895 48.9431 54.1541 54.4186 52.4529 59.8419 9.3947 9.66107 4% 11% 28.35

111.8205 47.5766 58.5061 59.2201 55.5736 73.1401 8.06721 7.19165 7% 23% -108.53

0

12

24

36

48

60

3 9 15 21 27 330%

1%

3%

4%

6%

7%2D, Thread Affinity, 4xDavid

Fram

es p

er S

econ

d

Number of GPUs

Spee

dup

due

to o

ptim

izatio

n

incorrectcorrectspeedup

0

10

20

30

40

50

60

3 9 15 21 27 330%

4%

8%

13%

17%

21%

25%2D, Region of Interest , 4xDavid

Fram

es p

er S

econ

d

Number of GPUs

Spee

dup

due

to o

ptim

izatio

n

ROI offROI onspeedup

15%

12%

9%

6%

3%

0

4

8

12

16

20

3 9 15 21 27 33

DB, Region of Interest , 4xDavid

Fram

es p

er S

econ

d

Number of GPUs

ROI onROI offspeeduplinear

5%

4%

3%

2%

1%

Sp

May 2012 Blue Brain Project - Stefan Eilemann

Thread Placement

10

Page 11: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

• Pipeline GPU->CPU transfer with next frame• One additional, lazy transfer thread per GPU• Extension of compression plugin API

Transmit Thread

RenderThread

drawstart RB

drawstart RB

Download Thread

finish RB

sendfinish RB

send

compress

compress

to destination node

n 11 1Transmit Thread

RenderThread

draw

draw

readback

sendreadback

send

compress

compressto destination

node

n 1

May 2012 Blue Brain Project - Stefan Eilemann

Asynchronous Readback

11

Page 12: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

linear synchronous asynchronous correct incorrect DB DB ROI DB AFF DB bad AFF DB async improvement speedup improvement improvement3915212733

10.1655 10.1655 10.9995 10.2424 10.1572 6.6491 6.64151 4.60805 4.96776 5.35856 4.19 8% -0.11 -7.2430.4965 24.9284 27.815 25.6116 25.0967 14.783 14.9979 5.64881 6.0736 9.78578 10.26 12% 1.45 -6.9950.8275 37.3315 40.6761 38.1282 36.7865 19.8859 20.1192 6.16123 6.52079 7.56609 18.24 9% 1.17 -5.5171.1585 45.4641 49.2602 45.3566 44.2131 12.6358 11.2892 7.05779 7.5014 7.99894 12.93 8% -10.66 -5.9191.4895 48.9431 54.5398 48.9431 47.7336 9.3947 9.66107 7.65548 7.7944 12.67 11% 2.84

111.8205 47.5766 60.6196 47.5766 44.5481 8.06721 7.19165 6.08177 6.50634 33.99 27% -10.85

0

10

20

30

40

50

3 9 15 21 27 33

2D, Thread Affinity, 4xDavid

Fram

es p

er S

econ

d

Number of GPUs

incorrectcorrectimprovementlinear

10%

8%

6%

4%

2%

0

10

20

30

40

50

60

70

3 9 15 21 27 330%

4%

9%

13%

17%

21%

26%

30%2D, Asynchronous Readback , 4xDavid

Fram

es p

er S

econ

d

Number of GPUs

Spee

dup

due

to o

ptim

izatio

n

synchronousasynchronousspeedup

0

4

8

12

16

20

3 9 15 21 27 33

DB, Region of Interest , 4xDavid

Fram

e pe

r Sec

ond

Category TitleDB DB ROI improvement

15%

12%

9%

6%

3%

May 2012 Blue Brain Project - Stefan Eilemann

Asynchronous Readback

12

Page 13: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

• Reduce pixel data during compositing• Optimize 2D load-balancer

– refined load grid– less oscillation

May 2012 Blue Brain Project - Stefan Eilemann

Region of Interest

13

Page 14: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

0

1

2

3

4

5

6

3 9 15 21 27 33-2.0%

-1.3%

-0.7%

0%

0.7%

1.3%

2.0%

2.7%

3.3%

4.0%Round-Robin DB, Region of Interest, Full Cortical Column

Fram

es p

er S

econ

d

Number of GPUs

spee

dup

due

to o

ptim

izatio

n

ROI offROI onspeedup

May 2012 Blue Brain Project - Stefan Eilemann

Region of Interest

14

Page 15: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

0

1

2

3

4

5

6

7

3 9 15 21 27 33-5%

0%

5%

10%

15%

20%

25%

30%Spatial DB, Region of Interest, Full Cortical Column

Fram

es p

er S

econ

d

Number of GPUs

spee

dup

due

to o

ptim

izatio

n

ROI offROI onspeedup

May 2012 Blue Brain Project - Stefan Eilemann

Region of Interest

15

Page 16: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

linear ROI off ROI on correct incorrect linear ROI off ROI on DB AFF DB bad AFF speedup speedup speedup improvement3915212733

10.1655 10.1655 10.4781 10.2424 10.1572 6.6491 6.6491 6.64151 4.99638 4.96776 1% 3% -1.14 0.5830.4965 24.9284 26.2072 25.6116 25.0967 19.9473 14.783 14.9979 6.07068 6.0736 2% 5% 14.54 -0.0550.8275 37.3315 38.4213 38.1282 36.7865 33.2455 19.8859 20.1192 6.501 6.52079 4% 3% 11.73 -0.3071.1585 45.4641 47.3652 45.3566 44.2131 46.5437 12.6358 11.2892 7.43733 7.5014 3% 4% -106.57 -0.8591.4895 48.9431 54.1541 48.9431 47.7336 59.8419 9.3947 9.66107 3% 11% 28.35

111.8205 47.5766 58.5061 47.5766 44.5481 73.1401 8.06721 7.19165 7% 23% -108.53

0

10

20

30

40

50

3 9 15 21 27 330%

1%

3%

4%

6%

7%2D, Thread Affinity, 4xDavid

Fram

es p

er S

econ

d

Number of GPUs

Spee

dup

due

to o

ptim

izatio

n

incorrectcorrectspeedup

0

10

20

30

40

50

60

3 9 15 21 27 330%

4%

8%

13%

17%

21%

25%2D, Region of Interest , 4xDavid

Fram

es p

er S

econ

d

Number of GPUs

Spee

dup

due

to o

ptim

izatio

n

ROI offROI onspeedup

15%

12%

9%

6%

3%

0

4

8

12

16

20

3 9 15 21 27 33

DB, Region of Interest , 4xDavid

Fram

es p

er S

econ

d

Number of GPUs

ROI onROI offspeeduplinear

5%

4%

3%

2%

1%

Sp

May 2012 Blue Brain Project - Stefan Eilemann

Region of Interest

16

Page 17: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

May 2012 Blue Brain Project - Stefan Eilemann

Multi-Thread vs Multi-Process

17

• Multi-process ‘MPI mode’– Increased memory usage, especially for sort-first– Increased inter-node communication cost

• Multi-threaded– Driver overhead– Memory bandwidth contention for sort-first

Page 18: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

linear Multithreaded Multiprocess speedup391521273339

10.464 10.464 10.7337 3%31.392 26.4688 27.4492 4%52.32 38.7724 40.427 4%

73.248 47.5555 49.3087 4%94.176 54.6644 55.2898 1%

115.104 57.188 60.8846 6%136.032

0

10

20

30

40

50

60

3 9 15 21 27 330%

1%

2%

4%

5%

6%

7%2D, Multi-Process , 4xDavid

Fram

es p

er S

econ

d

Number of GPUs

Spee

dup

due

to o

ptim

izatio

n

MultithreadedMultiprocessspeedup

May 2012 Blue Brain Project - Stefan Eilemann

Multi-Thread vs Multi-Process

18

Page 19: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

May 2012 Blue Brain Project - Stefan Eilemann

glFinish

19

linear ROI off ROI on correct incorrect linear ROI off ROI on DB AFF DB bad AFF improvement improvement improvement improvement3915212733

10.394 10.394 10.3644 10.2424 10.1572 6.6491 6.6491 6.64151 4.99638 4.96776 8.39 -2.85 -1.14 0.5831.182 25.9273 26.1 25.6116 25.0967 19.9473 14.783 14.9979 6.07068 6.0736 20.52 6.66 14.54 -0.05

51.97 37.7265 38.0058 38.1282 36.7865 33.2455 19.8859 20.1192 6.501 6.52079 36.47 7.40 11.73 -0.3072.758 45.9399 45.7622 45.3566 44.2131 46.5437 12.6358 11.2892 7.43733 7.5014 25.86 -3.87 -106.57 -0.8593.546 52.1795 51.8578 48.4795 47.7336 59.8419 9.3947 9.66107 15.63 -6.17 28.35

114.334 54.7485 54.1058 39.5859 44.5481 73.1401 8.06721 7.19165 -111.39 -11.74 -108.53

0

10

20

30

40

50

3 9 15 21 27 33

2D, Thread Affinity, 4xDavid

Fram

es p

er S

econ

d

Number of GPUs

incorrectcorrectimprovementlinear

5%

4%

3%

2%

1%

0

10

20

30

40

50

60

3 9 15 21 27 33

2D, Region of Interest , 4xDavid

Fram

es p

er S

econ

d

Number of GPUs

ROI offROI onimprovementlinear

40%

30%

20%

10%

0

4

8

12

16

20

3 9 15 21 27 33

DB, Region of Interest , 4xDavid

Fram

es p

er S

econ

d

Number of GPUs

ROI onROI offimprovementlinear

5%

4%

3%

2%

1%

linear synchronous asynchronous improvement3915212733

6.6491 6.6491 7.69532 15.73476109519.9473 14.783 16.5024 11.63092741733.2455 19.8859 18.8186 -5.36711941646.5437 12.6358 13.228 4.686683866559.8419 9.3947 9.40126 0.069826604473.1401 8.06721 8.36264 3.6621087092

0

5

10

15

20

3 9 15 21 27 33

direct send, asynchronous readback, 4xDavid

Fram

es p

er S

econ

d

Number of GPUs

synchronousasynchronousimprovementlinear

linear with finish without finish speedup3915212733

6.80103 6.80103 4.60805 48%20.40309 14.6088 5.64881 159%34.00515 18.3673 6.16123 198%47.60721 10.8445 7.05779 54%61.20927 9.31828 7.65548 22%74.81133 7.86052 6.08177 29%

0

4

8

12

16

20

3 9 15 21 27 330%

40%

80%

120%

160%

200%DB Direct Send, 4xDavid

Fram

es p

er S

econ

d

Number of GPUs

with finishwithout finishspeedup

linear synchronous asynchronous speedup3915212733

6.6491 6.6491 7.69532 16%19.9473 14.783 16.5024 12%33.2455 18.1 18.8186 4%46.5437 12.6358 13.228 5%59.8419 9.3947 9.40126 0%73.1401 8.06721 8.36264 4%

0

4

8

12

16

20

3 9 15 21 27 330%

3%

6%

10%

13%

16%DB Direct Send, asynchronous readback, 4xDavid

Fram

es p

er S

econ

d

Number of GPUs

synchronousasynchronousspeedup

Page 20: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

May 2012 Blue Brain Project - Stefan Eilemann

Conclusions

20

• Order of importance: – glFinish– Async readback (2D) or ROI (DB)– Thread placement

• User shouldn’t need to care• Time-consuming to implement all of them

Page 21: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

May 2012 Blue Brain Project - Stefan Eilemann

Future Work

21

• RDMA support and benchmarking• RTNeuron view frustum culling improvements• Subpixel FSAA compounds for RTNeuron

– Improve visual quality, not performance• Asynchronous uploads

Page 22: Parallel Rendering on Hybrid Multi-GPU Clusters€¦ · May 2012 Blue Brain Project - Stefan Eilemann Hybrid Multi-GPU Clusters 3 • 13 nodes, 2x Xeon X5690, 6 cores, 3.47GHz •

• Blue Brain Project, EPFL; CeSViMa, UPM; Visualization and MultiMedia Lab, University of Zürich

• Digital Michelangelo Project, Stanford 3D Scanning Repository• Swiss National Science Foundation grant 200020-129525• Spanish Ministry of Science and Innovation grant

TIN2010-21289-C02- 01/02

• http://www.open-mpi.org/projects/hwloc/• http://github.com/Eyescale/Equalizer/

May 2012 Blue Brain Project - Stefan Eilemann

Acknowledgements

22