Download - Simulating the Behavior of the Human Brain on NVIDIA GPUs...Pedro Valero-Lara, Ivan Martínez-Perez, Antonio J. Peña, Xavier Martorell, Raúl Sirvent, Jesús Labarta: cuHinesBatch:

Munich, 09-11-2018

Pedro Valero-Lara, Ivan Mart nez-Pérez, Antonio J. Peña,ııXavier Martorell, Raül Sirvent, and Jesús Labarta

www.bsc.es

Simulating the Behavior of the Human Brain on NVIDIA GPUs

(Human Brain Project)

H2020 FET Flagship Project

● Accelerate the fields of neuroscience, computing and brain-related medicine● 8 Different Sub-Projects

➔ Sub-Project 7: High Performance Analytics and Computing● WP 7.5: Providing support for the migration of simulation codes to hybrid and/or

accelerator-enabled architectures● 86x10 (86 Billions) neurons⁹

➔ ~80,000 Volta GPUs● Steps:

➔ Neurons Generator → once at the very begining➔ Solving voltage capacitance ➔ Synapses (Spiking) → communication

Human Brain Project (HBP)

2

Hines Method

● Ax=b, ➔ where A is a Hines (3 vectors) Matrix➔ Similar to Tridiagonal System (Thomas Method)➔ 8xN operations➔ Vector p → branches

Solving Voltage Capacitance – Hines Method

3

void hines solver(double *a, double *b, double *d, double *rhs, int *p, int cell size)

{

int i; double factor;

// backward sweep

for(int i=cell size-1; i>0; −−i) {

factor = a[i] / d[i];

d[p[i]] -= factor * b[i];

rhs[p[i]] -= factor * rhs[i];

}

rhs[0] /= d[0]

// forward sweep

for(i=1; i<cell size; ++i) {

rhs[i] -= b[i] * rhs[p[i]];

rhs[i] /= d[i];

}

}

cuHinesBatch● Saturate the GPU with a high number of neurons

➔ 1 thread per neuron➔ No synchronizations➔ No atomic operations

● Data Layouts➔ Flat

No coalesncing ➔ Full-Interleaved

Coalescing Big jumps in memory

➔ Block-Interleaved● Coalesing● Small jumps in memory

Implementation of cuHinesBatch

4

Test Case: Flat● K80 NVIDIA GPU (Kepler)

➔ 4992 CUDA cores➔ 24 GB GDDR5

● Input (Hines Matrices)➔ 300 elements and 2 branches➔ double precision➔ BatchSize

512; 5,120; 51,120; 512,000 ● Setting

➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1)➔ numactl −−interleave=all➔ -O3 -openmp -arch=compute 35

Performance of cuHinesBatch: Flat

5

Test Case: Full-Interleaved● K80 NVIDIA GPU (Kepler)


● Input (Hines Matrices)➔ 300 elements and 2 branches➔ double precision➔ BatchSize

512; 5,120; 51,120; 512,000 ● Setting

➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1)➔ numactl −−interleave=all➔ -O3 -openmp -arch=compute 35

Performance of cuHinesBatch: Full-Interleaved

6

Test Case: Block-Interleaved● K80 NVIDIA GPU (Kepler)


● Input (Hines Matrices)➔ 300 elements and 2 branches➔ double precision➔ BatchSize = 512,000

● Setting➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1)➔ numactl −−interleave=all➔ -O3 -openmp -arch=compute 35

Performance of cuHinesBatch: Block-Interleaved

7

Test Case: Real Neurons● K80 NVIDIA GPU (Kepler)


● Input (Hines Matrices)➔ 6 different morphologies

Small, medium, and big Low (10%) and high (50%) #branches● http://www.neuromorpho.org/

➔ BatchSize 256; 2,560; 25,600; 256,000

● Setting➔ Block-Interleaved➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1)➔ numactl −−interleave=all➔ -O3 -openmp -arch=compute 35

Performance of cuHinesBatch on Real Neurons

8

http://www.neuromorpho.org/

Test Case: Pascal ● P100 NVIDIA GPU (Pascal)

➔ 3584 CUDA cores➔ 16 GB HMB2

● Input (Hines Matrices)➔ Medium (size) ➔ Low (% #branches)➔ BatchSize = 25,6000

● Setting➔ Full-Interleaved➔ numactl −−interleave=all➔ -O3 -openmp -arch=compute 62

● NVPROF➔ High occupancy (99,5%)➔ High bandwidth (500 GB/s)➔ No memory issues

Performance of cuHinesBatch on Pascal

9

Test Case: cuThomasBatch● 1 logic (K40) GPU of K80 NVIDIA GPU (Kepler)


● Input (Hines Matrices)➔ System Size

64; 128; 256; 512 1,024; 2,048; 4,096; 8,192

➔ BatchSize 256; 2,560; 25,600; 256,000 20; 200; 2,000; 20,000

● Setting➔ cusparseDgtsvStridedBatch➔ cuThomasBatch

● Results ➔ 1,2-2,8x faster➔ 4x more precise➔ 2x less memory occupancy

Performance of cuHinesBatch: cuThomasBatch

10

Test Case: Multi-Morphology● 1 logic (K40) GPU of K80 NVIDIA GPU (Kepler)


● Input (Hines Matrices)➔ Different morphologies

Mono-Morpholgy Multi_morphology

• Same size• Different size

1,024; 2,048; 4,096; 8,192➔ BatchSize = 25,600➔ 10% and 50% of #Branches

● Setting➔ Full-Interleaved➔ Padding

Performance of cuHinesBatch: Multi-Morphology

11

cuHinesBatch● High performance (50x faster than seq. CPU)● Big scaling even when using a very high number of neurons

➔ 1 thread per neuron (Hines System)➔ Full-Interleaved Data Layout➔ Faster than using one CUDA Block per system

cuThomasBatch● Data Layout transformation (from flat to full-interleaved)

➔ Once at the very begining of the simulation● Fall in performance for multi-morphology

➔ 2 approaches: cuThomasBatch per segment CusparseDgtsvStridedBatch per segment

Conclusions & Future Work

12

Pedro Valero-Lara, Ivan Martínez-Perez, Antonio J. Peña, Xavier Martorell, Raúl Sirvent, Jesús Labarta: cuHinesBatch: Solving Multiple Hines systems on GPUs Human Brain Project*. ICCS 2017: 566-575

Pedro Valero-Lara, Ivan Mart nez-Pérez, Raül Sirvent, Xavier Martorell, and Antonio J. Peña: ıı NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch. PPAM 2017

cuHinesBatch repository: https://pm.bsc.es/gitlab/imartin1/cuHinesBatch

Acknowledgements: This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 720270

(HBP SGA1), from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII (TIN2015-65316-P) and the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d’Execució Paral·lels (2014-SGR-1051). We thank the support of NVIDIA through the BSC/UPC NVIDIA GPU Center of Excellence. Antonio J. Peña is cofinanced by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva fellowship number IJCI-2015-23266.

References & Acknowledgments

13

Thank you!For further information please contact

[email protected]

www.bsc.es