Munich, 09-11-2018
Pedro Valero-Lara, Ivan Mart nez-Pérez, Antonio J. Peña,ııXavier Martorell, Raül Sirvent, and Jesús Labarta
www.bsc.es
Simulating the Behavior of the Human Brain on NVIDIA GPUs
(Human Brain Project)
H2020 FET Flagship Project
● Accelerate the fields of neuroscience, computing and brain-related medicine● 8 Different Sub-Projects
➔ Sub-Project 7: High Performance Analytics and Computing● WP 7.5: Providing support for the migration of simulation codes to hybrid and/or
accelerator-enabled architectures● 86x10 (86 Billions) neurons⁹
➔ ~80,000 Volta GPUs● Steps:
➔ Neurons Generator → once at the very begining➔ Solving voltage capacitance ➔ Synapses (Spiking) → communication
Human Brain Project (HBP)
2
Hines Method
● Ax=b, ➔ where A is a Hines (3 vectors) Matrix➔ Similar to Tridiagonal System (Thomas Method)➔ 8xN operations➔ Vector p → branches
Solving Voltage Capacitance – Hines Method
3
void hines solver(double *a, double *b, double *d, double *rhs, int *p, int cell size)
{
int i; double factor;
// backward sweep
for(int i=cell size-1; i>0; −−i) {
factor = a[i] / d[i];
d[p[i]] -= factor * b[i];
rhs[p[i]] -= factor * rhs[i];
}
rhs[0] /= d[0]
// forward sweep
for(i=1; i<cell size; ++i) {
rhs[i] -= b[i] * rhs[p[i]];
rhs[i] /= d[i];
}
}
cuHinesBatch● Saturate the GPU with a high number of neurons
➔ 1 thread per neuron➔ No synchronizations➔ No atomic operations
● Data Layouts➔ Flat
No coalesncing ➔ Full-Interleaved
Coalescing Big jumps in memory
➔ Block-Interleaved● Coalesing● Small jumps in memory
Implementation of cuHinesBatch
4
Test Case: Flat● K80 NVIDIA GPU (Kepler)
➔ 4992 CUDA cores➔ 24 GB GDDR5
● Input (Hines Matrices)➔ 300 elements and 2 branches➔ double precision➔ BatchSize
512; 5,120; 51,120; 512,000 ● Setting
➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1)➔ numactl −−interleave=all➔ -O3 -openmp -arch=compute 35
Performance of cuHinesBatch: Flat
5
Test Case: Full-Interleaved● K80 NVIDIA GPU (Kepler)
➔ 4992 CUDA cores➔ 24 GB GDDR5
● Input (Hines Matrices)➔ 300 elements and 2 branches➔ double precision➔ BatchSize
512; 5,120; 51,120; 512,000 ● Setting
➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1)➔ numactl −−interleave=all➔ -O3 -openmp -arch=compute 35
Performance of cuHinesBatch: Full-Interleaved
6
Test Case: Block-Interleaved● K80 NVIDIA GPU (Kepler)
➔ 4992 CUDA cores➔ 24 GB GDDR5
● Input (Hines Matrices)➔ 300 elements and 2 branches➔ double precision➔ BatchSize = 512,000
● Setting➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1)➔ numactl −−interleave=all➔ -O3 -openmp -arch=compute 35
Performance of cuHinesBatch: Block-Interleaved
7
Test Case: Real Neurons● K80 NVIDIA GPU (Kepler)
➔ 4992 CUDA cores➔ 24 GB GDDR5
● Input (Hines Matrices)➔ 6 different morphologies
Small, medium, and big Low (10%) and high (50%) #branches● http://www.neuromorpho.org/
➔ BatchSize 256; 2,560; 25,600; 256,000
● Setting➔ Block-Interleaved➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1)➔ numactl −−interleave=all➔ -O3 -openmp -arch=compute 35
Performance of cuHinesBatch on Real Neurons
8
Test Case: Pascal ● P100 NVIDIA GPU (Pascal)
➔ 3584 CUDA cores➔ 16 GB HMB2
● Input (Hines Matrices)➔ Medium (size) ➔ Low (% #branches)➔ BatchSize = 25,6000
● Setting➔ Full-Interleaved➔ numactl −−interleave=all➔ -O3 -openmp -arch=compute 62
● NVPROF➔ High occupancy (99,5%)➔ High bandwidth (500 GB/s)➔ No memory issues
Performance of cuHinesBatch on Pascal
9
Test Case: cuThomasBatch● 1 logic (K40) GPU of K80 NVIDIA GPU (Kepler)
➔ 2496 CUDA cores➔ 16 GB GDDR5
● Input (Hines Matrices)➔ System Size
64; 128; 256; 512 1,024; 2,048; 4,096; 8,192
➔ BatchSize 256; 2,560; 25,600; 256,000 20; 200; 2,000; 20,000
● Setting➔ cusparseDgtsvStridedBatch➔ cuThomasBatch
● Results ➔ 1,2-2,8x faster➔ 4x more precise➔ 2x less memory occupancy
Performance of cuHinesBatch: cuThomasBatch
10
Test Case: Multi-Morphology● 1 logic (K40) GPU of K80 NVIDIA GPU (Kepler)
➔ 2496 CUDA cores➔ 16 GB GDDR5
● Input (Hines Matrices)➔ Different morphologies
Mono-Morpholgy Multi_morphology
• Same size• Different size
1,024; 2,048; 4,096; 8,192➔ BatchSize = 25,600➔ 10% and 50% of #Branches
● Setting➔ Full-Interleaved➔ Padding
Performance of cuHinesBatch: Multi-Morphology
11
cuHinesBatch● High performance (50x faster than seq. CPU)● Big scaling even when using a very high number of neurons
➔ 1 thread per neuron (Hines System)➔ Full-Interleaved Data Layout➔ Faster than using one CUDA Block per system
cuThomasBatch● Data Layout transformation (from flat to full-interleaved)
➔ Once at the very begining of the simulation● Fall in performance for multi-morphology
➔ 2 approaches: cuThomasBatch per segment CusparseDgtsvStridedBatch per segment
Conclusions & Future Work
12
Pedro Valero-Lara, Ivan Martínez-Perez, Antonio J. Peña, Xavier Martorell, Raúl Sirvent, Jesús Labarta: cuHinesBatch: Solving Multiple Hines systems on GPUs Human Brain Project*. ICCS 2017: 566-575
Pedro Valero-Lara, Ivan Mart nez-Pérez, Raül Sirvent, Xavier Martorell, and Antonio J. Peña: ıı NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch. PPAM 2017
cuHinesBatch repository: https://pm.bsc.es/gitlab/imartin1/cuHinesBatch
Acknowledgements: This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 720270
(HBP SGA1), from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII (TIN2015-65316-P) and the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d’Execució Paral·lels (2014-SGR-1051). We thank the support of NVIDIA through the BSC/UPC NVIDIA GPU Center of Excellence. Antonio J. Peña is cofinanced by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva fellowship number IJCI-2015-23266.
References & Acknowledgments
13
Top Related