GPU GPU and Supercomputer - University of Rochester€¦ · 4/16/2015 9 Outline: GPU GPGPU and CUDA...
Transcript of GPU GPU and Supercomputer - University of Rochester€¦ · 4/16/2015 9 Outline: GPU GPGPU and CUDA...
4/16/2015
1
GPU and Supercomputer
--Shaowei Su
--Jianbo Yuan
Outline:
❖ GPU
❖ GPGPU and CUDA
❖ CPU + GPU
❖ SuperComputer (TianHe 1A)
❖ Latest Trends
Outline:
❖ GPU
❖ GPGPU and CUDA
❖ CPU + GPU
❖ SuperComputer (TianHe 1A)
❖ Latest Trends
GPU
❖ Graphics Processing Unit (GPU)➢ Computational requirements are large
➢ Parallelism is substantial
➢ Throughput is more important
❖ Evolution:➢ “Real-time graphics performance needed to render complex, high-
resolution 3D scenes.”
John Nickolls, William J. Dally, "The GPU Computing Era", IEEE Micro, vol.30, no. 2, pp. 56-69,
March/April 2010, doi:10.1109/MM.2010.41
4/16/2015
2
GPU Development
John Nickolls, William J. Dally, "The GPU Computing Era", IEEE Micro, vol.30, no. 2, pp. 56-69,
March/April 2010, doi:10.1109/MM.2010.41
CPU and GPU
CUDA ToolKit Document: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
Outline:
❖ GPU
❖ GPGPU and CUDA
❖ CPU + GPU
❖ SuperComputer (TianHe 1A)
❖ Latest Trends
GPGPU
❖ General Purpose GPU:➢ GPU as a compelling alternative to traditional processors in high
performance computing
➢ Reason:
■ More cores
■ Massive parallelism
➢ Applications: Deep Learning…...
4/16/2015
3
GPGPU Computing Architecture
❖ Fermi:➢ 3.0 billion transistors
➢ 512 CUDA cores
■ 16 streaming multiprocessors (SM), each with 32 cores
➢ 64-bit unified addressing, caching hierarchy
➢ Error-Correcting Code (ECC) memory protection
➢ Support CUDA C, C++, OpenCL, DirecCompute
Fermi Architecture
John Nickolls, William J. Dally, "The GPU Computing Era", IEEE Micro, vol.30, no. 2, pp. 56-69,
March/April 2010, doi:10.1109/MM.2010.41
John Nickolls, William J. Dally, "The GPU Computing Era", IEEE Micro, vol.30, no. 2, pp. 56-69,
March/April 2010, doi:10.1109/MM.2010.41
Fermi SM❖ 32 cores❖ 16 load/store units
❖ Four special function units (SFU)
❖ L1 cache
➢ 16KB shared memory and
48KB L1 cache
or
➢ 48KB shared memory and
16KB L1 cache❖ 128-kbyte register file
❖ instruction cache
❖ two multithreaded warp schedulars❖ instruction dispatch units
Processing Flow❖ Copy data from main mem to GPU mem
❖ CPU instructs the process to GPU
❖ GPU execute parallel in each core
❖ Copy the result from GPU mem to main mem
4/16/2015
4
CUDA
❖ Compute Unifined Device Architecture:➢ Hardware and software coprocessing architecture
➢ Enable Nvidia GPU to execute C, C++, Fortran, OpenCL,
DirectCompute
➢ Introduced with GeForce 8800 GTX in 2006
Thread Hierarchy
❖ Threads
➢ Private local memory
❖ Thread Blocks
➢ Block shared memory
❖ Grids of Blocks
➢ Global memory
John Nickolls, William J. Dally, "The GPU Computing Era",
IEEE Micro, vol.30, no. 2, pp. 56-69, March/April 2010,
doi:10.1109/MM.2010.41
Blocks to Warps
John Nickolls, William J. Dally, "The GPU Computing Era",
IEEE Micro, vol.30, no. 2, pp. 56-69, March/April 2010
Program in CUDA
4/16/2015
5
Program in CUDA
❖ Task:
➢ Input:
u = [1 2 3 4]
➢ Compute:
u[i] = u[i]*u[i] + u[n-i-1]
➢ Output:
u = [5 7 11 17]
int main(){
float u[4] = {1,2,3,4};
float tmp[4];
int i;
for(i = 0; i < 4; i++){
tmp[i] = u[i];
}
for(i = 0; i < 4; i++){
u[i] = tmp[i]*tmp[i] + tmp[4-i-1];
}
return 0;
}
Program in CUDA
❖ __global__:
➢ kernel function
➢ <<<Numblocks,
ThreadsperBlock>>>
❖ __shared__:
➢ Block shared memory
❖ __syncthreads()➢ for threads within the same block
➢ all global and shared memory accesses
made by the threads prior to __syncthreads are
visible to the block
__global__ example(float* u){
int i = threadIdx.x;
__shared__ int tmp[i] = u[i];
__syncthreads();
u[i] = tmp[i]*tmp[i] + tmp[3-i];
}
int main(){
float hostU[4] = {1,2,3,4};
float* devU;
size_t size = sizeof(float)*4;
cudaMalloc(&devU, size);
cudaMemcpy(devU, hostU, size,
cudaMemcpyHostToDevice);
example<<<1,4>>>(devU);
cudaMemcpy(hostU, devU, size,
cudaMemcpyDeviceToHost);
cudaFree(devU);
return 0;
}
Outline:
❖ GPU
❖ GPGPU and CUDA
❖ CPU + GPU
❖ SuperComputer (TianHe 1A)
❖ Latest Trends
CPU + GPU coprocessing
● Why cooperation-- CPU: sequential-code-optimized
-- GPU: throughput-optimized
4/16/2015
6
CPU + GPU coprocessing
● How to cooperate-- (SW)Coprocessing progmming: CUDA...
-- (HW)GPU-accelerated system
Issues
❖ Unbalanced workloads across CPUs and GPUs
❖ low bandwidth communication between CPUs and
GPUs
Two-level dynamic adaptive
mapping
❖ Idea: divide workload w.r.t performance
➢ among CPUs and GPUs
➢ among cores on one CPU
Two-level dynamic adaptive
mapping
❖ First level: map computations to CPUs and GPUsStep1: Initialize the fraction of workload to CPU and GPU according to their
peak performance;
Step2: After one iteration, update both CPU and GPU performance
Step3: Update the fraction of workload to CPU and GPU according to their
performance in the previous iteration;
4/16/2015
7
Two-level dynamic adaptive
mapping
❖ Second level: map computations to cores in a CPUStep1: Initialize the fraction to each core in CPU equally;
Step2: After one iteration, update every core performance;
Step3: Update fraction to each core according to their performance in previous
iteration.
Issues
❖ Unbalanced workloads across CPUs and GPUs
❖ low bandwidth communication between CPUs and
GPUs
Software pipelining
❖ CPUs & GPUs communication
❖ PCI-E 2.0: 4-
8GBps
❖ CPU -> PCI-E:
hundreds MBps..
Software pipelining
CPUs & GPUs communication:❖ Pinned memory(memory never swapped out to secondary storage)
❖ Limited resource, too much pinned memory desrease host performance
4/16/2015
8
Software pipelining
❖ Methodology:
Assume N tasks: every task contains three phases
task = {input, exe, output}
that corresponding to time:
Tt = {Ti, Texe, To}
If the pipeline is full:
Tall = Ti + To + N * Te
Software pipelining
❖ Methodology:
Testing results
❖ Single node:
Testing results
❖ Single node:
DGEMM: alpha*A *B + beta *C
Tianhe-1:80(cabinets) * 32(nodes) * (two quadcore Xeon
processor + 1 ATI GPU)
CPU:DGEMM implemented in Intel Math Kernel Library
ACMLG: ACM Core Math Library for Graphic Processors
4/16/2015
9
Outline:
❖ GPU
❖ GPGPU and CUDA
❖ CPU + GPU
❖ SuperComputer (TianHe 1A)
❖ Latest Trends
GPU accelerated supercomputer
TianHe1A:
Peak performance: 4.7 Peta-flops(Linpack performance:2.6 Peta-
flops)
Top one in the Top500 List in Nov, 2010(17th today)
TianHe1A overview
112 compute racks, 8 service racks, 6 communication
racks and 14 I/O racks
TianHe1A overview
❖ 7168 compute nodes(64 nodes per rack), each node =
two Intel CPUs + one NVIDIA GPU;
❖ 1024 service nodes, each node = two FT-1000 CPUs;
❖ In total 14336 Intel Xeon X5670 CPUs, 2048 FT-1000
CPUs and 7168 NVIDIA Tesla M2050 GPUS;
❖ 7168 nodes * (11.733 GFLOPS per core * 6 CPU cores
* 2 sockets + 515.2 GFLOPS per GPU) = 4702.21
double precision TFLOPS
❖ Fat tree network(NRC + NIC, bidirection bandwidth of
160Gbps) for data transfter
4/16/2015
10
Interconnection topology
❖ First layer:480 switching boards are distributed in 120
racks(computation and service racks);
❖ Second layer:11 fully connected 384-port switches
High-radix routing chip(NRC)
❖ High radix router:
http://pages.cs.wisc.edu/~isca2005/slides/07A-02.PDF
High-radix routing chip(NRC)
❖ High radix router:
http://pages.cs.wisc.edu/~isca2005/slides/07A-02.PDF
Bandwidth is divided
among 2k input and
output channels and so
the channel bandwidth is
b = B / 2k;
Number of hops H for an
uniform traffic is at least
2logkN;
tr is the delay at each
router;
High-radix routing chip(NRC)
❖ High radix router:
http://pages.cs.wisc.edu/~isca2005/slides/07A-02.PDF
4/16/2015
11
High-radix routing chip(NRC)
❖ NRC:
16*16 switch chip;
XBAR=4*4 crossbar;
I/O storage system
❖ Lustre file system(centralized parallel file system):
--Meta Data Server:stores namespace meta
data(filenames, directories, permissions);
--Object Storage Server:stores file data;
--Central storage: secure, efficient, flexible
Outline:
❖ GPU
❖ GPGPU and CUDA
❖ CPU + GPU
❖ SuperComputer (TianHe 1A)
❖ Latest Trends
Latest trends
❖ AMD Accelerated Processing Unit (Fusion)
--x86 CPU cores and GPU on same silicon die
--high-speed block transfer engines
4/16/2015
12
Latest trends
❖ Tianhe-2: Neo heterogeneous architecture➢ Intel Xeon Phi: same ISA, with unified programming model
➢ streamline development and optimization process
Latest trends
❖ Top500:
#1: Tianhe2;
#2: Titan;
…
#17: Tianhe1A;
…
Latest trends
❖ End of Tianhe:
http://www.wsj.com/articles/u-s-agencies-block-technology-
exports-for-supercomputer-in-china-1428561987
References
❖ Xiangke Liao, Liquan Xiao, “MilkyWay-2
supercomputer:system and application”, 2014
❖ http://pages.cs.wisc.edu/~isca2005/slides/07A-02.PDF ,
John Kim, Stanford University
❖ Canqun Yang, Feng Wang, “Adaptive optimization for
petascale heterogeneous CPU/GPU computing”, 2010,
IEEE
❖ Min Xie, Yutong Lu, “Tianhe-1A interconnection and
message-passing services”, 2012
4/16/2015
13
References
❖ J Nickolls, “The GPU computing era”, 2010
❖ Xuejun Yang, Xiangke Liao, “The Tianhe-1A
Supercomputer: its Hardware and Software”, 2011
❖ http://en.wikipedia.org/wiki/Xeon_Phi
❖ http://www.infocellar.com/networks/ip/hop-count.htm
❖ http://en.wikipedia.org/wiki/CUDA_Pinned_memory
❖ http://www.systemfabricworks.com/products/system-
fabric-works-storage-solutions/lustre-hadoop