GPU GPU and Supercomputer - University of Rochester€¦ · 4/16/2015 9 Outline: GPU GPGPU and CUDA...

13
4/16/2015 1 GPU and Supercomputer --Shaowei Su --Jianbo Yuan Outline: GPU GPGPU and CUDA CPU + GPU SuperComputer (TianHe 1A) Latest Trends Outline: GPU GPGPU and CUDA CPU + GPU SuperComputer (TianHe 1A) Latest Trends GPU Graphics Processing Unit (GPU) Computational requirements are large Parallelism is substantial Throughput is more important Evolution: Real-time graphics performance needed to render complex, high- resolution 3D scenes.John Nickolls, William J. Dally, "The GPU Computing Era", IEEE Micro, vol.30, no. 2, pp. 56-69, March/April 2010, doi:10.1109/MM.2010.41

Transcript of GPU GPU and Supercomputer - University of Rochester€¦ · 4/16/2015 9 Outline: GPU GPGPU and CUDA...

4/16/2015

1

GPU and Supercomputer

--Shaowei Su

--Jianbo Yuan

Outline:

❖ GPU

❖ GPGPU and CUDA

❖ CPU + GPU

❖ SuperComputer (TianHe 1A)

❖ Latest Trends

Outline:

❖ GPU

❖ GPGPU and CUDA

❖ CPU + GPU

❖ SuperComputer (TianHe 1A)

❖ Latest Trends

GPU

❖ Graphics Processing Unit (GPU)➢ Computational requirements are large

➢ Parallelism is substantial

➢ Throughput is more important

❖ Evolution:➢ “Real-time graphics performance needed to render complex, high-

resolution 3D scenes.”

John Nickolls, William J. Dally, "The GPU Computing Era", IEEE Micro, vol.30, no. 2, pp. 56-69,

March/April 2010, doi:10.1109/MM.2010.41

4/16/2015

2

GPU Development

John Nickolls, William J. Dally, "The GPU Computing Era", IEEE Micro, vol.30, no. 2, pp. 56-69,

March/April 2010, doi:10.1109/MM.2010.41

CPU and GPU

CUDA ToolKit Document: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

Outline:

❖ GPU

❖ GPGPU and CUDA

❖ CPU + GPU

❖ SuperComputer (TianHe 1A)

❖ Latest Trends

GPGPU

❖ General Purpose GPU:➢ GPU as a compelling alternative to traditional processors in high

performance computing

➢ Reason:

■ More cores

■ Massive parallelism

➢ Applications: Deep Learning…...

4/16/2015

3

GPGPU Computing Architecture

❖ Fermi:➢ 3.0 billion transistors

➢ 512 CUDA cores

■ 16 streaming multiprocessors (SM), each with 32 cores

➢ 64-bit unified addressing, caching hierarchy

➢ Error-Correcting Code (ECC) memory protection

➢ Support CUDA C, C++, OpenCL, DirecCompute

Fermi Architecture

John Nickolls, William J. Dally, "The GPU Computing Era", IEEE Micro, vol.30, no. 2, pp. 56-69,

March/April 2010, doi:10.1109/MM.2010.41

John Nickolls, William J. Dally, "The GPU Computing Era", IEEE Micro, vol.30, no. 2, pp. 56-69,

March/April 2010, doi:10.1109/MM.2010.41

Fermi SM❖ 32 cores❖ 16 load/store units

❖ Four special function units (SFU)

❖ L1 cache

➢ 16KB shared memory and

48KB L1 cache

or

➢ 48KB shared memory and

16KB L1 cache❖ 128-kbyte register file

❖ instruction cache

❖ two multithreaded warp schedulars❖ instruction dispatch units

Processing Flow❖ Copy data from main mem to GPU mem

❖ CPU instructs the process to GPU

❖ GPU execute parallel in each core

❖ Copy the result from GPU mem to main mem

4/16/2015

4

CUDA

❖ Compute Unifined Device Architecture:➢ Hardware and software coprocessing architecture

➢ Enable Nvidia GPU to execute C, C++, Fortran, OpenCL,

DirectCompute

➢ Introduced with GeForce 8800 GTX in 2006

Thread Hierarchy

❖ Threads

➢ Private local memory

❖ Thread Blocks

➢ Block shared memory

❖ Grids of Blocks

➢ Global memory

John Nickolls, William J. Dally, "The GPU Computing Era",

IEEE Micro, vol.30, no. 2, pp. 56-69, March/April 2010,

doi:10.1109/MM.2010.41

Blocks to Warps

John Nickolls, William J. Dally, "The GPU Computing Era",

IEEE Micro, vol.30, no. 2, pp. 56-69, March/April 2010

Program in CUDA

4/16/2015

5

Program in CUDA

❖ Task:

➢ Input:

u = [1 2 3 4]

➢ Compute:

u[i] = u[i]*u[i] + u[n-i-1]

➢ Output:

u = [5 7 11 17]

int main(){

float u[4] = {1,2,3,4};

float tmp[4];

int i;

for(i = 0; i < 4; i++){

tmp[i] = u[i];

}

for(i = 0; i < 4; i++){

u[i] = tmp[i]*tmp[i] + tmp[4-i-1];

}

return 0;

}

Program in CUDA

❖ __global__:

➢ kernel function

➢ <<<Numblocks,

ThreadsperBlock>>>

❖ __shared__:

➢ Block shared memory

❖ __syncthreads()➢ for threads within the same block

➢ all global and shared memory accesses

made by the threads prior to __syncthreads are

visible to the block

__global__ example(float* u){

int i = threadIdx.x;

__shared__ int tmp[i] = u[i];

__syncthreads();

u[i] = tmp[i]*tmp[i] + tmp[3-i];

}

int main(){

float hostU[4] = {1,2,3,4};

float* devU;

size_t size = sizeof(float)*4;

cudaMalloc(&devU, size);

cudaMemcpy(devU, hostU, size,

cudaMemcpyHostToDevice);

example<<<1,4>>>(devU);

cudaMemcpy(hostU, devU, size,

cudaMemcpyDeviceToHost);

cudaFree(devU);

return 0;

}

Outline:

❖ GPU

❖ GPGPU and CUDA

❖ CPU + GPU

❖ SuperComputer (TianHe 1A)

❖ Latest Trends

CPU + GPU coprocessing

● Why cooperation-- CPU: sequential-code-optimized

-- GPU: throughput-optimized

4/16/2015

6

CPU + GPU coprocessing

● How to cooperate-- (SW)Coprocessing progmming: CUDA...

-- (HW)GPU-accelerated system

Issues

❖ Unbalanced workloads across CPUs and GPUs

❖ low bandwidth communication between CPUs and

GPUs

Two-level dynamic adaptive

mapping

❖ Idea: divide workload w.r.t performance

➢ among CPUs and GPUs

➢ among cores on one CPU

Two-level dynamic adaptive

mapping

❖ First level: map computations to CPUs and GPUsStep1: Initialize the fraction of workload to CPU and GPU according to their

peak performance;

Step2: After one iteration, update both CPU and GPU performance

Step3: Update the fraction of workload to CPU and GPU according to their

performance in the previous iteration;

4/16/2015

7

Two-level dynamic adaptive

mapping

❖ Second level: map computations to cores in a CPUStep1: Initialize the fraction to each core in CPU equally;

Step2: After one iteration, update every core performance;

Step3: Update fraction to each core according to their performance in previous

iteration.

Issues

❖ Unbalanced workloads across CPUs and GPUs

❖ low bandwidth communication between CPUs and

GPUs

Software pipelining

❖ CPUs & GPUs communication

❖ PCI-E 2.0: 4-

8GBps

❖ CPU -> PCI-E:

hundreds MBps..

Software pipelining

CPUs & GPUs communication:❖ Pinned memory(memory never swapped out to secondary storage)

❖ Limited resource, too much pinned memory desrease host performance

4/16/2015

8

Software pipelining

❖ Methodology:

Assume N tasks: every task contains three phases

task = {input, exe, output}

that corresponding to time:

Tt = {Ti, Texe, To}

If the pipeline is full:

Tall = Ti + To + N * Te

Software pipelining

❖ Methodology:

Testing results

❖ Single node:

Testing results

❖ Single node:

DGEMM: alpha*A *B + beta *C

Tianhe-1:80(cabinets) * 32(nodes) * (two quadcore Xeon

processor + 1 ATI GPU)

CPU:DGEMM implemented in Intel Math Kernel Library

ACMLG: ACM Core Math Library for Graphic Processors

4/16/2015

9

Outline:

❖ GPU

❖ GPGPU and CUDA

❖ CPU + GPU

❖ SuperComputer (TianHe 1A)

❖ Latest Trends

GPU accelerated supercomputer

TianHe1A:

Peak performance: 4.7 Peta-flops(Linpack performance:2.6 Peta-

flops)

Top one in the Top500 List in Nov, 2010(17th today)

TianHe1A overview

112 compute racks, 8 service racks, 6 communication

racks and 14 I/O racks

TianHe1A overview

❖ 7168 compute nodes(64 nodes per rack), each node =

two Intel CPUs + one NVIDIA GPU;

❖ 1024 service nodes, each node = two FT-1000 CPUs;

❖ In total 14336 Intel Xeon X5670 CPUs, 2048 FT-1000

CPUs and 7168 NVIDIA Tesla M2050 GPUS;

❖ 7168 nodes * (11.733 GFLOPS per core * 6 CPU cores

* 2 sockets + 515.2 GFLOPS per GPU) = 4702.21

double precision TFLOPS

❖ Fat tree network(NRC + NIC, bidirection bandwidth of

160Gbps) for data transfter

4/16/2015

10

Interconnection topology

❖ First layer:480 switching boards are distributed in 120

racks(computation and service racks);

❖ Second layer:11 fully connected 384-port switches

High-radix routing chip(NRC)

❖ High radix router:

http://pages.cs.wisc.edu/~isca2005/slides/07A-02.PDF

High-radix routing chip(NRC)

❖ High radix router:

http://pages.cs.wisc.edu/~isca2005/slides/07A-02.PDF

Bandwidth is divided

among 2k input and

output channels and so

the channel bandwidth is

b = B / 2k;

Number of hops H for an

uniform traffic is at least

2logkN;

tr is the delay at each

router;

High-radix routing chip(NRC)

❖ High radix router:

http://pages.cs.wisc.edu/~isca2005/slides/07A-02.PDF

4/16/2015

11

High-radix routing chip(NRC)

❖ NRC:

16*16 switch chip;

XBAR=4*4 crossbar;

I/O storage system

❖ Lustre file system(centralized parallel file system):

--Meta Data Server:stores namespace meta

data(filenames, directories, permissions);

--Object Storage Server:stores file data;

--Central storage: secure, efficient, flexible

Outline:

❖ GPU

❖ GPGPU and CUDA

❖ CPU + GPU

❖ SuperComputer (TianHe 1A)

❖ Latest Trends

Latest trends

❖ AMD Accelerated Processing Unit (Fusion)

--x86 CPU cores and GPU on same silicon die

--high-speed block transfer engines

4/16/2015

12

Latest trends

❖ Tianhe-2: Neo heterogeneous architecture➢ Intel Xeon Phi: same ISA, with unified programming model

➢ streamline development and optimization process

Latest trends

❖ Top500:

#1: Tianhe2;

#2: Titan;

#17: Tianhe1A;

Latest trends

❖ End of Tianhe:

http://www.wsj.com/articles/u-s-agencies-block-technology-

exports-for-supercomputer-in-china-1428561987

References

❖ Xiangke Liao, Liquan Xiao, “MilkyWay-2

supercomputer:system and application”, 2014

❖ http://pages.cs.wisc.edu/~isca2005/slides/07A-02.PDF ,

John Kim, Stanford University

❖ Canqun Yang, Feng Wang, “Adaptive optimization for

petascale heterogeneous CPU/GPU computing”, 2010,

IEEE

❖ Min Xie, Yutong Lu, “Tianhe-1A interconnection and

message-passing services”, 2012

4/16/2015

13

References

❖ J Nickolls, “The GPU computing era”, 2010

❖ Xuejun Yang, Xiangke Liao, “The Tianhe-1A

Supercomputer: its Hardware and Software”, 2011

❖ http://en.wikipedia.org/wiki/Xeon_Phi

❖ http://www.infocellar.com/networks/ip/hop-count.htm

❖ http://en.wikipedia.org/wiki/CUDA_Pinned_memory

❖ http://www.systemfabricworks.com/products/system-

fabric-works-storage-solutions/lustre-hadoop