Supercomputing with NVIDIA GPUs - t-systems-sfr.com · Supercomputing with NVIDIA GPUs HPCN...

© NVIDIA Corporation 2011

Supercomputing with NVIDIA GPUsHPCN Workshop, May, 2011Axel Koehler- NVIDIA


NVIDIA Introduction and HPC Evolution of GPUs

Public, based in Santa Clara, CA | ~$4B revenue | ~6000 employees

Founded in 1999 with primary business in semiconductor industry

Products for graphics in workstations, notebooks, mobile devices, etc.

Began R&D of GPUs for HPC in 2004, released first Tesla and CUDA in 2007

Development of GPUs as a co-processing accelerator for x86 CPUs

2004: Began strategic investments in GPU as HPC co-proces sor

2006: G80 first GPU with built-in compute features, 128 c ores; CUDA SDK Beta

2007: Tesla 8-series based on G80, 128 cores – CUDA 1.0, 1 .1

2008: Tesla 10-series based on GT 200, 240 cores – CUDA 2. 0, 2.3

2009: Tesla 20-series, code named “Fermi” up to 512 cores – CUDA SDK 3.0

HPC Evolution of GPUs

3 Generations ofTesla in 3 Years


#1 : Tianhe-1A7168 Tesla GPU’s 2.5 PFLOPS

#3 : Nebulae4650 Tesla GPU’s 1.2 PFLOPS

We not only created the world's fastest computer, but also implemented

a heterogeneous computing architecture incorporating CPU and GPU,

this is a new innovation. ” Premier Wen JiabaoPublic comments acknowledging Tianhe-1A

“

#4 : Tsubame 2.04224 Tesla GPU’s 1.194 PFLOPS

Tesla GPUs Power 3 of Top 5 Supercomputers


3 of Top5 Supercomputers

0

1

2

3

4

5

6

7

8

0

500

1000

1500

2000

2500

3000

Tianhe-1A Jaguar Nebulae Tsubame Hopper II Tera 100

Meg

awat

ts

Gig

aflo

ps


GPU Computing TodayBy the Numbers:

CUDA Capable GPUs200+ Million

CUDA Toolkit Downloads600,000+

Active GPU Computing Developers100,000+

Members in Parallel Nsight Developer Program8,000

Universities Teaching CUDA Worldwide362

CUDA Centers of Excellence Worldwide11


Wide Adoption of Tesla GPUs

FinanceGovernmentEdu/ResearchOil and gas Life Sciences Manufacturing

Reverse Time

Migration

Kirchoff Time

Migration

Reservoir Sim

Astrophysics

Molecular

Dynamics

Weather / Climate

Modeling

Signal Processing

Satellite Imaging

Video Analytics

Synthetic Aperture

Radar

Bio-chemistry

Bio-informatics

Material Science

Sequence Analysis

Genomics

Risk Analytics

Monte Carlo

Options Pricing

Insurance

modeling

Structural

Mechanics

Computational

Fluid Dynamics

Machine Vision

Electromagnetics


MATLAB makes GPUs more accessible

Scientist /Practitioner

Developer /Computer Scientist

Computational Expertise Domain Expertise

MATLAB Benefits• Faster time to discovery• Empowers scientist /

practitioner• No need for programming

expertise• No custom tools• Automated application

deployment

Language Integration

CUDA C / C++

High-LevelTechnical

ComputingLanguages

1 million+ MATLAB licensees


GPU Progress – CAE ISV Software

Available

Today

Product

in 2011

Product

Evaluation

Research

Evaluation

GPU Status Structural Mechanics Fluid Dynamics Electromagnetics

ANSYS Mechanical

AFEA

Abaqus/Standard

LS-DYNA implicit

Marc

MD Nastran

RADIOSS implicit

PAM-CRASH implicit

NX Nastran

RecurDyn

AcuSolve

Moldflow

Culises (OpenFOAM)

Particleworks

CFD-ACE+

Abaqus/CFD

FloEFD

STAR-CCM+

ANSYS CFD (FLUENT+CFX)

LS-DYNA

Abaqus/Explicit

RADIOSS

PAM-CRASH

CFD++

LS-DYNA CFD

Nexxim

EMPro

CST MS

XFdtd

SEMCAD X

Xpatch

HFSS

Maxwell


3 billion transistors

Over 2 x the cores (512 total)

8× the peak DP performance

ECC

L1 and L2 caches

~2× memory bandwidth (GDDR5)

Up to 1 Terabyte of GPU memory

Concurrent kernels

Hardware support for C++

DR

AM

I/F

HO

ST

I/F

Gig

a Th

read

DR

AM

I/F

DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

L2

The ‘Fermi’ ArchitectureThe Soul of a Supercomputer in the body of a GPU


Workstations2 to 4 Tesla GPUs

Integrated CPU-GPU Servers & Blades

Tesla Data Center & Workstation GPU Solutions

Tesla M-series GPUsM2090 | M2070 | M2050

Tesla C-series GPUsC2070 | C2050

M2090 M2070 M2050Cores 512 448 448

Memory 6 GB 6 GB 3 GB

Memory bandwidth (ECC off)

177.6 GB/s 148.8 GB/s 148.8 GB/s

Peak PerfGflops

Single Precision

1331 1030 1030

Double Precision

665 515 515

C2070 C2050448 448

6 GB 3 GB

144 GB/s 144 GB/s

1030 1030

515 515


CUDA GPU Roadmap16

2

4

6

8

10

12

14

DP GFLOPS per Watt

2007 2009 2011 2013

TeslaFermi

Kepler

Maxwell


NVIDIA Developer Eco -System

C

C++

Fortran

OpenCL

DirectCompute

Java

Python

GPU Compilers

PGI Accelerator

CAPS HMPP

mCUDA

OpenMP

Parallelizing

Compilers

BLAS

FFT

LAPACK

NPP

Video

Imaging

GPULib

Libraries

GPGPU Consultants & Training

ANEO GPU Tech

Debuggers

& Profilers

cuda-gdb

NV Visual Profiler

Parallel Nsight

Visual Studio

Allinea

TotalView

VampirTrace

MATLAB

Mathematica

NI LabView

pyCUDA

Numerical

Packages

Bright Cluster

Manager

Platform LSF /

Symphony

Altair PBS Pro

Torque

GridEngine

Cluster

Tools

OEM solutions +

Cloud Platform Provider

Amazon EC2

Peer 1


CUDA 4.0: Highlights

• Share GPUs across multiple threads

• Single thread access to all GPUs

• No-copy pinning of system memory

• New CUDA C/C++ features

• Thrust templated primitives library

• NPP image/video processing library

• Layered Textures

Easier ParallelApplication Porting

• Auto Performance Analysis

• C++ Debugging

• GPU Binary Disassembler

• cuda-gdb for MacOS

New & Improved Developer Tools

• Unified Virtual Addressing

• NVIDIA GPUDirect™ v2.0

• Peer-to-Peer Access

• Peer-to-Peer Transfers

• GPU-accelerated MPI

Faster Multi-GPU Programming


C++ Templatized Algorithms & Data Structures (Thrust)

Powerful open source C++ parallel algorithms & data structures

Similar to C++ Standard Template Library (STL)

Automatically chooses the fastest code path at comp ile time

Divides work between GPUs and multi-core CPUs

Parallel sorting @ 5x to 100x faster

Data Structures

• thrust::device_vector

• thrust::host_vector

• thrust::device_ptr

• Etc.

Algorithms

• thrust::sort

• thrust::reduce

• thrust::exclusive_scan

• Etc.


Unified Virtual Addressing Easier to Program with Single Address Space

No UVA: Multiple Memory Spaces UVA : Single Address Space

System

Memory

CPU GPU0

GPU0

Memory

GPU1

GPU1

Memory

System

Memory

CPU GPU0

GPU0

Memory

GPU1

GPU1

Memory

PCI-e PCI-e

0x0000

0xFFFF

0x0000

0xFFFF

0x0000

0xFFFF

0x0000

0xFFFF


Unified Virtual Addressing

One address space for all CPU and GPU memoryDetermine physical memory location from pointer val ueEnables libraries to simplify their interfaces (e.g . cudaMemcpy)

Supported on Tesla 20-series and other Fermi GPUs64-bit applications on Linux and Windows TCC

Before UVA With UVA

Separate options for each permutation One function handles all cases

cudaMemcpyHostToHostcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDevice

cudaMemcpyDefault(data location becomes an implementation detail)


NVIDIA GPUDirect™ :Towards Eliminating the CPU Bottleneck

• Direct access to GPU memory for 3rd

party devices

• Eliminates unnecessary sys memcopies & CPU overhead

• Supported by Mellanox and Qlogic

• Up to 30% improvement in communication performance

Version 1.0 for applications that communicate

over a network

• Peer-to-Peer memory access, transfers & synchronization

• Less code, higher programmer productivity

Details @ http://www.nvidia.com/object/software-for-tesla-products.html

Version 2.0for applications that communicate

within a node


GPUDirect v2.0: Peer-to -Peer Communication

Direct Access Direct Transfers

GPU1

GPU1

Memory

GPU0

GPU0

Memory

Load / Store cudaMemcpy()

GPU0

GPU0

Memory

GPU1

GPU1

Memory

PCI-e PCI-e


GPUDirect v2.0: Peer-to -Peer Communication

Direct communication between GPUsFaster - no system memory copy overheadMore convenient multi-GPU programming

Direct TransfersCopy from GPU 0 memory to GPU 1 memoryWorks transparently with UVA

Direct AccessGPU0 reads or writes GPU 1 memory (load/store)

Supported only on Tesla 20-series (Fermi)64-bit applications on Linux and Windows TCC


EchelonNVIDIA’s Extreme-Scale Computing Project


Optimize the Storage Hierarchy2

Tailor Memory to the Application3

Data Movement Dominates Power1

Power is THE Problem


Applications with Hierarchical Reuse Want a Deep Storage Hierarchy

P P P P P P P P P P P P P P P P

L2 L2 L2 L2

L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1

L3

L4


Applications with Plateaus Want a Shallow Storage Hierarchy

P P P P P P P P P P P P P P P P

NoC

L2 L2 L2 L2

L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1


Configurable Memory Can Do BothAt the Same Time

Flat hierarchy for large working setsDeep hierarchy for reuse“Shared” memory for explicit managementCache memory for unpredictable sharing

P

L1

SRAM SRAM SRAM SRAM

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

NoC


Lane - DFMAs, 20 GFLOPS

P P P P P P P P

Switch

L1$

SM - 8 lanes, 160 GFLOPS

1024 SRAM Banks, 256KB each

NIMC MC

SM SM SM SM

NoC

SM LP LP

SRAM SRAM SRAM

Chip – 128 SMs, 20.48 TFLOPS + 8 Latency Processors

GPU Chip20TF DP256MB

GPU Chip20TF DP256MB

1.4TB/sDRAM BW

150GB/sNetwork BW

DRAMStack

DRAMStack

DRAMStack

NVMemory

Node MCM – 20 TFLOPS + 256 GB

Echelon Architecture


Echelon System Sketch

Self-Aware OS

Self-Aware Runtime

Locality-AwareCompiler & Autotuner

Echelon System , 400 Cabinets, 1 EF, 15 MW)Cabinet 0 (C0) , 16 Modules, 2.6PF, 205TB/s, 32TB

Module 0 (M)) , 8 Nodes, 160TF, 12.8TB/s, 2TB M15Node 0 (N0) 20TF, 1.6TB/s, 256GB

Processor Chip (PC)

L0

C0

SM0

L0

C7

NoC

SM127

MC NICL20 L21023

DRAMCube

DRAMCube

NV RAM

High-Radix Router Module (RM)

CN

Dragonfly Interconnect (optical fiber)

N7

LC0

LC7


GPU Computing Enables Ex aScaleAt Reasonable Power2

The GPU is the ComputerA general purpose computing engine, not just an accelerator3

GPU Computing is #1 TodayOn Top 500 AND Dominant on Green 5001

GPU Computing is the Future

The Real Challenge is Software4


Supercomputing with NVIDIA GPUsHPCN Workshop, May, 2011Axel Koehler- NVIDIA

Supercomputing with NVIDIA GPUs - t-systems-sfr.com · Supercomputing with NVIDIA GPUs HPCN...

Documents

Transcript of Supercomputing with NVIDIA GPUs - t-systems-sfr.com · Supercomputing with NVIDIA GPUs HPCN...