Supercomputing at 1/10th the Cost - Artificial Intelligence Computing … · 2012-05-31 · CUDA...

Supercomputing at 1/10th the Cost

http://www.nvidia.com/tesla

http://www.nvidia.com/tesla

2

Amazingly fluid,

real-time video editing

Quick preview of real time

edits and effects

Realistic preview of final

content

Faster encoding

Mainstream Applications Going ParallelCUDA Accelerates Adobe Mercury Playback Engine

3

GPU Innovation Accelerating

0

100

200

300

400

500

600

700

800

900

2000 2001 2002 2003 2004 2005 2006 2007 2008

$ M

illions

NVIDIA R&D Budget

5

Max Planck

Institute

National Center for

Supercomputing Applications

Tokyo Institute

of TechnologyCSIRO

6

Dawning Nebulae

Second Fastest Supercomputer in the World

1.27 Petaflop

4640 Tesla GPUs

8

Fermi

Lab

CSIRO

256 GPUs

NCSA

384 GPUsChinese Academy

of Sciences

2000+ GPUs

Daresbury LabPNNL

256 GPUs

National

Taiwan Univ

Max Planck

Institute

Argonne

Lab

Harvard

Oxford

Jefferson LabsGeorgia Tech

TACC

Delaware

OSC Maryland

Johns Hopkins

WestGrid

Stanford

UNC

NERSC

Prospective Deployment

Existing Deployment

IIT Delhi

CEA

Tokyo Tech

680 GPUs

Peking

University

Univ of

Science & Tech

Tsinghua

University

NCHC

Curtin

University

Swinburne

University

Riken

220 GPUs

Osaka

Nagasaki

KISTI

Anna

Univ

IIT

Madras

Indian Institute

of Science

NIT Calicut

Dept of

Space

LRDE

Indian Inst

of Tropical

Meteorology

Nizhegorodsky

University

Kazan Univ

St. Petersburg

University

Institute of

Physics

Aarhus

Norwegian

Univ of S & T

Braunschweig

Copenhagen

Oak Ridge

WisconsinVaTech

Cambridge

Groningen

SNU

Yonsei

Utah

Berkeley

9

NVIDIA Tesla 20-Series Products

Data Center Workstation

10

GPU Computing

CPU + GPU Co-Processing

4 cores

CPU48 GigaFlops (DP)

GPU515 GigaFlops (DP)

11

NVIDIA Developer Eco-System

Debuggers& Profilers

cuda-gdbNV Visual Profiler

Parallel NsightVisual Studio

AllineaTotalView

MATLABMathematicaNI LabView

pyCUDA

Numerical Packages

CC++

FortranOpenCL

DirectComputeJava

Python

GPU Compilers

PGI AcceleratorCAPS HMPP

mCUDAOpenMP

ParallelizingCompilers

BLASFFT

LAPACKNPP

VideoImagingGPULib

Libraries

OEM Solution ProvidersGPGPU Consultants & Training

ANEO GPU Tech

http://www.supermicro.com/

http://en.wikipedia.org/wiki/File:Logo_groupe_bull.jpg

http://images.google.com/imgres?imgurl=http://fishtrain.com/wp-content/uploads/2007/09/cray_logo.gif&imgrefurl=http://fishtrain.com/2007/09/03/nvidias-playbook/&usg=__mBEPjqB6tUo0mps50ld866NdmmI=&h=70&w=160&sz=3&hl=en&start=8&sig2=erIWlru80_C67bxBapde6g&tbnid=ooG9_suq3ywK-M:&tbnh=43&tbnw=98&prev=/images?q=cray+logo&gbv=2&hl=en&ei=aHYpSvyWEo-ctgPd-dXxCg

http://www.google.com/imgres?imgurl=http://blog.taragana.com/wp-content/uploads/2009/05/nec-logo.jpg&imgrefurl=http://blog.taragana.com/index.php/t/east-asia/&h=354&w=354&sz=8&tbnid=YJa5kHMJJ5aMmM:&tbnh=121&tbnw=121&prev=/images?q=NEC+logo&hl=en&usg=__vqs8CIGTn2HFsKXlXcsnKjhGaww=&ei=Q98zSsTUG4vWsgPysrDODg&sa=X&oi=image_result&resnum=2&ct=image

12

146X

Medical Imaging

U of Utah

36X

Molecular Dynamics

U of Illinois, Urbana

18X

Video Transcoding

Elemental Tech

50X

Matlab Computing

AccelerEyes

100X

Astrophysics

RIKEN

149X

Financial simulation

Oxford

47X

Linear Algebra

Universidad Jaime

20X

3D Ultrasound

Techniscan

130X

Quantum Chemistry

U of Illinois, Urbana

30X

Gene Sequencing

U of Maryland

50x – 150x

13

Workstations

2 to 4 Tesla GPUs

Integrated CPU-GPU

Servers & Blades

OEM CPU Server +

Tesla S-series 1U

Tesla Data Center & Workstation GPU Solutions

Tesla S-series 1U SystemsS2050 S1070

Tesla M-series GPUsM2070 M2050 M1060

Tesla C-series GPUsC2070 C2050 C1060

14

Soul of a Supercomputer

0

5

10

15

20

25

Nehalem

4-core CPU

Westmere

6-core CPU

Tesla

10-series

Tesla

20-series

Tera

flo

ps

Linpack Teraflops per Rack

15

Heterogeneous = 2x Performance / Watt

Gigaflops/watt

X86GPU

16

8x Higher Linpack

80.1

656.1

0

150

300

450

600

750

CPU Server GPU-CPU Server

PerformanceGflops

11

60

0

10

20

30

40

50

60

70


Performance / $Gflops / $K

146

656

0

200

400

600

800


Performance / wattGflops / kwatt

CPU 1U Server: 2x Intel Xeon X5550 (Nehalem) 2.66 GHz, 48 GB memory, $7K, 0.55 kw

GPU-CPU 1U Server: 2x Tesla C2050 + 2x Intel Xeon X5550, 48 GB memory, $11K, 1.0 kw

17

4.5x Lower Power & Cooling Costs37 TeraFlop System : Top 150 System

$117 K $524 K4.5x Power Savings every Year

7x Less Space Required2 Racks of GPU+CPUs 15 Racks of CPUs

$740 K $3.8 M5x Lower Cost

As per November 2009 Top 500 List

18

What if Every Supercomputer Had Fermi?

0

200

400

600

800

1000

Linpack

Teraflops

Top 500 Supercomputers (Nov 2009)

150 GPUs

37 TeraFlops

$740K

Top 150

225 GPUs

55 TeraFlops

$1.1 M

Top 100

450 GPUs

110 TeraFlops

$2.2 M

Top 50

19

Successful Customers

GPU vs CPU

Improvements

Performance / Watt 18x - 27x 12x - 17x

Performance / Space 20x - 31x 15x - 20x

Performance / Cost 15x - 20x 10x - 12x

Oil & Gas ISVs

20+ Oil & Gas Companies Porting to CUDA

20

Successful Customers

Finance ISVs

Finance: 10+ Banks Porting to CUDA

Several unannounced

31.05 secs

0.41 secs

0.25 secs

0 5 10 15 20 25 30 35

Intel Xeon (2.6 GHz)

1 TeslaC1060

2 TeslaC1060s

Time (secs)

Derivative Pricing using SciFinanceBasket Equity-Linked Structured Note

124x

77x

UnRisk

21

Defense / Federal Agencies

Opportunities

Defense Contractors

Federal Agencies

Defense services

Speedups 10x-50x

GIS

Manifold, PCI Geomatics, DigitalGlobe

Signal Processing

GPU VSIPL

MATLAB

GPU Plugin available

UAV video analysis

MotionDSP Ikena

Virtual Prototyping

RealityServer

Surveillance, Cryptography

Software Available

22

Tesla Bio WorkBench : Bio-Chemistry & Bio-Informatics

Applications

Community

Platforms

Technical

papers

Discussion

Forums

Benchmarks

& Configurations

Tesla Personal Supercomputer Tesla GPU Clusters

Download,

Documentation

MUMmerGPU

LAMMPS CUDA-BLASTP

TeraChem

CUDA-EC

CUDA-MEME

Hex (Docking)

23

Increasing Number of CUDA Applications

CUDA C/C++, PGI Fortran

MathworksMATLAB

Nsight Visual Studio IDE

Agilent ADS Spice Simulator

Verilog SimulatorElectro-magnetics:

Agilent, CST, Remcom, SPEAG

Al r e a d y Av a i l a b l e

Mathematica

MSCMARC

MSCNastran

AMBER, GROMACS, GROMOS, HOOMD, LAMMPS, NAMD, VMD

Other Popular MD code

Quantum Chem Code 2

Quantum ChemCode 1

LithographyProducts

Risk Analysis 1Scicomp

SciFinance

CULALAPACK Library

AllineaDebugger

TotalViewDebugger

AutoDeskMoldflow

Credit Risk Analysis ISV

Jacket: MATLAB Plugin

NPP PerformancePrimitives (NPP)

NAG: RNGs

Thrust: C++ Template Lib

OpenCurrent:CFD/PDE Library

mental ray with iray

Main Concept Video Encoder

OptiX Ray TracingEngine

NumeriX: CounterParty Risk

ElementalVideo Live

Seismic Analysis:ffA, HeadWave

Tools &Libraries

Oil & Gas

Bio-Chemistry

Video & Rendering

Finance

EDA

CAE

FraunhoferJPEG2K

2010Q2 Q3 Q4

Seismic Interpretation

Reservoir Simulation 2

Seismic Analysis:Geostar

3D CAD SW with iray

Acusim AcuSolveCFD

Structural Mechanics ISV

Several CAE ISVs

Platform Cluster Management

PGI AcceleratorEnhancements

CAPS HMPP Enhancements

SPICESimulator

Seismic City, Acceleware

Reservoir Simulation 1

Moldex3D

Protein DockingCUDA-BLASTP, GPU-HMMER,

MUMmerGPU, MEME, CUDA-ECShort-read seq

analysisHex Protein

DockingBio-

Informatics

BigDFT, ABINIT, TeraChem

Announced Product

UnannouncedProduct

Released Product

Trading Platform ISVRisk Analysis 2

Bright Cluster Management

24

Bloomberg: Bond Pricing

$144K $4 Million28x Lower Cost

48 GPUs 2000 CPUs42x Lower Space

$31K / year $1.2 Million / year38x Lower Power Cost

25

Oil & Gas: Seismic Processing

~$400 K ~$8 M

45 kWatts 1200 kWatts27x Lower Power

20x Lower Cost

Equal Performance1 1

32 Tesla S1070s 2000 CPU Servers31x Less Space

26

Finding a Better Shampoo

1 kWatt 21 kWatts

$2 K $37 K

$7 K $128 K

No Data Center Required

19x Lower Power & Cooling Cost

13x Lower System Cost

Equal PerformanceTesla PSC 32 CPU Servers

27

Reducing Radiation in CT

28

Flow Cytometry : Finding Cancer Cells

Flow Cytometer

29

Post-Katrina Hurricane: What Google Showed

30

What was really there : Digital Globe

31

Explosive Growth in CUDA Developers

Teaching Parallel Programming using NVIDIA GPUs

200,000 CUDA Developer Downloads

32

1000+ Applications and Papers

using NVIDIA GPUs10 GPGPU Books using NVIDIA GPUs

NVIDIA’s Commitment to GPU Computing

33

Programming GPUs

34

Doing GPGPU Right: Combination of

Hardware and Software

C OpenCLtm Direct

ComputeFortran

Java and Python

OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.

C++

Libraries and Middleware

cuFFT cuBLASCULA

LAPACK

NPP &

cuDPPVideo

PhysX

Physics

OptiX

Ray

Tracing

mental ray

iray

Rendering

Reality

Server

3D Web

Services

NVIDIA GPU

CUDA Parallel Computing Architecture

GPU Computing Applications

35

CUDA C/C++ Continuous Innovation

2007 2008 2009 2010

July 07 Nov 07 April 08 Aug 08 July 09 Nov 09 Mar 10

CUDA Toolkit 1.1

• Win XP 64

• Atomics support

• Multi-GPU

support

CUDA Toolkit 2.0

• Double Precision

• Compiler

Optimizations

• Vista 32/64

• Mac OSX

• 3D Textures

• HW Interpolation

CUDA Toolkit 2.3

• DP FFT

• 16-32 Conversion

intrinsics

• Performance

enhancements

CUDA Toolkit 1.0

• C Compiler

• C Extensions

• Single Precision

• BLAS

• FFT

• SDK

40 examples

CUDA

Visual Profiler 2.2

cuda-gdb

HW Debugger

Parallel Nsight

BetaCUDA Toolkit 3.0

• C++ inheritance

• Fermi arch support

• Tools updates

• Driver / RT interop

36

Compiling C for CUDA Applications

void serial_function(… ) {

...

}

void other_function(int ... ) {

...

}

void saxpy_serial(float ... ) {

for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];

}

void main( ) {

float x;

saxpy_serial(..);

...

}

NVCC

(Open64)CPU Compiler

C CUDA

Key Kernels

CUDA object

files

Rest of C

Application

CPU object

filesLinker

CPU-GPU

Executable

Modify into

Parallel

CUDA code

37

C for CUDA : C with a few keywords

void saxpy_serial(int n, float a, float *x, float *y)

{

for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];

}

// Invoke serial SAXPY kernel

saxpy_serial(n, 2.0, x, y);

__global__ void saxpy_parallel(int n, float a, float *x, float *y)

{

int i = blockIdx.x*blockDim.x + threadIdx.x;

if (i < n) y[i] = a*x[i] + y[i];

}

// Invoke parallel SAXPY kernel with 256 threads/block

int nblocks = (n + 255) / 256;

saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);

Standard C Code

Parallel C Code

38

CUDA Programming Effort / Performance

Source : MIT CUDA Course

39

Parallel Nsight

Visual Studio

Visual Profiler

For Linux

cuda-gdb

For Linux

40

Targeting Multiple Platforms with CUDA

CUDA C / C++

NVCCNVIDIA CUDA Toolkit

MCUDACUDA to Multi-core

OcelotPTX to Multi-core

MCUDA: http://impact.crhc.illinois.edu/mcuda.php

Ocelot: http://code.google.com/p/gpuocelot/

Swan: http://www.multiscalelab.org/swan

SWAN CUDA to OpenCL

Other GPUs

Multi-Core

CPUs

NVIDIA

GPUs

http://impact.crhc.illinois.edu/mcuda.php

http://code.google.com/p/gpuocelot/

http://www.multiscalelab.org/swan

41

Performance Benchmarks

42

8x Higher Linpack

80.1

656.1

0

150

300

450

600

750


PerformanceGflops

11

60

0

10

20

30

40

50

60

70


Performance / $Gflops / $K

146

656

0

200

400

600

800


Performance / wattGflops / kwatt

CPU 1U Server: 2x Intel Xeon X5550 (Nehalem) 2.66 GHz, 48 GB memory, $7K, 0.55 kw

GPU-CPU 1U Server: 2x Tesla C2050 + 2x Intel Xeon X5550, 48 GB memory, $11K, 1.0 kw

8x 5x 4.5x

43

Performance Summary

0

20

40

60

80

100

120

Sp

ee

d-u

p

MIDG: Discontinuous Galerkin Solvers for

PDEs

0

1

2

3

4

5

6

7

8

Sp

ee

d-u

p

AMBER Molecular Dynamics (Mixed

Precision)

0

2

4

6

8

10

12

14

16

18

Sp

ee

d-u

p

Lock Exchange Problem OpenCurrent

0

1

2

3

4

5

6

7

8

Sp

ee

d-u

p

Radix Sort CUDA SDK

0

20

40

60

80

100

120

140

160

180

Sp

eed

-up

OpenEye ROCS Virtual Drug Screening

Preliminary data

44

Standard FFT Library: cuFFT 3.1

0

50

100

150

200

250

300

350

400

Gflops

Transform Size

Single Precision FFT

0

20

40

60

80

100

120

Gflops

Transform Size

Double Precision FFT

cuFFT 3.1: NVIDIA Tesla C1060, Tesla C2050 (Fermi)MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz

45

Standard BLAS Library: cuBLAS 3.1

0

50

100

150

200

250

300

350

400

450

256x256 512x512 1024x1024 2048x2048 4096x4096 8192x8192

Gflops

Matrix Size

Single Precision BLAS: SGEMM

0

20

40

60

80

100

120

140

160

180

256x256 512x512 1024x1024 2048x2048 4096x4096 8192x8192

Gflops

Matrix Size

Double Precision BLAS: DGEMM

cuBLAS 3.1: NVIDIA Tesla C1060, Tesla C2050 (Fermi)MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz

46

Matrix Size for Best cuBLAS Performance

0

50

100

150

200

250

300

350

400

450

500

550

600

80 x

80

400 x

400

720 x

720

1040 x

1040

1360 x

1360

1680 x

1680

2000 x

2000

2320 x

2320

2640 x

2640

2960 x

2960

3280 x

3280

3600 x

3600

3920 x

3920

4240 x

4240

4560 x

4560

4880 x

4880

5200 x

5200

5520 x

5520

5840 x

5840

6160 x

6160

6480 x

6480

6800 x

6800

7120 x

7120

7440 x

7440

7760 x

7760

8080 x

8080

Gflops SGEMM: Multiples of 80

0

50

100

150

200

250

48

x 4

8

33

6 x

33

6

62

4 x

62

4

91

2 x

91

2

12

00

x 1

20

0

14

88

x 1

48

8

17

76

x 1

77

6

20

64

x 2

06

4

23

52

x 2

35

2

26

40

x 2

64

0

29

28

x 2

92

8

32

16

x 3

21

6

35

04

x 3

50

4

37

92

x 3

79

2

40

80

x 4

08

0

43

68

x 4

36

8

46

56

x 4

65

6

49

44

x 4

94

4

52

32

x 5

23

2

55

20

x 5

52

0

58

08

x 5

80

8

60

96

x 6

09

6

63

84

x 6

38

4

66

72

x 6

67

2

69

60

x 6

96

0

72

48

x 7

24

8

75

36

x 7

53

6

78

24

x 7

82

4

81

12

x 8

11

2

Gflops DGEMM: Multiples of 48

cuBLAS 3.1: NVIDIA Tesla C1060, Tesla C2050 (Fermi)MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz

47

CULA 1.3 LAPACK Library from EM Photonics

0

25

50

75

100

125

2048 4096 6144 8192 10240

GL

OP

S

Matrix Size

LU Decomposition (DGETRF) Partial pivoting; Operation count estimated as 0.66N3

Core i7-920 Tesla C1060 Tesla C2050

0

10

20

30

40

50

60

2048 4096 6144 8192 10240

GL

OP

S

Matrix Size

Singular Value Decomposition (DGESVD)

Left & right singular vectors; Operation count estimated as 21N3


0

25

50

75

100

125

150

2048 4096 6144 8192 10240

GL

OP

S

Matrix Size

QR Decomposition (DGEQRF) Householder method; Operation count estimated as

1.33N3


Double Precision Results

Data Courtesy: EM Photonics

48

Sparse Matrix-Vector Multiplication (SpMV)

0

2

4

6

8

10

12

14

16

18

20

Gflops Double Precision

Tesla C2050 (ECC off)

Tesla C2050 (ECC on)

Tesla C1060

Intel Xeon 5550

SpMv: CUDA 3.0, Tesla C1060 and Tesla C2050

MKL 10.2: Intel Xeon 5550, 2.67 GHzPreliminary data

0

5

10

15

20

25

Gflops Single Precision

Tesla C2050 (ECC off)

Tesla C2050 (ECC on)

Tesla C1060

Intel Xeon 5550

49

Tesla GPU Computing Products

50

Fermi: The Computational GPU

Performance• 7x Double Precision of CPUs• IEEE 754-2008 SP & DP Floating Point

Flexibility

51

NVIDIA Tesla GPU Computing Products

1U Systems Workstation BoardsServer Module

Tesla M2070 /

Tesla M2050Tesla M1060 Tesla S2050 Tesla S1070

Tesla C2070 /

Tesla C2050Tesla C1060

GPUs 1 T20 GPU 1 T10 GPU 4 T20 GPUs 4 T10 GPUs 1 T20 GPU 1 T10 GPU

Single

Precision1030 GFlops 933 GFlops 4120 GFlops 4140 GFlops 1030 Gflops 933 GFlops

Double

Precision515 Gflops 78 GFlops 2060 GFlops 346 GFlops 515 Gflops 78 GFlops

Memory 6 GB / 3 GB 4 GB 12 GB (S2050)16 GB

4 GB / GPU6 GB / 3 GB 4 GB

Mem BW 148.4 GB/s 102 GB/s 148.4 GB/s 102 GB/s 144 GB/s 102 GB/s

52

In testing our key applications, the Tesla GPUs delivered speed-ups that we

had never seen before, sometimes even orders of magnitude“ ”Satoshi MatsuokaProfessorTokyo Institute of Technology

Future computing architectures will be hybrid systems with parallel-core

GPUs working in tandem with multi-core CPUs“ ”Jack DongarraProfessor, University of TennesseeAuthor of Linpack

I believe history will record Fermi as a significant milestone.“ ”Dave Patterson

Director Parallel Computing Research Laboratory, U.C. BerkeleyCo-Author of Computer Architecture: A Quantitative Approach

53

Mass market adoption

Applications

Scaling

Supercomputing at 1/10th the Cost - Artificial Intelligence Computing … · 2012-05-31 · CUDA...

Documents

Transcript of Supercomputing at 1/10th the Cost - Artificial Intelligence Computing … · 2012-05-31 · CUDA...