Product Availability Update
description
Transcript of Product Availability Update
1
Product Availability UpdateProduct Inventory Leadtime
for big orders Notes
C1060 200 units 8 weeks Build to orderM1060 500 units 8 weeks Build to order
S1070-400 50 units 10 weeks Build to orderS1070-500 25 units+ 75 being built 10 weeks Build to order
M2050 Shipping nowBuilding 20K for Q2
8 weeks Sold out through mid-July
S2050 Shipping nowBuilding 200 for Q2
8 weeks Sold out through mid-July
C2050 2000 units 8 weeks Will maintain inventoryM2070 Sept 2010 - Get PO in now to get priority
C2070 Sept-Oct 2010 - Get PO in now to get priorityM2070-Q Oct 2010 -
Processamento Paralelo em GPU’s na Arquitetura FermiArnaldo TavaresTesla Sales Manager for Latin America
2
Quadro or Tesla?QUADROTM TESLATM
Computer Aided Design• e.g. CATIA, SolidWorks, Siemens NX
3D Modeling / Animation• e.g. 3ds, Maya, Softimage
Video Editing / FX• e.g. Adobe CS5, Avid
Numerical Analytics• e.g. MATLAB, Mathematica
Computational Biology• e.g. AMBER, NAMD, VMD
Computer Aided Engineering• e.g. ANSYS, SIMULIA/ABAQUS
3
GPU Computing
CPU + GPU Co-Processing
4 cores448 cores
CPU48 GigaFlops (DP)
GPU515 GigaFlops (DP)
(Average efficiency in Linpack: 50%)
4
146X
Medical Imaging U of Utah
36X
Molecular DynamicsU of Illinois, Urbana
18X
Video TranscodingElemental Tech
50X
Matlab ComputingAccelerEyes
100X
AstrophysicsRIKEN
149X
Financial simulationOxford
47X
Linear AlgebraUniversidad Jaime
20X
3D UltrasoundTechniscan
130X
Quantum ChemistryU of Illinois, Urbana
30X
Gene SequencingU of Maryland
50x – 150x
5
Tools
Oil & Gas
Bio-Chemistry
Bio-Informatics
TotalViewDebugger
NVIDIAVideo Libraries
AccelerEyesJacket MATLAB
EMPhotonicsCULAPACK
Bright ClusterManagerCAPS HMPP
MATLAB
Thrust C++Template Lib
CUDA C/C++
PGI CUDA Fortran
Parallel NsightVis Studio IDE
Allinea DDTDebugger
OpenEye ROCS
Available Announced
TauCUDAPerf Tools
NVIDIA NPPPerf Primitives
ParaToolsVampirTrace
VSGOpen Inventor
StoneRidgeRTM
Headwave Suite
AccelewareRTM Solver
GeoStar Seismic Suite
ffA SVI Pro
OpenGeoSolutions OpenSEIS
Paradigm RTM
Seismic CityRTM
TsunamiRTM
CAE ACUSIMAcuSolve 1.8
AutodeskMoldflow
PrometchParticleworks
RemcomXFdtd 7.0
MSC.SoftwareMarc 2010.2
PGIAccelerators
Platform LSFCluster Mgr
MAGMA (LAPACK)
FluiDynaOpenFOAM
MetacompCFD++
Available Now Future
Libraries
Wolfram Mathematica
CUDA FFTCUDA BLAS
TeraChem BigDFTABINT
VMD
AcelleraACEMDAMBER DL-POLY
GROMACS
HOOMD
LAMMPS
NAMD
GAMESS CP2K
CUDA-BLASTP
CUDA-EC
CUDA-MEME
CUDA SW++SmithWaterm GPU-HMMR
HEX ProteinDocking
MUMmerGPU PIPERDocking
LSTCLS-DYNA 971
RNG & SPARSE CUDA Libraries
Paradigm SKUA
Panorama Tech
PGI CUDA x86
Increasing Number of Professional CUDA Apps
ANSYSMechanical
6
Increasing Number of Professional CUDA Apps
Siemens 4D Ultrasound
Rendering
Finance
EDA
Digisens Medical
SchrodingerCore Hopping
MotionDSPIkena Video
ManifoldGIS
Dalsa Machine Vision
SynopsysTCAD
SPEAGSEMCAD X
AgilentEMPro 2010
CST Microwave
Agilent ADSSPICE
AccelewareFDTD Solver
AccelewareEM Solution
Aquimin AlphaVision
Other
NAGRNG
SciCompSciFinance
HanweckOptions Analy
Available Now
Gauda OPC
Useful Progress Med
LightworksArtisan
Autodesk3ds Max
NVIDIA OptiX (SDK)
mental imagesiray (OEM)
BunkspeedShot (iray)
Refractive SWOctane
Works ZebraZeany
Chaos GroupV-Ray GPU
CebasfinalRender
Random Control Arion
Caustic Graphics
Weta DigitalPantaRay
ILMPlume
Future
Available Announced
Digital Anarchy Photo
Video
ElementalVideo
FraunhoferJPEG2000
CinnafilmPixel Strings
AssimilateSCRATCH
The FoundryKronos
TDVisionTDVCodec
ARRIVarious Apps
Black MagicDa Vinci
MainConceptCUDA Encoder
GenArtsSapphire
Adobe Premier Pro CS5
MurexMACS
Numerix Risk RMS RiskMgt Solutions
RocketickVeritlog Sim
MVTec Machine Vis
7
3 of Top5 Supercomputers
Tianhe-1A Jaguar Nebulae Tsubame Hopper II Tera 1000
500
1000
1500
2000
2500
3000
0
1
2
3
4
5
6
7
8G
igaf
lops
Meg
awat
ts
8
3 of Top5 Supercomputers
Tianhe-1A Jaguar Nebulae Tsubame Hopper II Tera 1000
500
1000
1500
2000
2500
3000
0
1
2
3
4
5
6
7
8G
igaf
lops
Meg
awat
ts
9
What if Every Supercomputer Had Fermi?
Oak Ridge National Laboratory Lawrence Livermore National Laboratory IDRIS Network Company IT Service Provider (D) Semiconductor Company (P) Semiconductor Company (O) Merlion Trade GmbH Geoscience (P) Semiconductor Company (O) Hosting Services IT Service Provider (D)0
200
400
600
800
1000
LinpackTeraflops
Top 500 Supercomputers (Nov 2009)
150 GPUs37 TeraFlops
$740KTop 150
225 GPUs55 TeraFlops
$1.1 MTop 100
450 GPUs110 TeraFlops
$2.2 MTop 50
10
Hybrid ExaScale Trajectory
20081 TFLOP7.5 KWatts
20101.27 PFLOPS2.55 MWatts
2017 *2 EFLOPS10 MWatts
* This is a projection based on Moore’s law and does not represent a committed roadmap
11
Tesla Roadmap
12
The March of the GPUs
2007 2008 2009 2010 2011 20120
50
100
150
200
250 Peak Memory Bandwidth GBytes/sec
T10
Nehalem 3 GHz
Westmere3 GHz
8-core Sandy Bridge3 GHz
T20
T20A
2007 2008 2009 2010 2011 20120
200
400
600
800
1000
1200 Peak Double Precision FP GFlops/sec
Nehalem3 GHz
Westmere3 GHz
T20
T20A
T10
8-coreSandy Bridge
3 GHz
NVIDIA GPU (ECC off) x86 CPUDouble Precision: NVIDIA GPU Double Precision: x86 CPU
13
Project Denver
14
Expected Tesla Roadmap with Project Denver
15
WorkstationsUp to 4x
Tesla C2050/70 GPUs
Integrated CPU-GPU Server
2x Tesla M2050/70 GPUs in 1U
OEM CPU Server +Tesla S2050/70
4 Tesla GPUs in 2U
Workstation / Data Center Solutions2 Tesla
M2050/70 GPUs
16
Tesla C2050 Tesla C2070Processors Tesla 20-series GPU
Number of Cores 448
Caches64 KB L1 cache + Shared Memory / 32 cores
768 KB L2 cache
Floating Point Peak Performance
1030 Gigaflops (single)515 Gigaflops (double)
GPU Memory3 GB
2.625 GB with ECC on6 GB
5.25 GB with ECC on
Memory Bandwith 144 GB/s (GDDR5)
System I/O PCIe x16 Gen2
Power 238 W (max) 238 W (max)
Available Shipping Now Shipping Now
Tesla C-Series Workstation GPUs
17
How is the GPU Used?
Basic Component: “Stream Multiprocessor” (SM)
SIMD: “Single Instruction Multiple Data”
Same Instruction for all cores, but can operate over different data
“SIMD at SM, MIMD at GPU chip”
Source: Presentation from Felipe A. Cruz, Nagasaki University
18
The Use of GPU’s and Bottleneck Analysis
Source: Presentation from Takayuki Aoki, Tokyo Institute of Technology
19
The Fermi Architecture3 billion transistors
16 x Streaming Multiprocessors (SM’s)
6 x 64-bit Memory Partitions = 384-bit Memory Interface
Host Interface: connects the GPU to the CPU via PCI-Express
GigaThread global scheduler: distribute thread blocks to SM thread schedulers
20
SM ArchitectureRegister File
Scheduler
Dispatch
Scheduler
Dispatch
Load/Store Units x 16Special Func Units x 4
Interconnect Network
64K ConfigurableCache/Shared Mem
Uniform Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Instruction Cache
32 CUDA cores per SM (512 total)
16 x Load/Store Units = source and destin. address calculated for 16 threads per clock
4 x Special Function Units (sin, cosine, sq. root, etc.)
64 KB of RAM for shared memory and L1 cache (configurable)
Dual Warp Scheduler
21
Dual Warp Scheduler
1 Warp = 32 parallel threads
2 Warps issued and executed concurrently
Each Warp goes to 16 CUDA Cores
Most instructions can be dual issued (exception: Double Precision instructions)
Dual-Issue Model allows near peak hardware performance
22
CUDA Core ArchitectureRegister File
Scheduler
Dispatch
Scheduler
Dispatch
Load/Store Units x 16Special Func Units x 4
Interconnect Network
64K ConfigurableCache/Shared Mem
Uniform Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Instruction Cache
CUDA CoreDispatch Port
Operand Collector
Result Queue
FP Unit INT Unit
New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs
Newly designed integer ALU optimized for 64-bit and extended precision operations
Fused multiply-add (FMA) instruction for both 32-bit single and 64-bit double precision
23
Fused Multiply-Add Instruction (FMA)
24
GigaThreadTM Hardware Thread Scheduler (HTS)
Hierarchically manages thousands of simultaneously active threads
10x faster application context switching (each program receives a time slice of processing resources)
Concurrent kernel execution
HTS
25
GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch
Serial Kernel Execution Parallel Kernel Execution
Tim
e
Kernel 1 Kernel 1 Kernel 2
Kernel 2 Kernel 3
Kernel 3
Ker4
nel Kernel 5
Kernel 5
Kernel 4
Kernel 2
Kernel 2
26
GigaThread Streaming Data Transfer Engine
Dual DMA enginesSimultaneous CPUGPU and GPUCPU data transferFully overlapped with CPU and GPU processing time
Activity Snapshot:
SDT
Kernel 0Kernel 1
Kernel 2Kernel 3
CPU
CPU
CPU
CPU
SDT0
SDT0
SDT0
SDT0
GPU
GPU
GPU
GPU
SDT1
SDT1
SDT1
SDT1
27
Cached Memory Hierarchy
First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory
Shared/L1 Cache per SM (64KB)Improves bandwidth and reduces latency
Unified L2 Cache (768 KB)Fast, coherent data sharing across all cores in the GPU
Global Memory (up to 6GB)
28
CUDA: Compute Unified Device Architecture
NVIDIA’s Parallel Computing Architecture
Software Development Platform aimed to the GPU Architecture
CUDA Driver
CUDA Parallel Compute Engines inside GPU
CUDA Support in Kernel Level Driver
OpenCLDriver
ApplicationsUsing OpenCL
OpenCL C
ApplicationsUsing the
CUDA Driver API
C for CUDA
C Runtimefor CUDA
ApplicationsUsing C, C++, Fortran,
Java, Python, ...
C for CUDA
PTX (ISA)
DirectX 11Compute
ApplicationsUsing DirectX
HLSL
Device-level APIs Language Integration
1
2
34
5
29
Thread Hierarchy
Kernels (simple C program) are executed by thread
Threads are grouped into Blocks
Threads in a Block can synchronize execution
Blocks are grouped in a Grid
Blocks are independent (must be able to be executed at any order
Source: Presentation from Felipe A. Cruz, Nagasaki University
30
Memory and Hardware Hierarchy
Threads access RegistersCUDA Cores execute Threads
Threads within a Block can share data/results via Shared MemoryStreaming Multiprocessors (SM’s) execute Blocks
Grids use Global Memory for result sharing (after kernel-wide global synchronization)GPU executes Grids
Source: Presentation from Felipe A. Cruz, Nagasaki University
31
Full View of the Hierarchy Model
CUDA Hardware Level Memory Access
Thread CUDA Core Registers
Block SM Shared Memory
Grid GPU Global Memory
Device Node Host Memory
32
IDs and Dimensions
DeviceGrid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Threads3D IDs, unique within a block
Blocks2D IDs, unique within a grid
Dimensions set at launch timeCan be unique for each grid
Built-in variablesthreadIdx, blockIdxblockDim, gridDim
33
Compiling C for CUDA Applications
void serial_function(… ) { ...}void other_function(int ... ) { ...}
void saxpy_serial(float ... ) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i];}
void main( ) { float x; saxpy_serial(..); ...}
NVCC(Open64) CPU Compiler
C CUDAKey Kernels
CUDA objectfiles
Rest of CApplication
CPU objectfilesLinker
CPU-GPUExecutable
Modify into Parallel
CUDA code
34
C for CUDA : C with a few keywords
void saxpy_serial(int n, float a, float *x, float *y){ for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i];}// Invoke serial SAXPY kernelsaxpy_serial(n, 2.0, x, y);
__global__ void saxpy_parallel(int n, float a, float *x, float *y){ int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i];}// Invoke parallel SAXPY kernel with 256 threads/blockint nblocks = (n + 255) / 256;saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
Standard C Code
Parallel C Code
35
Software Programming
Source: Presentation from Andreas Klöckner, NYU
36
Software Programming
Source: Presentation from Andreas Klöckner, NYU
37
Software Programming
Source: Presentation from Andreas Klöckner, NYU
38
Software Programming
Source: Presentation from Andreas Klöckner, NYU
39
Software Programming
Source: Presentation from Andreas Klöckner, NYU
40
Software Programming
Source: Presentation from Andreas Klöckner, NYU
41
Software Programming
Source: Presentation from Andreas Klöckner, NYU
42
Software Programming
Source: Presentation from Andreas Klöckner, NYU
43
CUDA C/C++ Leadership
2007 2008 2009 2010
July 07 Nov 07 April 08 Aug 08 July 09 Nov 09 Mar 10CUDA Toolkit 1.1
• Win XP 64
• Atomics support
• Multi-GPU support
CUDA Toolkit 2.0
• Double Precision
• Compiler Optimizations
• Vista 32/64
• Mac OSX
• 3D Textures
• HW Interpolation
CUDA Toolkit 2.3
• DP FFT
• 16-32 Conversion intrinsics
• Performance enhancements
CUDA Toolkit 1.0
• C Compiler• C Extensions
• Single Precision• BLAS• FFT• SDK 40 examples
CUDAVisual Profiler 2.2
cuda-gdbHW Debugger
Parallel NsightBeta CUDA Toolkit 3.0
• C++ inheritance
• Fermi arch support
• Tools updates
• Driver / RT interop
44
Why should I choose Tesla over consumer cards?Feature Benefits
Features
4x Higher double precision (on 20-series) Higher Performance for scientific CUDA applications
ECC only on Tesla & Quadro (on 20-series) Data reliability inside the GPU and on DRAM memories
Bi-directional PCI-E communication (Tesla has Dual DMA Engines, GeForce has only 1 DMA Engine)
Higher Performance for CUDA applications (by overlapping communication & computation)
Larger memory for larger data sets – 3GB and 6GB Products Higher performance on wide range of applications (medical, oil & gas, manufacturing, FEA, CAE)
Cluster management software tools available on Tesla only Needed for GPU monitoring and job scheduling in data center deployments
TCC (Tesla Compute Cluster) driver supported for Windows OS only on Tesla.
Higher performance for CUDA applications due to lower kernel launch overhead. TCC adds support for RDP and Services
Integrated OEM workstations and servers Trusted, reliable systems built for Tesla products.
Professional ISVs will certify CUDA applications only on Tesla Bug reproduction, support, feature requests for Tesla only.
Quality & Warranty
2 to 4 day Stress testing & memory burn-in for reliability. Added margin in memory and core clocks for added reliability. Built for 24/7 computing in data center and workstation environments.
Manufactured & guaranteed by NVIDIA No changes in key components like GPU and memory without notice. Always the same clocks for known, reliable performance.
3-year warranty from HP Reliable, long life products
Support & Lifecycle
Enterprise support, higher priority for CUDA bugs and requests Ability to influence CUDA and GPU roadmap. Get early access to features requests.
18-24 months availability + 6-month EOL notice Reliable product supply