New Technology in High Performance Computing Bae WonSoung Ph.D. TAO Computing inc....
-
Upload
hilary-walker -
Category
Documents
-
view
215 -
download
1
Transcript of New Technology in High Performance Computing Bae WonSoung Ph.D. TAO Computing inc....
1.Introduction hardware acceleration technology (Hybrid computing)
2. CSX600 Acceleration technology parallelism based on SIMD architecture
3. CSX600 Performance
4. Approach to Blade system etc
5. Application Performance, Research & Development
6. Case Study & Prospect
7. Discussion
Contents
Introduction
Hybrid computing 의 대두 현재의 cpu 의 기술의 한계 (chip 의 직접화의 한계 ) multi-core 기술
산술연산기능상대적 비율의 감소 범용 ( 개인용 ) 위주의 개발 특화
CPU operation Acceleratedoperation + Hybrid Computing
Hybrid computing
Acceleration & Hybrid computing
Acceleration & Hybrid computing technology
1. FPGA (Field Programmable Gate Array)
xilinx, Altera (SGI)
2. GPU( Graphic Processor Unit)
ATI (AMD) CTM project , NVIDIA (CUDA/Tesla)
3. Game Processor
Cell, Mercury
4. SIMD processor ( Accelerated coprocessor)
CSX600
FPGAs work best for bit and integer data types
• Excellent for bit-twiddling, like cryptography• Fast for integer manipulations, like genomic
algorithms for pattern matching• Marginal for 32-bit floating-point; have to do over
200 at once to compete with current general purpose chips
• Poor at 64-bit floating-point… don’t use them as a supplementary FLOPS unit• “Programming” is really circuit design, though tools are
making this easier.• Compare speed against meticulously coded assembler on node, not casual coding on node.• Where FPGAs make the most sense is in creating instructions very unlike those provided by the node instruction set.• FPGAs can cost more than an entire server!
Where GPUs can help with HPC applications
• Single-precision calculations where answer quality is less important than raw speed– Seismic exploration– Some types of Quantum Chromodynamics
• Graphics-type calculations, obviously, for visualization and result display
• Only 32-bit, and non-IEEE rounding degrades accuracy cumulatively• Can use over 200 watts, multiple slots• Often only have a few megabytes of local store (frame buffer architecture)• Cheap hardware but very expensive software (the kind you create yourself)• Game processors don’t match endian-ness of node CPUs
Performance via limited power dissipation
CSX600 Acceleration Technology
• Dual ClearSpeed CSX600 coprocessors• 96 GFLOPS “theoretical” Peak• R∞ ≈ 75 GFLOPS for 64-bit matrix multiply (DGEMM) calls
– Hardware does also support 32-bit floating point and integer calculations
• 133 MHz PCI-X 2/3rds length (8”) form factor and PCI-e(8X) form factor• 1GByte of memory on the board• Drivers today for Linux (RedHat and Suse) and
Windows(XP,sever2003(32/64 bit) wccs2003)• Low power; 25 Watts typical• Multiple boards can be used together for greater performance
X620 board e620 board
CSX600 Configuration
CSX600 Architecture
Introduction to PEs (processing Elements)
CSX600 Performance Data
64-bit Function Operations per Second (Billions)
0.0
0.5
1.0
1.5
2.0
2.5
Sqrt InvSqrt Exp Ln Cos Sin SinCos Inv
Function name
2.6 GHz dual-core Opteron
3 GHz dual-core Woodcrest
ClearSpeed Advance card
Typical speedup of ~8X over the fastest x86 processors, because math functions stay in the local memory on the card
Doubling host-to-card bandwidth has minor effect because of I/O overlap.A zero-latency connection would not visibly affect either curve!
2000 4000 6000 8000 10000
10
20
30
40
50
60
70
PCI-X
PCIe
Matrix size
GFLOPS
CSX600 DGEMM Performance
CSX600 LINPACK Performance
System SpecificationLinpack Result (GFLOPS)
Elapsed Time
4 nodes (16GB) w/o Advance boards 136.048.4 minutes
4 nodes (16GB) w/ 2 x Advance boards each
364.218.4 minutes
1 node (16GB) w/o Advance boards 34.0
1 node (16GB) w/ 2 x Advance boards
90.1
Note: Previously published Linpack results for similar single node systems were 34.9 GFLOPS for the standard node and 93 GFLOPS for an accelerated node with two ClearSpeed Advance boards. The variations are a result of small differences between system configurations and problem sizes used during the benchmark runs.
FFT performance
Approach to blade systems or etc
Example of a possible 1U server
Rack width
(19 in.)
6.1 in. card
2.66 GHzDual-socketquad-core x86plus 16 GB memory, 4x InfiniBand.
• Two ClearSpeed PCIe cards on risers
• 1 or 2 GB DRAM/card• Enclosure supplies
25W, cooling per card• Total power draw: 450
W
– 277 peak GFLOPS (64-bit)– ~225 DGEMM GFLOPS sustained with MKL+new DGEMM– ~170 LINPACK GFLOPS sustained– 18 or 20 GB DRAM on host: 2 or 4 GB on CS cards, 16 GB on
x86 host
Server and workstation installations
• Can be installed in standard servers and workstations– E.g. HP DL380 takes 2 boards
• Some 4U servers could take 6-8 boards• Potential for a PCI backplane chassis to take
anything from 12 to 19 boards in 4U– A 19-board system could be under 750W in 4U– Could put 9 of these in a 42U rack -> 171 boards– If each board 10X a fast x86 core, equivalent to
1,710 cores in a single rack but for only 6.8kW of power
Blade installations
• Can be installed in blades and via expansion units– E.g. IBM PEU2 compatible with HS21 takes 2
boards– HP blade case using customizing expansion
unit
Case installation
A cluster of four nodes 136 GFLOPS Hardware configuration
• Intel®Xeon® 5160 (Woodcrest) dual core processors 4 nodes (16CPU) • consuming 1,940 watts
• Two Advance accelerator boards in each server
• Cluster performance increase over 364 GFLOPS • Adding only 200 watts
TOP500 (November 1996)
•Center for Computational Science at the University Of Tsukuba in Japan•2048 CPU Hitachi system 368.2 GFLOPS
HP DL380 takes 2 boards
Conventional Way
• 32 kwatts X3+ α
• 10x3+α sq. ft.
• ~$500,000 X3+ α =~$1500,000+ α
• 2.2TFLOPS(lower)
• Reconstruction facility Expense
ClearSpeed Way
Increasing capacity to 2.2 TFLOPS
Clustering system 구축
Clustering system in CSX600
Titech 500.org ranking
• Announced on Monday 9th of October 2006: Tokyo Tech have accelerated their Linux supercomputer, TSUBAME, from 38 TFLOPS to 47 TFLOPS with 360 ClearSpeed Advance boards– An increase in performance of 24%, but for just a
1% increase in power consumption – 10,368 AMD Opteron cores with just 360
ClearSpeed Advance boards– #9 in November 2006 Top500– 1st accelerated system in the Top500
Professor Matsuoka standing beside TSUBAME at Tokyo Tech
Application performance & R&D
Structural Analysis
ElectromagneticModeling
Radar Cross-Section
Ab initio Computational Chemistry
Global Illumination Graphics
LINPACK speed correlates with many real applications
• Dense matrix-matrix kernels: order N 3 ops on order N 2 data
– CFD ,CAE by boundary element and Green’s function methods
• N-body interactions: order N 2 ops on order N data
– CFD, CAE with high mean-free-path
• Some sparse matrix operations: order NB 2 ops on order NB data where B is the average matrix band size
– CFD,CAE with finite element methods
• Time-space marching: order N 4 ops on order N 3 data
– CFD,CAE with finite difference methods; data must reside on board
• Fourier transforms: order N log N ops on order N data, with other processing to increase data re-use
– CFD, CAE by spectral methods or pseudo spectral methods
Application area (CAE CFD)
Ax = b
N equations N unknowns
N iterations
ClearSpeed accelerates the DGEMM kernel of equation solving that takes over 90% of the time.
DGEMM
DGEMM
DGEMM
DGEMM
Volume = 1⁄3 N3
multiply-adds
Solving N equations takes order N 3 work
Accelerating sparse solvers: ANSYS & LS-DYNAAccelerating sparse solvers: ANSYS & LS-DYNA
• Potentially pure plug-and-play• No added license fee• Demands ClearSpeed’s 64-bit
precision and speed• Enabled by recent DGEMM
improvements; still needs symmetric ATA modification
• Could enable some Computational Fluid Dynamics acceleration (codes based on finite elements)50,000 dense
equationsAccelerator can solveat over 50 GFLOPS
becomes…
3.6x net applicationacceleration
Non-solverSolver setup
DGEMMon x86
host Non-solverSolver setup
DGEMMwith ClearSpeed
10x
10 million degrees of freedom (sparse)
Matlab /Mathematica DGEMM performance
NEW DGEMM 이용 62.16 GFLOPs in Matlab2006a xeon system
• AMBER (porting complete) accelerate 4~10x대표적인 molecular dynamics 해석 및 시뮬레이션프로그램 분자레벨의 동역학 모의실험에 많이 이용되고 있으며 생물분야에서도 단백질구조체에 대한 거동해석 및 모의실험에도 많이 이용 . AMBER 의 경우는 상용이고 GROMACS 의 경우는 프리웨어임
분자동력학 분석 및 모의실험을 수행하는 프로그램 성격상 많은 대단위 행렬계산이 필요 병렬연산에 관한 부분까지 포함 하여 많은 부분이수치적연산에 의존
Amber acceleration
AMBER module
HostAdvance X620
Speedup
Gen. Born 1: 83.5 min. 24.6 min. 3.4x speedup
Gen. Born 2: 84.6 min. 23.5 min. 3.6x speedup
Gen. Born 6: 37.9 min. 4.0min. 9.4x speedup Host: 2.8GHz Pentium 4 EMT64, OS: RHEL4-64, CSXL: version 2.50
Amber acceleration
Monte Carlo methods exploit high local bandwidth
• Monte Carlo methods are ideal for ClearSpeed acceleration:– High regularity and locality of the algorithm– Very high compute to I/O ratio– Very good scalability to high degrees of parallelism– Needs 64-bit
• Excellent results for parallelization– Achieving 10X performance per Advance card vs. highly
optimized code on the fastest x86 CPUs available today– Maintains high precision required by the computations
True 64-bit IEEE 754 floating point throughout– 25 W per card typical when card is computing
• ClearSpeed has a Monte Carlo example code, available in source form for evaluation
CSX600 Monte Carlo Improvement I
Monte Carlo scale like the NAS “EP” benchmark
CSX600 Monte Carlo Improvement II
• 1 CPU, no acceleration : 400M samples, 60 seconds • 1 Advance board : 400M samples, 2.9 seconds, 20x
speed up • 2 Advance boards : 400M samples, 1.5 seconds, 40x
speed up• 4 Advance boards : 400M samples, 0.8 seconds, 79x
speed up
Why do Monte Carlo apps need 64-bit?• Accuracy increases as the
square root of the number of trials, so five-decimal accuracy takes 10 billion trials.
• But, when you sum many similar values, you start to scrape off all the significant digits.
• 64-bit summation needed to get a single-precision result!
Single precision:1.0000x108 + 1= 1.0000x108
Double precision:1.0000x108 + 1 = 1.00000001x108
CSX600 Monte Carlo Improvement III
CSX600 Financial Application
Black-Scholes analytic pricing formula
Binomial method
Monte Carlo
Finite difference method
Broadie-Glasserman random tree method
Overall speed up gain range from 2x~ 70X
Precision double
Customizing
CSX600 지원 software 개발 환경
Windows series
CSX600 Summary & discussion
• For acceleration of numerically-intensive codes, such as – Math functions (RNGs, sin, cos, log, exp, sqrt etc.) – Standard libraries (Level 3 BLAS, LAPACK, FFTs)
• ~70 GFLOPS sustained from a ClearSpeed Advance™ board• Accelerated functions are callable from C/C++ or Fortran• ~25 watts per single-slot board
– Performance measured in GFLOPS per watt rather than MFLOPS per watt
• Multiple ClearSpeed Advance™ boards can be used together for even higher performance and compute density
• Current ClearSpeed board is mature product in production systems at top500 site
• ClearSpeed can deliver:– > 100 TFLOPS for June top500 and – > 1 PFLOPS for November 2006 top500