GPU Supercomputing
N.D. Hari DassIndian Institute of Science, Bangalore
Poornaprajna Institute, Bangalore
Saturday, August 22, 2009
Supercomputing in Old Stone Age
• Long long ago Supercomputers had to be specially built.
• It required large memory blocks - expensive!!
• The interconnects were proprietary - also expensive, though with great performance!
• Additional features like large scale vector processing.
2
Saturday, August 22, 2009
Supercomputing in New Stone Age
• The idea was to use off the shelf desktops without monitors, connect them with networks with as high bandwidth and as low latency as possible.
• Distribute the memory• Era of Clusters
3
Saturday, August 22, 2009
KABRU – The Massive Cluster at IMSc
Saturday, August 22, 2009
Saturday, August 22, 2009
Saturday, August 22, 2009
Saturday, August 22, 2009
Supermicro Twin - 2 Nodes in 1UNode 1
Node 2
1U Twin™ is Supermicro innovative designed 1U rack mount system for increasing computing density, saving cost, and reducing energy and space requirements. Supports Dual Xeon Dual/Quad Core CPUs (up to 16 cores in 1 U, up to 672 cores in a 42U rack)
1U Twin systemcontains two independent symmetric motherboards!!!
Saturday, August 22, 2009
Twin Motherboards
Saturday, August 22, 2009
Supermicro Twin - Specifications• Supports up to two Intel® Xeon® 51xx, 52xx,
53xx & 54xx processors per node 1600/1333/1066MHz System Bus
• Supports up to 64GB memory per node DDR2-667/800(1.8V/1.5V) FBDIMMs (1.5V FBDIMMs consume less power and generate less heat)
• Available with GbE/DDR IB/10Gb Ethernet• PCI-Express x16 expansion slot• High-efficiency shared power supply (93%
efficiency)
Saturday, August 22, 2009
Supermicro Blade
• 90% cable reduction Results in better airflow & better cooling• Easier and faster to deploy & troubleshoot• Common, Shared, Redundant and high-efficiency power supply (90%-93% efficiency)
• 7U Blade chassis• Can accommodate 10 Dual-Processor or Quad-processor blades• Up to 160 cores per 7U or 960 cores per 42U rack (using quad-processor blades)• Up to 32GB/64 memory per Dual/Quad processor blade• DDR Infiniband available as option
Saturday, August 22, 2009
Clusters: Then & Now
2003 NOW
1U TWIN BLADE
No. Of CPU
164 20 20 20
Rack Space
82U 10U 5U 7U
WATTS 25KW
4KW 3.85KW 3.85KW
Saturday, August 22, 2009
Twin-U Vs Blade
Twin 1U Blades More Compact/Less space (0.5U)
0.7U
Cheaper Expensive Std. PCI-Express Expansion
Mezzanine Expansion
Power supply not redundant
Redundant Power
Cabling is a mess
Lesser/Neater cabling
Saturday, August 22, 2009
Some of the problems..
• Slow PCI slot performance• Memory access bottlenecks
14
Saturday, August 22, 2009
Core Incompetence?
15
Single 493 MB 81.2 s 1.936µs --
2 Cores 246.5 43.1 s 2.06µs 788 MB/s
4 Cores 129 33.3 s 3.18µs 4928 Cores1-D
70.4 32.2 s 6.15µs 173
8 Cores3-D
61.7 31.6 s 6.03µs 414
Intel 2xQuad Core @ 2.8 GHz
Saturday, August 22, 2009
Core Incompetence?
16
AMD 2xQuad 2111 GHz
1 Core 492 147 s 3.5µs
2 Cores 246 72.32 s 3.448µs
4 Cores 129 47.8 4.56µs
8 Cores 70 29.3 5.6
Saturday, August 22, 2009
Intel Nehalem
• This architecture has significantly overcome the FSB bottlenecks.
• The scaling from 1 to 2, 2 to 4 cores is excellent.
• The scaling from 4 to 8 is good though not as good as in the case of AMD
• But the overall performance of Nehalem better than that of AMD
17
Saturday, August 22, 2009
Speed - Memory Issue
• As the number of cores goes up the CPU performance (theoretical peak) increases.
• KABRU: 4.8 GFlops/CPU• Intel Quad Core: 50 GFlops/CPU• It becomes harder to maintain the ratio of
‘Memory to Performance’.• Issues with increasing memory: different
chipset, power consumption, ...
18
Saturday, August 22, 2009
GPU Based Supercomputing
• On a single Tesla C1060 card the claimed peak performance of 1Teraflops in single precision!
• Four such cards can sit in a single 1U box• Cost of such GPGPU supercomputers is
about 5 lakh rupees.• Nearly 4 times as fast as Kabru but
costing 50 times less!• Power consumption about 800 W - 40
times less; no airconditioning/infrastructure19
Saturday, August 22, 2009
A Tesla C1060 Card
20
Saturday, August 22, 2009
4 Tesla In 1U
21
Saturday, August 22, 2009
Issues with GPU’s
• Codes should have a high degree of data parallelism.
• Available dedicated memory rather low - even for Tesla C1060 cards it is 4 GB per card.
• Double precision performance much poorer than single precision performance - factor 12 lower!!
• Due to register structure - an improvement by a factor of 3 talked about. 22
Saturday, August 22, 2009
Issues with GPU’s
• If the code is a mixture of single and double precisions with the volume of latter around 10% still OK.
• Exploiting the host CPU’s an option.• Transfers between CPU and GPU through
the PCI x16 Gen 2.0 technology.• Transfer speed nowhere compared to, say,
between CPU & Cache• Often better to perform a fresh calculation
instead of fetching processed data 23
Saturday, August 22, 2009
Issues with GPU’s
• Have to code using a new ‘language’ - CUDA in the case of NVIDIA cards
• Not really a problem for moderate sized codes but can be an issue for large codes
• Requires a dexterous management of CPU and GPU resources
• But considering the phenomenal performance improvements that are being talked about, worth the trouble!!
• Intel Larrabie ?? 24
Saturday, August 22, 2009
Top Related