Download - Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

GPU Supercomputing

N.D. Hari DassIndian Institute of Science, Bangalore

Poornaprajna Institute, Bangalore

Saturday, August 22, 2009

Supercomputing in Old Stone Age

• Long long ago Supercomputers had to be specially built.

• It required large memory blocks - expensive!!

• The interconnects were proprietary - also expensive, though with great performance!

• Additional features like large scale vector processing.

2


Supercomputing in New Stone Age

• The idea was to use off the shelf desktops without monitors, connect them with networks with as high bandwidth and as low latency as possible.

• Distribute the memory• Era of Clusters

3


KABRU – The Massive Cluster at IMSc


Supermicro Twin - 2 Nodes in 1UNode 1

Node 2

1U Twin™ is Supermicro innovative designed 1U rack mount system for increasing computing density, saving cost, and reducing energy and space requirements. Supports Dual Xeon Dual/Quad Core CPUs (up to 16 cores in 1 U, up to 672 cores in a 42U rack)

1U Twin systemcontains two independent symmetric motherboards!!!


Twin Motherboards


Supermicro Twin - Specifications• Supports up to two Intel® Xeon® 51xx, 52xx,

53xx & 54xx processors per node 1600/1333/1066MHz System Bus

• Supports up to 64GB memory per node DDR2-667/800(1.8V/1.5V) FBDIMMs (1.5V FBDIMMs consume less power and generate less heat)

• Available with GbE/DDR IB/10Gb Ethernet• PCI-Express x16 expansion slot• High-efficiency shared power supply (93%

efficiency)


Supermicro Blade

• 90% cable reduction Results in better airflow & better cooling• Easier and faster to deploy & troubleshoot• Common, Shared, Redundant and high-efficiency power supply (90%-93% efficiency)

• 7U Blade chassis• Can accommodate 10 Dual-Processor or Quad-processor blades• Up to 160 cores per 7U or 960 cores per 42U rack (using quad-processor blades)• Up to 32GB/64 memory per Dual/Quad processor blade• DDR Infiniband available as option


Clusters: Then & Now

2003 NOW

1U TWIN BLADE

No. Of CPU

164 20 20 20

Rack Space

82U 10U 5U 7U

WATTS 25KW

4KW 3.85KW 3.85KW


Twin-U Vs Blade

Twin 1U Blades More Compact/Less space (0.5U)

0.7U

Cheaper Expensive Std. PCI-Express Expansion

Mezzanine Expansion

Power supply not redundant

Redundant Power

Cabling is a mess

Lesser/Neater cabling


Some of the problems..

• Slow PCI slot performance• Memory access bottlenecks

14


Core Incompetence?

15

Single 493 MB 81.2 s 1.936µs --

2 Cores 246.5 43.1 s 2.06µs 788 MB/s

4 Cores 129 33.3 s 3.18µs 4928 Cores1-D

70.4 32.2 s 6.15µs 173

8 Cores3-D

61.7 31.6 s 6.03µs 414

Intel 2xQuad Core @ 2.8 GHz


Core Incompetence?

16

AMD 2xQuad 2111 GHz

1 Core 492 147 s 3.5µs

2 Cores 246 72.32 s 3.448µs

4 Cores 129 47.8 4.56µs

8 Cores 70 29.3 5.6


Intel Nehalem

• This architecture has significantly overcome the FSB bottlenecks.

• The scaling from 1 to 2, 2 to 4 cores is excellent.

• The scaling from 4 to 8 is good though not as good as in the case of AMD

• But the overall performance of Nehalem better than that of AMD

17


Speed - Memory Issue

• As the number of cores goes up the CPU performance (theoretical peak) increases.

• KABRU: 4.8 GFlops/CPU• Intel Quad Core: 50 GFlops/CPU• It becomes harder to maintain the ratio of

‘Memory to Performance’.• Issues with increasing memory: different

chipset, power consumption, ...

18


GPU Based Supercomputing

• On a single Tesla C1060 card the claimed peak performance of 1Teraflops in single precision!

• Four such cards can sit in a single 1U box• Cost of such GPGPU supercomputers is

about 5 lakh rupees.• Nearly 4 times as fast as Kabru but

costing 50 times less!• Power consumption about 800 W - 40

times less; no airconditioning/infrastructure19


A Tesla C1060 Card

20


4 Tesla In 1U

21


Issues with GPU’s

• Codes should have a high degree of data parallelism.

• Available dedicated memory rather low - even for Tesla C1060 cards it is 4 GB per card.

• Double precision performance much poorer than single precision performance - factor 12 lower!!

• Due to register structure - an improvement by a factor of 3 talked about. 22


Issues with GPU’s

• If the code is a mixture of single and double precisions with the volume of latter around 10% still OK.

• Exploiting the host CPU’s an option.• Transfers between CPU and GPU through

the PCI x16 Gen 2.0 technology.• Transfer speed nowhere compared to, say,

between CPU & Cache• Often better to perform a fresh calculation

instead of fetching processed data 23


Issues with GPU’s

• Have to code using a new ‘language’ - CUDA in the case of NVIDIA cards

• Not really a problem for moderate sized codes but can be an issue for large codes

• Requires a dexterous management of CPU and GPU resources

• But considering the phenomenal performance improvements that are being talked about, worth the trouble!!

• Intel Larrabie ?? 24