Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is...

24
GPU Supercomputing N.D. Hari Dass Indian Institute of Science, Bangalore Poornaprajna Institute, Bangalore Saturday, August 22, 2009

Transcript of Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is...

Page 1: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

GPU Supercomputing

N.D. Hari DassIndian Institute of Science, Bangalore

Poornaprajna Institute, Bangalore

Saturday, August 22, 2009

Page 2: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Supercomputing in Old Stone Age

• Long long ago Supercomputers had to be specially built.

• It required large memory blocks - expensive!!

• The interconnects were proprietary - also expensive, though with great performance!

• Additional features like large scale vector processing.

2

Saturday, August 22, 2009

Page 3: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Supercomputing in New Stone Age

• The idea was to use off the shelf desktops without monitors, connect them with networks with as high bandwidth and as low latency as possible.

• Distribute the memory• Era of Clusters

3

Saturday, August 22, 2009

Page 4: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

KABRU – The Massive Cluster at IMSc

Saturday, August 22, 2009

Page 5: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Saturday, August 22, 2009

Page 6: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Saturday, August 22, 2009

Page 7: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Saturday, August 22, 2009

Page 8: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Supermicro Twin - 2 Nodes in 1UNode 1

Node 2

1U Twin™ is Supermicro innovative designed 1U rack mount system for increasing computing density, saving cost, and reducing energy and space requirements. Supports Dual Xeon Dual/Quad Core CPUs (up to 16 cores in 1 U, up to 672 cores in a 42U rack)

1U Twin systemcontains two independent symmetric motherboards!!!

Saturday, August 22, 2009

Page 9: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Twin Motherboards

Saturday, August 22, 2009

Page 10: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Supermicro Twin - Specifications• Supports up to two Intel® Xeon® 51xx, 52xx,

53xx & 54xx processors per node 1600/1333/1066MHz System Bus

• Supports up to 64GB memory per node DDR2-667/800(1.8V/1.5V) FBDIMMs (1.5V FBDIMMs consume less power and generate less heat)

• Available with GbE/DDR IB/10Gb Ethernet• PCI-Express x16 expansion slot• High-efficiency shared power supply (93%

efficiency)

Saturday, August 22, 2009

Page 11: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Supermicro Blade

• 90% cable reduction Results in better airflow & better cooling• Easier and faster to deploy & troubleshoot• Common, Shared, Redundant and high-efficiency power supply (90%-93% efficiency)

• 7U Blade chassis• Can accommodate 10 Dual-Processor or Quad-processor blades• Up to 160 cores per 7U or 960 cores per 42U rack (using quad-processor blades)• Up to 32GB/64 memory per Dual/Quad processor blade• DDR Infiniband available as option

Saturday, August 22, 2009

Page 12: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Clusters: Then & Now

2003 NOW

1U TWIN BLADE

No. Of CPU

164 20 20 20

Rack Space

82U 10U 5U 7U

WATTS 25KW

4KW 3.85KW 3.85KW

Saturday, August 22, 2009

Page 13: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Twin-U Vs Blade

Twin 1U Blades More Compact/Less space (0.5U)

0.7U

Cheaper Expensive Std. PCI-Express Expansion

Mezzanine Expansion

Power supply not redundant

Redundant Power

Cabling is a mess

Lesser/Neater cabling

Saturday, August 22, 2009

Page 14: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Some of the problems..

• Slow PCI slot performance• Memory access bottlenecks

14

Saturday, August 22, 2009

Page 15: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Core Incompetence?

15

Single 493 MB 81.2 s 1.936µs --

2 Cores 246.5 43.1 s 2.06µs 788 MB/s

4 Cores 129 33.3 s 3.18µs 4928 Cores1-D

70.4 32.2 s 6.15µs 173

8 Cores3-D

61.7 31.6 s 6.03µs 414

Intel 2xQuad Core @ 2.8 GHz

Saturday, August 22, 2009

Page 16: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Core Incompetence?

16

AMD 2xQuad 2111 GHz

1 Core 492 147 s 3.5µs

2 Cores 246 72.32 s 3.448µs

4 Cores 129 47.8 4.56µs

8 Cores 70 29.3 5.6

Saturday, August 22, 2009

Page 17: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Intel Nehalem

• This architecture has significantly overcome the FSB bottlenecks.

• The scaling from 1 to 2, 2 to 4 cores is excellent.

• The scaling from 4 to 8 is good though not as good as in the case of AMD

• But the overall performance of Nehalem better than that of AMD

17

Saturday, August 22, 2009

Page 18: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Speed - Memory Issue

• As the number of cores goes up the CPU performance (theoretical peak) increases.

• KABRU: 4.8 GFlops/CPU• Intel Quad Core: 50 GFlops/CPU• It becomes harder to maintain the ratio of

‘Memory to Performance’.• Issues with increasing memory: different

chipset, power consumption, ...

18

Saturday, August 22, 2009

Page 19: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

GPU Based Supercomputing

• On a single Tesla C1060 card the claimed peak performance of 1Teraflops in single precision!

• Four such cards can sit in a single 1U box• Cost of such GPGPU supercomputers is

about 5 lakh rupees.• Nearly 4 times as fast as Kabru but

costing 50 times less!• Power consumption about 800 W - 40

times less; no airconditioning/infrastructure19

Saturday, August 22, 2009

Page 20: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

A Tesla C1060 Card

20

Saturday, August 22, 2009

Page 21: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

4 Tesla In 1U

21

Saturday, August 22, 2009

Page 22: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Issues with GPU’s

• Codes should have a high degree of data parallelism.

• Available dedicated memory rather low - even for Tesla C1060 cards it is 4 GB per card.

• Double precision performance much poorer than single precision performance - factor 12 lower!!

• Due to register structure - an improvement by a factor of 3 talked about. 22

Saturday, August 22, 2009

Page 23: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Issues with GPU’s

• If the code is a mixture of single and double precisions with the volume of latter around 10% still OK.

• Exploiting the host CPU’s an option.• Transfers between CPU and GPU through

the PCI x16 Gen 2.0 technology.• Transfer speed nowhere compared to, say,

between CPU & Cache• Often better to perform a fresh calculation

instead of fetching processed data 23

Saturday, August 22, 2009

Page 24: Building a Teraflop Supercomputer for IndiaSupermicro Twin - 2 Nodes in 1U Node 1 Node 2 1U Twin is Supermicro innovative designed 1U rack mount system for increasing computing density,

Issues with GPU’s

• Have to code using a new ‘language’ - CUDA in the case of NVIDIA cards

• Not really a problem for moderate sized codes but can be an issue for large codes

• Requires a dexterous management of CPU and GPU resources

• But considering the phenomenal performance improvements that are being talked about, worth the trouble!!

• Intel Larrabie ?? 24

Saturday, August 22, 2009