Paving the Road to Exascale Computing · • US Department of Energy (DOE) funded project – ORNL...

Gilad Shainer, VP Marketing February 2014

Paving the Road to Exascale Computing

© 2013 Mellanox Technologies 2

Leading Supplier of End-to-End Interconnect Solutions

MXM Mellanox Messaging

Acceleration

FCA Fabric Collectives

Acceleration

Management

UFM Unified Fabric Management

Storage and Data

VSA Storage Accelerator

(iSCSI)

UDA Unstructured Data

Accelerator

Comprehensive End-to-End Software Accelerators and Managment

Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables/Modules

Comprehensive End-to-End InfiniBand and Ethernet Portfolio

Metro / WAN


Mellanox InfiniBand Paves the Road to Exascale Computing

Accelerating Half of the World’s Petascale Systems Mellanox Connected Petascale System Examples


20K InfiniBand nodes Mellanox end-to-end FDR and QDR InfiniBand Supports variety of scientific and engineering projects

• Coupled atmosphere-ocean models • Future space vehicle design • Large-scale dark matter halos and galaxy evolution

NASA Ames Research Center Pleiades

Asian Monsoon Water Cycle High-Resolution Climate Simulations


Helping to Make the World a Better Place

SANGER • Sequence Analysis and Genomics Research • Genomic Analysis for pediatric cancer patients

Challenge: An individual patient’s RNA analysis took 7 days Goal: Reduce it to 5 days

InfiniBand reduced the RNA-Sequence data analysis time

for patients to only 1 hour! Fast interconnect for fighting pediatric cancer


13 Million Financial Transactions Per Day, 4 Billion Database Inserts Real Time Fraud Detection

235 Supermarkets, 8 States, USA

Reacting to Customers’ Needs in Real Time! Reducing Data Queries from 20 minutes to 20 seconds

Accuracy, Details, Fast Response 10X Higher Performance, 50% CAPEX Reduction

Microsoft Bing Maps

Businesses Success Depends on Fast Interconnect

97% Reduction in Database Recovery Time From 7 Days to 4 Hours!

Tier-1 Fortune100 Company Web 2.0 Application

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=xokEbzK96ZcYzM&tbnid=gMELJMTvJ-5p5M:&ved=0CAgQjRwwADjqAQ&url=http://memeburn.com/2013/01/the-new-york-times-attacked-by-chinese-hackers-for-info-on-sources/&ei=foBVUcyFCMjYrQeH-4HgAQ&psig=AFQjCNE7NKZTClGAm8R09J51gIqkZpPldA&ust=1364644350289943

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=kqGHk2rZStg0MM&tbnid=3uvyYV6a0huhCM:&ved=0CAgQjRwwAA&url=http://armsaroundthechild.org/ways-to-give/ways-to-give-usa/paypal/&ei=qn1VUZHFBcusrAelxoHIBw&psig=AFQjCNElUWNfaMlajpQ6OMT7WuBwrAmcoQ&ust=1364643626243740


InfiniBand Enables Lowest Application Cost in the Cloud (Examples)

Microsoft Windows Azure 90.2% Cloud Efficiency

33% Lower Cost per Application

Cloud Application Performance

Improved up to 10X

3x Increase in VMs per Physical Server

Consolidation of Network and Storage I/O

32% Lower Cost per Application

694% Higher Network Performance

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=cJWpK6FexzC-0M&tbnid=F_hzi_fVEdqDfM:&ved=0CAgQjRwwADgI&url=http://www.smartbiz.be/article/150167/salesforce-com-en-oracle-worden-partners-in-de-cloud/&ei=_4JTUv7wOJjd4APd04HYBg&psig=AFQjCNEs3h1TDNNOZSGJLx53yzHgGOqAUA&ust=1381291136001201


InfiniBand’s Unsurpassed System Efficiency

TOP500 systems listed according to their efficiency InfiniBand is the key element responsible for the highest system efficiency Mellanox delivers efficiencies of up to 96% with InfiniBand


FDR InfiniBand Delivers Highest Return on Investment

Higher is better

Higher is better Higher is better

Source: HPC Advisory Council


Technology Roadmap

2000 2020 2010 2005

20Gbs 40Gbs 56Gbs 100Gbs

“Roadrunner” Mellanox Connected

1st 3rd TOP500 2003

Virginia Tech (Apple)

2015

200Gbs

Mega Supercomputers

Terascale Petascale Exascale

10Gbs

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=WZQwW7JWurxgJM&tbnid=OvKa9JUp-zTPGM:&ved=0CAgQjRwwADixBg&url=http://www.gamesthirst.com/2013/06/24/microsoft-invests-678m-in-iowa-data-centers-to-support-growing-services-business-xbox-live/&ei=GyZRUpydE6PF0QWK94DgCw&psig=AFQjCNHRYWwRn1vFGaB9Mweu30OcDb1w2A&ust=1381136283359407


Architectural Foundation for Exascale Computing

Connect-IB Interconnect Adapter


Mellanox Connect-IB The World’s Fastest Adapter

The 7th generation of Mellanox interconnect adapters

World’s first 100Gb/s interconnect adapter (dual-port FDR 56Gb/s InfiniBand)

Delivers 137 million messages per second – 4X higher than competition

Support the new innovative InfiniBand scalable transport – Dynamically Connected


Connect-IB Provides Highest Interconnect Throughput

Source: Prof. DK Panda

Hig

her i

s Be

tter

Gain Your Performance Leadership With Connect-IB Adapters

0

2000

4000

6000

8000

10000

12000

14000

4 16 64 256 1024 4K 16K 64K 256K 1M

Unidirectional Bandwidth

Band

wid

th (M

Byte

s/se

c)

Message Size (bytes)

3385

6343

12485

12810

0

5000

10000

15000

20000

25000

30000

4 16 64 256 1024 4K 16K 64K 256K 1M

ConnectX2-PCIe2-QDR

ConnectX3-PCIe3-FDR

Sandy-ConnectIB-DualFDR

Ivy-ConnectIB-DualFDR

Bidirectional Bandwidth

Band

wid

th (M

Byte

s/se

c)


11643

6521

21025

24727

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=l4bgqVY3Z-5H9M&tbnid=VeKX0Kar856WBM:&ved=0CAgQjRwwADgF&url=https://twitter.com/OSUCATS&ei=5a65UcmnGaqG0AWJ1YGwBg&psig=AFQjCNFmgs1A9YUXxMlqqJPS30QSMEHV0Q&ust=1371209829447892


Memory Scalability

1

1,000

1,000,000

1,000,000,000

InfiniHost, RC 2002 InfiniHost-III, SRQ 2005 ConnectX, XRC 2008 Connect-IB, DCT 2012

8 nodes

2K nodes

10K nodes

100K nodes

Hos

t Mem

ory

Con

sum

ptio

n (M

B)


Accelerator and GPU Offloads


GPUDirect 1.0

CPU

GPU Chip set

GPU Memory

InfiniBand

System Memory 1

2

CPU

GPU Chip set

GPU Memory

InfiniBand

System Memory 1

2

Transmit Receive

CPU

GPU Chip set

GPU Memory

InfiniBand

System Memory

1 CPU

GPU Chip set

GPU Memory

InfiniBand

System Memory

1

Non GPUDirect

GPUDirect 1.0


GPUDirect RDMA

Transmit Receive

CPU

GPU Chip set

GPU Memory

InfiniBand

System Memory

1 CPU

GPU Chip set

GPU Memory

InfiniBand

System Memory

1

GPUDirect RDMA

CPU

GPU Chip set

GPU Memory

InfiniBand

System Memory

1 CPU

GPU Chip set

GPU Memory

InfiniBand

System Memory

1

GPUDirect 1.0

Presenter

Presentation Notes


0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 4 16 64 256 1K 4K


Ban

dwid

th (M

B/s

)

0

5

10

15

20

25

1 4 16 64 256 1K 4K


Late

ncy

(us)

GPU-GPU Internode MPI Latency

Lower is B

etter 67 %

5.49 usec

Performance of MVAPICH2 with GPUDirect RDMA

67% Lower Latency

5X

GPU-GPU Internode MPI Bandwidth

Hig

her i

s B

ette

r

5X Increase in Throughput

Source: Prof. DK Panda

Presenter

Presentation Notes

Based on MVAPICH2--‐2.0b Intel Ivy Bridge (E5--‐2680 v2) node with 20 cores NVIDIA Telsa K40c GPU, Mellanox Connect--‐IB Dual--‐FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPU--‐Direct--‐RDMA Patch

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=l4bgqVY3Z-5H9M&tbnid=VeKX0Kar856WBM:&ved=0CAgQjRwwADgF&url=https://twitter.com/OSUCATS&ei=5a65UcmnGaqG0AWJ1YGwBg&psig=AFQjCNFmgs1A9YUXxMlqqJPS30QSMEHV0Q&ust=1371209829447892


Remote GPU Access through rCUDA

GPU servers GPU as a Service

rCUDA daemon

Network Interface CUDA Driver + runtime Network Interface

rCUDA library

Application

Client Side Server Side

Application

CUDA Driver + runtime

CUDA Application

rCUDA provides remote access from every node to any GPU in the system

CPU VGPU

CPU VGPU

CPU VGPU

GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU


rCUDA Performance Comparison


Solutions for MPI/SHMEM/PGAS

Fabric Collective Accelerations


Collective algorithms are not topology aware and can be inefficient Congestion due to many-to-many

communications

Slow nodes and OS jitter affect scalability and

increase variability

Collective Operation Challenges at Large Scale

Ideal Actual


CORE-Direct

• US Department of Energy (DOE) funded project – ORNL and Mellanox

• Adapter-based hardware offloading for collectives operations

• Includes floating-point capability on the adapter for data reductions

• CORE-Direct API is exposed through the Mellanox drivers

FCA

• FCA is a software plug-in package that integrates into available MPIs

• Provides scalable topology aware collective operations

• Utilizes powerful InfiniBand multicast and QOS capabilities

• Integrates CORE-Direct collective hardware offloads

Mellanox Collectives Acceleration Components


Minimizing the impact of system noise on applications – critical for scalability

The Effects of System Noise on Applications Performance

Ideal System noise CORE-Direct (Offload)


Provide support for overlapping computation and communication

CORE-Direct Enables Computation and Communication Overlap

Synchronous CORE-Direct - Asynchronous


Nonblocking Alltoall (Overlap-Wait) Benchmark

CoreDirect Offload allows Alltoall benchmark with almost 100% compute


Summary


The Only Provider of End-to-End 40/56Gb/s Solutions

From Data Center to Metro and WAN

X86, ARM and Power based Compute and Storage Platforms

The Interconnect Provider For 10Gb/s and Beyond

Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables/Modules

Comprehensive End-to-End InfiniBand and Ethernet Portfolio

Metro / WAN

Thank You

Paving the Road to Exascale Computing · • US Department of Energy (DOE) funded project – ORNL...

Documents

Transcript of Paving the Road to Exascale Computing · • US Department of Energy (DOE) funded project – ORNL...