Maximize Clusters' Performance, The Next Level of Clustering · 2015. 6. 10. · Ethernet...
Transcript of Maximize Clusters' Performance, The Next Level of Clustering · 2015. 6. 10. · Ethernet...
Maximize Clusters’ PerformanceThe Next Level of ClusteringThe Next Level of ClusteringMichael KaganVP ArchitectureMellanox Technologies
2
Agenda
The visionThe interconnect challengeIndustry solutionProducts and roadmap
3
The Vision – Clustered Grid
4
Deliver Services to ClientsFrom equipment warehouse to service provider
ResourceResourcepoolpool
0%10%20%30%40%50%60%70%80%90%
100%
03Q1 03Q2 03Q3 03Q4 04Q1 04Q2 04Q3 04Q4 05Q1 05Q2 05Q3 05Q4 06Q1 06Q2
Cluster Market Penetration ClustersNon-Clustered
Clustering – the grid backbone technology
5
Dual-socket server platform trends
1
10
100
1000
2000: 1Ghz, 1 core Today: 2Ghz, 4 cores 2010: 4Ghz, 2 16 cores
Sin
gle-
node
com
pute
pow
er
1
10
100
1000
Tota
l IO
ban
dwid
th (G
bit)
Single-node compute powerIO bandwidth
IO Performance Growth
More applications per More applications per server and I/Oserver and I/O
More traffic per I/O with More traffic per I/O with server I/O consolidationserver I/O consolidation
I/O capacity per server I/O capacity per server dictated by the most dictated by the most
demanding appsdemanding apps
10Gb/s+ 10Gb/s+ connectivity connectivity
for each for each serverserver
Multi Core CPUs
SAN Adoption
SharedResources
Multi-core CPUs mandating 10Gb/s+ connectivity
6
Typical Deployments With GigE and FC
6-8 I/O adapters per server• Bandwidth• Functions
Not feasible Blade Servers• Limited PCI slots, backplane
tracesHigh CAPEX, OPEX
• More ports/serverManagement challenge
• Multiple networks• Multiple management domains
Physical Server 1
VirtualMachine 1
Virtual Switch (Network and SCSI/File Sys)
VirtualNIC
GigE NIC
NICDriver
External LAN/WAN and SAN Infrastructure
FC Driver
…VirtualHBA
VirtualMachine 2
VirtualNIC
VirtualHBA
ManagementApplication
& Console OSVirtualNIC
VirtualHBA
FC Driver
GigE NIC
NICDriver
FCHBA
FCHBA
GigE NIC
NICDriver
GigE NIC
NICDriver
Physical Server 2
Ethernet Switch
Ethernet Switch
Ethernet Switch
FC Switch
FC Switch
Ethernet SwitchBlade switches
in chassisFC Switch
Physical Server 1
VirtualMachine 1
Virtual Switch (Network and SCSI/File Sys)
VirtualNIC
GigE NIC
NICDriver
External LAN/WAN and SAN Infrastructure
FC Driver
…VirtualHBA
VirtualMachine 2
VirtualNIC
VirtualHBA
ManagementApplication
& Console OSVirtualNIC
VirtualHBA
FC Driver
GigE NIC
NICDriver
FCHBA
FCHBA
GigE NIC
NICDriver
GigE NIC
NICDriver
Physical Server 2
Ethernet Switch
Ethernet Switch
Ethernet Switch
FC Switch
FC Switch
Ethernet SwitchBlade switches
in chassisFC Switch
Software-based IOVTraditional approach does not scale
7
I/O Overloaded?
IO DeliveryIO Deliverychallengeschallenges
High bandwidth, low latencyScalabilityLow power consumptionVirtualization, dynamic load balancingAgility, quicker business results
Server and storage I/O takes on a new role
8
YesYesYesYesYes
EasyYesYesYesGood
InfiniBand
HardNoNoNoPoor
TCP
NoNoNoNoNo
Network convergence (virtual networks)Quality of ServiceLossless networkCongestion managementOptimal routing
Data(layer 2)
Scalability to 100GbitVirtual interface supportPoint to multipoint communicationQoS-aware interface to clientHigh Availability features
Transport(layer 4)
EthernetData Center requirementNetwork Layer
IO Infrastructure – Requirements
Source: Intel/CISCO, August 2005
Legacy interconnect does not fit Grid requirements
9
InfiniBand – The Grid Interconnect
Top Performance at lowest price• Defined for low-cost implementation• Up to 120Gbit port speed
Scalable• Tens-of-thousands of nodes• Multi-core servers
Low CPU overhead• RDMA and Transport Offload
Service Oriented I/O• Quality-of-Service• Virtualization• I/O consolidation
Established software ecosystem
Switch-to
-Switch
Node-to-Node
Fibre Channel10GigE
2004 2005 2006 2007 2008
3040
1201101009080706050
1020
Gigabits per secondPerformance Roadmap
1GigE
Industry standard for grid interconnect
10
Delivering Service Oriented I/O
Congestion control @ sourceResource allocation
End-to-End QualityOf Service
I/O Consolidation
Optimal PathManagement
Multiple traffic typesUp to 40% power savings
Packet drop preventionScale @ wire speed
Dedicated VirtualMachine Services
Virtual machine partitioningNear native performance
Provisioned IO services on a single wire
Clustering
Communications
Storage
Management
11
InfiniBand HCA
Channel I/O Virtualization on Server
Hardware-based I/O virtualization• Isolation & protection per VM• DMA remapping / virtual address translation• Resource provisioning• Hypervisor offload (switching, traffic steering)
Supports current and future servers• Programmable• IOTLB* for future I/O MMU chipsets• Intel VT-d IOV, PCI-SIG IOV
Better resource utilization• Frees up CPU through hypervisor offload• Enables significantly more VMs per CPU
Native OS performance for VMs• Eliminates VMM overheads
VirtualMachine 1
HCA Driver
…
DMA Remapping
Memory
VirtualMachine 1
HCA Driver
VirtualMachine 1
HCA Driver
Physical Server
Virtual Machine Monitor
Port
VMM OffloadFunctions
* I/O Translation Look-aside Buffer
IO channels
Channel IO– deliver IO services to consumer
12
VMM
VMVM
VM
VMVM
AppApp App
VMVM
Virtualization engine
middleware
Cluster as a Pool of Resources
Each VM (“container”) represents available compute resource
Applications are run in any available container
Storage is disassociated from the platform
Applications become readily transportable
VMM
VMVM
Unconstrained service delivery
13
Clusters – Deliver Service to Consumer
Server-centric view: servers interconnected by networks
Service-centric view: applications are provisioned from a pool of resources
VM
App
VM
VM VM
App App
App App App
Interconnect – service delivery enabler
…
14
“Global architecture”
Location-agnostic solution – from microns to miles
950 MB/s
66 millisecond roundtrip delay
@OC192c
> 3500 Km each way
Washington DC
Los Angeles, CA
Source: Obsidian
App
VM
VM VM
App App
GW
App
VM
VMVM
AppApp
GW
15
Grid interconnect building blocks available today
Mellanox Technologies
A global leader in semiconductor solutions for server, storage and embedded connectivity Leading provider of low-latency and high-bandwidth InfiniBand solutions• Up to 20Gb/s NIC, up to 60Gb/s Switch• 1.7M ports shipped (Dec 2006)• 2.25us latency production, ~1us latency in 2007• 3W power consumption per HCA port
Efficient and scalable I/O• 4500-node cluster in production• 10K+ nodes clusters being installed
Price-performance-power leaderConverges clustering, communications, storage solutions
16
Product Roadmap
2000 2003 2004 2005 2006 20072001 2002
40Gb/s Total
160Gb/s Total
40Gb/s Total
480, 960Gb/s Total
40, 80Gb/s Total
20, 40Gb/s Total
1st Gen.
2nd Generation
3rd Generation
4th Generation
Adapter
Adapter
Adapter
Adapter + Switch
Switch
Switch
Gen2Adapter
Two 10/20/40 Gb/s IB or Two 1/10Gb/s Ethernet
Consistent execution on the roadmap
17
Interconnect: A Competitive Advantage
Enterprise Data CentersEnterprise Data Centers
HighHigh--Performance ComputingPerformance Computing
EmbeddedEmbedded
End-Users
Clustered DatabaseCustomer Relationship ManagementeCommerce and RetailFinancialWeb Services
Biosciences and GeosciencesComputer Automated EngineeringDigital Content CreationElectronic Design AutomationGovernment and Defense
CommunicationsComputing and Storage AggregationIndustrialMedicalMilitary
InfiniBand and Ethernet
ServersAnd Blades
Embedded
Switches
Storage
Complete system solutions for all market segments
18
Enterprise Data CentersEnterprise Data Centers
EmbeddedEmbedded
Leading Customers / Growing MarketsSoftware PartnersHardware OEMs
Embedded
Servers
Storage
COMPUTERS
Switches
End-Users
HighHigh--Performance ComputingPerformance Computing
19
InfiniBand Software Support
Industry-wide development of standard SW stackSupported by all Linux distributionsGranted WQHL by Microsoft
Full IO solution on all major OS distributions
20
HPC – the Early InfiniBand Adopters
Growth rate from June 06 to Nov 06
• InfiniBand: +105%• Myrinet: -10%• GigE: -16%
105% growth from June 2006173% growth from Nov 2005
Top500 Interconnect Trends
020406080
100120140160180200220240260
Num
ber o
f Clu
ster
s
Jun-05 Nov-05 Jun-06 Nov-06InfiniBand Myrinet GigE
82 80
215
InfiniBand – the only growing interconnect
21
Top500 Interconnect Placement Nov 06
InfiniBand is the preferred high performance interconnect• Connecting the most powerful clusters• InfiniBand is the best price/performance connectivity for Petascale
clusters
Top500 Interconnect Placement
01020304050607080
1-100 101-200 201-300 301-400 401-500
Top500 Placement
Num
ber o
f Clu
ster
s
InfiniBand Myrinet GigE Source:
InfiniBand – the dominant HPC interconnect
22
Top500 Multi-Core Clusters Percentage
0%5%
10%15%20%25%30%35%40%45%50%
2005 Nov 2006 June 2006 Nov 2007 June
Top500 StatisticsTop500 Interconnect Penetration
0
10
20
30
4050
60
70
80
90
Year 1
Year 2
Year 3
Year 4
Year 5
Year 6
Year 7
Year 8
Year of Introduction
# of
Clu
ster
s
Ethernet InfiniBand
Multi-core will dominate the list in 2007• Native multi-core
InfiniBand adoption is faster than Ethernet
• Ethernet year 1 – June 1996• InfiniBand year 1 – June 2003
The fastest growth in top500 interconnect history
23
Mellanox Superior Application Performance
57% better than GigEWolfram gridMathematicaMathematical Modeling
470% better than GigEAutodeskDigital Media
120% better than GigEWombatFinancial
29% better than MyrinetCD-Adapco STAR CD
24% better than MyrinetExa PowerFLOW
300% better than GigEESI PAM-CRASH
300% better than GigEOracleData Base
145%-1400% better than GigE15% better than QlogicFluent
Fluid Dynamics
55% better than MyrinetSchlumberger EclipseOil and Gas
26% better than Qlogic85% better than Myrinet115% better than GigELSTC LS-DYNAAutomotive
Platforms may vary, details available in the following slides Mellanox InfiniBand wins on real application
24
Personal Supercomputers (PSC)From top 500 to technical computing
Maximum Performance At Your Fingertips
Driving supercomputing to the masses• Maximum performance, Minimum cost• Easy to use, Turnkey cluster• Fits into “cubicle” environment• Standard power, quiet operation• Ability to scale efficiently
Cluster waterfall from compute room to desktop
25
InfiniBand Storage Interconnect
Native InfiniBand storage servers• Optimal Performance, Power Savings and TCO• No gateway bottlenecks• Ultimate scalability• Service Oriented I/O for consolidation
Storage controller clustering and failover solutionUsed to connect to the storage disk arrays
NFS, CIFS,FTP, HTTP
InfiniBand Switch
InfiniBand as a storage controller clustering solution
InfiniBand as a storage shelf interconnect
InfiniBand Switch
InfiniBandCluster
InfiniBandStorage
Storage servers and backend clustering
26
InfiniBand Storage Solutions
InfiniBand Backend Clustering and Failover
Native InfiniBand Block Storage SystemsNative InfiniBand
Clustered File System
Native InfiniBandClustered File Storage Software
Native InfiniBandVisualization Storage
System
Native InfiniBandBlock Storage Software
Native InfiniBandSolid State Storage
Multiple storage vendors deploy InfiniBand today
27
IB Storage = Best Price/Performance
TPC-H –Industry Standard Benchmark• Actual database transaction profiling• Results must be qualified and approved
InfiniBand Storage delivers best price/performance at 1TB class8 4-way dual-core AMD Opteron Servers37 x PANTA Systems Storage Array Modules• 518 x 250GB 7200rpm SATA HDD• Total Storage : 129500 GB
Turnkey server/storage solution for data warehousing applications
InfiniBand solutions win TPC-H 3rd year in a row
28
Cluster Evaluation Center
Essential platform for customer evaluation• Latest and greatest InfiniBand hardware and software• InfiniBand based storage
NFS over RDMA• AMD and Intel based platforms
Available clusters• Mellanox cluster center: Intel quad core, AMD dual core
rev E• Colfax International: AMD dual core rev F
Available for customers’ evaluation• Developments, testing and evaluation• Free of charge
InfiniBand clusters available for evaluation
29
Summary
InfiniBand is a clustering interconnect of choice• The only growing interconnect on top500• Dominates the top part of the top500
InfiniBand supports major industry trend• Industry standard, supported by major OEMs and ISVs• Leading cost/performance solutions• Multi-core and multi-node scalability• Service-oriented IO delivery• Compute and storage services on a single wire• Transition from equipment warehouse to service provider• Clusters – the utility computing backbone
InfiniBand clusters available for customers’ evaluation• Free of charge• Multiple platforms
InfiniBand – the standard cluster interconnect
30
Real-life Applications
ECLIPSE million cell model • HP DL145 2.6Ghz servers, single CPU • OS: SUSE 9
CFD – CD-Adapco Star CD
InfiniBand wins on real application
Schlumberger ECLIPSE
600700800900
1000110012001300140015001600
16 32 64
# of Servers
Ela
psed
Run
time
(Sec
onds
)
Myrinet InfiniBand
Infiniband is 55% more efficient !
Lower is better
Schlumberger ECLIPSE
600700800900
1000110012001300140015001600
16 32 64
# of Servers
Ela
psed
Run
time
(Sec
onds
)
Myrinet InfiniBand
Infiniband is 55% more efficient !
Lower is better Benchmarks STAR-CD V3240/V3260
Lower is better
A Class Benchmark8 node cluster
600650
700750800850
900950
1000
10501100
1
Exe
cutio
n Ti
me
(sec
)Mellanox Myrinet
Benchmarks STAR-CD V3240/V3260
Lower is better
A Class Benchmark8 node cluster
600650
700750800850
900950
1000
10501100
1
Exe
cutio
n Ti
me
(sec
)Mellanox Myrinet
31
CFD – Fluent
Great scaling from small to large clusterMulti-core environment demands InfiniBand
F L UE NT 6 .3 B eta - F L 5L 3 c as e
0
500
1000
1500
2000
2500
0 20 40 60 80 100 120 140
Nu m b e r o f C o re s
Flue
nt P
erfo
man
ce R
atin
g
M e lla n o x In fin iB an d G ig a b it E th ern e t
F luent 6.3, FL5L3 case
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 20 40 60 80 100 120 140
C P U co res
Rat
ing
(per
form
ance
)
Q lo g ic M e llan o xHigher is better
FLUENT - Performace RatingFL5L3 case, 16 nodes cluster
300
400
500
600
700
800
900
1000
1100
Mellanox InfiniBand Gigabit Ethernet
Flue
nt P
erfo
man
ce R
atin
g
Single-core Xeon 3.4GHz Dual-core Woodcrest 3GHz
47%
132%
Highest performance, best scalability
32
PAM-CRASH
05000
10000150002000025000300003500040000
16 CPUs 32 CPUs 64 CPUs
Elap
sed
Tim
e (s
ec)
InfiniBand GigE
Crash – ESI PAM-CRASH
Bavarian Car-To-Car Model1.1 M elements, 145000 cycles
“Gigabit Ethernet becomes ineffective with cluster size growth while Mellanox InfiniBand allows continued scalable speed up”
Lower is better
Highest performance, best scaling