Dolphin Wulfkit and Scali software The Supercomputer interconnect Summation Enterprises Pvt. Ltd....
-
Upload
jillian-astin -
Category
Documents
-
view
221 -
download
0
Transcript of Dolphin Wulfkit and Scali software The Supercomputer interconnect Summation Enterprises Pvt. Ltd....
Dolphin Wulfkit and Scali softwareThe Supercomputer interconnect
Summation Enterprises Pvt. Ltd.Preferred System Integrators since 1991.
Amal D’[email protected]
Agenda• Dolphin Wulfkit hardware
• Scali Software / some commercial benchmarks
• Summation profile
Interconnect Technologies
WAN LAN I/O Memory Processor
ATM
Myrinet, cLan
100 000 10 000 1 000 100 10 1
20
50 000
1
100 000
1
100 000
100 000
100
1 000
10 000
∞
1
Design space for different
technologies
Distance
Bandwidth
Latency
FibreChannel
Cache
Proprietary Busses
Application areas:
Application requirements:
Bus
Ethernet
Cluster Interconnect Requirements
SCSI
Network
Dolphin SCI Technology
Interconnect impact on cluster performance
Some Real-world examples from Top500 May 2004 List• Intel, Bangalore cluster:
574 Xeon 2.4 GHz CPUs/ GigE interconnect Rpeak: 2755 GFLOPs Rmax: 1196 GFLOPs Efficiency: 43%
• Kabru, IMSc, Chennai: 288 Xeon 2.4 GHz CPUs/ Wulfkit 3D interconnect Rpeak: 1382 GFLOPs Rmax: 1002 GFLOPs Efficiency: 72%
• Simply put, Kabru gives 84 % of the performance with HALF the number of CPUs !
Commodity interconnect limitations
• Cluster performance depends primarily on two factors: Bandwidth and Latency
• Gigabit: Speed limited to 1000mbps (approx 80 Megabytes/s in real world). This is fixed irrespective of processor power
• With Increasing processor speeds, latency “time taken to move data” from one node to another is playing an increasing role in cluster performance
• Gigabit typically gives an internode latency of 120 ~ 150 microsecs. As a result, CPUs in a node are often idling waiting to get data from another node
• In any switch based architecture, the switch becomes the single point of failure. If the switch goes down, so does the cluster.
Dolphin Wulfkit advantages
• Internode bandwidth: 260 Megabytes/s on Xeon/ (over three times faster that Gigabit).
• Latency: under 5 microsecs ( over TWENTY FIVE times quicker than Gigabit)
• Matrix type internode connections: No switch, hence no single point of failure
• Cards can be moved across processor generations. This leads to investment protection
Dolphin Wulfkit advantages (contd.)
• Linear scalability: e.g. adding 8 nodes to a 16 node cluster involves known fixed costs: eight nodes and eight Dolphin SCI cards. With any switch based architecture, there are additional issues like “unused ports” on the switch to be considered. E.g. For Gigabit, one has to “throw away” the 16 port switch and buy a 32 port switch
• Realworld performance on par /better than proprietary interconnects like Memory Channel (HP) and NUMAlink (SGI), at cost effective price points
Wulfkit : The Supercomputer Interconect
• Wulfkit is based on the Scalable Coherent Interface (SCI), the ANSI/IEEE 1596-1992 standard defines a point-to-point interface and a set of packet protocols.
• Wulfkit is not a networking technology, but is a purpose-designed cluster interconnect.
• The SCI interface has two unidirectional links that operate concurrently.
• Bus imitating protocol with packet-based handshake protocols and guaranteed data delivery.
• Upto 667 MegaBytes/s internode bandwidth.
SCI
PCI-SCI Adapter Card 1 slot 2 dimensions
SCI
PSBPCI
LC LC
2D Adapter Card
• SCI ADAPTERS (64 bit - 66 MHz)
PCI / SCI ADAPTER (D335)
D330 card with LC3 daughter card
Supports 2 SCI ring connections
Switching over B-Link
Used for WulfKit 2D clusters
PCI 64/66
D339 2-slot version
High PerformanceHigh PerformanceInterconnect:Interconnect:•Torus TopologyTorus Topology•IEEE/ANSI std. 1596 SCIIEEE/ANSI std. 1596 SCI•667MBytes/s/segment/ring667MBytes/s/segment/ring•Shared Address SpaceShared Address Space
System Interconnect
Maintenance and LAN Maintenance and LAN Interconnect:Interconnect:•100Mbit/s Ethernet 100Mbit/s Ethernet •(out of band monitoring)(out of band monitoring)
System Architecture
34x4 2D Torus SCI cluster
Control Node(Frontend)
GUI
SCI
RemoteWorkstation
GUI
CS
TCP/IP Socket
Server daemon
Node daemon
PCI
PSB66
LC-3 LC-3 LC-3
3D Torus topology (for greater than 64 ~ 72 nodes)
Linköping University - NSC - SCI Clusters
• Monolith: 200 node, 2xXeon, 2,2 GHz, 3D SCI
• INGVAR: 32 node, 2xAMD 900 MHz, 2D SCI
• Otto: 48 node, 2xP4 2.26 GHz, 2D SCI
• Commercial under installation: 40, 2xXeon, 2D SCI
• Total 320 SCI nodes
Also in Sweden, Umeå University 120 Athlon nodes
Slide 14 - 21.04.23 The difference is in the software...The difference is in the software...
http://www.scali.comhttp://www.scali.com
MPI connect middleware MPI connect middleware and MPIManage Cluster and MPIManage Cluster
setup/ mgmt toolssetup/ mgmt tools
Slide 15 - 21.04.23 The difference is in the software...The difference is in the software...
Scali Software PlatformScali Software Platform
• Scali MPI Manage Scali MPI Manage – Cluster Installation /ManagementCluster Installation /Management
• Scali MPI Connect Scali MPI Connect – High Performance MPI LibrariesHigh Performance MPI Libraries
Slide 16 - 21.04.23 The difference is in the software...The difference is in the software...
• Fault TolerantFault Tolerant• High BandwidthHigh Bandwidth• Low LatencyLow Latency• Multi-Thread safeMulti-Thread safe• Simultaneous Inter/-Simultaneous Inter/-
Intra-node operationIntra-node operation• UNIX command line UNIX command line
replicatedreplicated
• Exact message size optionExact message size option• Manual/debugger mode for Manual/debugger mode for
selected processesselected processes• Explicit host specificationExplicit host specification• Job queuingJob queuing
– PBS, DQS, LSF, CCS, NQS, PBS, DQS, LSF, CCS, NQS, MauiMaui
• Conformance to MPI-1.2 Conformance to MPI-1.2 verified through 1665 MPI verified through 1665 MPI teststests
Scali MPI ConnectScali MPI Connect
Slide 17 - 21.04.23 The difference is in the software...The difference is in the software...
Scali MPI Manage featuresScali MPI Manage features
• System Installation and ConfigurationSystem Installation and Configuration
• System AdministrationSystem Administration
• System Monitoring Alarms and Event System Monitoring Alarms and Event AutomationAutomation
• Work Load ManagementWork Load Management
• Hardware ManagementHardware Management
• Heterogeneous Cluster SupportHeterogeneous Cluster Support
Fault Tolerance
2D Torus topologymore routing options
XY routing algorithmNode 33 fails (3)Nodes on 33’s ringlets
becomes unavailableCluster fractured with
current routing setting
14 24 34 44
13 23 33 43
12 22 32 42
11 21 31 41
Fault Tolerance
Scali advanced routing algorithm:
From the Turn Model family of routing algorithms
All nodes but the failed one can be utilised as one big partition
43 13 23 33
42 12 22 32
41 11 21 31
44 14 24 34
Slide 20 - 21.04.23 The difference is in the software...The difference is in the software...
Scali MPI Manage GUIScali MPI Manage GUI
Slide 21 - 21.04.23 The difference is in the software...The difference is in the software...
Monitoring ctd.Monitoring ctd.
Sam 11351
Slide 22 - 21.04.23 The difference is in the software...The difference is in the software...
System MonitoringSystem Monitoring
Resource Resource MonitoringMonitoring
•CPU CPU •Memory Memory •Disk Disk
Hardware Hardware MonitoringMonitoring
•TemperatureTemperature•Fan SpeedFan Speed
Operator Alarms on Operator Alarms on selected selected Parameters at Parameters at Specified TresholdsSpecified Tresholds
Slide 24 - 21.04.23 The difference is in the software...The difference is in the software...
SCI vs. Myrinet 2000:SCI vs. Myrinet 2000:Ping-Pong comparisonPing-Pong comparison
0
20
40
60
80
100
120
140
160
180
200
0 2 4 8 16
32
64
128
256
512
1k
2k
4k
8k
16k
32k
64k
128k
256k
512k
1M
2M
4M
8M
16M
Message length[Byte]
Ban
dw
idth
[MB
yte
/sec]
-20%
0%
20%
40%
60%
80%
100%
120%
140%
160%
180%M2K
SCI
%faster
Slide 25 - 21.04.23 The difference is in the software...The difference is in the software...
Itanium vs Cray T3E Itanium vs Cray T3E BandwidthBandwidth
Slide 26 - 21.04.23 The difference is in the software...The difference is in the software...
Itanium vs T3E LatencyItanium vs T3E Latency
Slide 27 - 21.04.23 The difference is in the software...The difference is in the software...
• Max Planck Institute für Plasmaphysik, Max Planck Institute für Plasmaphysik, GermanyGermany
• University of Alberta, CanadaUniversity of Alberta, Canada
• University of Manitoba, CanadaUniversity of Manitoba, Canada
• Etnus Software, USAEtnus Software, USA
• Oracle Inc., USAOracle Inc., USA
• University of Florida, USAUniversity of Florida, USA
• deCODE Genetics, IcelanddeCODE Genetics, Iceland
• Uni-Heidelberg, GermanyUni-Heidelberg, Germany
• GMD, GermanyGMD, Germany
• Uni-Giessen, GermanyUni-Giessen, Germany
• Uni-Hannover, GermanyUni-Hannover, Germany
• Uni-Düsseldorf, GermanyUni-Düsseldorf, Germany
• Linux NetworX, USALinux NetworX, USA
• Magmasoft AG, Germany Magmasoft AG, Germany
• University of Umeå, SwedenUniversity of Umeå, Sweden
• University of Linkøping, SwedenUniversity of Linkøping, Sweden
• PGS Inc., USAPGS Inc., USA
• US Naval Air, USAUS Naval Air, USA
Some Reference Some Reference CustomersCustomers
• Spacetec/Tromsø Satellite Station, NorwaySpacetec/Tromsø Satellite Station, Norway• Norwegian Defense Research EstablishmentNorwegian Defense Research Establishment• Parallab, NorwayParallab, Norway• Paderborn Parallel Computing Center, GermanyPaderborn Parallel Computing Center, Germany• Fujitsu Siemens computers, GermanyFujitsu Siemens computers, Germany• Spacebel, BelgiumSpacebel, Belgium• Aerospatiale, FranceAerospatiale, France• Fraunhofer Gesellschaft, GermanyFraunhofer Gesellschaft, Germany• Lockheed Martin TDS, USALockheed Martin TDS, USA• University of Geneva, SwitzerlandUniversity of Geneva, Switzerland• University of Oslo, NorwayUniversity of Oslo, Norway• Uni-C, DenmarkUni-C, Denmark• Paderborn Parallel Computing CenterPaderborn Parallel Computing Center• University of Lund, SwedenUniversity of Lund, Sweden• University of Aachen, GermanyUniversity of Aachen, Germany• DNV, NorwayDNV, Norway• DaimlerChrysler, GermanyDaimlerChrysler, Germany• AEA Technology, GermanyAEA Technology, Germany• BMW AG, GermanyBMW AG, Germany• Audi AG, GermanyAudi AG, Germany• University of New Mexico, USAUniversity of New Mexico, USA
Slide 28 - 21.04.23 The difference is in the software...The difference is in the software...
Some more Reference Some more Reference CustomersCustomers
• Rolls Royce Ltd., UKRolls Royce Ltd., UK• Norsk Hydro, NorwayNorsk Hydro, Norway• NGU, NorwayNGU, Norway• University of Santa Cruz, USAUniversity of Santa Cruz, USA• Jodrell Bank Observatory, UKJodrell Bank Observatory, UK• NTT, JapanNTT, Japan• CEA, FranceCEA, France• Ford/Visteon, GermanyFord/Visteon, Germany• ABB AG, GermanyABB AG, Germany• National Technical University of Athens, GreeceNational Technical University of Athens, Greece• Medasys Digital Systems, FranceMedasys Digital Systems, France• PDG Linagora S.A., FrancePDG Linagora S.A., France• Workstations UK, Ltd., EnglandWorkstations UK, Ltd., England• Bull S.A., FranceBull S.A., France• The Norwegian Meteorological Institute, NorwayThe Norwegian Meteorological Institute, Norway• Nanco Data AB, SwedenNanco Data AB, Sweden• Aspen Systems Inc., USAAspen Systems Inc., USA• Atipa Linux Solution Inc., USAAtipa Linux Solution Inc., USA• California Institute of Technology, USACalifornia Institute of Technology, USA• Compaq Computer Corporation Inc., USACompaq Computer Corporation Inc., USA• Fermilab, USAFermilab, USA• Ford Motor Company Inc., USAFord Motor Company Inc., USA• General Dynamics Inc., USAGeneral Dynamics Inc., USA
• Intel Corporation Inc., USAIntel Corporation Inc., USA• IOWA State University, USAIOWA State University, USA• Los Alamos National Laboratory, USALos Alamos National Laboratory, USA• Penguin Computing Inc., USAPenguin Computing Inc., USA• Times N Systems Inc., USATimes N Systems Inc., USA• University of Alberta, CanadaUniversity of Alberta, Canada• Manash University, AustraliaManash University, Australia• University of Southern Mississippi, AustraliaUniversity of Southern Mississippi, Australia• Jacusiel Acuna Ltda., ChileJacusiel Acuna Ltda., Chile• University of Copenhagen, DenmarkUniversity of Copenhagen, Denmark• Caton Sistemas Alternativos, SpainCaton Sistemas Alternativos, Spain• Mapcon Geografical Inform, SwedenMapcon Geografical Inform, Sweden• Fujitsu Software Corporation, USAFujitsu Software Corporation, USA• City Team OY, FinlandCity Team OY, Finland• Falcon Computers, FinlandFalcon Computers, Finland• Link Masters Ltd., HollandLink Masters Ltd., Holland• MIT, USAMIT, USA• Paralogic Inc., USAParalogic Inc., USA• Sandia National Laboratory, USASandia National Laboratory, USA• Sicorp Inc., USASicorp Inc., USA• University of Delaware, USAUniversity of Delaware, USA• Western Scientific Inc., USAWestern Scientific Inc., USA• Group of Parallel and Distr. Processing, BrazilGroup of Parallel and Distr. Processing, Brazil
Slide 29 - 21.04.23 The difference is in the software...The difference is in the software...
Application BenchmarksApplication Benchmarks
With Dolphin SCI and Scali MPIWith Dolphin SCI and Scali MPI
Slide 30 - 21.04.23 The difference is in the software...The difference is in the software...
NAS parallel benchmarks NAS parallel benchmarks (16cpu/8nodes)(16cpu/8nodes)
BT CG EP FT IS LU MG SP60%
80%
100%
120%
140%
160%
180%
200%
220%
240%
NPB 2.3 Performance
Mpich
ScaMPI/SCI
ScaMPI/tcpip
ScaMPI/DET2
Per
form
ance
[rel
. to
MP
ICH
]
Slide 31 - 21.04.23 The difference is in the software...The difference is in the software...
Magma (16cpus/8nodes)Magma (16cpus/8nodes)
MPI0
5
10
15
20
25
30
35
40
45
50
55
60
65
Magma Performance
MPICH
ScaMPI/SCI
ScaMPI/tcp
ScaMPI/det
Per
form
ance
[jo
bs/d
ay]
Slide 32 - 21.04.23 The difference is in the software...The difference is in the software...
Eclipse (16cpus/8nodes)Eclipse (16cpus/8nodes)
MPI0
10
20
30
40
50
60
70
80
90
100
Eclipse300 Performance
MPICH
ScaMPI/SCI
ScaMPI/tcpip
ScaMPI/DET
ScaMPI/DET2
Pe
rfo
rma
nce
[jo
bs/
da
y]
Slide 33 - 21.04.23 The difference is in the software...The difference is in the software...
FEKO: Parallel SpeedupFEKO: Parallel Speedup
Slide 34 - 21.04.23 The difference is in the software...The difference is in the software...
Acusolve (16cpus/8nodes)Acusolve (16cpus/8nodes)
MPI60
61
62
63
64
65
66
67
68
69
70
71
72
Acusolve Performance
MPICH
ScaMPI-SCI
ScaMPI/tcpip
ScaMPI/DET
Pe
form
an
ce [j
ob
s/d
ay]
Slide 35 - 21.04.23 The difference is in the software...The difference is in the software...
Visage (16cpus/8nodes)Visage (16cpus/8nodes)
MPI200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
Visage Performance
ScaMPI/SCI
ScaMPI/tcpip
ScaMPI/DET
ScaMPI/DET2
Pe
rfo
rma
nce
[jo
bs/
da
y]
Slide 36 - 21.04.23 The difference is in the software...The difference is in the software...
CFD scaling mm5: linear to 400 CPUs CFD scaling mm5: linear to 400 CPUs
2 4 8 16 32 64 128 256 4000
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
mm5 t3a data set
# cpus
Per
form
ance
(M
flops
/sec
.)
Slide 37 - 21.04.23 The difference is in the software...The difference is in the software...
Scaling - Fluent – Linköping cluster Scaling - Fluent – Linköping cluster
32 64 1280
5
10
15
20
25
30
35
40
45
50
55
60
65
70
Fluent Performance - 64 Million cells
Per
form
ance
[job
s/da
y]
Dolphin Software
• All Dolphin SW is free open source (GPL or LGPL)• SISCI• SCI-SOCKET
Low Latency Socket Library TCP and UDP Replacement User and Kernel level support Release 2.0 available
• SCI-MPICH (RWTH Aachen) MPICH 1.2 and some MPICH 2 features New release is being prepared, beta available
• SCI Interconnect Manager Automatic failover recovery. No single point of failure in 2D and 3D networks.
• Other SCI Reflective Memory, Scali MPI, Linux Labs SCI Cluster Cray-compatible
shmem and Clugres PostgreSQL, MandrakeSoft Clustering HPC solution, Xprime X1 Database Performance Cluster for Microsoft SQL Servers, ClusterFrame from Qlusters and SunCluster 3.1 (Oracle 9i), MySQL Cluster
Summation Enterprises Pvt. Ltd.
Brief Company Profile
• Our expertise: Clustering for High Performance Technical Computing, Clustering for High Availability, Terabyte Storage solutions, SANs
• O.S. skills : Linux (Alpha 64bit, x86:32 and 64bit), Solaris (SPARC and x86), Tru64unix, Windows NT/ 2K/ 2003 and the QNX Realtime O.S.
Summation milestones
• Working with Linux since 1996• First in India to deploy/ support 64bit Alpha Linux
workstations (1999)• First in India to spec, deploy and support a 26 Processor
Alpha Linux cluster (2001)• Only company in India to have worked with Gigabit, SCI
and Myrinet interconnects• Involved with the design, setup, support of many of the
largest HPTC clusters in India.
Exclusive Distributors /System Integrators in India
• Dolphin Interconnect AS, Norway– SCI interconnect for Supercomputer performance
• Scali AS, Norway– Cluster management tools
• Absoft, Inc., USA– FORTRAN Development tools
• Steeleye Inc., USA– High Availability Clustering and Disaster Recovery Solutions for Windows
& Linux– Summation is the sole Distributor, Consulting services & Technical
support partner for Steeleye in India
Partnering with Industry leaders
• Sun Microsystems, Inc.– Focus on Education & Research segments– High Performance Technical Computing,
Grid Computing Initiative with Sun Grid Engine (SGE/ SGEE)
– HPTC Competency Centre
Wulfkit / HPTC users
• Institute of Mathematical Sciences, Chennai– 144 node Dual Xeon Wulfkit 3D cluster, – 9 node Dual Xeon Wulfkit 2D cluster– 9 node Dual Xeon Ethernet cluster– 1.4 TB RAID storage
• Bhaba Atomic Research Centre, Mumbai– 64 node Dual Xeon Wulfkit 2D cluster
– 40 node P4 Wulfkit 3D cluster
– Alpha servers / Linux OpenGL workstations / Rackmount servers
• Harish Chandra Research Institute, Allahabad– Forty Two node Dual Xeon Wulfkit Cluster, – 1.1 TB RAID Storage
Wulfkit / HPTC users (contd.)
• Intel Technology India Pvt. Ltd., Bangalore– Eight node Dual Xeon Wulfkit Clusters (ten nos.)
• NCRA (TIFR), Pune– 4 node Wulfkit 2D cluster
• Bharat Forge Ltd., Pune– Nine node Dual Xeon Wulfkit 2D cluster
• Indian Rare Earths Ltd., Mumbai– 26 Processor Alpha Linux cluster with RAID storage
• Tata Institute of Fundamental Research, Mumbai– RISC/Unix servers, Four node Xeon cluster
• Centre for Advanced Technology, Indore– Alpha/ Sun Workstations