Voltaire - Reducing the Runtime of Collective Communications

21
© 2010 Voltaire Inc. June 3, 2010 Reducing the Runtime of Collective Communications ISC’10 Birds of a Feather Session

description

Presented at ISC '10 Birds of a Feather Session

Transcript of Voltaire - Reducing the Runtime of Collective Communications

  • 1. Reducing the Runtime of Collective Communications ISC10 Birds of a Feather SessionJune 3, 2010 2010 Voltaire Inc.

2. Agenda Scalability Challenges for Group CommunicationVoltaire Fabric Collective Accelerator (FCA) Yaron Haviv, CTO, Voltaire Customer Experience: University of Braunschweig Josef Schle 2010 Voltaire Inc.Confidential - Internal 2 3. About Voltaire (NASDAQ: VOLT)Leading provider of scale-out data center fabrics Used by more than 30% of Fortune100 companies Hundreds of installations of over 1000 serversAddressing the challenges of HPC, virtualized data centersand clouds More than half of TOP500 InfiniBand sites InfiniBand and 10GbE scale-out fabricsEnd-to-End Scale-out Fabric Product Line 2010 Voltaire Inc. Confidential - Internal3 4. MPI CollectivesCollective Operations = Group Communication (All to All, One toAll, All to One) Synchronous by nature = consume many Wait cycles on largeclusters Collective Operations % of MPI Job Runtime 100Popular examples:90 Reduce 8070 AllreducePercentage 60 Barrier50 Bcast4030 Gather 20 Allgather100ANSYSSAGE CPMD LSTC LS- CD-Adapco DacapoFLUENTDYNA STAR-CDYour cluster might be spending half its time on idle collective cycles 2010 Voltaire Inc.Confidential - Internal 4 5. Collective Example - AllreduceAllreduce The Concept Perform specific operation on all arguments, and distribute result to allprocesses. Example with SUM operation:30158 307 3015630 9Allreduce on a 4-node cluster 144144 144144 144 2 52 6120 5 1 25 6144144 144144 20 2 52 615 144144 144144 1 25 6144144 144144 144144 1441443 47 8 3 47 8144144 144144 3 47 8 144144 144144 3 47 8144144 144144 2010 Voltaire Inc. Confidential - Internal 5 6. Now try running it on a Petascale machineDozens of coreswitches (3 hops)Hundreds of edgeswitches (1 hop) 1 25 61 2 5 6 Tens of thousands1 2 5 63 47 83 4 7 8 of cores3 4 7 8 Single Operation > 3000usec Not Scalable 2010 Voltaire Inc.Confidential - Internal6 7. The Challenge:Collective Operations ScalabilityGrouping algorithms are unaware of the topologyand inefficient Network congestion due to All-to-Allcommunication Slow nodes & OS involvement impair scalabilityand predictability Expected Actual The more powerful servers get (GPUs, morecores), the poorer collectives scale in the fabric 2010 Voltaire Inc. Confidential - Internal7 8. The Voltaire InfiniBand Fabric:Equipped for the Challenge Grid DirectorUnified FabricSwitches:Manager (UFM):Fabric Topology AwareProcessing + + OrchestratorPower ++ . . Fabric computing in use to address the collective challenge 2010 Voltaire Inc.Confidential - Internal8 9. Introducing:Voltaire Fabric Collective Accelerator Grid DirectorGrid DirectorFCA Manager: Unified FabricSwitches: Manager (UFM):Topology-based collective treeSwitches:FabricTopology AwareSeparate Virtual networkCollective Processing ++ for result distributionIB multicastOrchestratoroperations PowerIntegration with job schedulersoffloaded toswitch CPUs+ FCA Agent:+ Inter-core processing localized & optimized. . Breakthrough performance with no additional hardware 2010 Voltaire Inc.Confidential - Internal 9 10. Efficient Collectives with FCA4. 2nd tier offload5. Result distribution 1. Pre-config(result at root)(single message) 64811664 648 36 648 3636648363. 1st tier offload 11664 1166411664 11664 11664 11664 11664 11664 11664 1166411664 11664 1 2 5 6 1 2 5 6 1 2 5 636 8 311664 7116644 11664 1166436 1166411664 411664 8 3 11664 736 116644 116648 311664 7116642. Inter-core 6. Allreduce on 100Kprocessing cores in 25 usec 2010 Voltaire Inc. Confidential - Internal10 11. UFM Integrated With Job Schedulers Matching Jobs AutomaticallyJob Submitted in Scheduler Created in UFM QoS Routing Placement Collectives Application Level MonitoringFabric-wide Policy Pushed to Match& Optimization MeasurementsApplication Requirements 2010 Voltaire Inc.Confidential - Internal11 12. FCA Benefits:Slashing Job RuntimeSlashing Runtime IMB Allreduce 2048 Cores Open MPI: 4000 >3000usec 3500 3000 2500 usec2000 1500 1000500FCA: 180X FCA-Allreduce> 50%10FCA-Barrier 1 0 200400 600 800 1000 1200 Extreme performance As process count increases improvement on raw % of time spent in MPI collectives increases Scale according to number % of time spent in collectives of switch hops, not number increases of nodes O(log18)Enabling capability computing on HPC clusters 2010 Voltaire Inc.Confidential - Internal13 14. Additional Benefits Simple, fully integrated No changes to application requiredTolerance to higher oversubscription (blocking) ratio Same performance at lower costEnables use of non-blocking collectives Part of future MPI implementations FCA guarantees no computation power penaltyReduce fabric congestion Avoid interference to other jobs 2010 Voltaire Inc.Confidential - Internal 14 15. Customer ExperienceUniversity of Braunschweig June 3, 2010 2010 Voltaire Inc. 16. About University of BraunschweigGeneral Overview Founded in 1745 120 institutes with ca. 2900 employees Ca. 13000 students Main Fields of Research Mobility and transport (road, rail, air and space) Biological and biotechnological research Digital television 2010 Voltaire Inc. Confidential - Internal 16 17. System ConfigurationNewest installation: Nodes type: NEC HPC 1812Rb-2 CPU: 2 x Intel X5550, MEM: 6 x 2GB, IB: 1 x Infinihost DDR onboard System Configuration: 186 nodes 24 nodes per switch (DDR), 12 QDR links to tier2 switches (non-blocking) OS: CentOS 5.4 Open MPI: 1.4.1 4 x QDR 4 x QDR FCA:1.0_RC3 rev 2760 UFM: 2.3 RC7 Switch: 3.0.629 24 x DDR 24 x DDR 2010 Voltaire Inc. Confidential - Internal17 18. FCA Performance:A Real Cluster Example with 2048 RanksCollective latency (usec) 10000 4000 Microsecond ompi-Allreduce1000 ompi-Barrier Latency (us)180x FasterFCA-Allreduce100 FCA-Barrier 100 5001000 150020002500Number of ranks (16 ranks per node) 2010 Voltaire Inc. Confidential - Internal 18 19. Real Application ResultsOpenFoam Open source CFD solver produced by a commercial company, OpenCFD Used by many leading automotive companiesOpen Foam CFD Aerodynamic Benchmark (64 cores) 50004500 400041 ette b3500% r3000Seconds Open MPI 1.4.12500 Open MPI 1.4.1 + FCA2000 15001000 50001 Expected benefits for several other applications e.g. DLPOLY (molecular dynamics) 2010 Voltaire Inc.Confidential - Internal19 20. Voltaire Fabric Collective AcceleratorSummary FullyIntegrated Fabric computing offload Combination of SW & HW in a single solution Offloading blocking computational tasks Algorithms leveraging the topology for computation (trees) Extreme MPI performance & scalability Capability computing on commodity clusters Two orders of magnitude, hundred-times faster collective runtime Scale by number of hops, not number of nodes Variation eliminated - Consistent results Transparent to the application Plug & play - No need for code changes Accelerate your fabric! 2010 Voltaire Inc.Confidential - Internal 20 21. Q&A 2010 Voltaire Inc. Confidential - Internal 21