A brief overview with emphasis on cluster performance Eric Lantz...
-
Upload
vivian-nash -
Category
Documents
-
view
219 -
download
0
Transcript of A brief overview with emphasis on cluster performance Eric Lantz...
![Page 1: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/1.jpg)
A brief overview with emphasis on cluster performance
Eric Lantz ([email protected])Lead Program Manager , HPC TeamMicrosoft Corp.
Fab Tillier ([email protected] )Developer, HPC TeamMicrosoft Corp.
![Page 2: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/2.jpg)
A Brief Overview of this second release from Microsoft’s HPC team.
![Page 3: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/3.jpg)
Some Applicable Market Data IDC Cluster Study (113 sites, 303 clusters, 29/35/36 GIA split)
Industry self-reports average of 85 nodes per cluster When needing more computing power:
~50% buy a new cluster, ~50% add nodes to existing cluster When purchasing:
61% buy direct from vendor, 67% have integration from vendor 51% use a standard benchmark in purchase decision
Premium paid for lower network latency as well as power and cooling solutions Applications Study (IDC Cluster Study, IDC App Study (250 codes, 112 vendors,
11 countries) - Visits) Application usage
Apps use 4-128 CPUs and are majority In-house developed Majority multi-threaded Only 15% use whole cluster In practice 82% are run at 32 processors or below
Excel running in parallel is an application of broad interest Top challenges for implementing clusters:
Facility issues with power and cooling System management capability Complexity implementing parallel algorithms Interconnect latency Complexity of system purchase and deployment
Sources: 2006 IDC Cluster Study, HECMS, 2006 Microsoft HEWS Study
Page3
![Page 4: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/4.jpg)
Markets Addressed by HPCS2008
![Page 5: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/5.jpg)
Key HPC Server 2008 Features Systems Management
New admin console based on System Center UI framework integrates every aspect of cluster management
Monitoring heat map allows viewing cluster status at-a-glance High availability for multiple head nodes Improved compute node provisioning using Windows Deployment Services Built-in system diagnostics and cluster reporting
Job Scheduling Integration with the Windows Communication Foundation, allowing SOA application developers
to harness the power of parallel computing offered by HPC solutions Job scheduling granularity at processor core, processor socket, and compute node levels Support for Open Grid Forum’s HPC-Basic Profile interface
Networking and MPI Network Direct, providing dramatic RDMA network performance improvements for MPI
applications Improved Network Configuration Wizard New shared memory MS-MPI implementation for multicore servers MS-MPI integrated with Event Tracing for Windows and Open Trace Format translation
Storage Improved iSCSI SAN and Server Message Block (SMB) v2 support in Windows Server 2008 New parallel file system support and vendor partnerships for clusters with high-performance
storage needs New memory cache vendor partnerships
![Page 6: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/6.jpg)
![Page 7: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/7.jpg)
8
![Page 8: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/8.jpg)
End-To-End Approach To PerformanceMulti-Core is Key
Big improvements in MS-MPI shared memory communications
NetworkDirect A new RDMA networking interface built for speed and stability
Devs can't tune what they can't see MS-MPI integrated with Event Tracing for Windows
Perf takes a village Partnering for perf
Regular Top500 runs Performed by the HPCS2008 product team on a permanent,
scale-testing cluster
9
![Page 9: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/9.jpg)
Multi-Core is KeyBig improvements in MS-MPI shared memory communicationsMS-MPI automatically routes between
Shared Memory: Between processes on a single [multi-proc] node
Network: TCP, RDMA (WinsockDirect, NetworkDirect)
MS-MPIv1 monitored incoming shmem traffic by aggressively polling [for low latency] which caused: Erratic latency measurements High CPU utilization
MS-MPIv2 uses entirely new shmem approach Direct process-to-process copy to increase shm
throughput. Advanced algorithms to get the best shm latency
while keeping CPU utilization low.
10
Prelim shmem results
![Page 10: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/10.jpg)
NetworkDirectA new RDMA networking interface built for speed and stability Priorities
Equal toHardware-Optimized stacks for MPI micro-benchmarks
Focus on MPI-Only Solution for CCSv2 Verbs-based design for close fit with
native, high-perf networking interfaces
Coordinated w/ Win Networking team’s long-term plans
Implementation MS-MPIv2 capable of 4 networking
paths: Shared Memory
between processors on a motherboard TCP/IP Stack (“normal” Ethernet) Winsock Direct
for sockets-based RDMA New NetworkDirect interface
HPC team partnering with networking IHVs to develop/distribute drivers for this new interface
User ModeKernel Mode
TCP/Ethernet Networking
Kern
el
By-
Pass
MPI AppSocket-
Based App
MS-MPI
Windows Sockets (Winsock + WSD)
Networking HardwareNetworking HardwareNetworking Hardware
Networking HardwareNetworking HardwareHardware Driver
Networking
Hardware
Networking
Hardware
Mini-port Driver
TCP
NDIS
IP
Networking HardwareNetworking HardwareUser Mode Access Layer
Networking
Hardware
Networking
Hardware
WinSock Direct
Provider
Networking
Hardware
Networking
Hardware
NetworkDirect
Provider
RDMA Networking
OS Componen
t
CCP Componen
t
IHV Componen
t(ISV) App
![Page 11: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/11.jpg)
Devs can't tune what they can't seeMS-MPI integrated with Event Tracing for Windows
Single, time-correlated log of: OS, driver, MPI, and app events
CCS-specific additions High-precision CPU clock
correction Log consolidation from multiple
compute nodes into a single record of parallel app execution
Dual purpose: Performance Analysis
Application Trouble-Shooting
Trace Data Display Visual Studio & Windows ETW
tools
!Soon! Vampir Viewer for Windows
13
MS-MPI
Windows ETW
Infrastructure
mpiexec.exe
-trace args
logman.exe
Trace settings
(mpitrace.mof)
Trace Log File
Convert to textLive feed
MS-MPI
Windows ETW
Infrastructure
Trace Log File
Trace Log Files
Trace Log Files
Trace Log Files
![Page 12: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/12.jpg)
Perf takes a village(Partnering for perf)
Networking Hardware vendorsNetworkDirect design reviewNetworkDirect & WinsockDirect provider
developmentWindows Core Networking TeamCommercial Software Vendors
Win64 best practicesMPI usage patternsCollaborative performance tuning
3 ISVs and counting 4 benchmarking centers online
IBM, HP, Dell, SGI14
![Page 13: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/13.jpg)
Regular Top500 runs MS HPC team just completed a
3rd entry to the Top500 list Using our dev/test scale cluster
(Rainier) Currently #116 on Top500 Best efficiency of any Clovertown
with SDR IB (77.1%)
Learnings incorporated into white papers & CCS product
15
Configuration:•260 Dell Blade Servers•1 Head node•256 compute nodes•1 IIS server•1 File Server•App/MPI: Infiniband•Private: Gb-E•Public: Gb-E
Location:•Microsoft Tukwila Data center (22 miles from Redmond campus)
• Each compute node has two quad-core Intel 5320 Clovertown, 1.86GHz, 8GB RAM• Total
•2080 Cores•2+TB RAM
![Page 14: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/14.jpg)
![Page 15: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/15.jpg)
What is Network Direct?What Verbs should look like for Windows:Service Provider Interface (SPI)
Verbs Specifications are not APIs!Aligned with industry-standard Verbs
Some changes for simplicitySome changes for convergence of IB and
iWARPWindows-centric design
Leverage Windows asynchronous I/O capabilities
![Page 16: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/16.jpg)
ND Resources
![Page 17: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/17.jpg)
Resources ExplainedResource Description
Provider Represents the IHV driver
Adapter Represents an RDMA NICContainer for all other resources
Completion Queue (CQ)
Used to get I/O results
Endpoint (EP) Used to initiate I/OUsed to establish and manage connections
Memory Registration (MR)
Make buffers accessible to HW for local access
Memory Window (MW) Make buffers accessible for remote access
19
![Page 18: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/18.jpg)
ND to Verbs Resource MappingNetwork Direct Verbs
Provider N/A
Adapter HCA/RNIC
Completion Queue (CQ) Completion Queue (CQ)
Endpoint (EP) Queue Pair (QP)
Memory Registration (MR)
Memory Region (MR)
Memory Window (MW) Memory Window (MW)
![Page 19: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/19.jpg)
ND SPI TraitsExplicit resource management
Application manages memory registrations Applications manages CQ to Endpoint bindings
Only asynchronous data transfers Initiate requests on an Endpoint Get request results from the associated CQ
Application can use event driven and/or polling I/O modelLeverage Win32 asynchronous I/O for event driven operationNo kernel transitions for polling mode“Simple” Memory Management Model
Memory Registrations are used for local access Memory Windows are used for remote access
IP Addressing No proprietary address management required
21
![Page 20: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/20.jpg)
ND SPI ModelCollection of COM interfaces
No COM runtime dependency Use the interface model only Follows model adopted by the UMDF
Thread-less providersNo callbacks
Aligned with industry standard VerbsFacilitates IHV adoption
![Page 21: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/21.jpg)
Why COM Interfaces?Well understood programming modelEasily extensible via IUnknown::QueryInterface
Allows retrieving any interface supported by an object
Object orientedC/C++ language independent
Callers and providers can be independently implemented in C or C++ without impact on one another
Interfaces support native code syntax - no wrappers
![Page 22: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/22.jpg)
Asynchronous OperationsWin32 Overlapped operations used for:
Memory RegistrationCQ NotificationConnection Management
Client controls threading and completion mechanismI/O Completion Port or GetOverlappedResult
Simpler for kernel drivers to supportIoCompleteRequest – I/O manager handles the
rest.
![Page 23: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/23.jpg)
Microsoft HPC web site - HPC Server 2008 (beta) Available Now!!
http://www.microsoft.com/hpc Network Direct SPI documentation, header and test executables
In the HPC Server 2008 (beta) SDKhttp://www.microsoft.com/hpc
Microsoft HPC Community Sitehttp://windowshpc.net/default.aspx
Argonne National Lab’s MPI websitehttp://www-unix.mcs.anl.gov/mpi/
CCS 2003 Performance Tuning Whitepaperhttp://www.microsoft.com/downloads/details.aspx?FamilyID=40cd8152-f89d-4abf-ab1c-a467e180cce4&DisplayLang=en Or go to http://www.microsoft.com/downloads and search for CCS Performance
![Page 24: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/24.jpg)
![Page 25: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/25.jpg)
Socrates software boosts performance by 30% on Microsoft cluster to achieve 77.1% overall cluster efficiency
![Page 26: A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com)elantz@microsoft.com Lead Program Manager, HPC Team Microsoft Corp.](https://reader036.fdocuments.us/reader036/viewer/2022062321/56649dc65503460f94abaead/html5/thumbnails/26.jpg)
Performance improvement was demonstrated with exactly the same hardware and is attributed to : Improved networking performance of MS-MPI’s
NetworkDirect interface
Entirely new MS-MPI implementation for shared memory communications
Tools and scripts to optimize process placement and tune the Linpack parameters for this 256-node, 2048-processor cluster
Windows Server 2008 improvements in querying completion port status
Use of Visual Studio’s Profile Guided Optimization (POGO) on the Linpack, MS-MPI, and the ND provider binaries