Fermi Cluster for Real-Time Hyperspectral Scene Generation · SDP Sockets Direct Protocol SRP SCSI...
Transcript of Fermi Cluster for Real-Time Hyperspectral Scene Generation · SDP Sockets Direct Protocol SRP SCSI...
Fermi Cluster for Real-Time Hyperspectral Scene Generation
Gary McMillian, Ph.D. Crossfield Technology LLC
9390 Research Blvd, Suite I200 Austin, TX 78759-7366
(512)795-0220 x151 [email protected]
AF SBIR Program, Donald Snyder, III Program Manager
Funding provided by Frank Carlen, Multi-Spectral Test
System Architecture & Approach
• Scenes generated by heterogeneous processors, then transported over In5iniBand to the projector(s) using RDMA protocol for high throughput and low latency
• Network interfaces aggregate data from multiple heterogeneous processors in high-‐speed frame buffers
• Contents of frame buffers output to projector through FPGA Mezzanine Card (FMC) interface
• IEEE 1588 Precision Time Protocol (PTP) provides global time synchronization
• Heterogeneous processors and projector network interfaces scale independently
7/20/11 Crossfield Technology LLC 2
Scalable System Architecture
7/20/11 Crossfield Technology LLC 3
InfiniBand Switch
Network Interface
Network Interface Adapters
Projector
HWIL
DVI
LVDS Fiber
Processor CPU/GPU
Processor Nodes
HWIL Simulation System
7/20/11 Crossfield Technology LLC 4
CPU CPU DDR3
SDRAM DDR3
SDRAM
GPU GDDR5 SDRAM GPU
GDDR5 SDRAM
Network Adapter
PCIe Bridge
Network Adapter
CPU DDR3
SDRAM
FPGA DDR3
SDRAM
PHY
Projector / HWIL
1U-4U Heterogeneous
Processor
1U Crossfield
Network Interface
InfiniBand Switch (36-648 ports)
PCIe Bridge
QuickPath Interconnect (QPI) ~100 Gbps PCI Express x8 ~32 Gbps (x16 ~64 Gbps) DDR3 SDRAM ~85 Gbps/ch x 3 ch GDDR5 SDRAM ~192 Gbps/ch x 6 ch QDR InfiniBand ~32 Gbps VITA 57.1 / FMC ~100 Gbps SERDES + 120 Gbps LVDS I/O
PCIe x8
PCIe x8
FMC
QPI QPI
User-definable Frame Synch/Request
Network Adapter
IEEE 1588 PTP Server + Ethernet
SSD
GPU GDDR5 SDRAM
PCIe x8
REAL-TIME HIGH PERFORMANCE COMPUTER (HPC)
7/20/11 Crossfield Technology LLC 5
Real-Time HPC Requirements
• Deterministic & Synchronous
– Synthesized images complete & ready at HWIL frame rate
• High Floating-Point Performance
– Implement physics-based algorithms
• High Bandwidth
– Inter-processor communications for data exchange
– Stream high-resolution images to projector at high frame rates
• High Memory Capacity & Performance
– Processor memory – code, model parameters, data
– Non-volatile storage – code, model parameters, data, logging
7/20/11 Crossfield Technology LLC 6
Intel Xeon Processor Roadmap
7/20/11 Crossfield Technology LLC 7
Sandy Bridge Microarchitecture • 32 nm process, 4-8 Cores • 40 lanes PCI Express Gen 3.0 • 4 channels DDR3-1600
Westmere Microarchitecture • 32 nm process, 6 Cores • 40 lanes PCI Express Gen 2.0 • 3 channels DDR3-1333
Nvidia CUDA GPU Roadmap
7/20/11 Crossfield Technology LLC 8
21 SEP 2010 Kepler – To be released sometime in 2011, 28 nm process. Estimated performance of 4-6 DP GFLOPS/W Maxwell – To be released sometime in 2013, 22 nm process. Estimated performance of 15-16 DP GFLOPS/W
Nvidia Tesla (Fermi Architecture)
• CUDA™ Programming Environment – C/C++, Fortran, OpenCL, Java, Python or
DirectX Compute
• GIGATHREAD™ Engine – 515 GFLOP Double Precision
– 1030 GFLOP Single Precision
• PARALLEL DATACACHE™ Technology – 3 - 6 GB GDDR5 memory
– 384-bit bus
– ECC option
• GPUDirect™ with InfiniBand
• PCI Express 2.0 (16 lanes) – Two DMA engines for bi-directional data
transfer
7/20/11 9
C2050/C2070
M2050/M2070
Crossfield Technology LLC
Nvidia Tesla Comparison
Tesla C2070 Tesla M2070 Tesla M2090 Peak double precision floating point performance
515 GFLOPS 515 GFLOPS 665 GFLOPS
Peak single precision floating point performance
1030 GFLOPS 1030 GFLOPS 1331 GFLOPS
CUDA cores 448 448 512 Memory size (GDDR5) 6 GB 6 GB 6 GB
Memory bandwidth (ECC off) 144 GB/s 150 GB/s 177 GB/s
Total Dissipated Power (TDP) 247 W 225 W 250 W
Retail price $2300 ~$2300 ~$3500
7/20/11 Crossfield Technology LLC 10
InfiniBand Roadmap
7/20/11 Crossfield Technology LLC 11
SDR - Single Data Rate DDR - Double Data Rate QDR - Quad Data Rate FDR - Fourteen Data Rate EDR - Enhanced Data Rate HDR - High Data Rate NDR - Next Data Rat
Mellanox ConnectX-2 Network Adapters
• Nvidia GPUDirect™ – InfiniBand Adapter and Nvidia
GPU share CPU memory region
• Open Fabrics Enterprise Distribution (OFED) Software
• Bandwidth – 10G Ethernet
– 10/20/40G InfiniBand
– PCIe 2.0 (8-lanes)
• Performance – 1 µs Ping latency
– 50M MPI messages/s
• Protocol Support – Remote Direct Memory
Access (RDMA)
– OpenMPI, OSU MVAPICH, HPMPI, Intel MPI, MS MPI, Scali MPI
– TCP/UDP, IPoIB, SDP, RDS – SRP, iSER, NFS RDMA, FCoIB, FCoE
7/20/11 12 Crossfield Technology LLC
Mellanox IS5200 InfiniBand Switch
• Non-blocking, full bisectional bandwidth
• 100-300 ns latency
• Up to 216 QSFP ports – 17.28 Tb/s aggregate
throughput
• 9U cabinet – 6 spine modules
– 12 leaf modules
• 1 kW
7/20/11 Crossfield Technology LLC 13
Remote Direct Memory Access (RDMA)
• Remote Direct Memory Access enables data to be transferred from one processor’s memory to another processor’s memory across a network, without significantly involving either operating system
• RDMA supports zero-copy data transfers by enabling the network adapter to transfer data directly to or from application memory, eliminating the need to copy data between application memory and data buffers in the operating system kernel
• RDMA defines READ, WRITE and SEND/RECEIVE
• RDMA adapters support thousands of concurrent transactions using work queues
7/20/11 Crossfield Technology LLC 14
7/20/11 Crossfield Technology LLC 15
SA Subnet Administrator
MAD Management Datagram
SMA Subnet Manager Agent
PMA Performance Manager Agent
IPoIB IP over InfiniBand
SDP Sockets Direct Protocol
SRP SCSI RDMA Protocol (Initiator)
iSER iSCSI RDMA Protocol (Initiator)
RDS Reliable Datagram Service
UDAPL User Direct Access Programming Lib
HCA Host Channel Adapter
R-NIC RDMA NIC
Common
InfiniBand
iWARP
Key
InfiniBand HCA iWARP R-NIC
Hardware Specific Driver
Hardware Specific Driver
Connection Manager
MAD
InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC
SA Client
Connection Manager
Connection Manager Abstraction (CMA)
InfiniBand OpenFabrics User Level Verbs / API iWARP R-NIC
SDP IPoIB SRP iSER RDS
SDP Lib
User Level MAD API
Open SM
Diag Tools
Hardware
Provider
Mid-Layer
Upper Layer Protocol
User APIs
Kernel Space
User Space
NFS-RDMA RPC
Cluster File Sys
Application Level
SMA
Clustered DB Access
Sockets Based Access
Various MPIs
Access to File
Systems
Block Storage Access
IP Based App
Access
Apps & Access Methods for using OF Stack
UDAPL
Ker
nel b
ypas
s
Ker
nel b
ypas
s
OpenFabrics Alliance (OFA) Open Source
GPU Server Options
• 1U server
– Dual Xeon 5600 processors & 5520 chipsets
– Three 16-lane + one 8-lane PCIe slots
– Supports 1-3 M2090 + 1-2 IB HCA
• 2U server
– Dual Xeon 5600 processors & 5520 chipsets
– Four 16-lane + two 8-lane PCIe slots (PLX 8647 switch)
– Supports 1-4 M2090 + 1-2 IB HCA
• 4U server
– Dual Xeon 5600 processors & 5520 chipsets
– Eight 16-lane PCIe slots (4 PLX 8647 switches)
– Supports 4-7 C2070 + 1-4 IB HCA
7/20/11 Crossfield Technology LLC 16
HPC System Configuration
• 4U Servers (64 + 1)
– Dual 6-core, 2.66 GHz Intel Xeon 5650 (Westmere) CPUs
– Dual Intel 5520 (Tylersburg-36D) IOH with 6.4 GT/s QPI
• Four 16-lane PCI Express Gen 2 slots
– Six 8 GB DDR3-1333 DIMMs (48 GB)
– Four Nvidia Tesla C2070 (Fermi) GPUs
– One Mellanox 40G InfiniBand Host Channel Adapter
– One 300 GB, 10K RPM disk drive
• Mellanox 40G InfiniBand Switch (216 ports max)
• Symmetricom IEEE 1588 PTP Master Clock
• APC Smart-UPS RT 6000VA (18) – 76 kW
• 42U Racks (9)
7/20/11 Crossfield Technology LLC 17
*65 nodes x 1.4 kW/node = 91 kW
Advanced HPC System Configuration
• 2U Servers (64 + 1)
– Dual 6-core, 2.66 GHz Intel Xeon 5650 (Westmere) CPUs
– Dual Intel 5520 (Tylersburg-36D) IOH with 6.4 GT/s QPI
• Four 16-lane + two 8-lane PCI Express Gen 2 slots (with switch)
– Six 8 GB DDR3-1333 DIMMs (48 GB)
– Three Nvidia Tesla M2090 (Fermi) GPUs
– Two Mellanox 40G InfiniBand Host Channel Adapters
– One 250 GB SSD (solid state disk)
• Mellanox 40G InfiniBand Switch (216 ports max)
• Symmetricom IEEE 1588 PTP Master Clock
• APC Symmetra PX SY100K100F UPS - 100 kW
• 42U Racks (4+1)
7/20/11 Crossfield Technology LLC 18
Future HPC System Configuration
• 2U Servers (64 + 1)
– Dual 8-core, 2.3 GHz Intel Xeon E5-2600 (Sandy Bridge) CPUs
• Four 16-lane + two 8-lane PCI Express Gen 3 slots (with switch)
– Eight 8 GB DDR3-1600 DIMMs (64 GB)
– Three Nvidia Tesla M2090 (Fermi) GPUs
– Two Mellanox 56G InfiniBand Host Channel Adapters
– One 250 GB SSD (solid state disk)
• Mellanox 56G InfiniBand Switch (648 ports max)
• Symmetricom IEEE 1588 PTP Master Clock
• APC Symmetra PX SY100K100F UPS - 100 kW
• 42U Racks (4+1)
7/20/11 Crossfield Technology LLC 19
IEEE 1588 Precision Time Protocol
• IEEE 1588-2008 Precision Time Protocol (PTP) Version 2 overcomes network and application latency and jitter through hardware time stamping at the physical layer of the network.
• IEEE 1588-2008 provides time transfer accuracy in the sub ns range, a significant improvement in time synchronization accuracy over Network Time Protocol (NTP).
• The Symmetricom XLi Grandmaster is IEEE 1588-2008 PTP V2 compliant and time stamps PTP packets with a time stamp accuracy of 50 ns to UTC. Measured synchronization accuracy at a PTP client has been shown to be as good as a 17 ns offset from the XLi Grandmaster. Operating at 100BaseT line speed with deep time stamp packet buffers, the XLi Grandmaster can support thousands of 1588 clients.
7/20/11 20 Crossfield Technology LLC
Uninterruptable Power Supply (UPS)
• APC Symmetra PX 100kW
• Scalable to 100kW/100kVA
• 208V 3PH 332A Service
7/20/11 Crossfield Technology LLC 21
APC Symmetra PX Performance
7/20/11 Crossfield Technology LLC 22
HPC Performance
Node System Cores – CPU/GPU 12/1536 768/98304 CPU SP FP Performance 128 GFLOP 8 TFLOP CPU DP FP Performance 64 GFLOP 4 TFLOP GPU SP FP Performance 3990 GFLOP 255 TFLOP GPU DP FP Performance 1995 GFLOP 128 TFLOP Main Memory Size 48 GB 3 TB Main Memory BW 64 GB/s 4 TB/s Disk Size 250 GB 16 TB Disk IOPS (4 KB) 20K 1.28M Disk R/W BW 500/315 MB/s 32/20 GB/s Network BW 50 Gb/s 3.2 Tb/s Power 1.5 kW 100 kW
7/20/11 Crossfield Technology LLC 23
HPC Procurement Schedule
• Breadboard Performance Evaluation 15 JUL
• Finalize HPC Configuration 15 JUL
– # Fermi Processors (4 -> 3)
– # IB Adapters (1 -> 2)
– UPS (100 kW), Server (4U -> 2U), SSD
• Request Final Vendor Quotes 1 AUG
• HPC Vendor Selection
– Issue HPC System Purchase Order OCT 31
• HPC System Integration & Test by Vendor
– 6-12 week delivery ARO
• Installation DEC 31
– Prepare electrical supply for UPS
7/20/11 Crossfield Technology LLC 24
REAL-TIME LINUX
7/20/11 Crossfield Technology LLC 25
Real-Time Operating System (RTOS)
• Requirements
– No dropped frames during simulation run
– Support Nvidia’s CUDA
– Support InfiniBand Adapter with GPUDirect™
– Support Precision Time Protocol (PTP) IEEE 1588
• Candidate RTOS’
– Concurrent Computer RedHawk
– RedHat MRG (Messaging, Real-Time, Grid)
7/20/11 Crossfield Technology LLC 26
Interrupt Dispatch Latency*
7/20/11 Crossfield Technology LLC 27
*Ravi Malhotra, “Real-Time Performance on Linux-based Systems,” 2011 Freescale Technology Forum
Real-Time Support on Linux*
• Traditionally, Linux is not a real-time operating system
– Designed for server throughput performance rather than embedded systems latency
– Scheduling latencies can be unbound
– Big kernel lock and other mechanisms (softIRQ) typically end up blocking real-time critical tasks
– Processes cannot be pre-empted while executing system calls
7/20/11 Crossfield Technology LLC 28
*Ravi Malhotra, “Real-Time Performance on Linux-based Systems,” 2011 Freescale Technology Forum
Sources of Latency & How RT Patch Helps*
7/20/11 Crossfield Technology LLC 29
*Ravi Malhotra, “Real-Time Performance on Linux-based Systems,” 2011 Freescale Technology Forum
HPC PERFORMANCE MODEL
7/20/11 Crossfield Technology LLC 30
Hyperformix Workbench Performance Model
7/20/11 Crossfield Technology LLC 31
Workbench Model Steps
The application consists of 9 steps that comprise the generation and transfer of a frame:
1. Projector requests frame (provides state data)
2. CPU setups Frame Generation Process
3. CPU writes task data to CPU Memory (DDR3 SDRAM)
4. CPU tasks the GPU to synthesize the Frame
5. GPU reads the task data from CPU memory
6. GPU synthesizes the Frame
7. GPU transfers the frame data to CPU memory
8. CPU tasks the InfiniBand Network Adapters to transfer the frame to Crossfield Network
Interface via the InfiniBand Switch
9. Network Adapters transfer the frame to FPGA memory using RDMA Protocol
7/20/11 Crossfield Technology LLC 32
Hyperformix Workbench Performance Model
7/20/11 Crossfield Technology LLC 33
Workbench Model Results
7/20/11 Crossfield Technology LLC 34
Application Steps Response
(µs)
Application.Step_1_Frame_Request_from_Projector.response 1.151
Application.Step_2_and_3_Setup_Process_and_write_data_to_memory.response 0.1923
Application.Step_4_CPU_tasks_GPU.response 0.1923
Application.Step_5_GPU_reads_data_from_CPU_Memory.response 0.4148
Application.Step_6_GPU_synthesizes_Frame_first_transfer.response 1000
Application.Step_7_GPU_xfers_Frame_to_CPU_memory.response 917.7
Application.Step_8_CPU_tasks_Network_Adapter_to_transfer_Frame_to_NI.response 0.1682
Application.Step_9_Network_Adapter_xfer_frame_to_NI_FPGA_Memory.response 2259
Application.Main_RT_App.All_Steps_transfer_RT_2 4179
PROJECTOR INTERFACE
7/20/11 Crossfield Technology LLC 35
Projector Interfaces
FPGA Mezzanine Cards (FMC)
1. Two Dual DVI
2. Parallel Fiber Optic Ports (8-10)
3. Digital Micromirror Device (DMD) Interface
– All modules provide 2 User Definable I/Os, e.g.
• HWIL Synchronization Signal
• Output Next Frame
7/20/11 Crossfield Technology LLC 36