A Quantitative Analysis of Runtime System Software for...
Transcript of A Quantitative Analysis of Runtime System Software for...
A Quantitative Analysis of Runtime System Software for Asynchronous Multi-
Tasking Computation
Thomas SterlingM. Brodowicz, M. Anderson, L. D'Alessandro, D. Kogler
RoMoL 2016 Workshop
Center for Research in Extreme Scale Technologies
School of Informatics and Computing
Indiana University
March 17, 2016
Introduction
• Performance modeling and parameters• Execution model: ParalleX• HPX-5 runtime system software• Applications• Micro operation measurements• Architecture implications• When we were here before
2
Performance Factors - SLOWER• Starvation
– Insufficiency of concurrency of work– Impacts scalability and latency
hiding– Effects programmability
• Latency– Time measured distance for remote
access and services– Impacts efficiency
• Overhead– Critical time additional work to
manage tasks & resources– Impacts efficiency and granularity
for scalability
• Waiting for contention resolution– Delays due to simultaneous access
requests to shared physical or logical resources
P = e(L,O,W) * S(s) * a(r) * U(E)
P – performance (ops)e – efficiency (0 < e < 1)s – application’s average parallelism, a – availability (0 < a < 1)U – normalization factor/compute unitE – watts per average compute unitr – reliability (0 < r < 1)
Performance Model, Full Example System
• Example system:– 2 nodes,– 2 cores per node,– 2 memory banks per node
• Accounts for:– Functional unit workload– Memory workload/latency– Network overhead/latency– Context switch overhead– Lightweight task management (red regions can
have one active task at a time)– Memory contention (green regions allow only a
single memory access at a time)– Network contention (blue region represents
bandwidth cap)– NUMA affinity of cores
• Assumes:– Balanced workload– Homogenous system– Flat network
Modeling the full example system
Gain with Respect to Cores per Node and Overhead;
Latency of 8192 reg-ops, 64 Tasks per Core
0
10
20
30
40
50
60
70
12481632
Performance Gain of Non-Blocking Programs over Blocking Programs with Varying Core Counts (Memory Contention) and Overheads
Performance Gain
ParalleX Execution Model
ParalleX Computation Complex
-- Runtime Aware -- Logically Active -- Physically Active
HPX-5 Runtime Software Architecture
9
LCOsLCOs
APPLICATION LAYER
APPLICATION LAYER
LuleshLuleshLibPXGLLibPXGL
N-Body
N-Body
FMM
GLOBAL ADDRESS SPACE
GLOBAL ADDRESS SPACE
PGASPGASAGASAGAS
PARCELSPARCELS PROCESSESPROCESSES
SCHEDULERSCHEDULER
Worker threads
ISIRISIR PWCPWC
OPERATING SYSTEMOPERATING SYSTEM
HARDWAREHARDWARE
Cores
NETWORKNETWORK
NETWORK LAYERNETWORK LAYER
Courtesy of Jayashree Candadai, IU
HPX: Distinguishing Features (1)• Derived within the conceptual context of an execution model• Derived within the context of the SLOWER performance model• Global Name Space and active global address space• ParalleX Processes
– Span and share multiple hardware nodes– hierarchical (nested)– First-class objects– Support user and node OS requests for global services– Supports data decomposition
• Message-driven computation with continuations– Does not always return results to parent thread but migrates
continuations– Percolation for moving work to resources such as GPUs
• Compute complexes extend beyond typical threads 10
HPX: Distinguishing Features (2)• Embedded data-structure (graphs) control objects
– Resolves arbitration of simultaneous use requests
• Distributed parallel control state through dynamic graphs of continuations (futures and dataflow)– e.g., graph vertex/links insertion or deletion
• Copy semantics through Distributed Control Operations (DCO)– Distributed arbitration of access conflicts to structure elements– Graph structure changes
• Suspended (Depleted) threads (compute complexes) serve as control objects to build continuation graphs– Planning– Search spaces
• Responds to OS service requests for multi-node capabilities
11
LULESH• Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics
evolving a Sedov blast wave• Developed at the ASCR Co-design center at Livermore National Lab• Contains three types of communication patterns: face adjacent, 26
neighbor, and 13 neighbor communications each timestep
• LULESH is a static, regular computation – very well suited for MPI
Red indicates computationWhite indicates communication
LULESH on a small cluster
LULESH on a large cluster
SpMV in HPCG
SpMV for parcels and memget
Wavelet Adaptive Multiresoultion
Courtesy of Matt Anderson, IU
AMR in HPX-5
• Model of a short gamma ray burst
• Shows a fluid which is initially confined to a small high density high pressure region.
• Explodes creating spherical shock wave
• Colors indicate the density of the fluid
• 2nd inward moving wave collides at center then reflects back out
• 6 levels of refinement
Application: Adaptive Mesh Refinement(AMR) for Astrophysics simulations
From in vivo to in silico neuroscience
( source: Christof Koch and Idan Segev, Methods in Neuronal Modeling: From Ions to Networks )
From characterized neuron to
compartmental modelFrom compartmental model to neural circuits
A Hodgkin-Huxley simulation of 3.1M neurons.( represented as points for simplicity )
mouse brain: 80M neurons; human brain: 100B neurons
Blue Brain Project @ EPFL
Testing results
Hardware specification:
• Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
• 1 thread per core, 8 cores per socket, 2 sockets, 2 NUMA nodes;
• 128 GB RAM; Cache 20480 KB;
single node execution time for one simulation time step* input data: 537 biologically inspired cells from layer 5
( * HPX 1.2, preliminary results)
Bruno Magalhaes @ EPFL
Kelvin-Helmholtz Instability
Courtesy of Matt AndersonIU
Time Required to Check if Memory Address is Local or Remote in HPX5
Chart courtesy of Daniel Kogler, IU
Time Required to Perform a Context SwitchBetween Lightweight Threads in HPX5
Chart courtesy of Daniel Kogler, IU
Time Required to Create a New Lightweight Thread in HPX5
Chart courtesy of Daniel Kogler, IU
Instructions Issued During Creation of a New Lightweight Thread in HPX5
Chart courtesy of Daniel Kogler, IU
Time Required to Acquire, Prepare, and Put a Parcel into the Send Queue (8 Byte payload, no contention, jemalloc allocator)
Chart courtesy of Daniel Kogler, IU
Time Required to Acquire a Parcel (8 Byte payload, no contention, jemalloc allocator)
Chart courtesy of Daniel Kogler, IU
Phase Plot of LULESH Execution on Cori Demonstrating Dynamic Work Assignment
Chart courtesy of Daniel Kogler, IU
Time Spent on Useful Computation vs. Latency and Runtime Overheads During LULESH Execution
Chart courtesy of Daniel Kogler, IU
Architecture Features
• Compute Complexes/Threads– Thread creation/termination– Thread context switching
• Parcels– Parcel synthesis and launch– Parcel assimilation and thread instantiation– Lightweight messaging (e.g., Data Vortex)
• AGAS– Address predicate: local or remote– GAS translation
• Synchronization– Futures– Dataflow– Single assignment thread registers 31
32SCHOOL OF INFORMATICS AND COMPUTINGINDIANA UNIVERSITY
Embedded Memory Processor
core
cache
core
cache…
Execution Resources
Parcel Handler
OS
DRAM
Input Queue
Thread StateProcessor
Thread Create
Thread DestroyThread
Scheduler
DRAMInterface
ParcelEvents
Output Queue
HT/PCIeInterface
Thread Pool Metadata
Thread State Scratchpad
RequestDecode
Architecture Frameworks• FPGA
– Synchronization– Thread contexts, switching– User thread external preemption– Parcels management
• Data Vortex– Lightweight messaging– Contention elimination through decentralized routing
• Micron PIM– Merges logic with DRAM cells
• Neo-digital – post von Neumann era– Change ALU utilization to high availability for permieated– Lightweight Cores– Continuum Computer Architecture
33