A Quantitative Analysis of Runtime System Software for...

A Quantitative Analysis of Runtime System Software for Asynchronous Multi-

Tasking Computation

Thomas SterlingM. Brodowicz, M. Anderson, L. D'Alessandro, D. Kogler

RoMoL 2016 Workshop

Center for Research in Extreme Scale Technologies

School of Informatics and Computing

Indiana University

March 17, 2016

Introduction

• Performance modeling and parameters• Execution model: ParalleX• HPX-5 runtime system software• Applications• Micro operation measurements• Architecture implications• When we were here before

2

Performance Factors - SLOWER• Starvation

– Insufficiency of concurrency of work– Impacts scalability and latency

hiding– Effects programmability

• Latency– Time measured distance for remote

access and services– Impacts efficiency

• Overhead– Critical time additional work to

manage tasks & resources– Impacts efficiency and granularity

for scalability

• Waiting for contention resolution– Delays due to simultaneous access

requests to shared physical or logical resources

P = e(L,O,W) * S(s) * a(r) * U(E)

P – performance (ops)e – efficiency (0 < e < 1)s – application’s average parallelism, a – availability (0 < a < 1)U – normalization factor/compute unitE – watts per average compute unitr – reliability (0 < r < 1)

Performance Model, Full Example System

• Example system:– 2 nodes,– 2 cores per node,– 2 memory banks per node

• Accounts for:– Functional unit workload– Memory workload/latency– Network overhead/latency– Context switch overhead– Lightweight task management (red regions can

have one active task at a time)– Memory contention (green regions allow only a

single memory access at a time)– Network contention (blue region represents

bandwidth cap)– NUMA affinity of cores

• Assumes:– Balanced workload– Homogenous system– Flat network

Modeling the full example system

Gain with Respect to Cores per Node and Overhead;

Latency of 8192 reg-ops, 64 Tasks per Core

0

10

20

30

40

50

60

70

12481632

Performance Gain of Non-Blocking Programs over Blocking Programs with Varying Core Counts (Memory Contention) and Overheads

Performance Gain

ParalleX Execution Model

ParalleX Computation Complex

-- Runtime Aware -- Logically Active -- Physically Active

HPX-5 Runtime Software Architecture

9

LCOsLCOs

APPLICATION LAYER

APPLICATION LAYER

LuleshLuleshLibPXGLLibPXGL

N-Body

N-Body

FMM

GLOBAL ADDRESS SPACE

GLOBAL ADDRESS SPACE

PGASPGASAGASAGAS

PARCELSPARCELS PROCESSESPROCESSES

SCHEDULERSCHEDULER

Worker threads

ISIRISIR PWCPWC

OPERATING SYSTEMOPERATING SYSTEM

HARDWAREHARDWARE

Cores

NETWORKNETWORK

NETWORK LAYERNETWORK LAYER

Courtesy of Jayashree Candadai, IU

HPX: Distinguishing Features (1)• Derived within the conceptual context of an execution model• Derived within the context of the SLOWER performance model• Global Name Space and active global address space• ParalleX Processes

– Span and share multiple hardware nodes– hierarchical (nested)– First-class objects– Support user and node OS requests for global services– Supports data decomposition

• Message-driven computation with continuations– Does not always return results to parent thread but migrates

continuations– Percolation for moving work to resources such as GPUs

• Compute complexes extend beyond typical threads 10

HPX: Distinguishing Features (2)• Embedded data-structure (graphs) control objects

– Resolves arbitration of simultaneous use requests

• Distributed parallel control state through dynamic graphs of continuations (futures and dataflow)– e.g., graph vertex/links insertion or deletion

• Copy semantics through Distributed Control Operations (DCO)– Distributed arbitration of access conflicts to structure elements– Graph structure changes

• Suspended (Depleted) threads (compute complexes) serve as control objects to build continuation graphs– Planning– Search spaces

• Responds to OS service requests for multi-node capabilities

11

LULESH• Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics

evolving a Sedov blast wave• Developed at the ASCR Co-design center at Livermore National Lab• Contains three types of communication patterns: face adjacent, 26

neighbor, and 13 neighbor communications each timestep

• LULESH is a static, regular computation – very well suited for MPI

Red indicates computationWhite indicates communication

LULESH on a small cluster

LULESH on a large cluster

SpMV in HPCG

SpMV for parcels and memget

Wavelet Adaptive Multiresoultion

Courtesy of Matt Anderson, IU

AMR in HPX-5

• Model of a short gamma ray burst

• Shows a fluid which is initially confined to a small high density high pressure region.

• Explodes creating spherical shock wave

• Colors indicate the density of the fluid

• 2nd inward moving wave collides at center then reflects back out

• 6 levels of refinement

Application: Adaptive Mesh Refinement(AMR) for Astrophysics simulations

From in vivo to in silico neuroscience

( source: Christof Koch and Idan Segev, Methods in Neuronal Modeling: From Ions to Networks )

From characterized neuron to

compartmental modelFrom compartmental model to neural circuits

A Hodgkin-Huxley simulation of 3.1M neurons.( represented as points for simplicity )

mouse brain: 80M neurons; human brain: 100B neurons

Blue Brain Project @ EPFL

Testing results

Hardware specification:

• Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

• 1 thread per core, 8 cores per socket, 2 sockets, 2 NUMA nodes;

• 128 GB RAM; Cache 20480 KB;

single node execution time for one simulation time step* input data: 537 biologically inspired cells from layer 5

( * HPX 1.2, preliminary results)

Bruno Magalhaes @ EPFL

Kelvin-Helmholtz Instability

Courtesy of Matt AndersonIU

Time Required to Check if Memory Address is Local or Remote in HPX5

Chart courtesy of Daniel Kogler, IU

Time Required to Perform a Context SwitchBetween Lightweight Threads in HPX5


Time Required to Create a New Lightweight Thread in HPX5


Instructions Issued During Creation of a New Lightweight Thread in HPX5


Time Required to Acquire, Prepare, and Put a Parcel into the Send Queue (8 Byte payload, no contention, jemalloc allocator)


Time Required to Acquire a Parcel (8 Byte payload, no contention, jemalloc allocator)


Phase Plot of LULESH Execution on Cori Demonstrating Dynamic Work Assignment


Time Spent on Useful Computation vs. Latency and Runtime Overheads During LULESH Execution


Architecture Features

• Compute Complexes/Threads– Thread creation/termination– Thread context switching

• Parcels– Parcel synthesis and launch– Parcel assimilation and thread instantiation– Lightweight messaging (e.g., Data Vortex)

• AGAS– Address predicate: local or remote– GAS translation

• Synchronization– Futures– Dataflow– Single assignment thread registers 31

32SCHOOL OF INFORMATICS AND COMPUTINGINDIANA UNIVERSITY

Embedded Memory Processor

core

cache

core

cache…

Execution Resources

Parcel Handler

OS

DRAM

Input Queue

Thread StateProcessor

Thread Create

Thread DestroyThread

Scheduler

DRAMInterface

ParcelEvents

Output Queue

HT/PCIeInterface

Thread Pool Metadata

Thread State Scratchpad

RequestDecode

Architecture Frameworks• FPGA

– Synchronization– Thread contexts, switching– User thread external preemption– Parcels management

• Data Vortex– Lightweight messaging– Contention elimination through decentralized routing

• Micron PIM– Merges logic with DRAM cells

• Neo-digital – post von Neumann era– Change ALU utilization to high availability for permieated– Lightweight Cores– Continuum Computer Architecture

33

A Quantitative Analysis of Runtime System Software for...

Documents

Transcript of A Quantitative Analysis of Runtime System Software for...