A Quantitative Analysis of Runtime System Software for...

35
A Quantitative Analysis of Runtime System Software for Asynchronous Multi- Tasking Computation Thomas Sterling M. Brodowicz, M. Anderson, L. D'Alessandro, D. Kogler RoMoL 2016 Workshop Center for Research in Extreme Scale Technologies School of Informatics and Computing Indiana University March 17, 2016

Transcript of A Quantitative Analysis of Runtime System Software for...

Page 1: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

A Quantitative Analysis of Runtime System Software for Asynchronous Multi-

Tasking Computation

Thomas SterlingM. Brodowicz, M. Anderson, L. D'Alessandro, D. Kogler

RoMoL 2016 Workshop

Center for Research in Extreme Scale Technologies

School of Informatics and Computing

Indiana University

March 17, 2016

Page 2: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

Introduction

• Performance modeling and parameters• Execution model: ParalleX• HPX-5 runtime system software• Applications• Micro operation measurements• Architecture implications• When we were here before

2

Page 3: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

Performance Factors - SLOWER• Starvation

– Insufficiency of concurrency of work– Impacts scalability and latency

hiding– Effects programmability

• Latency– Time measured distance for remote

access and services– Impacts efficiency

• Overhead– Critical time additional work to

manage tasks & resources– Impacts efficiency and granularity

for scalability

• Waiting for contention resolution– Delays due to simultaneous access

requests to shared physical or logical resources

P = e(L,O,W) * S(s) * a(r) * U(E)

P – performance (ops)e – efficiency (0 < e < 1)s – application’s average parallelism, a – availability (0 < a < 1)U – normalization factor/compute unitE – watts per average compute unitr – reliability (0 < r < 1)

Page 4: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application
Page 5: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

Performance Model, Full Example System

• Example system:– 2 nodes,– 2 cores per node,– 2 memory banks per node

• Accounts for:– Functional unit workload– Memory workload/latency– Network overhead/latency– Context switch overhead– Lightweight task management (red regions can

have one active task at a time)– Memory contention (green regions allow only a

single memory access at a time)– Network contention (blue region represents

bandwidth cap)– NUMA affinity of cores

• Assumes:– Balanced workload– Homogenous system– Flat network

Modeling the full example system

Page 6: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

Gain with Respect to Cores per Node and Overhead;

Latency of 8192 reg-ops, 64 Tasks per Core

0

10

20

30

40

50

60

70

12481632

Performance Gain of Non-Blocking Programs over Blocking Programs with Varying Core Counts (Memory Contention) and Overheads

Performance Gain

Page 7: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

ParalleX Execution Model

Page 8: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

ParalleX Computation Complex

-- Runtime Aware -- Logically Active -- Physically Active

Page 9: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

HPX-5 Runtime Software Architecture

9

LCOsLCOs

APPLICATION LAYER

APPLICATION LAYER

LuleshLuleshLibPXGLLibPXGL

N-Body

N-Body

FMM

GLOBAL ADDRESS SPACE

GLOBAL ADDRESS SPACE

PGASPGASAGASAGAS

PARCELSPARCELS PROCESSESPROCESSES

SCHEDULERSCHEDULER

Worker threads

ISIRISIR PWCPWC

OPERATING SYSTEMOPERATING SYSTEM

HARDWAREHARDWARE

Cores

NETWORKNETWORK

NETWORK LAYERNETWORK LAYER

Courtesy of Jayashree Candadai, IU

Page 10: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

HPX: Distinguishing Features (1)• Derived within the conceptual context of an execution model• Derived within the context of the SLOWER performance model• Global Name Space and active global address space• ParalleX Processes

– Span and share multiple hardware nodes– hierarchical (nested)– First-class objects– Support user and node OS requests for global services– Supports data decomposition

• Message-driven computation with continuations– Does not always return results to parent thread but migrates

continuations– Percolation for moving work to resources such as GPUs

• Compute complexes extend beyond typical threads 10

Page 11: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

HPX: Distinguishing Features (2)• Embedded data-structure (graphs) control objects

– Resolves arbitration of simultaneous use requests

• Distributed parallel control state through dynamic graphs of continuations (futures and dataflow)– e.g., graph vertex/links insertion or deletion

• Copy semantics through Distributed Control Operations (DCO)– Distributed arbitration of access conflicts to structure elements– Graph structure changes

• Suspended (Depleted) threads (compute complexes) serve as control objects to build continuation graphs– Planning– Search spaces

• Responds to OS service requests for multi-node capabilities

11

Page 12: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

LULESH• Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics

evolving a Sedov blast wave• Developed at the ASCR Co-design center at Livermore National Lab• Contains three types of communication patterns: face adjacent, 26

neighbor, and 13 neighbor communications each timestep

• LULESH is a static, regular computation – very well suited for MPI

Red indicates computationWhite indicates communication

Page 13: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

LULESH on a small cluster

Page 14: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

LULESH on a large cluster

Page 15: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

SpMV in HPCG

Page 16: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

SpMV for parcels and memget

Page 17: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

Wavelet Adaptive Multiresoultion

Courtesy of Matt Anderson, IU

Page 18: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

AMR in HPX-5

• Model of a short gamma ray burst

• Shows a fluid which is initially confined to a small high density high pressure region.

• Explodes creating spherical shock wave

• Colors indicate the density of the fluid

• 2nd inward moving wave collides at center then reflects back out

• 6 levels of refinement

Page 19: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

Application: Adaptive Mesh Refinement(AMR) for Astrophysics simulations

Page 20: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

From in vivo to in silico neuroscience

( source: Christof Koch and Idan Segev, Methods in Neuronal Modeling: From Ions to Networks )

From characterized neuron to

compartmental modelFrom compartmental model to neural circuits

A Hodgkin-Huxley simulation of 3.1M neurons.( represented as points for simplicity )

mouse brain: 80M neurons; human brain: 100B neurons

Blue Brain Project @ EPFL

Page 21: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

Testing results

Hardware specification:

• Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

• 1 thread per core, 8 cores per socket, 2 sockets, 2 NUMA nodes;

• 128 GB RAM; Cache 20480 KB;

single node execution time for one simulation time step* input data: 537 biologically inspired cells from layer 5

( * HPX 1.2, preliminary results)

Bruno Magalhaes @ EPFL

Page 22: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

Kelvin-Helmholtz Instability

Courtesy of Matt AndersonIU

Page 23: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

Time Required to Check if Memory Address is Local or Remote in HPX5

Chart courtesy of Daniel Kogler, IU

Page 24: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

Time Required to Perform a Context SwitchBetween Lightweight Threads in HPX5

Chart courtesy of Daniel Kogler, IU

Page 25: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

Time Required to Create a New Lightweight Thread in HPX5

Chart courtesy of Daniel Kogler, IU

Page 26: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

Instructions Issued During Creation of a New Lightweight Thread in HPX5

Chart courtesy of Daniel Kogler, IU

Page 27: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

Time Required to Acquire, Prepare, and Put a Parcel into the Send Queue (8 Byte payload, no contention, jemalloc allocator)

Chart courtesy of Daniel Kogler, IU

Page 28: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

Time Required to Acquire a Parcel (8 Byte payload, no contention, jemalloc allocator)

Chart courtesy of Daniel Kogler, IU

Page 29: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

Phase Plot of LULESH Execution on Cori Demonstrating Dynamic Work Assignment

Chart courtesy of Daniel Kogler, IU

Page 30: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

Time Spent on Useful Computation vs. Latency and Runtime Overheads During LULESH Execution

Chart courtesy of Daniel Kogler, IU

Page 31: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

Architecture Features

• Compute Complexes/Threads– Thread creation/termination– Thread context switching

• Parcels– Parcel synthesis and launch– Parcel assimilation and thread instantiation– Lightweight messaging (e.g., Data Vortex)

• AGAS– Address predicate: local or remote– GAS translation

• Synchronization– Futures– Dataflow– Single assignment thread registers 31

Page 32: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

32SCHOOL OF INFORMATICS AND COMPUTINGINDIANA UNIVERSITY

Embedded Memory Processor

core

cache

core

cache…

Execution Resources

Parcel Handler

OS

DRAM

Input Queue

Thread StateProcessor

Thread Create

Thread DestroyThread

Scheduler

DRAMInterface

ParcelEvents

Output Queue

HT/PCIeInterface

Thread Pool Metadata

Thread State Scratchpad

RequestDecode

Page 33: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application

Architecture Frameworks• FPGA

– Synchronization– Thread contexts, switching– User thread external preemption– Parcels management

• Data Vortex– Lightweight messaging– Contention elimination through decentralized routing

• Micron PIM– Merges logic with DRAM cells

• Neo-digital – post von Neumann era– Change ALU utilization to high availability for permieated– Lightweight Cores– Continuum Computer Architecture

33

Page 34: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application
Page 35: A Quantitative Analysis of Runtime System Software for ...romol2016.bsc.es/_media/1.2.romol_sterling_5.pdf · hpx-5 runtime software architecture 9 lcoslcos application layer application