A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

A MultidimensionalDistributed Array Abstractionfor PGAS

www.dash-project.org

Tobias Fuchsfuchst@nm.ifi.lmu.deLudwig-Maximilians-Universität München, MNM-Team

| 2DASH

DASH - Overview

DASH is a C++ template library that offers– distributed data structures and parallel algorithms

– a complete PGAS (part. global address space) programming system without a custom (pre-)compiler

PGAS Terminology – SHMEM Analogy

Unit: The individual participants in a DASH program, usually full OS processes.

Private

Shared

Unit 0 Unit 1 Unit N-1

int b;int c;

dash::Array a(1000);

int a;…

dash::Shared s;

10..190..9 ..999

Shared data: managed by DASHin a virtual global address space

Private data: managed by regular C/C++ mechanisms

| 3DASH

DASH Project Structure

Phase I (2013-2015) Phase II (2016-2018)

LMU MunichProject lead,

C++ template library

Project lead, C++ template library,

data dock

TU DresdenLibraries and

interfaces, toolsSmart data

structures, resilience

HLRS Stuttgart DART runtime DART runtime

KIT Karlsruhe Application studies

IHR StuttgartSmart deployment, Application studies

| 4DASH

DASH - Partitioned Global Address Space

Data affinity– data has well-defined owner but can be accessed by any unit

– data locality important for performance

– support for the owner computes execution model

DASH:– unified access to

local and remotedata in globalmemory space

| 5DASH

DASH - Partitioned Global Address Space

Data affinity– data has well-defined owner but can be accessed by any unit

– data locality important for performance

– support for the owner computes execution model

DASH:– unified access to

local and remotedata in globalmemory space

– and explicit viewson local memoryspace

| 6DASH

DASH Distributed Data Structures Overview

Container Description Data distribution

Array<T> 1D Array static, configurable

NArray<T, N> N-dim. Array static, configurable

Shared<T> Shared scalar fixed (at 0)

Directory(*)<T> Variable-size,locally indexedarray

manual,load-balanced

List<T> Variable-sizelinked list

dynamic,load-balanced

Map<T> Variable-sizeassociative map

dynamic, balancedby hash function

(*) Under construction

| 7DASH

DASH Distributed Data Structures Overview

Container Description Data distribution

Array<T> 1D Array static, configurable

NArray<T, N> N-dim. Array static, configurable

Shared<T> Shared scalar fixed (at 0)

Directory(*)<T> Variable-size,locally indexedarray

manual,load-balanced

List<T> Variable-sizelinked list

dynamic,load-balanced

Map<T> Variable-sizeassociative map

dynamic, balancedby hash function

(*) Under construction

| 8DASH

Multidimensional Data Distribution (1)

dash::Pattern<N> specifies N-dim data distribution– Blocked, cyclic, and block-cyclic in multiple dimensions

Pattern<2>(20, 15)

(BLOCKED,NONE)

(NONE,BLOCKCYCLIC(2))

(BLOCKED,BLOCKCYCLIC(3))

Extent in first and second dimension

Distribution in first and second dimension

| 9DASH

Multidimensional Data Distribution (2)

Example: tiled and tile-shifted data distribution

(TILE(4), TILE(3))

ShiftTilePattern<2>(32, 24)TilePattern<2, COL_MAJOR>(20, 15)

(TILE(5), TILE(5))

| 10DASH

Multidimensional Views

Lightweight Multidimensional Views

// 8x8 2D arraydash::NArray<int, 2> mat(8,8);

// linear access using iteratorsdash::distance(mat.begin(), mat.end()) == 64

// create 2x5 region viewauto reg = matrix.cols(2,5).rows(3,2);

// region can be used just like 2D arraycout << reg[1][2] << endl; // ‘7’

dash::distance(reg.begin(), reg.end()) == 10

| 11DASH

Lightweight Multidimensional Views– Local and block views

dash::NArray<int, 2> mat(80000,17000);

// view to block element rangeauto block = matrix.block(3,2);// use view as Cartesian space:auto elem = block[30][20];

dash::NArray<int, 2> mat(80000,17000);

if (dash::myid() == 1) {// view to local element rangeauto local_elems = matrix.local;// use view as sequential range:for (auto elem : local_elems) { … }

| 12DASH

Multidimensional Iterator Ranges

– Global iterators provide access to theunderlying view of their index space

– DASH global iterators on multi-dimensional regions canstill be passed to standard library algorithms

auto r = mat.sub(0, { 4,7 }) // rows.sub(1, { 2,7 }); // cols

auto r_view = r.begin().view();r_view.extents() == { 3,5 }r_view.offsets() == { 4,2 }

// DASH algorithms use n-dim. view: dash::summa(r.begin(), r.end(), …);// multidimensional iterators still// are sequential:std::for_each(r.begin(), r.end(), …);

| 13DASH

Multidimensional Iterator Ranges

– Global iterators provide access to thedata distribution pattern of their iteration space

auto r = mat.sub(0, { 4,7 }) // rows.sub(1, { 2,7 }); // cols

auto r_pattern = r.begin().pattern();r_pattern.blocksize() == 4r_pattern.blocks() == 16r_pattern.blocks_at(dash::myid()) == 4

| 14DASH

DASH Algorithms

Growing number of DASH equivalents to STL algorithms:

Examples of STL algorithms ported to DASH, also workfor multidimensional ranges:

dash::GlobIter<T> dash::fill(GlobIter<T> begin, GlobIter<T> end,T val);

- dash::fill range[i] <- val- dash::generate range[i] <- func()- dash::for_each func(range[i])- dash::transform range[i] = func(range2[i])- dash::accumulate sum(range[i]) (0<=i<=n-1)- dash::min_element min(range[i]) (0<=i<=n-1)

- dash::copy range[i] <- range2[i]

| 15DASH

DASH Algorithms

Growing number of DASH equivalents to STL algorithms:

Examples of STL algorithms ported to DASH, also workfor multidimensional ranges:

dash::GlobIter<T> dash::fill(GlobIter<T> begin, GlobIter<T> end,T val);

- dash::fill range[i] <- val- dash::generate range[i] <- func()- dash::for_each func(range[i])- dash::transform range[i] = func(range2[i])- dash::accumulate sum(range[i]) (0<=i<=n-1)- dash::min_element min(range[i]) (0<=i<=n-1)

- dash::copy range[i] <- range2[i]

| 16DASH

Asynchronous Copying for Latency Hiding

Asynchronous Operations

– Async. algorithm interface:dash::copy_async()

– Launch policy:dash::launch::async (in upcoming DASH release 0.3.0)

std::vector<int> lcopy(block.size());// starts async. copy of global range to local memory// … via algorithm interface:auto fut = dash::copy_async(block.begin(), block.end(),

lcopy.begin());// … or via launch policy:auto fut = dash::copy(dash::launch::async,

block.begin(), block.end(),lcopy.begin());

overlapping computation();auto copy_end = fut.get(); // blocks until copy received

| 17DASH

Block matrix-matrix multiplication with prefetching

while(!done) {blk_a = matrixA.local.block(k); …blk_b = matrixB.local.block(k); …// prefetchauto get_a = dash::copy_async(blk_a.begin(), blk_a.end(), lblk_a_get);auto get_b = dash::copy_async(blk_b.begin(), blk_b.end(), lblk_b_get);// local DGEMMdash::multiply(lblk_a_comp, lblk_b_comp, lblk_c_comp); // wait for transfer to finishget_a.wait(); get_b.wait();// swap buffersswap(lblk_a_get, lblk_a_comp); swap(lblk_b_get, lblk_b_comp);

Case Study: S(R)UMMA Algorithm

| 18DASH

DISCLAIMER

This code is simplified for brevity.Get the real source code here:

https://github.com/dash-project/dash/blob/development/dash/include/dash/algorithm/SUMMA.h

| 19DASH

Schedules block transmissions tominimize network congestion

| 20DASH

Local submatrix multiplication using DGEMM from serial Intel MKL

Schedules block transmissions tominimize network congestion

| 21DASH

DASH vs. DGEMM: Intel MKL, PLASMA

| 22DASH

DASH vs. PDGEMM: ScaLAPACK

But acing singular benchmarksis not the actual point.

Most important: the NArray concept allows intuitive

design of efficient algorithms we achieved portable, robust

efficiency on different hardwareand system environments

| 23DASH

Summary

NArray Concept– Views simplify design of efficient algorithms

– First-class support for locality-based operations

– Complies to existing C++ standard library concepts

DASH algorithms on n-dim. ranges– SUMMA case study: straight-forward, compact

implementation

– Leveraged portable efficiency of Intel MKL

– Beats performance in (P)DGEMM compared toIntel MKL, PLASMA, ScaLAPACK

– Robust scalability in a variety of node-level and highly distributed benchmark scenarios

http://www.dash-project.org/http://github.com/dash-project/

| 24DASH

Acknowledgements

DASH on GitHub:https://github.com/dash-project/dash/

Funding

The DASH Team

T. Fuchs (LMU), R. Kowalewski (LMU), D. Hünich (TUD), A. Knüpfer (TUD), J. Gracia (HLRS), C. Glass (HLRS), H. Zhou (HLRS), K. Idrees (HLRS), J. Schuchart (HLRS), F. Mößbauer (LMU), K. Fürlinger (LMU)

A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

Software

Transcript of A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

PGAS Programming on Cray XK6 - cscs.ch

HPCC Systems: Big Data NLP with HPCC Systems – A ... · HPCC Systems: Big Data NLP with HPCC Systems – A Development Ride from Spray to THOR to ROXIE Bob Foreman – Senior Software

HPCC Systems JDBC Driver

HPCC Systems Engineering Summit Presentation: Building An HPCC Systems Community in Silicon Valley

HPCC Systems Flavio Villanustre VP, Products and Infrastructure HPCC Systems

HPCC Systems Engineering Summit Presentation - Leveraging HPCC Systems with VCL (Virtual Computing Lab)

Programmable Gain Amplifiers (PGAs), Operational ...ww1.microchip.com/downloads/en/DeviceDoc/21861e.pdf · Programmable Gain Amplifiers (PGAs), Operational Amplifiers and Comparators

Unit Testing choiyj - ICER HPCC

LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

HPCC System Administrator's Guide€¦ · HPCC System Administrator's Guide Introducing HPCC Systems® Administraton Data loading is controlled through the Distributed File Utility

HPCC Systems Introduction to HPCC (High-Performance ...cdn.hpccsystems.com/whitepapers/wp_introduction_HPCC.pdf · Introduction to HPCC (High-Performance Computing Cluster) ... job

MPI+PGAS Hybrid Programmingpire.cct.lsu.edu/documents/Tomko.pdf• (MPI + PGAS) Model – MPI across address spaces – PGAS within an address space • MPI is good at moving data

DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorithms (HPCC'16)

Scalable RDMA performance in PGAS languages

Installing & Running the HPCC Platformcdn.hpccsystems.com/install/docs/3_6_0_1_CE/Installing_and_RunningThe... · Installing & Running the HPCC Platform © 2012 HPCC Systems. All

HPCC Platform + Visualization

ECL Best Practices - HPCC

WHT/082311 HPCC Systems Flavio Villanustre VP, Products and Infrastructure HPCC Systems Risk Solutions.

Performance of PGAS Models on Emerging Multi- /Many-core ...hibd.cse.ohio-state.edu/static/media/talks/slide/... · •PGAS and Hybrid MPI+PGAS models support in MVAPICH2-X •Optimizations

PGAs of Europe 18th Anniversary Issue