Post on 14-Jan-2017
A MultidimensionalDistributed Array Abstractionfor PGAS
www.dash-project.org
Tobias Fuchsfuchst@nm.ifi.lmu.deLudwig-Maximilians-Universität München, MNM-Team
| 2DASH
DASH - Overview
DASH is a C++ template library that offers– distributed data structures and parallel algorithms
– a complete PGAS (part. global address space) programming system without a custom (pre-)compiler
PGAS Terminology – SHMEM Analogy
Unit: The individual participants in a DASH program, usually full OS processes.
Private
Shared
Unit 0 Unit 1 Unit N-1
int b;int c;
dash::Array a(1000);
int a;…
dash::Shared s;
10..190..9 ..999
Shared data: managed by DASHin a virtual global address space
Private data: managed by regular C/C++ mechanisms
| 3DASH
DASH Project Structure
Phase I (2013-2015) Phase II (2016-2018)
LMU MunichProject lead,
C++ template library
Project lead, C++ template library,
data dock
TU DresdenLibraries and
interfaces, toolsSmart data
structures, resilience
HLRS Stuttgart DART runtime DART runtime
KIT Karlsruhe Application studies
IHR StuttgartSmart deployment, Application studies
| 4DASH
DASH - Partitioned Global Address Space
Data affinity– data has well-defined owner but can be accessed by any unit
– data locality important for performance
– support for the owner computes execution model
DASH:– unified access to
local and remotedata in globalmemory space
| 5DASH
DASH - Partitioned Global Address Space
Data affinity– data has well-defined owner but can be accessed by any unit
– data locality important for performance
– support for the owner computes execution model
DASH:– unified access to
local and remotedata in globalmemory space
– and explicit viewson local memoryspace
| 6DASH
DASH Distributed Data Structures Overview
Container Description Data distribution
Array<T> 1D Array static, configurable
NArray<T, N> N-dim. Array static, configurable
Shared<T> Shared scalar fixed (at 0)
Directory(*)<T> Variable-size,locally indexedarray
manual,load-balanced
List<T> Variable-sizelinked list
dynamic,load-balanced
Map<T> Variable-sizeassociative map
dynamic, balancedby hash function
(*) Under construction
| 7DASH
DASH Distributed Data Structures Overview
Container Description Data distribution
Array<T> 1D Array static, configurable
NArray<T, N> N-dim. Array static, configurable
Shared<T> Shared scalar fixed (at 0)
Directory(*)<T> Variable-size,locally indexedarray
manual,load-balanced
List<T> Variable-sizelinked list
dynamic,load-balanced
Map<T> Variable-sizeassociative map
dynamic, balancedby hash function
(*) Under construction
| 8DASH
Multidimensional Data Distribution (1)
dash::Pattern<N> specifies N-dim data distribution– Blocked, cyclic, and block-cyclic in multiple dimensions
Pattern<2>(20, 15)
(BLOCKED,NONE)
(NONE,BLOCKCYCLIC(2))
(BLOCKED,BLOCKCYCLIC(3))
Extent in first and second dimension
Distribution in first and second dimension
| 9DASH
Multidimensional Data Distribution (2)
Example: tiled and tile-shifted data distribution
(TILE(4), TILE(3))
ShiftTilePattern<2>(32, 24)TilePattern<2, COL_MAJOR>(20, 15)
(TILE(5), TILE(5))
| 10DASH
Multidimensional Views
Lightweight Multidimensional Views
// 8x8 2D arraydash::NArray<int, 2> mat(8,8);
// linear access using iteratorsdash::distance(mat.begin(), mat.end()) == 64
// create 2x5 region viewauto reg = matrix.cols(2,5).rows(3,2);
// region can be used just like 2D arraycout << reg[1][2] << endl; // ‘7’
dash::distance(reg.begin(), reg.end()) == 10
| 11DASH
Multidimensional Views
Lightweight Multidimensional Views– Local and block views
dash::NArray<int, 2> mat(80000,17000);
// view to block element rangeauto block = matrix.block(3,2);// use view as Cartesian space:auto elem = block[30][20];
dash::NArray<int, 2> mat(80000,17000);
if (dash::myid() == 1) {// view to local element rangeauto local_elems = matrix.local;// use view as sequential range:for (auto elem : local_elems) { … }
}
| 12DASH
Multidimensional Views
Multidimensional Iterator Ranges
– Global iterators provide access to theunderlying view of their index space
– DASH global iterators on multi-dimensional regions canstill be passed to standard library algorithms
auto r = mat.sub(0, { 4,7 }) // rows.sub(1, { 2,7 }); // cols
auto r_view = r.begin().view();r_view.extents() == { 3,5 }r_view.offsets() == { 4,2 }
// DASH algorithms use n-dim. view: dash::summa(r.begin(), r.end(), …);// multidimensional iterators still// are sequential:std::for_each(r.begin(), r.end(), …);
| 13DASH
Multidimensional Views
Multidimensional Iterator Ranges
– Global iterators provide access to thedata distribution pattern of their iteration space
auto r = mat.sub(0, { 4,7 }) // rows.sub(1, { 2,7 }); // cols
auto r_pattern = r.begin().pattern();r_pattern.blocksize() == 4r_pattern.blocks() == 16r_pattern.blocks_at(dash::myid()) == 4
| 14DASH
DASH Algorithms
Growing number of DASH equivalents to STL algorithms:
Examples of STL algorithms ported to DASH, also workfor multidimensional ranges:
dash::GlobIter<T> dash::fill(GlobIter<T> begin, GlobIter<T> end,T val);
- dash::fill range[i] <- val- dash::generate range[i] <- func()- dash::for_each func(range[i])- dash::transform range[i] = func(range2[i])- dash::accumulate sum(range[i]) (0<=i<=n-1)- dash::min_element min(range[i]) (0<=i<=n-1)
- dash::copy range[i] <- range2[i]
| 15DASH
DASH Algorithms
Growing number of DASH equivalents to STL algorithms:
Examples of STL algorithms ported to DASH, also workfor multidimensional ranges:
dash::GlobIter<T> dash::fill(GlobIter<T> begin, GlobIter<T> end,T val);
- dash::fill range[i] <- val- dash::generate range[i] <- func()- dash::for_each func(range[i])- dash::transform range[i] = func(range2[i])- dash::accumulate sum(range[i]) (0<=i<=n-1)- dash::min_element min(range[i]) (0<=i<=n-1)
- dash::copy range[i] <- range2[i]
| 16DASH
Asynchronous Copying for Latency Hiding
Asynchronous Operations
– Async. algorithm interface:dash::copy_async()
– Launch policy:dash::launch::async (in upcoming DASH release 0.3.0)
std::vector<int> lcopy(block.size());// starts async. copy of global range to local memory// … via algorithm interface:auto fut = dash::copy_async(block.begin(), block.end(),
lcopy.begin());// … or via launch policy:auto fut = dash::copy(dash::launch::async,
block.begin(), block.end(),lcopy.begin());
overlapping computation();auto copy_end = fut.get(); // blocks until copy received
| 17DASH
Block matrix-matrix multiplication with prefetching
while(!done) {blk_a = matrixA.local.block(k); …blk_b = matrixB.local.block(k); …// prefetchauto get_a = dash::copy_async(blk_a.begin(), blk_a.end(), lblk_a_get);auto get_b = dash::copy_async(blk_b.begin(), blk_b.end(), lblk_b_get);// local DGEMMdash::multiply(lblk_a_comp, lblk_b_comp, lblk_c_comp); // wait for transfer to finishget_a.wait(); get_b.wait();// swap buffersswap(lblk_a_get, lblk_a_comp); swap(lblk_b_get, lblk_b_comp);
}
Case Study: S(R)UMMA Algorithm
| 18DASH
Block matrix-matrix multiplication with prefetching
while(!done) {blk_a = matrixA.local.block(k); …blk_b = matrixB.local.block(k); …// prefetchauto get_a = dash::copy_async(blk_a.begin(), blk_a.end(), lblk_a_get);auto get_b = dash::copy_async(blk_b.begin(), blk_b.end(), lblk_b_get);// local DGEMMdash::multiply(lblk_a_comp, lblk_b_comp, lblk_c_comp); // wait for transfer to finishget_a.wait(); get_b.wait();// swap buffersswap(lblk_a_get, lblk_a_comp); swap(lblk_b_get, lblk_b_comp);
}
Case Study: S(R)UMMA Algorithm
DISCLAIMER
This code is simplified for brevity.Get the real source code here:
https://github.com/dash-project/dash/blob/development/dash/include/dash/algorithm/SUMMA.h
| 19DASH
Block matrix-matrix multiplication with prefetching
while(!done) {blk_a = matrixA.local.block(k); …blk_b = matrixB.local.block(k); …// prefetchauto get_a = dash::copy_async(blk_a.begin(), blk_a.end(), lblk_a_get);auto get_b = dash::copy_async(blk_b.begin(), blk_b.end(), lblk_b_get);// local DGEMMdash::multiply(lblk_a_comp, lblk_b_comp, lblk_c_comp); // wait for transfer to finishget_a.wait(); get_b.wait();// swap buffersswap(lblk_a_get, lblk_a_comp); swap(lblk_b_get, lblk_b_comp);
}
Case Study: S(R)UMMA Algorithm
Schedules block transmissions tominimize network congestion
| 20DASH
Block matrix-matrix multiplication with prefetching
while(!done) {blk_a = matrixA.local.block(k); …blk_b = matrixB.local.block(k); …// prefetchauto get_a = dash::copy_async(blk_a.begin(), blk_a.end(), lblk_a_get);auto get_b = dash::copy_async(blk_b.begin(), blk_b.end(), lblk_b_get);// local DGEMMdash::multiply(lblk_a_comp, lblk_b_comp, lblk_c_comp); // wait for transfer to finishget_a.wait(); get_b.wait();// swap buffersswap(lblk_a_get, lblk_a_comp); swap(lblk_b_get, lblk_b_comp);
}
Case Study: S(R)UMMA Algorithm
Local submatrix multiplication using DGEMM from serial Intel MKL
Schedules block transmissions tominimize network congestion
| 21DASH
DASH vs. DGEMM: Intel MKL, PLASMA
| 22DASH
DASH vs. PDGEMM: ScaLAPACK
Good.
But acing singular benchmarksis not the actual point.
Most important: the NArray concept allows intuitive
design of efficient algorithms we achieved portable, robust
efficiency on different hardwareand system environments
| 23DASH
Summary
NArray Concept– Views simplify design of efficient algorithms
– First-class support for locality-based operations
– Complies to existing C++ standard library concepts
DASH algorithms on n-dim. ranges– SUMMA case study: straight-forward, compact
implementation
– Leveraged portable efficiency of Intel MKL
– Beats performance in (P)DGEMM compared toIntel MKL, PLASMA, ScaLAPACK
– Robust scalability in a variety of node-level and highly distributed benchmark scenarios
http://www.dash-project.org/http://github.com/dash-project/
| 24DASH
Acknowledgements
DASH on GitHub:https://github.com/dash-project/dash/
Funding
The DASH Team
T. Fuchs (LMU), R. Kowalewski (LMU), D. Hünich (TUD), A. Knüpfer (TUD), J. Gracia (HLRS), C. Glass (HLRS), H. Zhou (HLRS), K. Idrees (HLRS), J. Schuchart (HLRS), F. Mößbauer (LMU), K. Fürlinger (LMU)