Software Abstractions for Parallel Hardware

Software Abstractions for Parallel Architectures

Joel Falcou

LRI - CNRS - INRIA

HDR Thesis Defense12/01/2014

The Paradigm Change in Science

From Experiments to Simulations

� Simulations is now an integralpart of the Scientic Method

� Scientic Computing enableslarger, faster, more accurateResearch

� Fast Simulation is Time Travel asscientic results are now morereadily available

Local Galaxy Cluster Simulation - Illustris project

Computing is rst and foremost a mainstream science tool

2 of 41

The Paradigm Change in Science

The Parallel Hell

� Heat Wall: Growing coresinstead of GHz

� Hierarchical and heterogeneousparallel systems are the norm

� The Free Lunch is over ashardware complexity rises fasterthan the average developer skills

Local Galaxy Cluster Simulation - Illustris project

The real challenge in HPC is the Expressiveness/Efficiency War

2 of 41

The Expressiveness/Efficiency War

Single Core Era

Performance

Expressiveness

C/Fort.

C++

Java

Multi-Core/SIMD Era

Performance

Expressiveness

Sequential

Threads

SIMD

Heterogenous Era

Performance

Expressiveness

Sequential

SIMD

Threads

GPUPhi

Distributed

As parallel systems complexity grows, the expressiveness gap turns into an ocean

3 of 41

Designing tools for Scientic Computing

Objectives

1. Be non-disruptive

2. Domain driven optimizations

3. Provide intuitive API for the user

4. Support a wide architectural landscape

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Languages (DSEL) (2+3)� Use Parallel Programming Abstractions as parallel components (4)� Use Generative Programming to deliver performance (5)

4 of 41

Talk Layout

Introduction

Abstractions & Efficiency

Experimental Results

Conclusion

5 of 41

Why Parallel Programming Models ?

Limits of regular tools

� Unstructured parallelism is error-prone� Low level parallel tools are non-composable� Contribute to the Expressiveness Gap

Available Models

� Performance centric: P-RAM, LOG-P, BSP� Pattern centric: Futures, Skeletons� Data centric: HTA, PGAS

6 of 41

Bulk Synchronous Parallelism [Valiant, McColl 90]

Principles

� Machine Model� Execution Model� Analytic Cost Model

Compute

Barrier

Comm

Wmax h.g

P0

P1

P2

P3

Superstep T Superstep T+1

LWmax h.g

BSP Execution Model

7 of 41

Bulk Synchronous Parallelism [Valiant, McColl 90]

Advantages

� Simple set of primitives� Implementable on any

kind of hardware� Possibility to reason about

BSP programs

Compute

Barrier

Comm

Wmax h.g

P0

P1

P2

P3

Superstep T Superstep T+1

LWmax h.g

BSP Execution Model

7 of 41

Parallel Skeletons [Cole 89]

Principles

� There are patterns in parallel applications� Those patterns can be generalized in Skeletons� Applications are assembled as a combination of such patterns

Functional point of view

� Skeletons are Higher-Order Functions� Skeletons support a compositionnal semantic� Applications become composition of state-less functions

8 of 41

Parallel Skeletons [Cole 89]

Principles

� There are patterns in parallel applications� Those patterns can be generalized in Skeletons� Applications are assembled as a combination of such patterns

Classical Skeletons

� Data parallel: map, fold, scan� Task parallel: par, pipe, farm� More complex: Distribuable Homomorphism, Divide & Conquer, …

8 of 41

Relevance to our Objectives

Why using Parallel Skeletons ?

� Write code independant of parallel programming minutiae� Composability supports hierarchical architectures� Code is scalable and easy to maintain

Why using BSP ?

� Cost model guide development� Few primitives mean that intellectual burden is low� Good medium for developping skeletons

How to ensure performance of those models’ implementations ?

9 of 41

Domain Specic Embedded Languages

Domain Specic Languages

� Non-Turing complete declarative languages� Solve a single type of problems� Express what to do instead of how to do it� E.g: SQL, M, M, …

From DSL to DSEL [Abrahams 2004]

� A DSL incorporates domain-specic notation, constructs, and abstractions asfundamental design considerations.

� A Domain Specic Embedded Languages (DSEL) is simply a library that meets thesame criteria

� Generative Programming is one way to design such libraries

10 of 41

Generative Programming [Eisenecker 97]

Domain SpecificApplication Description

Generative Component Concrete Application

Translator

Parametric Sub-components

11 of 41

Meta-programming as a Tool

DenitionMeta-programming is the writing of computer programs that analyse, transform andgenerate other programs (or themselves) as their data.

Meta-programmable Languages

� metaOCAML : runtime code generation via code quoting� Template Haskell : compile-time code generation via templates� C++ : compile-time code generation via templates

C++ meta-programming

� Relies on the Turing-complete C++ sub-language� Handles types and integral constants at compile-time� classes and functions act as code quoting

12 of 41

The Expression Templates Idiom

Principles

� Relies on extensive operatoroverloading

� Carries semantic informationaround code fragment

� Introduces DSLs withoutdisrupting dev. chain

matrix x(h,w),a(h,w),b(h,w);

x = cos(a) + (b*a);

expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);

+

*cos

a ab

=

x

#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}

Arbitrary Transforms appliedon the meta-AST

General Principles of Expression Templates

13 of 41

The Expression Templates Idiom

Advantages

� Generic implementation becomesself-aware of optimizations

� API abstraction level is arbitraryhigh

� Accessible through high-leveltools like B.P

matrix x(h,w),a(h,w),b(h,w);

x = cos(a) + (b*a);

expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);

+

*cos

a ab

=

x

#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}

Arbitrary Transforms appliedon the meta-AST

General Principles of Expression Templates

13 of 41

Our Contributions

Our Strategy

� Applies DSEL generation techniques to parallel programming� Maintains low cost of abstractions through meta-programming� Maintains abstraction level via modern library design

Our contributions

Tools Pub. Scope ApplicationsQuaff ParCo’06 MPI Skeletons Real-time 3D reconstructionSkellPU PACT’08 Skeleton on Cell BE Real-time Image processingBSP++ IJPP’12 MPI/OpenMP BSP Bioinformatics, Model CheckingNT2 JPDC’14 Data Parallel Matlab Fluid Dynamics, Vision

14 of 41

Example of BSP++ ApplicationKhaled Hamidouche PHD 2008-2011 in collab. with Univ. Brasilia

BSP Smith & Waterman� S&W computes DNA sequences alignment� BSP++ implementation was written once and run on 7 different hardwares� Efficiency of 95+% even on 6000 cores super-computer

Platform MaxSize # Elements Speedup GCUPscluster (MPI) 1,072,950 128 cores 73x 6.53

cluster (MPI/OpenMP) 1,072,950 128 cores 116x 10.41OpenMP 1,072,950 16 cores 16x 0.40

CellBE 85,603 8 SPEs — 0.14cluster of CellBEs 85,603 24 SPEs (8:24) 2.8x 0.37

Hopper(MPI) 5,303,436 3072 cores 260x 3.09Hopper(MPI+OpenMP) 24,894,269 6144 cores 5664x 15,5

15 of 41

Second Look at our Contributions

Development Limitations

� DSELs are mostly tied to the domain model� Architecture support is often an afterthought� Extensibility is difficult as many refactoring are required per architecture� Example : No proper way to support GPUs with those implementation techniques

Proposed Method

� Extends Generative Programming to take this architecture into account� Provides an architecture description DSEL� Integrates this description in the code generation process

16 of 41

Architecture Aware Generative Programming

17 of 41

Software refactoring

Tools Issues ChangesQuaff Raw skeletons API Re-engineered as part of NT2

SkellPU Too architecture specic Re-engineered as part of NT2

BSP++ Integration issues Integrate hybrid code generationNT2 Not easily extendable Integrate Quaff Skeleton modelsBoost.SIMD - Side product of NT2 restructuration

Conclusion� Skeletons are ne as parallel middleware� Model based abstractions are not high level enough� For low level architectures, the simplest model is often the best

18 of 41

Boost.SIMDPierre Estérie PHD 2010-2014

Principles

� Provides simple C++ API over SIMDextensions

� Supports every Intel, PPC and ARMinstructions sets

� Fully integrates with modern C++idioms

Sparse Tridiagonal Solver - collaboration with M. Baboulin and Y. wang

19 of 41

Talk Layout

Introduction



Conclusion

20 of 41

The Numerical Template ToolboxPierre Estérie PHD 2010-2014

NT2 as a Scientic Computing Library

� Provides a simple, M-like interface for users� Provides high-performance computing entities and primitives� Is easily extendable

Components

� Uses Boost.SIMD for in-core optimizations� Uses recursive parallel skeletons� Supports task parallelism through Futures

21 of 41

The Numerical Template Toolbox

Principles

� table<T,S> is a simple, multidimensional array object that exactlymimics M array behavior and functionalities

� 500+ functions usable directly either on table or on any scalar valuesas in M

How does it works

� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

22 of 41


Principles



How does it works

� Take a .m le, copy to a .cpp le

� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

22 of 41


Principles



How does it works

� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes

� Compile the le and link with libnt2.a

22 of 41


Principles



How does it works

� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

22 of 41

NT2 - From M to C++

M codeA1 = 1:1000;A2 = A1 + randn(size(A1));X = lu(A1*A1’);

rms = sqrt( sum(sqr(A1(:) - A2(:))) / numel(A1) );

NT2 codetable <double > A1 = _(1. ,1000.);table <double > A2 = A1 + randn(size(A1));table <double > X = lu( mtimes(A1 , trans(A1) );

double rms = sqrt( sum(sqr(A1(_) - A2(_))) / numel(A1) );

23 of 41

Parallel Skeletons extraction process

A = B / sum(C+D);

=

A /

B sum

+

C D

fold

transform

24 of 41

Parallel Skeletons extraction process

A = B / sum(C+D);

; ;

=

A /

B sum

+

C D

fold

transform

=

tmp sum

+

C D

fold

⇒=

A /

B tmp

transform

25 of 41

From data to task parallelismAntoine Tran Tan PHD, 2012-2015

Limits of the fork-join model

� Synchronization cost due to implicit barriers� Under-exploitation of potential parallelism� Poor data locality and no inter-statement optimization

Skeletons from the Future

� Adapt current skeletons for taskication� Use Futures ( or HPX) to automatically pipeline� Derive a dependency graph between statements

26 of 41

Parallel Skeletons extraction process - Take 2

A = B / sum(C+D);

; ;

=

tmp sum

+

C D

fold

=

A /

B tmp

transform

27 of 41

Parallel Skeletons extraction process - Take 2

A = B / sum(C+D);

fold

=

tmp(3) sum(3)

+

C(:, 3) D(:, 3)transform

=

A(:, 3) /

B(:, 3) tmp(3)

fold

=

tmp(2) sum(2)

+

C(:, 2) D(:, 2)transform

=

A(:, 2) /

B(:, 2) tmp(2)

workerfold,simd

=

tmp(1) sum

+

C(:, 1) D(:, 1)workertransform,simd

=

A(:, 1) /

B(:, 1) tmp(1)

spawnertransform,OpenMP

spawnertransform,OpenMP

;

28 of 41

Motion DetectionLacassagne et al., ICIP 2009

� Sigma-Delta algorithm based on background substraction� Use local gaussian model of lightness variation to detect motion� Challenge: Very low arithmetic density� Challenge: Integer-based implementation with small range

29 of 41

Motion Detection

table <char > sigma_delta( table <char >& background, table <char > const& frame, table <char >& variance)

{// Estimate Raw Movementbackground = selinc( background < frame

, seldec(background > frame , background));

table <char > diff = dist(background , frame);

// Compute Local Variancetable <char > sig3 = muls(diff ,3);

var = if_else( diff != 0, selinc( variance < sig3

, seldec( var > sig3 , variance))

, variance);

// Generate Movement Labelreturn if_zero_else_one( diff < variance );

}

30 of 41

Motion Detection

0

2

4

6

8

10

12

14

16

18

512x512 1024x1024

cycle

s/e

lem

ent

Image Size (N x N)

x6.8

x14.8

x16.5

x2.1

x3.6

x6.7

x15.3

x18

x2.3

x3.9

9

x10.8

x10.8

SCALARHALF COREFULL CORE

SIMDJRTIP2008

SIMD + HALF CORESIMD + FULL CORE

31 of 41

Black and Scholes Option Pricing

NT2 Codetable <float > blackscholes( table <float > const& Sa, table <float > const& Xa

, table <float > const& Ta, table <float > const& ra, table <float > const& va)

{table <float > da = sqrt(Ta);table <float > d1 = log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da);table <float > d2 = d1-va*da;

return Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2);}

32 of 41

Black and Scholes Option Pricing

NT2 Code with loop fusiontable <float > blackscholes( table <float > const& Sa, table <float > const& Xa

, table <float > const& Ta, table <float > const& ra, table <float > const& va)

{// Preallocate temporary tablestable <float > da(extent(Ta)), d1(extent(Ta)), d2(extent(Ta)), R(extent(Ta));

// tie merge loop nest and increase cache localitytie(da,d1 ,d2,R) = tie( sqrt(Ta)

, log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da), d1-va*da, Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2));

return R;}

32 of 41

Black and Scholes Option PricingPerformance

1000000

0

50

100

150

x1.8

9

x2.9

1

x5.5

8

x6.3

0

Size

cycle/value

scalar

SSE2

AVX2

SSE2, 4 cores

AVX2, 4 cores

33 of 41

Black and Scholes Option PricingPerformance with loop fusion/futurisation

1000000

0

50

100

150

x2.2

7

x4.1

3

x8.0

5

x11.

12

Size

cycle/value

scalar

SSE2

AVX2

SSE2, 4 cores

AVX2, 4 cores

34 of 41

LU DecompositionAlgorithm

A00

A01 A02A10

A20 A11

A21

A12

A22 A11

A12A21

A22

A22

step 1

step 2

step 3

step 4

step 5

step 6

step 7

DGETRF

DGESSM

DTSTRF

DSSSSM

35 of 41

LU DecompositionPerformance

0 10 20 30 40 50

0

50

100

Number of cores

Med

ian

GFL

OPS

8000× 8000 LU decomposition

NT2Intel MKL

36 of 41

Talk Layout

Introduction



Conclusion

37 of 41

Conclusion

Parallel Computing for Scientist

� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.

� Like regular language, DSEL needs informations about the hardware system� Integrating hardware descriptions as Generic components increases tools portability

and re-targetability

Our Achievements� A new method for parallel software development� Efficient libraries working on large subset of hardware� High level of performances across a wide application spectrum

38 of 41

Works in Progress

Application to Accelerators

� Exploration of proper skeleton implementation on GPUs� Adaptation of Future based code generator� In progress with Ian Masliah’ PHD thesis

Parallelism within C++

� SIMD as part of the standard library� Proposal N3571 for standard SIMD computation� Interoperability with current parallel model of C++

39 of 41

Perspectives

DSEL as C++ rst class idiom

� Build partial evaluation into the language� Ease transition between regular and meta C++� Mid-term Prospect: metaOCAML like quoting for C++

DSEL and compilers relationship

� C++ DSEL hits a limit on their applicability� Compilers often lack high level informations for proper optimization� Mid-term Prospect: Hybrid library/compiler approaches for DSEL

40 of 41

Thanks for your attention

Software Abstractions for Parallel Hardware

Software

Transcript of Software Abstractions for Parallel Hardware