Portable, usable, and efficient sparse matrix vector ...

117
Usable sparse matrix–vector multiplication Portable, usable, and efficient sparse matrix–vector multiplication Albert-Jan Yzelman Parallel Computing and Big Data Huawei Technologies France 30th of November, 2016 A. N. Yzelman

Transcript of Portable, usable, and efficient sparse matrix vector ...

Page 1: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Portable, usable, and efficient sparse matrix–vectormultiplication

Albert-Jan YzelmanParallel Computing and Big Data

Huawei Technologies France

30th of November, 2016

A. N. Yzelman

Page 2: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Outline

1 Shared-memory architectures

2 (Shared-memory) parallel programming models

3 Application to sparse computing

A. N. Yzelman

Page 3: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Shared-memory architectures

1 Shared-memory architectures

2 (Shared-memory) parallel programming models

3 Application to sparse computing

A. N. Yzelman

Page 4: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Moore’s Law (Waldrop, Nature vol. 530, 2016)

A. N. Yzelman

Page 5: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Hardware trends: core count

Transistor counts continue to double every two years (at least until2021, cf. Waldrop), yet clock speeds have stalled. Why?

Graphics from Ron Maltiel, Maltiel Consulting(http://www.maltiel-consulting.com/ISSCC-2013-High-Performance-Digital-Trends.html)

M. W. Waldrop, “The chips are down for Moore’s Law”; Nature, Vol. 530, pp. 144–147, 2016.

A. N. Yzelman

Page 6: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Hardware trends: core count

Transistor counts continue to double every two years (at least until2021, cf. Waldrop), yet clock speeds have stalled. Why?

Graphics from Ron Maltiel, Maltiel Consulting(http://www.maltiel-consulting.com/ISSCC-2013-High-Performance-Digital-Trends.html)

M. W. Waldrop, “The chips are down for Moore’s Law”; Nature, Vol. 530, pp. 144–147, 2016.

A. N. Yzelman

Page 7: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Hardware trends: bandwidth

CPU speeds stall, but Moore’s Law now translates to an increasingamount of cores per die, i.e., the effective flop rate of processorsstill rises.

But what about bandwidth?

Technology Year Bandwidth #cores

EDO 1970s 27 Mbyte/s 1SDRAM early 1990s 53 Mbyte/s 1RDRAM mid 1990s 1.2 Gbyte/s 1

DDR 2000 1.6 Gbyte/s 1DDR2 2003 3.2 Gbyte/s 2DDR3 2007 6.4 Gbyte/s 4DDR3 2013 11 Gbyte/s 8DDR4 2015 25 Gbyte/s 14

Effective bandwidth per core, stalled at best...

A. N. Yzelman

Page 8: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Arithmetic intensity

1

2

4

8

16

32

64

1/8 1/4 1/2 1 2 4 8 16

atta

inab

le G

FL

OP

/sec

Arithmetic Intensity FLOP/Byte

peak m

emory BW

peak floating-point

withvectorization

If your computation has enough work per data element, it iscompute bound; otherwise, it is bandwidth bound.

If you are bandwidth bound, reducing your memory footprint, e.g.,by compression, directly results in faster execution.

(Image courtesy of Prof. Wim Vanroose, UA)

A. N. Yzelman

Page 9: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Multi-socket architectures: NUMA

Each socket has local main memory where access is fast. Betweensockets, access is slower.

Memory

CPUs

Access times to shared-memory depends on the physical location,leading to non-uniform memory access (NUMA).

A. N. Yzelman

Page 10: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Dealing with NUMA: distribution types

Implicit distribution, centralised local allocation:

If each processor moves data to the same single memory element, thebandwidth is limited by that of a single memory controller.

A. N. Yzelman

Page 11: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Dealing with NUMA: distribution types

Implicit distribution, centralised interleaved allocation:

If each processor moves data from all memory elements, the bandwidthmultiplies if accesses are uniformly random.

A. N. Yzelman

Page 12: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Dealing with NUMA: distribution types

Explicit distribution, distributed local allocation:

If each processor moves data from and to its own unique memoryelement, the bandwidth multiplies.

A. N. Yzelman

Page 13: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Caches

Divide the main memory (RAM) in stripes of size LS .

Mainmemory

(RAM)

The ith line in RAM is mapped to the cache line i mod L, where L isthe number of available cache lines. Can we do better?

A. N. Yzelman

Page 14: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Caches

A smarter cache follows a pre-defined policy instead; for instance, the‘Least Recently Used (LRU)’ policy:

x1

Req. x1, . . . , x4

x4

x3

x2

x1

Req. x2

x2

x4

x3

x1

Req. x5

x5

x2

x4

x3

A. N. Yzelman

Page 15: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Caches

Realistic caches combine modulo-mapping and the LRU policy:

Mainmemory

(RAM)

Cache

Subcaches

Modulo mapping

LRU−stack

k is the number of subcaches; there are L/k LRU stacks.

A. N. Yzelman

Page 16: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Caches

Realistic caches are used within multi-level memory hierarchies:

Mainmemory

(RAM)������������

������������

��������������������

��������������������

��������������������

Cache Cache

(L1) (L2)CPU

Intel Core2 (Q6600) AMD Phenom II (945e) Intel Westmere (E7-2830)L1: 32kB k = 8L2: 4MB k = 16L3: - -

S = 64kB k = 2S = 512kB k = 8S = 6MB k = 48

S = 256kB k = 8S = 2MB k = 8S = 24MB k = 24

A. N. Yzelman

Page 17: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Caches and multiplication

Dense matrix–vector multiplication

a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

·

x0

x1

x2

x3

=

y0

y1

y2

y3

Example with LRU caching and S = 4:

x0

=⇒

a00

x0

=⇒

y0

a00

x0 =⇒

x1

y0

a00

x0

=⇒

a01

x1

y0

a00

x0

=⇒

y0

a01

x1

a00

x0

A. N. Yzelman

Page 18: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Caches and multiplication

Dense matrix–vector multiplication

a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

·

x0

x1

x2

x3

=

y0

y1

y2

y3

Example with LRU caching and S = 4:

x0

=⇒

a00

x0

=⇒

y0

a00

x0 =⇒

x1

y0

a00

x0

=⇒

a01

x1

y0

a00

x0

=⇒

y0

a01

x1

a00

x0

A. N. Yzelman

Page 19: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Caches and multiplication

Dense matrix–vector multiplication

a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

·

x0

x1

x2

x3

=

y0

y1

y2

y3

Example with LRU caching and S = 4:

x0

=⇒

a00

x0

=⇒

y0

a00

x0 =⇒

x1

y0

a00

x0

=⇒

a01

x1

y0

a00

x0

=⇒

y0

a01

x1

a00

x0

A. N. Yzelman

Page 20: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Caches and multiplication

Dense matrix–vector multiplication

a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

·

x0

x1

x2

x3

=

y0

y1

y2

y3

Example with LRU caching and S = 4:

x0

=⇒

a00

x0

=⇒

y0

a00

x0 =⇒

x1

y0

a00

x0

=⇒

a01

x1

y0

a00

x0

=⇒

y0

a01

x1

a00

x0

A. N. Yzelman

Page 21: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Caches and multiplication

Dense matrix–vector multiplication

a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

·

x0

x1

x2

x3

=

y0

y1

y2

y3

Example with LRU caching and S = 4:

x0

=⇒

a00

x0

=⇒

y0

a00

x0 =⇒

x1

y0

a00

x0

=⇒

a01

x1

y0

a00

x0

=⇒

y0

a01

x1

a00

x0

A. N. Yzelman

Page 22: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Caches and multiplication

Dense matrix–vector multiplication

a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

·

x0

x1

x2

x3

=

y0

y1

y2

y3

Example with LRU caching and S = 4:

x0

=⇒

a00

x0

=⇒

y0

a00

x0 =⇒

x1

y0

a00

x0

=⇒

a01

x1

y0

a00

x0

=⇒

y0

a01

x1

a00

x0

A. N. Yzelman

Page 23: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Caches and multiplication: NUMA again

When k, L are larger, we can predict:

lower elements from x are evicted while processing the first row.This causes O(n) cache misses on m − 1 rows.

Fix:

stop processing a row before an element from x would be evicted;first continue with the next rows.

This results in column-wise ‘stripes’ of the dense A:

A =

· · ·

.

A. N. Yzelman

Page 24: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Caches and multiplication: NUMA again

When k, L are larger, we can predict:

lower elements from x are evicted while processing the first row.This causes O(n) cache misses on m − 1 rows.

Fix:

stop processing a row before an element from x would be evicted;first continue with the next rows.

This results in column-wise ‘stripes’ of the dense A:

A =

· · ·

.

A. N. Yzelman

Page 25: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Caches and multiplication: NUMA again

A =

· · ·

.

But now:

elements from the vector y can be prematurely evicted; O(m)cache misses on each block of columns. (Already much better!)

Fix:

stop processing before an element from y is evicted; first do theremaining column blocks.

This is cache-aware blocking.

A. N. Yzelman

Page 26: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Caches and multiplication: NUMA again

A =

· · ·

.

But now:

elements from the vector y can be prematurely evicted; O(m)cache misses on each block of columns. (Already much better!)

Fix:

stop processing before an element from y is evicted; first do theremaining column blocks.

This is cache-aware blocking.

A. N. Yzelman

Page 27: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Caches and multicore

Most architectures employ shared caches.

������������

������������

������������

������������

������������

������������

������������

������������

������

������

������

������

������

������

������

������

64kB L1 64kB L1 64kB L1 64kB L1

Core 1 Core 2 Core 3 Core 4

512kB L2512kB L2512kB L2512kB L2

System interface

6MB shared L3 cache

In BSP: (4, 3GHz, l , g). Is this a correct view?

A. N. Yzelman

Page 28: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Caches and multicore: NUMA

������������

������������

������������

������������

������������

������������

������������

������������

32kB L1 32kB L1 32kB L1 32kB L1

Core 1 Core 2 Core 3 Core 4

4MB L2

System interface

4MB L2

In BSP: (4, 2.4 GHz, l , g)...

A. N. Yzelman

Page 29: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Shared-memory architectures

Caches and multicore: NUMA

������������

������������

������������

������������

������������

������������

������������

������������

32kB L1 32kB L1 32kB L1 32kB L1

Core 1 Core 2 Core 3 Core 4

4MB L2

System interface

4MB L2

In BSP: (4, 2.4 GHz, l , g) but Non-Uniform Memory Access!

A. N. Yzelman

Page 30: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

(Shared-memory) parallel programmingmodels

1 Shared-memory architectures

2 (Shared-memory) parallel programming models

3 Application to sparse computing

A. N. Yzelman

Page 31: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Shared-memory programming intro

Suppose x and y are in a shared memory. We calculate aninner-product in parallel, using the cyclic distribution.

Input:s the current processor ID,p the total number of processors (threads),n the size of the input vectors.

Output: xT y

Shared-memory SPMD program with ‘double α;’ globally allocated:

α = 0.0

for i = s to n step p

α += xiyireturn α

Data race! (for n = p = 2, output can be x0y0, x1y1, or x0y0 + x1y1)

A. N. Yzelman

Page 32: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Shared-memory programming intro

Suppose x and y are in a shared memory. We calculate aninner-product in parallel, using the cyclic distribution.

Input:s the current processor ID,p the total number of processors (threads),n the size of the input vectors.

Output: xT y

Shared-memory SPMD program with ‘double α;’ globally allocated:

α = 0.0

for i = s to n step p

α += xiyireturn α

Data race! (for n = p = 2, output can be x0y0, x1y1, or x0y0 + x1y1)

A. N. Yzelman

Page 33: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Shared-memory programming intro

Suppose x and y are in a shared memory. We calculate aninner-product in parallel, using the cyclic distribution.

Input:s the current processor ID,p the total number of processors (threads),n the size of the input vectors.

Output: xT y

Shared-memory SPMD program with ‘double α[p];’ globally allocated:

for i = s to n step p

αs += xiyisynchronise

return∑p−1

i=0 αi

False sharing! (processors access and update the same cache lines)

A. N. Yzelman

Page 34: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Shared-memory programming intro

Suppose x and y are in a shared memory. We calculate aninner-product in parallel, using the cyclic distribution.

Input:s the current processor ID,p the total number of processors (threads),n the size of the input vectors.

Output: xT y

Shared-memory SPMD program with ‘double α[p];’ globally allocated:

for i = s to n step p

αs += xiyisynchronise

return∑p−1

i=0 αi

False sharing! (processors access and update the same cache lines)

A. N. Yzelman

Page 35: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Shared-memory programming intro

Suppose x and y are in a shared memory. We calculate aninner-product in parallel, using the cyclic distribution.

Input:s the current processor ID,p the total number of processors (threads),n the size of the input vectors.

Output: xT y

Shared-memory SPMD program with ‘double α[8p];’ globally allocated:

for i = s to n step pα8s += xiyi

synchronisereturn

∑p−1i=0 α8i

Inefficient cache use!(All threads access virtually all cache lines associated with x , y)

A. N. Yzelman

Page 36: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Shared-memory programming intro

Suppose x and y are in a shared memory. We calculate aninner-product in parallel, using the cyclic distribution.

Input:s the current processor ID,p the total number of processors (threads),n the size of the input vectors.

Output: xT y

Shared-memory SPMD program with ‘double α[8p];’ globally allocated:

for i = s to n step pα8s += xiyi

synchronisereturn

∑p−1i=0 α8i

Inefficient cache use!(All threads access virtually all cache lines; Θ(pn) data movement)

A. N. Yzelman

Page 37: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Shared-memory programming intro

Suppose x and y are in a shared memory. We calculate aninner-product in parallel, using the cyclic distribution.

Input:s the current processor ID,p the total number of processors (threads),n the size of the input vectors.

Output: xT y

Shared-memory SPMD program with ‘double α[8p];’ globally allocated:

for i = s · dn/pe to (s + 1) · dn/peα8s += xiyi

synchronise

return∑p−1

i=0 α8i

(Now inefficiency only at boundaries; O(n + p − 1) data movement)

A. N. Yzelman

Page 38: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Shared-memory programming intro

Suppose x and y are in a shared memory. We calculate aninner-product in parallel, using the cyclic distribution.

Input:s the current processor ID,p the total number of processors (threads),n the size of the input vectors.

Output: xT y

Shared-memory SPMD program with ‘double α[8p];’ globally allocated:

for i = s · dn/pe to (s + 1) · dn/peα8s += xiyi

synchronise

return∑p−1

i=0 α8i

(Now inefficiency only at boundaries; O(n + p − 1) data movement)

A. N. Yzelman

Page 39: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Speedup

Definition (Speedup)

Let Tseq be the sequential running time required for solving a problem.Let Tp be the running time of a parallel algorithm using p processes,solving the same problem. Then the speedup is given by

S = Tseq/Tp.

Scalable in time:

Ideally, S = p;if we are lucky, S > p;realistically, 1� S < p;if we do very badly, S < 1.

A. N. Yzelman

Page 40: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Maximum attainable speedup

Consider a graph G = (V ,E ) of a given algorithm, e.g.,

Nodes correspond to data,edges indicate which data iscombined to generate acertain output.

Question: If we had an infinitenumber of processors, how fastwould we be able to run thealgorithm shown on the right?

f(a)

h(f(a))...

a

g(a)

A. N. Yzelman

Page 41: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Maximum attainable speedup

Consider a graph G = (V ,E ) of a given algorithm, e.g.,

Nodes correspond to data,edges indicate which data iscombined to generate acertain output.

Question: If we had an infinitenumber of processors, how fastwould we be able to run thealgorithm shown on the right?

Answer: Tseq = |V | = 9, while thecritical path length T∞ equals 4.

The maximum speedup hence is:

Tseq/T∞ = 9/4.

f(a)

h(f(a))...

a

g(a)

A. N. Yzelman

Page 42: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

What is parallelism?

Definition (Parallelism)

The parallelism of a given algorithm is its maximum attainablespeedup:

Tseq/T∞.

T∞ is known as the critical path length or the algorithmic span.

This leads to a theoretical upper bound on speedup:

S = Tseq/Tp ≤ Tseq/T∞.

This type of analysis forms the basis of

fine-grained parallel computation.

A. N. Yzelman

Page 43: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Fine-grained parallel computing

Decompose a problem into many small tasks, that run concurrently(as much as possible). A run-time scheduler assigns tasks toprocesses.

What is small? Grain-size.

Performance model? Parallelism.

Algorithms can be implemented as graphs, explicitly or implicitly:

Intel: Threading Building Blocks (TBB),

OpenMP,

Intel / MIT / Cilk Arts: Cilk,

Google: Pregel,

. . .

By contrast, BSP computing is coarse-grained.

A. N. Yzelman

Page 44: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Cilk

Only two parallel programming primitives:1 (binary) fork, and2 (binary) join.

A. N. Yzelman

Page 45: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Cilk

Example: calculate x4 from xn = xn−2 + xn−1 given x0 = x1 = 1:

int f( int n ) {if( n == 0 ∨ n == 1 ) return 1;int x1 = cilk spawn f( n-1 ); //forkint x2 = cilk spawn f( n-2 ); //forkcilk sync; //joinreturn x1 + x2;

}

int main() {int x4 = f( 4 );printf( ”x 4 = %d\n”, &x4 );return 0;

}

A. N. Yzelman

Page 46: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Cilk

Spawned function calls are assigned to one of the available processesby the Cilk run time scheduler.

The Cilk scheduler guarantees, under some assumptions on thedeterminism of the algorithm, that the parallel run-time Tp is boundedby

O(T1/p + T∞).

Not all run-time schedulers have such guarantees, and Cilk is but oneof the many fine-grained parallel programming frameworks.

A. N. Yzelman

Page 47: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Is parallelism the way to go?

Example:

Consider the naive Θ(n2) Fourier transformation; its span is Θ(log n),so its parallelism is Θ(n2/ log n). Lots of parallelism!

The FFT formulation has work Θ(n log n), also with span Θ(log n)resulting in Θ(n log n

log n ) = Θ(n) parallelism. Less parallelism...

Which would you prefer?

A. N. Yzelman

Page 48: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Is parallelism the way to go?

Is there a difference between considering Tseq or T1?

A. N. Yzelman

Page 49: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Is parallelism the way to go?

Is there a difference between considering Tseq or T1?

Yes(!)

There may be multiple sequential algorithms to solve the sameproblem. When comparing, always compare to the best. For parallelsorting:

Sodd-even sort = T qsortseq /T odd-even sort

p .

A. N. Yzelman

Page 50: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Is parallelism the way to go?

Are there any other issues that may be overlookedwhen focusing solely on parallelism?

A. N. Yzelman

Page 51: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Overheads

Definition (Overhead)

The overhead of parallel computation is any extra effort expendedover the original amount of work Tseq

To = pTp − Tseq.

The parallel computation time can be expressed in To :

Tp =Tseq + To

p.

Cilk: To = pT∞.

BSP: To = p(∑N−1

i=0 hig + l)

+ p∑N−1

i=0 maxs w(s)i − Tseq.

Data movement, latency costs, and extra computations on top of thebare minimum required should be modelled. This enables parallel

algorithm design, instead of simply enabling parallel execution.A. N. Yzelman

Page 52: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

MapReduce

Most parallel programming models only consider algorithmic structure;fork-join, parallel for, dataflows, etc. One other is MapReduce, whichoperates on key-value pairs from

K × V ,

with K a set of possible keys and V a set of values. MapReducedefines two operations:

mapper: K × V → P(K × V );

reducer: K × P(V )→ V .

Mapper applies to all key-value pairs, embarrasingly parallel.

Reducer applies to a single key only, global communication.

Shuffling refers to the sorting-by-keys prior to the reduction stage.A. N. Yzelman

Page 53: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Pregel

Consider a graph G = (V ,E ). Graph algorithms may be phrased asfollows:

For each vertex v ∈ V , a thread executes a user-defined SPMDalgorithm;

each algorithm consists of successive local compute phases andglobal communication phases;

during a communication phase, a vertex v can only send messagesto N(v), where N(v) is the set of neighbouring vertices of v ; i.e.,N(v) = {w ∈ V | {v ,w} ∈ E}.

MapReduce and Pregel are variants of the BSP algorithm model,

a type of fine-grained BSP.

A. N. Yzelman

Page 54: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Spark

High level language for large-scale computing: resilient, scalable, hugeuptake; expressive and easy to use:

s c a l a> v a l A = s c . t e x t F i l e ( ”A . t x t ” ) . map( x => x . s p l i t ( ” ” ) match {case Array ( a , b , c ) => ( a . t o I n t − 1 , b . t o I n t − 1 , c . toDouble )

} ) . groupBy ( x => x . 1 ) ;A : org . apache . s p a r k . rdd .RDD[ ( I n t , I t e r a b l e [ ( I n t , I n t , Double ) ] ) ] = Shuff ledRDD [ 8 ] . . .

s c a l a>

Spark is implemented in Scala, runs on the JVM, relies on serialisation,and commonly uses HDFS for distributed and resilient storage.

Ref.: Zaharia, M. (2013). An architecture for fast and general data processing on large clusters. Dissertation, UCB.

A. N. Yzelman

Page 55: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Spark

Concepts:

RDDs are fine-grained data distributed by hashing;

Transformations (map, filter, groupBy) are lazy operators;

DAGs thus formed are resolved by actions: reduce, collect, ...

Computations are offloaded as close to the data as possible;

all-to-all data shuffles for communication required by actions.

A. N. Yzelman

Page 56: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Bridging HPC and Big Data

Platforms like Spark essentially do PRAM simulation:

automatic mode vs. direct modeeasy-of-use vs. performance

Ref.: Valiant, L. G. (1990). A bridging model for parallel computation. Communications of the ACM, 33(8).

Both are scalable as long as the shuffle remains balanced well:

”First, [..] we show that RDDs can emulate any distributed system, and will

do so efficiently as long as the system tolerates some network latency [..]

because, once augmented with fast data sharing, MapReduce can emulate the

BSP model of parallel computing, with the main drawback being the latency

of each MapReduce step.”– from the PhD thesis introducing Spark, page 4.

Ref.: Zaharia, M. (2013). An architecture for fast and general data processing on large clusters. Dissertation, UCB.

A. N. Yzelman

Page 57: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Multi-BSP

BSP itself is evolving as well:

Multi-BSP computer = p ( subcomputers or processors ) +

M bytes of local memory+

an interconnect

A total of 4L parameters: (p0, g0, l0,M0, . . . , pL−1, gL−1, lL−1,ML−1).

memory-aware,

non-uniform!

However,

harder to prove optimality(?)

L. G. Valiant, A bridging model for multi-core computing, 2008, 2011.

A. N. Yzelman

Page 58: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Multi-BSP

BSP itself is evolving as well:

Multi-BSP computer = p ( subcomputers or processors ) +

M bytes of local memory+

an interconnect

A total of 4L parameters: (p0, g0, l0,M0, . . . , pL−1, gL−1, lL−1,ML−1).

memory-aware,

non-uniform!

However,

harder to prove optimality(?)

L. G. Valiant, A bridging model for multi-core computing, 2008, 2011.

A. N. Yzelman

Page 59: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Multi-BSP

BSP itself is evolving as well:

Multi-BSP computer = p ( subcomputers or processors ) +

M bytes of local memory+

an interconnect

A total of 4L parameters: (p0, g0, l0,M0, . . . , pL−1, gL−1, lL−1,ML−1).

memory-aware,

non-uniform!

However,

harder to prove optimality(?)

L. G. Valiant, A bridging model for multi-core computing, 2008, 2011.

A. N. Yzelman

Page 60: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

Multi-BSP

An example with L = 4 quadlets (p, g , l ,M):

A. N. Yzelman

Page 61: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Application to sparse computing

1 Shared-memory architectures

2 (Shared-memory) parallel programming models

3 Application to sparse computing

A. N. Yzelman

Page 62: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Problem setting

Given a sparse m × n matrix A, and corresponding vectors x , y .

How to calculate y = Ax as fast as possible?

How to make the code usable for the 99%?

Figure: Wikipedia link matrix (’07) with on average ≈ 12.6 nonzeroes per row.

A. N. Yzelman

Page 63: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Problem setting

Shared-memory central obstacles for SpMV multiplication:

inefficient cache use,

limited memory bandwidth, and

non-uniform memory access (NUMA).

Distributed-memory:

inefficient network use.

Shared-memory and distributed-memory share their objectives:

cache misses==

communication volumeCache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods

by A. N. Yzelman & Rob H. Bisseling in SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009).

A. N. Yzelman

Page 64: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Problem setting

Shared-memory central obstacles for SpMV multiplication:

inefficient cache use,

limited memory bandwidth, and

non-uniform memory access (NUMA).

Distributed-memory:

inefficient network use.

Shared-memory and distributed-memory share their objectives:

cache misses==

communication volumeCache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods

by A. N. Yzelman & Rob H. Bisseling in SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009).

A. N. Yzelman

Page 65: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Inefficient cache use

Visualisation of the SpMV multiplication Ax = y with nonzeroesprocessed in row-major order:

Accesses on the input vector are completely unpredictable.

A. N. Yzelman

Page 66: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Enhanced cache use: nonzero reorderings

Blocking to cache subvectors, and cache-oblivious traversals.

Other approaches: no blocking (Haase et al.), Morton Z-curves and bisection (Martone et al.), Z-curve within blocks(Buluc et al.), composition of low-level blocking (Vuduc et al.), ...

Ref.: Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication”,IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31 (2014).

A. N. Yzelman

Page 67: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Enhanced cache use: nonzero reorderings

Blocking to cache subvectors, and cache-oblivious traversals.

Sequential SpMV multiplication on the Wikipedia ’07 link matrix:345 (CRS), 203 (Hilbert), 245 (blocked Hilbert) ms/mul.

Ref.: Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication”,IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31 (2014).

A. N. Yzelman

Page 68: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Enhanced cache use: matrix permutations

1 2 3 4

1 2

43

(Upper bound on) the number of cache misses:∑i

(λi − 1)

Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methodsby A. N. Yzelman & Rob H. Bisseling in SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009).

A. N. Yzelman

Page 69: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Enhanced cache use: matrix permutations

cache misses ≤∑i

(λi − 1) = communication volume

Lengauer, T. (1990). Combinatorial algorithms for integrated circuit layout. Springer Science & Business Media.

Catalyurek, U. V., & Aykanat, C. (1999). Hypergraph-partitioning-based decomposition for parallel sparse-matrixvector multiplication. IEEE Transactions on Parallel and Distributed Systems, 10(7), 673-693.

Catalyurek, U. V., & Aykanat, C. (2001). A Fine-Grain Hypergraph Model for 2D Decomposition of SparseMatrices. In IPDPS (Vol. 1, p. 118).

Vastenhouw, B., & Bisseling, R. H. (2005). A two-dimensional data distribution method for parallel sparsematrix-vector multiplication. SIAM review, 47(1), 67-95.

Bisseling, R. H., & Meesen, W. (2005). Communication balancing in parallel sparse matrix-vector multiplication.Electronic Transactions on Numerical Analysis, 21, 47-65.

Should we program shared-memory as though it were distributed?

A. N. Yzelman

Page 70: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Enhanced cache use: matrix permutations

Practical gains:

Figure: the Stanford link matrix (left) and its 20-part reordering (right).

Sequential execution using CRS on Stanford:

18.99 (original), 9.92 (1D), 9.35 (2D) ms/mul.

Ref.: Two-dimensional cache-oblivious sparse matrix-vector multiplicationby A. N. Yzelman & Rob H. Bisseling in Parallel Computing 37(12), pp. 806-819 (2011).

A. N. Yzelman

Page 71: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Bandwidth

Exploiting sparsity through computation using only nonzeroes:

i = (0, 0, 1, 1, 2, 2, 2, 3)j = (0, 4, 2, 4, 1, 3, 5, 2)v = (a00, a04, . . . , a32)

for k = 0 to nz − 1yik := yik + vk · xjk

The coordinate (COO) format: two flops versus five data words.

Θ(3nz) storage. CRS has lower arithmetic intensity: Θ(2nz + m) storage.

A. N. Yzelman

Page 72: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Efficient bandwidth use

A =

4 1 3 00 0 2 31 0 0 27 0 1 1

Bi-directional incremental CRS (BICRS):

A =

V [7 1 4 1 2 3 3 2 1 1]

∆J [0 4 4 1 5 4 5 4 3 1]

∆I [3 -1 -2 1 -1 1 1 1]

Storage requirements, allowing arbitrary traversals:

Θ(2nz + row jumps + 1).Ref.: Yzelman and Bisseling, “A cache-oblivious sparse matrix–vector multiplication scheme based on the Hilbert curve”,

Progress in Industrial Mathematics at ECMI 2010, pp. 627-634 (2012).

A. N. Yzelman

Page 73: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Efficient bandwidth use

With BICRS you can, distributed or not,

vectorise,

compress,

do blocking,

have arbitrary nonzero or block orders.

Optimised BICRS takes less than or equal to 2nz + m of memory.

Ref.: Buluc, Fineman, Frigo, Gilbert, Leiserson (2009). Parallel sparse matrix-vector and matrix-transpose-vectormultiplication using compressed sparse blocks. In Proceedings of the twenty-first annual symposium on Parallelism inalgorithms and architectures (pp. 233-244). ACM.

Ref.: Yzelman and Bisseling (2009). Cache-oblivious sparse matrix-vector multiplication by using sparse matrixpartitioning methods. In SIAM Journal of Scientific Computation 31(4), pp. 3128-3154.

Ref.: Yzelman and Bisseling (2012). A cache-oblivious sparse matrix–vector multiplication schemebased on the Hilbert curve”. In Progress in Industrial Mathematics at ECMI 2010, pp. 627-634.

Ref.: Yzelman and Roose (2014). High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication.In IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31.

Ref.: Yzelman, A. N. (2015). Generalised vectorisation for sparse matrix: vector multiplication. In Proceedings of the 5thWorkshop on Irregular Applications: Architectures and Algorithms. ACM.

A. N. Yzelman

Page 74: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

One-dimensional data placement

Coarse-grain row-wise distribution, compressed, cache-optimised:

explicit allocation of separate matrix parts per core,

explicit allocation of the output vector on the various sockets,

interleaved allocation of the input vector,

Ref.: Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication”, IEEETransactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31 (2014).

A. N. Yzelman

Page 75: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Two-dimensional data placement

Distribute row- and column-wise (individual nonzeroes):

all data allocation is explicit,inter-process communication minimised by partitioning;incurs cost of partitioning.

Ref.: Yzelman and Roose, High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication, IEEETrans. Parallel and Distributed Systems, doi:10.1109/TPDS.2013.31 (2014).

Ref.: Yzelman, Bisseling, Roose, and Meerbergen, MulticoreBSP for C: a high-performance library for shared-memoryparallel programming, Intl. J. Parallel Programming, doi:10.1007/s10766-013-0262-9 (2014).

A. N. Yzelman

Page 76: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Results

Sequential CRS on Wikipedia ’07: 472 ms/mul. 40 threads BICRS:

21.3 (1D), 20.7 (2D) ms/mul. Speedup: ≈ 22x.

Average speedup on six large matrices:2 x 6 4 x 10 8 x 8

–, 1D fine-grained, CRS∗ 4.6 6.8 6.2Hilbert, Blocking, 1D, BICRS∗ 5.4 19.2 24.6

Hilbert, Blocking, 2D, BICRS† − 21.3 30.8

†: uses an updated test set. (Added for reference versus a good 2D algorithm.)

As NUMA increases, interleaved and 1D algorithms lose efficiency.

∗: Yzelman and Roose, High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication, IEEETrans. Parallel and Distributed Systems, doi:10.1109/TPDS.2013.31 (2014).

†: Yzelman, Bisseling, Roose, and Meerbergen, MulticoreBSP for C: a high-performance library for shared-memoryparallel programming, Intl. J. Parallel Programming, doi:10.1007/s10766-013-0262-9 (2014).

A. N. Yzelman

Page 77: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Results

Sequential CRS on Wikipedia ’07: 472 ms/mul. 40 threads BICRS:

21.3 (1D), 20.7 (2D) ms/mul. Speedup: ≈ 22x.

Average speedup on six large matrices:2 x 6 4 x 10 8 x 8

–, 1D fine-grained, CRS∗ 4.6 6.8 6.2Hilbert, Blocking, 1D, BICRS∗ 5.4 19.2 24.6

Hilbert, Blocking, 2D, BICRS† − 21.3 30.8

†: uses an updated test set. (Added for reference versus a good 2D algorithm.)

As NUMA increases, interleaved and 1D algorithms lose efficiency.

∗: Yzelman and Roose, High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication, IEEETrans. Parallel and Distributed Systems, doi:10.1109/TPDS.2013.31 (2014).

†: Yzelman, Bisseling, Roose, and Meerbergen, MulticoreBSP for C: a high-performance library for shared-memoryparallel programming, Intl. J. Parallel Programming, doi:10.1007/s10766-013-0262-9 (2014).

A. N. Yzelman

Page 78: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Usability

The previous all is available through free software:

http://albert-jan.yzelman.net/software.php#SL

However, there are problems integrating into existing codes:

SPMD (PThreads, MPI, ...) vs. others (OpenMP, Cilk, ...).

Globally allocated vectors versus explicit data allocation.

Conversion between matrix data formats.

A. N. Yzelman

Page 79: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Usability

The previous all is available through free software:

http://albert-jan.yzelman.net/software.php#SL

However, there are problems integrating into existing codes:

SPMD (PThreads, MPI, ...) vs. others (OpenMP, Cilk, ...).

Globally allocated vectors versus explicit data allocation.

Conversion between matrix data formats.

A. N. Yzelman

Page 80: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Usability

The previous all is available through free software:

http://albert-jan.yzelman.net/software.php#SL

However, there are problems integrating into existing codes:

SPMD (PThreads, MPI, ...) vs. others (OpenMP, Cilk, ...).

Globally allocated vectors versus explicit data allocation.

Conversion between matrix data formats.

A. N. Yzelman

Page 81: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Usability

Wish list:

Performance and scalability.

Portable codes and/or APIs: GPUs, x86, ARM, phones, ...

Out-of-core, streaming capabilities, dynamic updates.

User-defined overloaded operations on user-defined data.

Ease of use!

GraphBLAS.org

Interoperability (PThreads + Cilk, MPI + OpenMP, DSLs!)

A. N. Yzelman

Page 82: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Usability

Wish list:

Performance and scalability.

Portable codes and/or APIs: GPUs, x86, ARM, phones, ...

Out-of-core, streaming capabilities, dynamic updates.

User-defined overloaded operations on user-defined data.

Ease of use!

GraphBLAS.org

Interoperability (PThreads + Cilk, MPI + OpenMP, DSLs!)

A. N. Yzelman

Page 83: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Usability

Wish list:

Performance and scalability.

Portable codes and/or APIs: GPUs, x86, ARM, phones, ...

Out-of-core, streaming capabilities, dynamic updates.

User-defined overloaded operations on user-defined data.

Ease of use!

GraphBLAS.org

Interoperability (PThreads + Cilk, MPI + OpenMP, DSLs!)

A. N. Yzelman

Page 84: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Usability

Wish list:

Performance and scalability.

Portable codes and/or APIs: GPUs, x86, ARM, phones, ...

Out-of-core, streaming capabilities, dynamic updates.

User-defined overloaded operations on user-defined data.

Ease of use!

GraphBLAS.org

Interoperability (PThreads + Cilk, MPI + OpenMP, DSLs!)

A. N. Yzelman

Page 85: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Usability

Wish list:

Performance and scalability.

Portable codes and/or APIs: GPUs, x86, ARM, phones, ...

Out-of-core, streaming capabilities, dynamic updates.

User-defined overloaded operations on user-defined data.

Ease of use!

GraphBLAS.org

Interoperability (PThreads + Cilk, MPI + OpenMP, DSLs!)

A. N. Yzelman

Page 86: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Bridging HPC and Big Data

Wanted: a bridge between Big Data and HPC

Our take:

Spark I/O via native RDDs and native Scala interfaces;

Rely on serialisation and the JNI to switch to C;

Intercept Spark’s execution model to switch to SPMD;

Set up and enable inter-process RDMA communications.

d e f c r e a t e M a t r i x (rdd : RDD[ ( I n t , I t e r a b l e [ I n t ] , I t e r a b l e [ Double ] ) ] ,P : I n t

) : S p a r s e M a t r i x = . . .

d e f m u l t i p l y (s c : o rg . apache . s p a r k . SparkContext ,x : DenseVector , A : S p a r s e M a t r i x , y : DenseVector

) : DenseVector = . . . //−−− t h i s f u n c t i o n c a l l s 1D or 2D SpMVs −−−//

d e f toRDD(s c : org . apache . s p a r k . SparkContext ,x : DenseVector

) : RDD[ ( I n t , Double ) ] = . . .

A. N. Yzelman

Page 87: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Bridging HPC and Big Data

Wanted: a bridge between Big Data and HPC

Our take:

Spark I/O via native RDDs and native Scala interfaces;

Rely on serialisation and the JNI to switch to C;

Intercept Spark’s execution model to switch to SPMD;

Set up and enable inter-process RDMA communications.

d e f c r e a t e M a t r i x (rdd : RDD[ ( I n t , I t e r a b l e [ I n t ] , I t e r a b l e [ Double ] ) ] ,P : I n t

) : S p a r s e M a t r i x = . . .

d e f m u l t i p l y (s c : o rg . apache . s p a r k . SparkContext ,x : DenseVector , A : S p a r s e M a t r i x , y : DenseVector

) : DenseVector = . . . //−−− t h i s f u n c t i o n c a l l s 1D or 2D SpMVs −−−//

d e f toRDD(s c : org . apache . s p a r k . SparkContext ,x : DenseVector

) : RDD[ ( I n t , Double ) ] = . . .

A. N. Yzelman

Page 88: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

DataBSP

We have a shared-memory prototype. Preliminary results:

SpMM multiply, SpMV multiply, and basic vector operations;

one machine learning application.

Cage15, n = 5 154 859, nz = 99 199 551. Using the 1D method:

Note: this is ongoing work;performance and functionality improvements are forthcoming.

A. N. Yzelman

Page 89: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication > Application to sparse computing

Conclusions and future work

Needed for current algorithms:

faster partitioning to enable scalable 2D sparse computations,

integration in practical and extensible libraries (GraphBLAS),

making them interoperable with common use scenarios.

Extend application areas further:

sparse power kernels,

symmetric matrix support,

graph and sparse tensor computations,

support various hardware and execution platforms (Hadoop?).

Thank you!The basic SpMV multiplication codes are free:

http://albert-jan.yzelman.net/software#SL

A. N. Yzelman

Page 90: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Backup slides

A. N. Yzelman

Page 91: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Results: cross platform

Cross platform results over 24 matrices:

Structured Unstructured Average

Intel Xeon Phi 21.6 8.7 15.22x Ivy Bridge CPU 23.5 14.6 19.0

NVIDIA K20X GPU 16.7 13.3 15.0

no one solution fits all.

If we must, some generalising statements:

Large structured matrices: GPUs.

Large unstructured matrices: CPUs or GPUs.

Smaller matrices: Xeon Phi or CPUs.

Ref.: Yzelman, A. N. (2015). Generalised vectorisation for sparse matrix: vector multiplication. In Proceedings of the 5thWorkshop on Irregular Applications: Architectures and Algorithms. ACM.

A. N. Yzelman

Page 92: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

BSP sparse matrix–vector multiplication

Variables As , xs , ys are local versions of the global variables A, x , ydistributed according to πA, πx , πy .

1: for j | ∃aij 6= 0 ∈ As and πx(j) 6= s do2: get xπx (j),j

3: end for4: sync {execute fan-out}5: ys = Asxs {local multiplication stage}6: for i | ∃aij ∈ As and πy (i) 6= s do7: send (i , ys,i ) to πy (i)8: end for9: sync {execute fan-in}

10: for all (i , α) received do11: add α to ys,i12: end for

Rob H. Bisseling, “Parallel Scientific Computation”, Oxford Press, 2004.

A. N. Yzelman

Page 93: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

BSP sparse matrix–vector multiplication

Variables As , xs , ys are local versions of the global variables A, x , ydistributed according to πA, πx , πy .

1: for j | ∃aij 6= 0 ∈ As and πx(j) 6= s do2: get xπx (j),j

3: end for4: sync {execute fan-out}

5: ys = Asxs {local multiplication stage}6: for i | ∃aij ∈ As and πy (i) 6= s do7: send (i , ys,i ) to πy (i)8: end for9: sync {execute fan-in}

10: for all (i , α) received do11: add α to ys,i12: end for

Rob H. Bisseling, “Parallel Scientific Computation”, Oxford Press, 2004.

A. N. Yzelman

Page 94: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

BSP sparse matrix–vector multiplication

Variables As , xs , ys are local versions of the global variables A, x , ydistributed according to πA, πx , πy .

1: for j | ∃aij 6= 0 ∈ As and πx(j) 6= s do2: get xπx (j),j

3: end for4: sync {execute fan-out}5: ys = Asxs {local multiplication stage}

6: for i | ∃aij ∈ As and πy (i) 6= s do7: send (i , ys,i ) to πy (i)8: end for9: sync {execute fan-in}

10: for all (i , α) received do11: add α to ys,i12: end for

Rob H. Bisseling, “Parallel Scientific Computation”, Oxford Press, 2004.

A. N. Yzelman

Page 95: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

BSP sparse matrix–vector multiplication

Variables As , xs , ys are local versions of the global variables A, x , ydistributed according to πA, πx , πy .

1: for j | ∃aij 6= 0 ∈ As and πx(j) 6= s do2: get xπx (j),j

3: end for4: sync {execute fan-out}5: ys = Asxs {local multiplication stage}6: for i | ∃aij ∈ As and πy (i) 6= s do7: send (i , ys,i ) to πy (i)8: end for9: sync {execute fan-in}

10: for all (i , α) received do11: add α to ys,i12: end for

Rob H. Bisseling, “Parallel Scientific Computation”, Oxford Press, 2004.

A. N. Yzelman

Page 96: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

BSP sparse matrix–vector multiplication

Variables As , xs , ys are local versions of the global variables A, x , ydistributed according to πA, πx , πy .

1: for j | ∃aij 6= 0 ∈ As and πx(j) 6= s do2: get xπx (j),j

3: end for4: sync {execute fan-out}5: ys = Asxs {local multiplication stage}6: for i | ∃aij ∈ As and πy (i) 6= s do7: send (i , ys,i ) to πy (i)8: end for9: sync {execute fan-in}

10: for all (i , α) received do11: add α to ys,i12: end for

Rob H. Bisseling, “Parallel Scientific Computation”, Oxford Press, 2004.A. N. Yzelman

Page 97: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Multi-BSP SpMV multiplication

SPMD-style Multi-BSP SpMV multiplication:

define process 0 at level −1 as the Multi-BSP root.

let process s at level k have parent t at level k − 1.

define (A−1,0, x−1,0, y−1,0) = (A, x , y), the original input.

variables Ak,s , xk,s , yk,s are local versions of Ak−1,t , xk−1,t , yk−1,t ,

{A, x , k}k−1,t was distributed into pk−1 parts,

where pk−1 ≥ pk−1 is such that all {A, x , y}k,s fit into Mk bytes.

1: do2: for j = 0 to p step p3: get {A}k,j from parent4: down5: while(up)

Mandatory input data movement only.

A. N. Yzelman

Page 98: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Multi-BSP SpMV multiplication

SPMD-style Multi-BSP SpMV multiplication:

define process 0 at level −1 as the Multi-BSP root.

let process s at level k have parent t at level k − 1.

define (A−1,0, x−1,0, y−1,0) = (A, x , y), the original input.

variables Ak,s , xk,s , yk,s are local versions of Ak−1,t , xk−1,t , yk−1,t ,

{A, x , k}k−1,t was distributed into pk−1 parts,

where pk−1 ≥ pk−1 is such that all {A, x , y}k,s fit into Mk bytes.

1: do2: for j = 0 to p step p3: get {A}k,j from parent4: down5: while(up)

Mandatory input data movement only.

A. N. Yzelman

Page 99: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Multi-BSP SpMV multiplication

SPMD-style Multi-BSP SpMV multiplication:

define process 0 at level −1 as the Multi-BSP root.

let process s at level k have parent t at level k − 1.

define (A−1,0, x−1,0, y−1,0) = (A, x , y), the original input.

variables Ak,s , xk,s , yk,s are local versions of Ak−1,t , xk−1,t , yk−1,t ,

{A, x , k}k−1,t was distributed into pk−1 parts,

where pk−1 ≥ pk−1 is such that all {A, x , y}k,s fit into Mk bytes.

1: do2: for j = 0 to p step p3: get {A}k,j from parent4: down5: while(up)

Mandatory input data movement only.

A. N. Yzelman

Page 100: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Multi-BSP SpMV multiplication

SPMD-style Multi-BSP SpMV multiplication:

define process 0 at level −1 as the Multi-BSP root.

let process s at level k have parent t at level k − 1.

define (A−1,0, x−1,0, y−1,0) = (A, x , y), the original input.

variables Ak,s , xk,s , yk,s are local versions of Ak−1,t , xk−1,t , yk−1,t ,

{A, x , k}k−1,t was distributed into pk−1 parts,

where pk−1 ≥ pk−1 is such that all {A, x , y}k,s fit into Mk bytes.

1: do2: for j = 0 to p step p3: get {A}k,j from parent4: down5: while(up)

Mandatory input data movement only.

A. N. Yzelman

Page 101: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Multi-BSP SpMV multiplication

SPMD-style Multi-BSP SpMV multiplication:

1: do2: . . .3: for j = 0 to p step p4: get {A, x , y}k,j from parent5: . . .6: if(not down)7: compute yk,j = Ak,jxk,j {only executed on leafs}8: . . .9: put yk,j into parent

10: . . .11: while(up)

Mandatory and mixed mandatory/overhead data movement.Minimal required work only.

A. N. Yzelman

Page 102: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Multi-BSP SpMV multiplication

SPMD-style Multi-BSP SpMV multiplication:

1: do2: ∀j , get separator xk,j and initialise yk,j iff j mod p = s3: for j = 0 to p step p4: get {A, x , y}k,j from parent5: sync6: if(not down)7: compute yk,j = Ak,jxk,j {only executed on leafs}8: perform fan-in on seperator yk,j9: put yk,j into parent

10: sync11: put yk,j into parent and sync12: while(up)

Mandatory costs plus overhead. Split vectors: {x , y}s versus {x , y}s .

A. N. Yzelman

Page 103: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Flat partitioning for Multi-BSP

Can we reuse existing partitioning techniques?

1 Partition A = A0 ∪ . . .Ap−1 with p = πL−1l=0 pl?

No: As , xs , ys may not fit in ML−1.

2 Find minimal k to partition A into s.t. {A, x , y}i fits into ML−1?Very similar to previous work!

Y. and Bisseling, “Cache-oblivous sparse matrix–vectormultiplication by using sparse matrix partitioning”, SISC, 2009.Y. and Bisseling, “Two-dimensional cache-oblivious sparsematrix–vector multiplication”, Parallel Computing, 2011.

3 Hierarchical partitioning?A = A0 ∪ . . . ∪ Ak0 ,Ai = Ai,0 ∪ . . . ∪ Ai,k1 , etc.solves assignment issue.

However,all of these do not take into account different gl !

A. N. Yzelman

Page 104: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Flat partitioning for Multi-BSP

Can we reuse existing partitioning techniques?

1 Partition A = A0 ∪ . . .Ap−1 with p = πL−1l=0 pl?

2 Find minimal k to partition A into s.t. {A, x , y}i fits into ML−1?Very similar to previous work!

Y. and Bisseling, “Cache-oblivous sparse matrix–vectormultiplication by using sparse matrix partitioning”, SISC, 2009.Y. and Bisseling, “Two-dimensional cache-oblivious sparsematrix–vector multiplication”, Parallel Computing, 2011.

3 Hierarchical partitioning?A = A0 ∪ . . . ∪ Ak0 ,Ai = Ai,0 ∪ . . . ∪ Ai,k1 , etc.solves assignment issue.

However,all of these do not take into account different gl !

A. N. Yzelman

Page 105: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Flat partitioning for Multi-BSP

Can we reuse existing partitioning techniques?

1 Partition A = A0 ∪ . . .Ap−1 with p = πL−1l=0 pl?

2 Find minimal k to partition A into s.t. {A, x , y}i fits into ML−1?Very similar to previous work!

Y. and Bisseling, “Cache-oblivous sparse matrix–vectormultiplication by using sparse matrix partitioning”, SISC, 2009.Y. and Bisseling, “Two-dimensional cache-oblivious sparsematrix–vector multiplication”, Parallel Computing, 2011.

3 Hierarchical partitioning?A = A0 ∪ . . . ∪ Ak0 ,Ai = Ai,0 ∪ . . . ∪ Ai,k1 , etc.solves assignment issue.

However,all of these do not take into account different gl !

A. N. Yzelman

Page 106: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Flat partitioning for Multi-BSP

Can we reuse existing partitioning techniques?

1 Partition A = A0 ∪ . . .Ap−1 with p = πL−1l=0 pl?

2 Find minimal k to partition A into s.t. {A, x , y}i fits into ML−1?Very similar to previous work!

Y. and Bisseling, “Cache-oblivous sparse matrix–vectormultiplication by using sparse matrix partitioning”, SISC, 2009.Y. and Bisseling, “Two-dimensional cache-oblivious sparsematrix–vector multiplication”, Parallel Computing, 2011.

3 Hierarchical partitioning?A = A0 ∪ . . . ∪ Ak0 ,Ai = Ai,0 ∪ . . . ∪ Ai,k1 , etc.solves assignment issue.

However,all of these do not take into account different gl !

A. N. Yzelman

Page 107: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Hierarchical partitioning

If g0 < 2g1, greedy hierarchical partitioning is suboptimal.

Upper level Lower level

Fan-out 6g0

Fan-in 2g0

Total: 8g0 + . . .Previous:

A. N. Yzelman

Page 108: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Hierarchical partitioning

If g0 < 2g1, greedy hierarchical partitioning is suboptimal.

Upper level Lower level

Fan-out 6g0 0Fan-in 2g0 0

Total: 8g0

Previous:

A. N. Yzelman

Page 109: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Hierarchical partitioning

If g0 < 2g1, greedy hierarchical partitioning is suboptimal.

Upper level Lower level

Fan-out 0Fan-in 4g0

Total: 4g0 + . . .Previous: 8g0

A. N. Yzelman

Page 110: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Hierarchical partitioning

If g0 < 2g1, greedy hierarchical partitioning is suboptimal.

Upper level Lower level

Fan-out 0 6g1

Fan-in 4g0 2g1

Total: 4g0 + 8g1

Previous: 8g0

A. N. Yzelman

Page 111: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Hierarchical partitioning

If g0 < 2g1, greedy hierarchical partitioning is suboptimal.

Upper level Lower level

Fan-out 0 6g1

Fan-in 4g0 2g1

Total: 4g0 + 8g1

Previous: 8g0

A. N. Yzelman

Page 112: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Multi-BSP aware partitioning

Slightly modified V-cycle:

1 coarsen

2 recurse or randomly partition3 do k steps of HKLFM

calculate gains taking g0, . . . , gL−1 into account

4 refine

Claim: if g0 > g1 > g2 . . ., then HKLFM is a local operation.

By enumeration of all possibilities (L = 2). At level-1 refinement:

suppose we move a nonzero aij from As1,s2 to At1,t2 with s1 6= t1:

aij ∈ As1,s2 , aij /∈ As1 : gain is g1 − g0 or 2(g1 − g0).

aij /∈ As1,s2 : gain is 0, g1 − g0, or 2(g1 − g0).

Hence it suffices to perform HKLFM steps on each level separately.

A. N. Yzelman

Page 113: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Multi-BSP aware partitioning

Slightly modified V-cycle:

1 coarsen

2 recurse or randomly partition3 do k steps of HKLFM

calculate gains taking g0, . . . , gL−1 into account

4 refine

Claim: if g0 > g1 > g2 . . ., then HKLFM is a local operation.

By enumeration of all possibilities (L = 2). At level-1 refinement:

suppose we move a nonzero aij from As1,s2 to At1,t2 with s1 6= t1:

aij ∈ As1,s2 , aij /∈ As1 : gain is g1 − g0 or 2(g1 − g0).

aij /∈ As1,s2 : gain is 0, g1 − g0, or 2(g1 − g0).

Hence it suffices to perform HKLFM steps on each level separately.

A. N. Yzelman

Page 114: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Summary

Differences from flat BSP:

different notion of load balanceparts must fit into local memory.

non-uniform communication costsimplies different partitioning techniques.

Non-uniform data locality...

with fine-grained distribution.

A. N. Yzelman

Page 115: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Summary

Differences from flat BSP:

different notion of load balanceparts must fit into local memory.

non-uniform communication costsimplies different partitioning techniques.

Non-uniform data locality...with fine-grained distribution.

A. N. Yzelman

Page 116: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

How does it compare?

ANSI C++11, parallelisation using std::thread,

implementation relies on shared-memory cache coherency

Mondriaan 4.0, medium-grain, symmetric doubly BBD reordering

Global arrays without blocking, nonzero reordering, compression.

matrix original p = 1 p = max Optimal

2x8 G3 circuit 33.3 26.7 10.5 2.772x8 FS1 83.5 65.3 22.0 10.32x8 cage15 523 387 77.1 29.8

2x10 G3 circuit 22.7 16.9 9.77 1.732x10 FS1 83.5 65.3 22.0 7.562x10 cage15 341 233 54.7 23.4

all numbers are in ms.

Y. and Bisseling, Cache-oblivious sparse matrix–vector multiplication, SISC 2009

Y. and Roose, High-level strategies for sparse matrix–vector multiplication, IEEE TPDS 2014

A. N. Yzelman

Page 117: Portable, usable, and efficient sparse matrix vector ...

Usable sparse matrix–vector multiplication

Vectorised BICRS

Incorporating vectorisation in the compressed data structure:

Ref.: Yzelman, A. N. & Roose, D. (2014). Sparse matrix computations on multi-core systems. In Intel European ExascaleLabs report 2013, pp. 24–29, Intel.Ref.: Yzelman, A. N. (2015). Generalised vectorisation for sparse matrix–vector multiplication. In Proceedings of the 5th

Workshop on Irregular Applications: Architectures and Algorithms. ACM.

A. N. Yzelman