Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri...

64
Generating Ecient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation Indian Institute of Science {roshan,chandan.g,thejas,uday}@csa.iisc.ernet.in September 11, 2013 Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 1 / 61

Transcript of Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri...

Page 1: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Generating EXcient Data Movement Code forHeterogeneous Architectures with Distributed-Memory

Roshan Dathathri Chandan ReddyThejas Ramashekar Uday Bondhugula

Department of Computer Science and Automation

Indian Institute of Science

{roshan,chandan.g,thejas,uday}@csa.iisc.ernet.in

September 11, 2013Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 1 / 61

Page 2: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Parallelizing code for distributed-memory architectures

OpenMP code for shared-memory systems: MPI code for distributed-memory systems:

for ( i =1; i<=N; i++) {#pragma omp parallel forfor ( j =1; j<=N; j++) {<computation>

}}

for ( i =1; i<=N; i++) {set_of_j_s = dist (1, N, processor_id );for each j in set_of_j_s {<computation>

}<communication>

}

Explicit communication is required between:

devices in a heterogeneous system with CPUs and multiple GPUs.

nodes in a distributed-memory cluster.

Hence, tedious to program.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 2 / 61

Page 3: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

AXne loop nests

Arbitrarily nested loops with aXne bounds and aXne accesses.Form the compute-intensive core of scientiVc computations like:

stencil style computations,linear algebra kernels,alternating direction implicit (ADI) integrations.

Can be analyzed by the polyhedral model.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 3 / 61

Page 4: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Example iteration space

i

j

Dependence (1,0)Dependence (0,1)Dependence (1,1)

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 4 / 61

Page 5: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Example iteration space

i

j

Dependence (1,0)Dependence (0,1)Dependence (1,1)

Tiles

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 5 / 61

Page 6: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Example iteration space

i

j

Dependence (1,0)Dependence (0,1)Dependence (1,1)

Tiles Parallel phases

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 6 / 61

Page 7: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Automatic Data Movement

For aXne loop nests:Statically determine data to be transferred between computes devices.

with a goal to move only those values that need to be moved to preserveprogram semantics.

Generate data movement code that is:parametric in problem size symbols and number of compute devices.valid for any computation placement.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 7 / 61

Page 8: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Communication is parameterized on a tile

Tile represents an iteration of the innermost distributed loop.

May or may not be the result of loop tiling.

A tile is executed atomically by a compute device.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 8 / 61

Page 9: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Existing Wow-out (FO) scheme

i

j

Dependence (1,0)Dependence (0,1)Dependence (1,1)

Tiles Flow-out set

Flow-out set:

The values that need to becommunicated to other tiles.

Union of per-dependenceWow-out sets of all RAWdependences.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 9 / 61

Page 10: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Existing Wow-out (FO) scheme

i

j

Dependence (1,0)Dependence (0,1)Dependence (1,1)

Tiles Flow-out set

Receiving tiles:

The set of tiles that requirethe Wow-out set.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 10 / 61

Page 11: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Existing Wow-out (FO) scheme

i

j

Dependence (1,0)Dependence (0,1)Dependence (1,1)

Tiles Flow-out setCommuncation

All elements in the Wow-outset might not be required byall its receiving tiles.

Only ensures that the receiverrequires at least one elementin the communicated set.

Could transfer unnecessaryelements.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 11 / 61

Page 12: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Our Vrst scheme

Motivation:

All elements in the data communicated should be required by thereceiver.

Key idea:

Determine data that needs to be sent from one tile to another,parameterized on a sending tile and a receiving tile.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 12 / 61

Page 13: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Flow-in (FI) set

i

j

Dependence (1,0)Dependence (0,1)Dependence (1,1)

Tiles Flow-in set

Flow-in set:

The values that need to bereceived from other tiles.

Union of per-dependenceWow-in sets of all RAWdependences.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 13 / 61

Page 14: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Flow-out intersection Wow-in (FOIFI) scheme

i

j

Dependence (1,0)Dependence (0,1)Dependence (1,1)

Tiles Flow set

Flow set:

Parameterized on two tiles.

The values that need to becommunicated from asending tile to a receiving tile.

Intersection of the Wow-outset of the sending tile and theWow-in set of the receivingtile.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 14 / 61

Page 15: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Flow-out intersection Wow-in (FOIFI) scheme

i

j

Dependence (1,0)Dependence (0,1)Dependence (1,1)

Tiles Flow set

Flow set:

Parameterized on two tiles.

The values that need to becommunicated from asending tile to a receiving tile.

Intersection of the Wow-outset of the sending tile and theWow-in set of the receivingtile.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 15 / 61

Page 16: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Flow-out intersection Wow-in (FOIFI) scheme

i

j

Dependence (1,0)Dependence (0,1)Dependence (1,1)

Tiles Flow set

Flow set:

Parameterized on two tiles.

The values that need to becommunicated from asending tile to a receiving tile.

Intersection of the Wow-outset of the sending tile and theWow-in set of the receivingtile.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 16 / 61

Page 17: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Flow-out intersection Wow-in (FOIFI) scheme

i

j

Dependence (1,0)Dependence (0,1)Dependence (1,1)

Tiles Flow setCommuncation

Precise communication wheneach receiving tile is executedby a diUerent compute device.

Could lead to hugeduplication when multiplereceiving tiles are executedby the same compute device.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 17 / 61

Page 18: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Comparison with virtual processor based schemes

Some existing schemes use a virtual processor to physical mapping tohandle symbolic problem sizes and number of compute devices.

Tiles can be considered as virtual processors in FOIFI.Lesser redundant communication in FOIFI than prior works that usevirtual processors since it:

uses exact-dataWow information.combines data to be moved due to multiple dependences.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 18 / 61

Page 19: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Our main scheme

Motivation:

Partitioning the communication set such that all elements within eachpartition is required by all receivers of that partition.

Key idea:

Partition the dependences in a particular way, and determinecommunication sets and their receivers based on those partitions.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 19 / 61

Page 20: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Flow-out partitioning (FOP) scheme

i

j

Dependence (1,0)Dependence (0,1)Dependence (1,1)

Tiles Flow-out partitions

Source-distinct partitioning ofdependences - partitions depen-dences such that:

all dependences in a partitioncommunicate the same set ofvalues.

any two dependences indiUerent partitionscommunicate disjoint set ofvalues.

Determine communication set andreceiving tiles for each partition.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 20 / 61

Page 21: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Flow-out partitioning (FOP) scheme

i

j

Dependence (1,0)Dependence (0,1)Dependence (1,1)

Tiles Flow-out partitions

Communication sets ofdiUerent partitions aredisjoint.

Union of communication setsof all partitions yields theWow-out set.

Hence, the Wow-out set of a tile ispartitioned.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 21 / 61

Page 22: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Source-distinct partitioning of dependences

i

j

Dependence (1,1)

Tiles

Partitions of dependences

Initially, each dependence is:

restricted to those constraintswhich are inter-tile, and

put in a separate partition.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 22 / 61

Page 23: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Source-distinct partitioning of dependences

i

j

Dependence (1,0) Tiles

Partitions of dependences

Initially, each dependence is:

restricted to those constraintswhich are inter-tile, and

put in a separate partition.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 23 / 61

Page 24: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Source-distinct partitioning of dependences

i

j

Dependence (0,1)Tiles

Partitions of dependences

Initially, each dependence is:

restricted to those constraintswhich are inter-tile, and

put in a separate partition.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 24 / 61

Page 25: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Source-distinct partitioning of dependences

i

j

Dependence (1,0)

Dependence (1,1)

Tiles

Partitions of dependences

For all pairs of dependences in twopartitions:

Find the source iterations thataccess the same region ofdata - source-identical.

Get new dependences byrestricting the originaldependences to thesource-identical iterations.

Subtract out the newdependences from theoriginal dependences.

The set of new dependencesformed is a new partition.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 25 / 61

Page 26: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Source-distinct partitioning of dependences

i

j

Dependence (1,0)

Dependence (1,1)

Tiles

Partitions of dependences

For all pairs of dependences in twopartitions:

Find the source iterations thataccess the same region ofdata - source-identical.

Get new dependences byrestricting the originaldependences to thesource-identical iterations.

Subtract out the newdependences from theoriginal dependences.

The set of new dependencesformed is a new partition.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 26 / 61

Page 27: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Source-distinct partitioning of dependences

i

j

Dependence (0,1)Dependence (1,1)

Tiles

Partitions of dependences

For all pairs of dependences in twopartitions:

Find the source iterations thataccess the same region ofdata - source-identical.

Get new dependences byrestricting the originaldependences to thesource-identical iterations.

Subtract out the newdependences from theoriginal dependences.

The set of new dependencesformed is a new partition.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 27 / 61

Page 28: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Source-distinct partitioning of dependences

i

j

Dependence (0,1)Dependence (1,1)

Tiles

Partitions of dependences

For all pairs of dependences in twopartitions:

Find the source iterations thataccess the same region ofdata - source-identical.

Get new dependences byrestricting the originaldependences to thesource-identical iterations.

Subtract out the newdependences from theoriginal dependences.

The set of new dependencesformed is a new partition.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 28 / 61

Page 29: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Source-distinct partitioning of dependences

i

j

Dependence (1,0)Dependence (0,1)Dependence (1,1)

Tiles

Partitions of dependences

For all pairs of dependences in twopartitions:

Find the source iterations thataccess the same region ofdata - source-identical.

Get new dependences byrestricting the originaldependences to thesource-identical iterations.

Subtract out the newdependences from theoriginal dependences.

The set of new dependencesformed is a new partition.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 29 / 61

Page 30: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Source-distinct partitioning of dependences

i

j

Dependence (1,0)Dependence (0,1)Dependence (1,1)

Tiles

Partitions of dependences

For all pairs of dependences in twopartitions:

Find the source iterations thataccess the same region ofdata - source-identical.

Get new dependences byrestricting the originaldependences to thesource-identical iterations.

Subtract out the newdependences from theoriginal dependences.

The set of new dependencesformed is a new partition.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 30 / 61

Page 31: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Source-distinct partitioning of dependences

i

j

Dependence (1,0)Dependence (0,1)Dependence (1,1)

Tiles

Partitions of dependences

Stop when no new partitions canbe formed.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 31 / 61

Page 32: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Flow-out partitioning (FOP) scheme: at runtime

i

j

Dependence (1,0)Dependence (0,1)Dependence (1,1)

Tiles Flow-out partitions For each partition and tile exe-cuted, one of these is chosen:

multicast-pack: thepartioned communication setfrom this tile is copied to thebuUer of its receivers.

unicast-pack: thepartioned communication setfrom this tile to a receivingtile is copied to the buUer ofthat receiver.

unicast-pack is chosen only ifeach receiving tile is executed by adiUerent receiver.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 32 / 61

Page 33: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Flow-out partitioning (FOP) scheme

i

j

Dependence (1,0)Dependence (0,1)Dependence (1,1)

Tiles Flow-out partitions

Communcation

Reduces granularity at whichreceivers are determined.

Reduces granularity at whichthe conditions to choosebetween multicast-pack

and unicast-pack areapplied.

Minimizes communication ofboth duplicate andunnecessary elements.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 33 / 61

Page 34: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Another example - dependences

i

jk

j=k+1

Dependence1Dependence2 Let:

(k, i, j) - source iteration(k′, i′, j′) - target iteration

Dependence1:k′ = k+ 1

i′ = ij′ = j

Dependence2:k′ = k+ 1

i′ = ij = k+ 1

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 34 / 61

Page 35: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Another example - FO scheme

i

jk

j=k+1

Dependence1Dependence2

Tiles Flow-out set

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 35 / 61

Page 36: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Another example - FOIFI scheme

i

jk

j=k+1

Dependence1Dependence2

Tiles Flow set

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 36 / 61

Page 37: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Another example - FOIFI scheme

i

jk

j=k+1

Dependence2Tiles Flow set

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 37 / 61

Page 38: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Another example - FOIFI scheme

i

jk

j=k+1

Dependence2Tiles Flow set

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 38 / 61

Page 39: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Another example - FOP scheme

i

jk

j=k+1

Dependence1Dependence2

Tiles Flow-out partition

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 39 / 61

Page 40: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Another example - FOP scheme

i

jk

j=k+1

Dependence1 Tiles Flow-out partition

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 40 / 61

Page 41: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Implementation

As part of the PLUTO framework.

Input is sequential C code which is tiled and parallelized using thePLUTO algorithm.

Data movement code is automatically generated using our scheme.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 41 / 61

Page 42: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Implementation - distributed-memory systems

Code for distributed-memory systems using existing techniques isautomatically generated.

Asynchronous MPI primitives are used to communicate between nodesin a distributed-memory system.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 42 / 61

Page 43: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Implementation - heterogeneous systems

For heterogeneous systems, the host CPU acts both as a compute deviceand as the orchestrator of data movement between compute devices,while the GPU acts only as a compute device.

OpenCL functions clEnqueueReadBuUerRect() andclEnqueueWriteBuUerRect() are used for data movement inheterogeneous systems.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 43 / 61

Page 44: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Experimental evaluation: distributed-memory cluster

32-node InVniBand cluster.

Each node consists of two quad-core Intel Xeon E5430 2.66 GHzprocessors.

The cluster uses MVAPICH2-1.8 as the MPI implementation.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 44 / 61

Page 45: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Benchmarks

Floyd Warshall (Woyd).

LU Decomposition (lu).

Alternating Direction Implicit solver (adi).

2-D Finite DiUerent Time Domain Kernel (fdtd-2d).

Heat 2D equation (heat-2d).

Heat 3D equation (heat-3d).

The Vrst 4 are from Polybench/C 3.2 suite, while heat-2d and heat-3d arewidely used stencil computations.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 45 / 61

Page 46: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Comparison of FOP, FOIFI and FO

Same parallelizing transformation -> same frequency of communication.

DiUer only in the communcation volume.

Comparing execution times directly compares their eXciency.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 46 / 61

Page 47: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Comparison of FOP with FO

Communication volume reduced by a factor of 1.4× to 63.5×.

Communication volume reduction translates to signiVcant speedup,except for heat-2d.

Speedup of upto 15.9×.

Mean speedup of 1.55×.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 47 / 61

Page 48: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Comparison of FOP with FOIFI

Similar behavior for stencil-style codes.For Woyd and lu:

Communcation volume reduced by a factor of 1.5× to 31.8×.Speedup of upto 1.84×.

Mean speedup of 1.11×.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 48 / 61

Page 49: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

OMPD - OpenMP to MPI

Takes OpenMP code as input and generates MPI code.

Primarily a runtime dataWow analysis technique.Handles only those aXne loop nests which have a repetitivecommunication pattern.

Communication should not vary based on the outer sequential loop.

Cannot handle Woyd, lu and time-tiled (outer sequential dimension tiled)stencil style codes.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 49 / 61

Page 50: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Comparison of FOP with OMPD

For heat-2d and heat-3d, signiVcant speedup over OMPD.The computation time is much lesser.Better load balance and locality due to advanced transformations.OMPD cannot handle such transformed code.

For adi: signiVcant speedup over OMPD.Same volume of communication.Better performance due to loop tiling.Lesser runtime overhead.

Mean speedup of 3.06×.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 50 / 61

Page 51: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

UniVed Parallel C (UPC)

UniVed programming model for both shared-memory anddistributed-memory systems.All benchmarks were manually ported to UPC.

Sharing data only if it may be accessed remotely.UPC-speciVc optimizations like localized array accesses, block copy,one-sided communication.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 51 / 61

Page 52: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Comparison of FOP with UPC

For lu, heat-2d and heat-3d, signiVcant speedup over UPC.Better load balance and locality due to advanced transformations.DiXcult to manually write such transformed code.UPC model is not suitable when the same data element could be writtenby diUerent nodes in diUerent parallel phases.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 52 / 61

Page 53: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Comparison of FOP with UPC

For lu, heat-2d and heat-3d, signiVcant speedup over UPC.Better load balance and locality due to advanced transformations.DiXcult to manually write such transformed code.UPC model is not suitable when the same data element could be writtenby diUerent nodes in diUerent parallel phases.

For adi: signiVcant speedup over UPC.Same computation time and communication volume.Data to be communicated is not contiguous in memory.UPC incurs huge runtime overhead for such multiple shared memoryrequests to non-contiguous data.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 52 / 61

Page 54: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Comparison of FOP with UPC

For lu, heat-2d and heat-3d, signiVcant speedup over UPC.Better load balance and locality due to advanced transformations.DiXcult to manually write such transformed code.UPC model is not suitable when the same data element could be writtenby diUerent nodes in diUerent parallel phases.

For adi: signiVcant speedup over UPC.Same computation time and communication volume.Data to be communicated is not contiguous in memory.UPC incurs huge runtime overhead for such multiple shared memoryrequests to non-contiguous data.

For fdtd-2d and Woyd: UPC performs slightly better.Same computation time and communication volume.Data to be communicated is contiguous in memory.UPC has no additional runtime overhead.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 52 / 61

Page 55: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Comparison of FOP with UPC

For lu, heat-2d and heat-3d, signiVcant speedup over UPC.Better load balance and locality due to advanced transformations.DiXcult to manually write such transformed code.UPC model is not suitable when the same data element could be writtenby diUerent nodes in diUerent parallel phases.

For adi: signiVcant speedup over UPC.Same computation time and communication volume.Data to be communicated is not contiguous in memory.UPC incurs huge runtime overhead for such multiple shared memoryrequests to non-contiguous data.

For fdtd-2d and Woyd: UPC performs slightly better.Same computation time and communication volume.Data to be communicated is contiguous in memory.UPC has no additional runtime overhead.

Mean speedup of 2.19×.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 52 / 61

Page 56: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Results: distributed-memory cluster

0

5

10

15

20

25

1 2 4 8 16 32

Speedu

p

Number of nodes

floyd

fdtd-2d

heat-3d

lu

heat-2d

adi

Figure: FOP – strong scaling ondistributed-memory cluster

0

5

10

15

20

25

1 2 4 8 16 32

Speedu

p

Number of nodes

FOP

FOIFI

FO

UPC

Figure: floyd – speedup of FOP,FOIFI, FO and hand-optimized UPCcode over seq on distributed-memorycluster

For the transformations and computation placement chosen:FOP achieves the minimum communication volume.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 53 / 61

Page 57: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Experimental evaluation: heterogeneous systems

Intel-NVIDIA system:

Intel Xeon multicore server consisting of 12 Xeon E5645 cores.

4 NVIDIA Tesla C2050 graphics processors connected on the PCI expressbus.

NVIDIA driver version 304.64 supporting OpenCL 1.1.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 54 / 61

Page 58: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Comparison of FOP with FO

Communication volume reduced by a factor of 11× to 83×.

Communication volume reduction translates to signiVcant speedup.

Speedup of upto 3.47×.

Mean speedup of 1.53×.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 55 / 61

Page 59: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Results: heterogeneous systems

0

0.5

1

1.5

2

2.5

3

3.5

4

1GPU 1CPU+1GPU 2GPUs 4GPUs

Speedup

Device combination

floydfdtd-2dheat-3dluheat-2d

Figure: FOP – strong scaling on the Intel-NVIDIA system

For the transformations and computation placement chosen:FOP achieves the minimum communication volume.

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 56 / 61

Page 60: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Acknowledgements

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 57 / 61

Page 61: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Conclusions

The framework we propose frees programmers from the burden ofmoving data.

Partitioning of dependences enables precise determination of data to bemoved.

Our tool is the Vrst one to parallelize aXne loop nests for a combinationof CPUs and GPUs while providing precision of data movement at thegranularity of array elements.

Our techniques will be able to provide OpenMP-like programmerproductivity for distributed-memory and heterogeneous architectures ifimplemented in compilers.

Publicly available: http://pluto-compiler.sourceforge.net/

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 58 / 61

Page 62: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Results: distributed-memory cluster

Mean speedup of FOP over FO is 1.55x

Mean speedup of FOP over OMPD is 3.06x

Mean speedup of FOP over UPC is 2.19xMulticore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 59 / 61

Page 63: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Results: heterogeneous systems

Mean speedup of FOP over FO is 1.53x

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 60 / 61

Page 64: Generating Efficient Data Movement Code for …roshan/data_movement_pact13...Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation

Results: heterogeneous systems

Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 61 / 61