Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was...

82
1 Task-Based Parallel Programming in Legion Alex Aiken Stanford University & SLAC ECP Legion Tutorial, February 2018

Transcript of Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was...

Page 1: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

1

Task-Based Parallel Programming in Legion

Alex Aiken Stanford University & SLAC

ECP Legion Tutorial, February 2018

Page 2: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

2

Acknowledgments

  The Legion project is joint work between Stanford, Los Alamos National Lab, NVIDIA, and SLAC.

  Funding has come from many sources, but particularly the DOE and the leadership class facilities.

  This research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S. Department of Energy’s Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation’exascale computing imperative.

ECP Legion Tutorial, February 2018

Page 3: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

3

Tutorial Materials

The slides, example program, and performance profiles are at:

http://theory.stanford.edu/~aiken/ecp

ECP Legion Tutorial, February 2018

Page 4: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

4

OVERVIEW

ECP Legion Tutorial, February 2018

Page 5: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

5

Legion & Regent

 Legion is   a C++ runtime   a programming model

 Regent is a programming language   For the Legion programming model   Current implementation is embedded in Lua   Has an optimizing compiler

 This tutorial focuses on Regent

ECP Legion Tutorial, February 2018

Page 6: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

6

Why Use Legion?

 Easy access to GPUs   Simplifies programming complex hardware

 Easy control over data   Partitioning, placement and layout in memory

 Automated scheduling and latency hiding   Asynchronous tasking   Throughput oriented

 Performance portability

ECP Legion Tutorial, February 2018

Page 7: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

7

Regent Stack

Regent Language and

compiler

Legion High-level runtime

Realm Low-level runtime

Lua Host language

ECP Legion Tutorial, February 2018

Page 8: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

8

Regent in Lua

 Embedded in Lua   Popular scripting language in the graphics community

 Excellent interoperation with C   And with other languages

 Python-ish syntax   For both Lua and Regent

ECP Legion Tutorial, February 2018

Page 9: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

9

PAGERANK

ECP Legion Tutorial, February 2018

Page 10: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

10

PageRank

 Today’s example

  Input: A directed graph.  Output: The rank of each node

  A measure of a node’s importance   E.g., used for ranking web search results

  Web pages are nodes   Hyperlinks are edges

ECP Legion Tutorial, February 2018

Page 11: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

11

PageRank Equation

ECP Legion Tutorial, February 2018

rank(n) = (1 – α)/N + α × Σ p ε pred(n). rank(p)/|succs(p)|

Page 12: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

12

TASKS

ECP Legion Tutorial, February 2018

Page 13: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

13

The PageRank Task

task pagerank(nodes: region(…), edges: region(…), pr_old: region(…), pr_new:region(…), alpha: float)

{ }

Tasks are the unit of parallel execution.

Logical regions are (typed) collections Logical: no implied layout no implied location

ECP Legion Tutorial, February 2018

Page 14: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

14

The PageRank Task

task pagerank(nodes: region(…), edges: region(…), pr_old: region(…), pr_new:region(…), alpha: float)

{ }

ECP Legion Tutorial, February 2018

Page 15: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

15

The PageRank Task

task pagerank(nodes: region(…), edges: region(…), pr_old: region(…), pr_new:region(…), alpha: float) where reads(nodes, edges, pr_old), writes(pr_new) { }

Privileges declare how a task will use its region arguments.

ECP Legion Tutorial, February 2018

Page 16: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

16

The PageRank Task task pagerank(nodes: region(…), edges: region(…), pr_old: region(…), pr_new:region(…), alpha: float) where reads(nodes, edges, pr_old), writes(pr_new) { … for n in nodes do … score = 0 for e in left, right do -- indices of predecessor edges of n … score = score + pr_old[edges[e].src] end … score = (1 – alpha) / num_nodes + alpha * score pr_new[n] = score / out_degree end }

ECP Legion Tutorial, February 2018

Page 17: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

17

REGIONS

ECP Legion Tutorial, February 2018

Page 18: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

18

Regions

 A region is a (typed) collection

 Regions are the cross product of   An index space   A field space

src dst 0 a x 1 b x 2 c x 3 a y

The region’s index space The region’s field space

The edges region is constructed so that edges are grouped by dst node

EDGES

ECP Legion Tutorial, February 2018

Page 19: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

19

Discussion

 Regions are the way to organize large data collections in Regent

 Regions can be   Structured (e.g., like arrays)   Unstructured (e.g., pointer data structures)

 Any number of fields

 Built-in support for multidimensional index spaces

ECP Legion Tutorial, February 2018

Page 20: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

20

Nodes & Edges

EDGES

NODES

Nodes have two fields: out_degree and index

ECP Legion Tutorial, February 2018

Page 21: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

21

Nodes & Edges

EDGES

NODES

A node’s index field points just beyond its last predecessor edge.

ECP Legion Tutorial, February 2018

Page 22: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

22

PAGERANK TASK

ECP Legion Tutorial, February 2018

Page 23: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

23

PARTITIONING

ECP Legion Tutorial, February 2018

Page 24: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

24

Partitioning

 To enable parallelism on a region, partition it into smaller pieces

  And then run a task on each piece

 Partitioning is built in to Legion/Regent   A rich set of partitioning primitives

 Use the primitives to build partitioning algorithms

ECP Legion Tutorial, February 2018

Page 25: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

25

Equal Partitions

 One commonly used primitive is to split a region into a number of (nearly) equal size subregions

num_pieces = ispace(int1d, 4) r = region(ispace(int1d, 12), int64) p = partition(equal, r, num_pieces)

ECP Legion Tutorial, February 2018

Page 26: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

26

Region Trees

0

r

1 2 3

The Legion runtime knows and uses the region tree to manage mapping and automate parallelism and data movement.

ECP Legion Tutorial, February 2018

Page 27: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

27

Partitions

•  Partitions are first class objects

•  An array of the subregions formed by a partition

p = partition(equal, r, num_pieces)

p[0] p[1] p[2] p[3]

ECP Legion Tutorial, February 2018

Page 28: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

28

Discussion

•  Partitioning and region creation are dynamic   Can be done at any time   Regions and partitions are first class values

 Regions trees can be any depth Subregions can be partitioned, too

 Regions can be partitioned in multiple ways   A program can define multiple views of its data

 Defining regions/partitions does not materialize them   Gives names to subsets of the data   Actual computations access physical instances of regions

ECP Legion Tutorial, February 2018

Page 29: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

29

Partitioning Strategy for PageRank

•  Use edge partitioning •  Approximately equal number of edges per subregion •  Better than node partitioning if nodes can have very

different out degrees

•  But •  Keep all predecessor edges of a given node in the same

subregion

•  So •  Calculate the range of edges for each subregion •  Partition the edges by range •  Partition the nodes compatibly with the edge partition

ECP Legion Tutorial, February 2018

Page 30: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

30

Nodes & Edges

EDGES

NODES

a node’s index field points just beyond its last predecessor edge

ECP Legion Tutorial, February 2018

Page 31: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

31

First Step: Calculate Edge Ranges

task init_partition( edge_range : region(ispace(int1d),rect1d), edges : region(ispace(int1d), EdgeStruct), avg_num_edges : E_ID, num_parts: int) where writes(node_range, edge_range), reads(nodes) do … for p = 0, num_parts do right_bound = min(avg_num_edges * (p + 1), total_num_edges) var my_dst: V_ID = edges[right_bound].dst -- extend the right bound to the last edge of the current node while (right_bound < total_num_edges) do var next_dst : V_ID = edges[right_bound+1].dst if (my_dst<next_dst) then break end right_bound = right_bound + 1 end edge_range[p] = {left_bound, right_bound} end

ECP Legion Tutorial, February 2018

Page 32: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

32

DEPENDENT PARTITIONING

ECP Legion Tutorial, February 2018

Page 33: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

33

Partitioning, Revisited

 Why do we want to partition data?   For parallelism   We will launch many tasks over many subregions

 A problem   We often need to partition multiple data structures in a consistent way   E.g., given that we have partitioned the nodes a particular way, that will dictate the desired partitioning of the edges

ECP Legion Tutorial, February 2018

Page 34: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

34

Dependent Partitioning

 Distinguish two kinds of partitions

  Independent partitions   Computed from the parent region, using, e.g.,

  partition(equals, … )

 Dependent partitions   Computed using another partition

ECP Legion Tutorial, February 2018

Page 35: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

35

Dependent Partitioning Operations

  Image   Use the image of a partition to define a new partition   E.g., the image of a field   E.g., or a range of values

 Preimage   Use the pre-image of a field in a partition …

 Set operations   Form new partitions using the intersection, union, and set difference of other partitions   Not illustrated today

ECP Legion Tutorial, February 2018

Page 36: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

36

LR1

Image

  Computes elements reachable via a field lookup

  Can be applied to index space or another partition

  Computation is distributed based on location of data

IS1

s1 s2 s3

IS1

IS2

s1 s2 s3

source partition

pointer field

IS2

destination index space

ECP Legion Tutorial, February 2018

Page 37: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

37

Preimage

  Inverse of image   Computes elements that

reach a given subspace   Preserves disjointness

  Multiple dependent partitioning operations can be combined

  Capture complex task access patterns

IS1

s1 s2 s3

IS2

IS2

s1 s2 s3

source partition pointer field

IS1

destination index space

ECP Legion Tutorial, February 2018

Page 38: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

38

Dependent Partitioning in PageRank

 The use of dependent partitioning in PageRank is simple

 Define a partition of the edges   Using the computed edge ranges

 Then define a partition of the nodes using the destination node of each edge

ECP Legion Tutorial, February 2018

Page 39: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

39

Picture (Reminder)

EDGES

NODES

ECP Legion Tutorial, February 2018

Page 40: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

40

Picture

EDGES

NODES

= dst field of edge

0:1 2:4 5:7 8:9 EDGE_RANGE

ECP Legion Tutorial, February 2018

Page 41: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

41

Picture

EDGES

NODES

= dst field of edge

0:1 2:4 5:7 8:9 EDGE_RANGE

ECP Legion Tutorial, February 2018

Page 42: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

42

Picture

EDGES

NODES

= dst field of edge

0:1 2:4 5:7 8:9 EDGE_RANGE

ECP Legion Tutorial, February 2018

Page 43: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

43

Picture

EDGES

NODES

= dst field of edge

0:1 2:4 5:7 8:9 EDGE_RANGE

ECP Legion Tutorial, February 2018

Page 44: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

44

PAGERANK MAIN

ECP Legion Tutorial, February 2018

Page 45: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

45

PARALLELISM

ECP Legion Tutorial, February 2018

Page 46: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

46

Program Semantics

 A Legion program is a sequence of task launches

 The runtime analyzes the tasks for interference   Tasks with conflicting accesses to the same data   Non-interfering tasks can execute in parallel   Interfering tasks are serialized

 Guarantees sequential semantics   Program result is as if it had executed sequentially   Very useful for debugging at scale

ECP Legion Tutorial, February 2018

Page 47: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

47

Task Graphs

 When Legion discovers interfering tasks an edge is added to the task graph recording the dependency

 Three wavefronts:

  The runtime building the task graph   The application executing the graph   The runtime collecting resources from finished tasks

ECP Legion Tutorial, February 2018

Page 48: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

48

Parallel Loop from PageRank

for p in part do pagerank(part_nodes[p], part_edges[p], pr_score0, part_score1[p], … ) end

The different calls to pagerank don’t interefere. Why? Only part_score1[] is written and it is a disjoint partition.

Note the use of different views on to the data. We use both the entire pr_score0[] region and subregions of part_score1[].

ECP Legion Tutorial, February 2018

Page 49: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

49

MAPPING

ECP Legion Tutorial, February 2018

Page 50: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

50

Mapping Interface  Application selects:

  Where tasks run   Where regions are placed

 Mapping computed dynamically

 Decouple correctness from performance

50

t1

t2

t3

t4t5

rc

rw

rw1 rw2

rn

rn1 rn2

$

$

$

$

N U M A

N U M A

FB

D R A M

x86

CUDA

x86

x86

x86

ECP Legion Tutorial, February 2018

Page 51: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

51

Mapping

 Mapping is the process of assigning resources to Regent/Legion programs

 Conceptually   Assign a processor to each task

  The task will execute in its entirety on that processor

  Assign a memory to each region argument

 And many other things!

 There is a default mapper with reasonable heuristics   Just another mapper, but a generic one

ECP Legion Tutorial, February 2018

Page 52: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

52

Mapping Interface

 At the Legion level, mapping is an API   A set of callbacks   Each is called at a particular point in a task’s lifetime   To write mappers, need to know this sequence of stages

 Regent has a mapping DSL   Concise, easy to use   Compiles to the Legion mapping API   Currently supports only static mappings

ECP Legion Tutorial, February 2018

Page 53: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

53

High-Level Overview of Mapping

 An instance of the Legion runtime runs on each node

 When a task is launched on the local runtime:   The mapper picks a processor for the task   The mapper picks memories for the region arguments   … and other things as well …

ECP Legion Tutorial, February 2018

Page 54: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

54

New Concepts

 There are a number of concepts at the mapping level that don’t exist in Regent

 Machine models  Variants  Physical Instances

ECP Legion Tutorial, February 2018

Page 55: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

55

Machine Model

 To pick concrete processors & memories, the runtime must know:

 How many processors/memories there are   And of what kinds

 And where the processors/memories are   At least relative to each other

 A machine model is written once for each machine

ECP Legion Tutorial, February 2018

Page 56: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

56

Components of a Machine Model

  Processors   LOC   TOC   PROC_SET   UTILITY   IO

  Memories   GLOBAL   SYSTEM   RDMA   FRAME_BUFFER   ZERO_COPY   DISK   HDF5

ECP Legion Tutorial, February 2018

Page 57: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

57

Affinities

  Processor -> Memory   Which memories are attached to a processor

  Memory -> Memory

  Which memories have channels between them

  Memory -> Processor   All processors attached to a memory

  Affinities are provided as a list   (proc,mem) and (mem,mem) pairs   Also include bandwidth and latency information

ECP Legion Tutorial, February 2018

Page 58: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

58

Task Variants

 A task can have multiple variants   Different implementations of the same task   Multiple variants can be registered with the runtime   Variants can have associated constraints

 Examples   A variant for LOC   Another variant for TOC   Variants for different data layouts

ECP Legion Tutorial, February 2018

Page 59: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

59

Variants in Regent

 Place immediately before a task declaration   __demand(__cuda)

 Causes both CPU and GPU task variants to be produced

 And the default mapper always prefers to pick a GPU variant if possible

ECP Legion Tutorial, February 2018

Page 60: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

60

Physical Instances

 A region is a logical name for data

 A physical instance is a copy of that data   For some set of fields

 There can be 0, 1 or many physical instances of a specific field of a region at any time

ECP Legion Tutorial, February 2018

Page 61: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

61

Physical Instances

  Can be valid or invalid   Is the data current or not?

  Live in a specific memory

  Have a specific layout   Column major, row major, blocked, struct-of-arrays, array-of-

structs, …

  Are allocated explicitly by the mapper

  Are deallocated by the runtime   Garbage collected

ECP Legion Tutorial, February 2018

Page 62: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

62

A Word About Physical Instances

  Many physical instances of a region can exist simultaneously   Including different versions of the same data

  A task writing version 0 to disk   A task reading version 5   A task writing version 6

  The current version!   A task scheduled to read version 6   A task scheduled to write version 7   A (meta)task scheduled to deallocate version 6   …

ECP Legion Tutorial, February 2018

Page 63: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

63

Layout Constraints

 Tasks can have layout constraints on physical instances

  “This task requires data in row major order”

 Constraints are just that   Don’t specify an exact layout   Multiple instances may satisfy the constraints

ECP Legion Tutorial, February 2018

Page 64: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

64

Summary

 Mapping   Selects processors for tasks   Selects memories for physical instances

  Satisfying region requirements of tasks

 Many options   Default mapper does reasonable things   But any sufficiently complex program will need some customization

ECP Legion Tutorial, February 2018

Page 65: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

65

PAGERANK MAPPER

ECP Legion Tutorial, February 2018

Page 66: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

66

PROFILING

ECP Legion Tutorial, February 2018

Page 67: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

67

Legion Prof

 A tool for showing performance timeline   Each processor is a timeline   Each operation is a time interval   Different kinds of operations have different colors

 White space = idle time   Want to understand why there is white space

ECP Legion Tutorial, February 2018

Page 68: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

68

Example Profiles from PageRank

1 node, 8 cpus pagerank/run_pr.sh --program baseline --cpus 8 --nodes 1 --gpus 0

1 node, 1 GPU pagerank/run_pr.sh --program baseline --cpus 4 --nodes 1 --gpus 1

1 node, 2 GPUs pagerank/run_pr.sh --program baseline --cpus 4 --nodes 1 --gpus 2

1 node, 4 GPUS

pagerank/run_pr.sh --program baseline --cpus 4 --nodes 1 --gpus 4

ECP Legion Tutorial, February 2018

Page 69: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

69

Performance Results

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1Node 2Nodes 4Nodes

Runtim

epe

riteratio

n(secon

ds)

PageRankPerformancewithDifferentMappingConfigurations

1GPU/node(GPUMemory)

2GPU/node(GPUMemory)

4GPU/node(GPUMemory)

1GPU/node(Zero-CopyMemory)

2GPU/node(Zero-CopyMemory)

4GPU/node(Zero-CopyMemory)

ECP Legion Tutorial, February 2018

Page 70: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

70

OTHER APPLICATIONS

ECP Legion Tutorial, February 2018

Page 71: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

71

S3D: Combustion Simulation  Simulates chemical reactions

  DME (30 species)   Heptane (52 species)   PRF (116 species)

 Two parts   Physics

  Nearest neighbor communication   Data parallel

  Chemistry   Local   Complex task parallelism

  Large working sets/task 5

Planning the science simulation

•  Recent 3D simulation on Jaguar was used to extrapolate and plan a target Titan simulation

•  Planned simulation will have more grid points and/or larger chemistry

•  Will need a month on 12,000 hybrid nodes of Titan

Figure 5: Computational domain and grid to be used for simulations of the CRF HCCI engine.

Figure 6: Reaction and diffusion structures for OH radical during the third stage thermal explosion of a high-pressureDME fueled autoignition process.Recent 3D DNS of auto-ignition with 30-species

DME chemistry (Bansal et al. 2011)

ECP Legion Tutorial, February 2018

Page 72: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

72

Mapping for Heptane 483

4 AMD Interlagos

Integer cores for Legion Runtime

8 AMD Interlagos FP

cores for application

NVIDIA Kepler K20

Dynamic Analysis for (rhsf+2) Clean-up/meta tasks

ECP Legion Tutorial, February 2018

Page 73: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

73

Mapping for Heptane 963  Handle larger problem sizes per node

  Higher computation-to-communication ratios   More power efficient

 Different mapping   Limited by size of GPU framebuffer

 Legion analysis is independent of problem size   Larger tasks -> fewer runtime cores

ECP Legion Tutorial, February 2018

Page 74: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

74

Weak Scaling: PRF on Titan

3X

7X

ECP Legion Tutorial, February 2018

Page 75: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

75

Fast Graph Analytics

 Conventional wisdom:   Graph processing has trouble taking advantage of distributed memory

 High performance graph processing systems are dominated by shared-memory CPU-based systems

 Observation   GPUs provide higher memory bandwidth than CPUs   Can avoid communication by careful placement of data in the memory hierarchy

ECP Legion Tutorial, February 2018

Page 76: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

76

Fast Graph Processing [VLDB’18]

Performance comparison on a single GPU (lower is better).

Performance comparison among different graph processing frameworks (lower is better).

Competitive with state-of-the-art single-GPU graph processing engines.

Orders of magnitude speedup compared to state-of-the-art distributed/shared memory CPU systems.

ECP Legion Tutorial, February 2018

Page 77: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

77

  In CNNs, data is commonly organized as 4D tensors.   tensor = [image, height, width, channel]

 Existing tools parallelize the image dimension.

 Motivation   Explore other parallelizable dimensions   Allow each layer to be parallelized differently

Convolutional Neural Networks [ICML18]

ECP Legion Tutorial, February 2018

Page 78: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

78

Results

Figure 1: Training throughput (images/second) on 16 GPUs.

Figure 2: Data transfers in each step on 16 GPUs with a minibatch size of

512. ECP Legion Tutorial, February 2018

Page 79: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

79 ECP Legion Tutorial, February 2018

Page 80: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

80

Separating Concerns

 Current practice entangles functionality, scheduling, and mapping

  Consider a code written in MPI + OpenMP + CUDA

 Alternative   Specify functionality and dependencies first   Then focus on mapping and scheduling for a machine   A lot of the benefits of Legion flow from this design

ECP Legion Tutorial, February 2018

Page 81: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

81

Programmer Productivity

  In the end, it’s all about productivity

 How much work is needed to achieve a desired level of performance?

 Legion philosophy   More expressive data model   Requires more initial work from the programmer   But makes later stages easier & more flexible

  Easy to try different partitioning strategies   Easy to explore alternative mappings

ECP Legion Tutorial, February 2018

Page 82: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S.

82

Legion

Legion website: http://legion.stanford.edu

ECP Legion Tutorial, February 2018