Legion: Programming Distributed Heterogeneous ......Legion: Programming Distributed Heterogeneous...

Post on 24-Sep-2020

10 views 0 download

Transcript of Legion: Programming Distributed Heterogeneous ......Legion: Programming Distributed Heterogeneous...

Tasks are the unit of computation in Legion -  Tasks name logical regions and fields they will access at dispatch -  Each logical region and field annotated with privilege: read-only, read-write, reduce -  Tasks can recursively launch arbitrary sub-tasks for nested parallelism

Tasks issued in program order, parallelism inferred from non-interference -  Sub-region non-interference: accessing disjoint sub-regions -  Field non-interference: accessing disjoint sets of fields -  Privilege non-interference: performing non-interfering access (e.g. both read-only) Legion runtime operates similarly to a hardware out-of-order processor

Legion: Programming Distributed Heterogeneous Architectures with Logical Regions

Michael Bauer, Sean Treichler, Elliott Slaughter, Alex Aiken Stanford University

Overview

Partitioning

Implicit Task Parallelism S3D

Evaluation

Three Important Trends 1.  Cost of Data Movement Dominates 2.  Dynamism in Both Hardware and Software 3.  Heterogeneity of Processors and Memory Legion Goals 1.  Abstractions for logically describing program data to minimize data movement 2.  Emphasize runtime decision making to handle dynamism 3.  Decouple specification from mapping to handle heterogeneity

Sub-Region Non-Interference

Mapping

Entry

Field Logical Regions: A Relational Abstraction for Data Logical regions describe data in a relation style: entries (rows) and fields (columns). Support some relational operators for providing different views on data. - Partitioning is selection (σ) - Field-slicing is projection (π) Relations can encode arbitrary data structures.

Field Field Field

Entry

Entry

Entry

Entry

Entry

Entry

Entry

Entry

Logical regions can be partitioned into sub-regions -  Partitions are computed dynamically, either disjoint or aliased -  Logical regions can be partitioned recursively -  Logical regions support multiple partitions

Logical region trees capture important properties of program data -  Independence: disjoint logical regions in same partition -  Locality: elements in same logical regions -  Sub-region: parent and child logical regions -  Aliasing: aliased partitions and different partitions of same region

N

SP

p1 pn … s1 sn … g1 gn …

disjoint

possibly overlap

Field Field Field Field Field Field

Dep. Analysis Distribute Map Execute Resolve

Spec. Complete Commit

Field Non-Interference Privilege Non-Interference

Legion applications are mapped onto different architectures -  Tasks assigned to processors, region instances assigned to memories -  All decisions are programmatically available via mapper interface -  Mapper objects can make dynamic decisions based on runtime information

Mapping decisions are independent of correctness -  Makes tuning Legion applications simple -  Porting can be performed easily and only requires writing a new mapper Mappers are customizable and composable -  Provide a default mapper with heuristics -  Can write Legion applications and gradually refine mapping by overriding virtual functions -  Different mappers in same application

1  

t1!

t2!

t3!

t4!t5!

rc

rw

rw1 rw2

rn

rn1 rn2

$

$

$

$

N U M A

N U M A

FB

D R A M

x86

CUDA

x86

x86

x86

Production combustion simulation from the Department of Energy -  Full scale application consisting of more than 200K lines of Fortran MPI -  Significant task and data level parallelism -  Data movement is the limiting factor

Ported main simulation loop (100K lines) of S3D into Legion C++ -  Interoperate with MPI version -  Legion automatically discovers significantly task level parallelism -  Can explore different mapping running on Titan, worlld’s #2 supercomputer

Many applications ported to Legion, showing both strong and weak scaling -  Circuit simulation, AMR, fluid flow, unstructured mesh, S3D -  Running on Titan and Keeneland supercomputers at Oak Ridge National Lab -  Can handle CPU+GPU, Infiniband, Gemini, Aires, NUMA, …

S3D performance results demonstrate that Legion can weak scale -  Compare against MPI+OpenACC version tuned by NVIDIA and Cray teams -  Take best Legion mapping after trying many different tuning techniques -  Between 2.0 and 2.27X faster than MPI+OpenACC

8 16 32 64 128 256 512Nodes

0

50000

100000

150000

200000

250000

Thro

ughp

ut(P

oint

s/s)

Legion 643

MPI+OpenACC 643

8 16 32 64 128 256 512 1024Nodes

0

20000

40000

60000

80000

100000

120000

140000

Thro

ughp

ut(P

oint

s/s)

Legion 483

MPI+OpenACC 483

DME Mechanism Heptane Mechanism