Kokkos%...

17
Photos placed in horizontal position with even amount of white space between photos and header Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP Kokkos Hierarchical TaskData Parallelism for C++ HPC Applica9ons GPU Tech. Conference April 47, 2016 San Jose, CA H. Carter Edwards SAND20163114 C

Transcript of Kokkos%...

Page 1: Kokkos% Hierarchical%TaskData%Parallelism%on-demand.gputechconf.com/gtc/2016/presentation/s6145... · 2016. 4. 26. · Photos placed in horizontal position with even amount of white

Photos placed in horizontal position with even amount

of white space between photos

and header

Photos placed in horizontal position

with even amount of white space

between photos and header

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP

Kokkos  Hierarchical  Task-­‐Data  Parallelism  

for  C++  HPC  Applica9ons  

GPU  Tech.  Conference  April  4-­‐7,  2016  San  Jose,  CA    H.  Carter  Edwards    SAND2016-­‐3114  C  

Page 2: Kokkos% Hierarchical%TaskData%Parallelism%on-demand.gputechconf.com/gtc/2016/presentation/s6145... · 2016. 4. 26. · Photos placed in horizontal position with even amount of white

1  

DDR  

HBM  

DDR  

HBM  

DDR  DDR  

DDR  

HBM  HBM  

Mul9-­‐Core   Many-­‐Core   APU   CPU+GPU  

Drekar   Trilinos  SPARC  

Applica9ons  &  Libraries  

Kokkos*  performance  portability  for  C++  applica?ons  

Albany  EMPIRE  LAMMPS  

*κόκκος Greek:  “granule”  or  “grain”  ;    like  grains  of  sand  on  a  beach  

Page 3: Kokkos% Hierarchical%TaskData%Parallelism%on-demand.gputechconf.com/gtc/2016/presentation/s6145... · 2016. 4. 26. · Photos placed in horizontal position with even amount of white

2  

Abstrac9ons:  PaWerns,  Policies,  and  Spaces  

§ Parallel  PaWern  of  user’s  computa9ons  §  parallel_for,  parallel_reduce,  parallel_scan,  ...  EXTENSIBLE  

§  Execu9on  Policy  tells  how  user  computa9on  will  be  executed  §  Sta9c  scheduling,  dynamic  scheduling,  thread-­‐teams,  ...  EXTENSIBLE  

§  Execu9on  Space  tells  where  user  computa9ons  will  execute  §  Which  cores,  numa  region,  GPU,  ...  (extensible)  

§ Memory  Space  tells  where  user  data  resides  §  Host  memory,  GPU  memory,  high  bandwidth  memory,  ...  (extensible)  

§ Array  Layout  (policy)  tells  how  user  data  is  laid  out  in  memory  §  Row-­‐major,  column-­‐major,  array-­‐of-­‐struct,  struct-­‐of-­‐array  …    (extensible)  

§ Differen9a9ng  feature:  Array  Layout  and  Memory  Space  §  Versus  other  programming  models  (OpenMP,  OpenACC,  …)  §  Cri9cal  for  performance  portability  …  see  previous  GPU-­‐Tech  Kokkos  talks  

Page 4: Kokkos% Hierarchical%TaskData%Parallelism%on-demand.gputechconf.com/gtc/2016/presentation/s6145... · 2016. 4. 26. · Photos placed in horizontal position with even amount of white

3  

§  Sandia  Na9onal  Laboratory  (SNL)  Laboratory  Directed  Research  &  Development  (LDRD)  Project  §  Strategic  R&D  funded  at  Sandia’s  discre9on  §  Recognized  with  $$  as  strategic  high  priority  

§  SNL  LDRD  Team  §  myself,  Stephen  Olivier,  Jonathan  Berry,  Greg  Mackey,  Siva  Rajamanickam,  Kyungjoo  Kim,  George  Stelle,  and  Michael  Wolf  

§  Scheduling  algorithm  inspired  from  SNL’s  Qthreads  library  

§ Cri9cal  support  from  NVIDIA  §  Thanks  to  Tony  Scudiero,  Greg  Branch,  Ujval  Kapasi,  and  the                                        whole  nvcc  development  team  

§  Early  access  to  nvcc  CUDA  8  essen9al  for  relocatable  device  code  

Acknowledgements  

Page 5: Kokkos% Hierarchical%TaskData%Parallelism%on-demand.gputechconf.com/gtc/2016/presentation/s6145... · 2016. 4. 26. · Photos placed in horizontal position with even amount of white

4  

New  (prototype)  Kokkos  Capability:  

Dynamic  Directed  Acyclic  Graph  (DAG)  of  Tasks  

§  Extension  of  Parallel  PaWern  §  Tasks:  Heterogeneous  collec9on  of  parallel  computa9ons  §  DAG:  Tasks  may  have  acyclic  “execute  aker”  dependences  §  Dynamic:  New  tasks  may  be  created/allocated  by  execu9ng  tasks  

§  Extension  of  Execu9on  Policy    §  Schedule  tasks  for  execu9on  §  Manage  tasks’  dynamic  lifecycle  

Page 6: Kokkos% Hierarchical%TaskData%Parallelism%on-demand.gputechconf.com/gtc/2016/presentation/s6145... · 2016. 4. 26. · Photos placed in horizontal position with even amount of white

5  

Use  Cases  (mini-­‐applica9ons)  1.   Incomplete  Level-­‐K  Cholesky  factoriza9on  of  sparse  matrix    §  Block  par99oning  into  submatrices  §  DAG  of  submatrix  computa9ons  §  Each  submatrix  computa9on  is  internally  data  parallel  

§  Lead:  Kyungjoo  Kim  

2.   Triangle  enumera9on  in  social  networks  (highly  irregular  graphs)    §  Iden9fy  triangles  within  the  graph  §  Compute  sta9s9cs  on  triangles  §  Triangles  are  an  intermediate  result  that  do  not  need  to  be  saved  /  stored  Ø  Problem:  memory  “high  water  mark”  

§  Lead:  Michael  Wolf  

114 875 91 6 102 30

3

0

8

10

2

7

6

11

5

9

4

1

X X

11

10

9

8

6 X

7

X4

5

X

XXX

X

X

X

X X

X

X

3

2

X

X

1

0

Chol

Trsm

Herk

Chol

Herk

HerkGemmHerk

Herk Gemm

Herk

Trsm

Trsm Trsm

Chol

Chol

Chol

Trsm

Trsm

4 5 1

2

3

k2

k1

Page 7: Kokkos% Hierarchical%TaskData%Parallelism%on-demand.gputechconf.com/gtc/2016/presentation/s6145... · 2016. 4. 26. · Photos placed in horizontal position with even amount of white

6  

Use  Case  1:  Incomplete  Cholesky  Factoriza9on  § Reordering  and  Block  Par99oning  of  Sparse  Matrix  § One  driver  task  that  creates  and  spawns  submatrix  task-­‐DAG  §  Submatrix  tasks  factor,  triangular  solve,  rank-­‐k  update,  mul9ply  

114 875 91 6 102 30

3

0

8

10

2

7

6

11

5

9

4

1

X X

11

10

9

8

6 X

7

X4

5

X

XXX

X

X

X

X X

X

X

3

2

X

X

1

0

Chol

Trsm

Herk

Chol

Herk

HerkGemmHerk

Herk Gemm

Herk

Trsm

Trsm Trsm

Chol

Chol

Chol

Trsm

Trsm

Page 8: Kokkos% Hierarchical%TaskData%Parallelism%on-demand.gputechconf.com/gtc/2016/presentation/s6145... · 2016. 4. 26. · Photos placed in horizontal position with even amount of white

X X

11

10

9

8

6 X

7

X4

5

X

XXX

X

X

X

X X

X

X

3

2

X

X

1

0

0

1

2

3

4

0 1 2 3 4

TRSMCHOL

HERK

CHOL

TRSM

HERK

CHOL

HERK

HERKGEMMHERK

HERK GEMM

HERK

TRSM

TRSM TRSM

CHOL

CHOL

CHOL

TRSM

TRSM

7  

Use  Case  1:  Incomplete  Cholesky  Factoriza9on  Driver  Task:  Spawn  Submatrix  Task-­‐DAG  

§  Periodically  stop  itera9ng  and  respawn  this  task  with  low  priority  §  ThroWle  back  submatrix  task  genera9on  for  memory  constraint  §  Allow  submatrix  tasks  to  complete  and  their  memory  to  be  reclaimed  

X X

11

10

9

8

6 X

7

X4

5

X

XXX

X

X

X

X X

X

X

3

2

X

X

1

0

0

1

2

3

4

0 1 2 3 4

GEMM

TRSM

DONE

HERK

CHOL T

H

DONEDONE

CHOL

TRSM

HERK

CHOL

HERK

HERKGEMMHERK

HERK GEMM

HERK

TRSM

TRSM TRSM

CHOL

CHOL

CHOL

TRSM

TRSM

X X

11

10

9

8

6 X

7

X4

5

X

XXX

X

X

X

X X

X

X

3

2

X

X

1

0

0

1

2

3

4

0 1 2 3 4

GEMM

HERK

H

DONE DONE

CHOL TRSMT

CHOL

TRSM

HERK

CHOL

HERK

HERKGEMMHERK

HERK GEMM

HERK

TRSM

TRSM TRSM

CHOL

CHOL

CHOL

TRSM

TRSM

X X

11

10

9

8

6 X

7

X4

5

X

XXX

X

X

X

X X

X

X

3

2

X

X

1

0

0

1

2

3

4

0 1 2 3 4

HERK

C

DONE DONE

TRSM

CHOL

TRSM

HERK

CHOL

HERK

HERKGEMMHERK

HERK GEMM

HERK

TRSM

TRSM TRSM

CHOL

CHOL

CHOL

TRSM

TRSM

X X

11

10

9

8

6 X

7

X4

5

X

XXX

X

X

X

X X

X

X

3

2

X

X

1

0

0

1

2

3

4

0 1 2 3 4

DONE DONE

CHOL

CHOL

TRSM

HERK

CHOL

HERK

HERKGEMMHERK

HERK GEMM

HERK

TRSM

TRSM TRSM

CHOL

CHOL

CHOL

TRSM

TRSM

Page 9: Kokkos% Hierarchical%TaskData%Parallelism%on-demand.gputechconf.com/gtc/2016/presentation/s6145... · 2016. 4. 26. · Photos placed in horizontal position with even amount of white

8  

Use  Case  2:  Triangle  Enumera9on  and  Sta9s9cs  of  Social  Network    

§ miniTri:  proxy  (mini-­‐applica9on)  for  triangle  based  data  analy9cs  

§  Current  linear-­‐algebra  strategy  §  “A”  is  the  adjacency  matrix  §  “B”  is  the  incidence  matrix  

§  Challenges  §  Very  irregular  graph,  difficult  to  sta9cally  load  balance  §  Graph  BLAS  strategy  explicitly  forms  “C”  which  is  all  triangles  in  the  graph  Ø Extremely  large  intermediate  storage  of  “C”  

§  Task  parallelism  pipelines  opera9ons  §  Each  phase  is  a  data  parallel  BLAS  §  Block  par99on  and  pipeline  via  DAG  §  Priori9ze  downstream  tasks  to  “re9re”  temporary  “C”  submatrices  

miniTri 1 C = A * B 2 tv = C * 1 3 te = CT * 1 4 kcount(C, tv, te)

K-count

core 1

core 2

core 3

core 4

C=A!B T4 T5 T6 T7 T8 T2 T3

te = CT!1

tv = C!1

T6 T1 T2 T4 T5 T7 T8 T3

T1

T4 T5 T6 T7 T8 T3 T1 T2

T6 T1 T2 T4 T5 T7 T8 T3

Page 10: Kokkos% Hierarchical%TaskData%Parallelism%on-demand.gputechconf.com/gtc/2016/presentation/s6145... · 2016. 4. 26. · Photos placed in horizontal position with even amount of white

Task  Parallelism  for  Resource  Constraints    

§  Key  insight:    Task  parallelism  can  be  used  to  reduce  memory  footprint    §  Priori?ze  k-­‐count  tasks  to  free  blocks  of  triangles  from  memory  §  Need  run?me  system  to  support  advanced  resource  management/

priori?es  (ongoing  effort:  HPX  and  Kokkos/Qthreads)  

9  

K-count

core 1

core 2

core 3

core 4

C=A!B T4 T5 T6 T7 T8 T2 T3

te = CT!1

tv = C!1

T6 T1 T2 T4 T5 T7 T8 T3

T1

T4 T5 T6 T7 T8 T3 T1 T2

T6 T1 T2 T4 T5 T7 T8 T3

Prioritize 1.00E+02'

1.00E+03'

1.00E+04'

1.00E+05'

1.00E+06'

1.00E+07'

1.00E+08'

1.00E+09'

1.00E+10'

1.00E+11'

1.00E+12'

1.00E+13'

1.00E+14'

1k' 10k' 100k' 1M' 10M' 100M' 1B' 10B'

Mem

ory'size'(M

B)'

Number'of'ver3ces'

Memory'Usage'for'miniTri'

Task'//'GABB'

GABB'

Supercomputer'(1PB)'

WorkstaEon'(512GB)'

Task parallel approach allows Graph BLAS implementation of miniTri to solve much larger problems

Page 11: Kokkos% Hierarchical%TaskData%Parallelism%on-demand.gputechconf.com/gtc/2016/presentation/s6145... · 2016. 4. 26. · Photos placed in horizontal position with even amount of white

parallel_for

parallel_reduce

10  

Hierarchical  Parallelism  §  Shared  func9onality  with  hierarchical  data-­‐data  parallelism  

§  The  same  kernel  (task)  executed  on  …  §  OpenMP:  League  of  Teams  of  Threads  §  Cuda:                Grid            of  Blocks  of  Threads  

§  Intra-­‐Team  Parallelism  (data  or  task)  §  Threads  within  a  team  execute  concurrently    §  Data:  each  team  executes  the  same  computa9on  §  Task:  each  team  executes  a  different  task  Ø Nested  parallel  paWerns:  for,  reduce,  scan  

§  Mapping  teams  onto  hardware  §  CPU  :  team  ==  hyperthreads  sharing  L1  cache  

§  Requires  low  degree  of  intra-­‐team  parallelism  §  Cuda  :  team  ==  thread  block  

§  Requires  high  degree  of  intra-­‐team  parallelism  §  …  revisit  this  later  

Page 12: Kokkos% Hierarchical%TaskData%Parallelism%on-demand.gputechconf.com/gtc/2016/presentation/s6145... · 2016. 4. 26. · Photos placed in horizontal position with even amount of white

11  

Anatomy  and  Life-­‐cycle  of  a  Task  § Anatomy  §  Is  a  C++  closure  (e.g.,  functor)  of  data  +  func9on  §  Is  referenced  by  a  Kokkos::future  §  Executes  on  a  single  thread  or  thread  team  §  May  only  execute  when  its  dependences  are  complete  (DAG)  

§  Life-­‐cycle:  construc9ng  

wai9ng  

execu9ng  

task  with  internal  data  parallelism  on  a  thread  team  

serial  task  on  a  single  thread  

complete  

Page 13: Kokkos% Hierarchical%TaskData%Parallelism%on-demand.gputechconf.com/gtc/2016/presentation/s6145... · 2016. 4. 26. · Photos placed in horizontal position with even amount of white

12  

Dynamic  Task  DAG  Execu9on  Policy  § Manage  a  heterogeneous  collec9on  of  tasks  §  Map  task  execu9on  to  a  thread  team  or  single  thread    §  Execu9on  constrained  by  task  dependence  DAG  §  Memory  management  for  tasks’  dynamically  allocated  memory  

§ Challenges  §  Portability  across  mul9core/manycore  architectures:  CPU,  GPU,  Xeon  Phi,  …  

§  GPU  func9on  pointer  accessibility  on  host  and  device  §  Dynamic  –  crea9ng  tasks  within  execu9ng  tasks  on  GPU  §  Performance  –  thread  scalable  alloca9on/dealloca9on  within  finite  memory  §  Performance  –  execu9on  overhead  and  thread  scalable  scheduling  

§ Non-­‐blocking  constraint  for  portability  and  performance  §  An  execu9ng  task  cannot  block  or  yield  §  Eliminates  overhead  of  saving  execu9on  state:  registers,  stack,  …  §  Reduces  overhead  of  context  switching  

Page 14: Kokkos% Hierarchical%TaskData%Parallelism%on-demand.gputechconf.com/gtc/2016/presentation/s6145... · 2016. 4. 26. · Photos placed in horizontal position with even amount of white

13  

Managing  a  Non-­‐blocking  Task’s  Lifecycle  § Create:  allocate  and  construct    §  By  main  process  or  within  another  task    §  Allocate  from  a  memory  pool  §  Construct  internal  data  §  Assign  DAG  dependences  

§  Spawn:  enqueue  to  scheduler  § Respawn:  re-­‐enqueue  to  scheduler  §  Instead  of  the  task  wai9ng  or  yielding  §  Can  reassign  DAG  dependences  

§  Essen9al  capability  for  mini-­‐Tri  

§  Used  by  Cholesky  factoriza9on  to  throWle  back  task  genera9on  §  Reconceived  wait-­‐for-­‐child-­‐task  use  case  

Ø  Create  &  spawn  child  task(s)  Ø  Reassign  DAG  dependence(s)  to  new  child  task(s)  Ø  Re-­‐spawn  to  execute  again  aker  child  task(s)  complete  

construc9ng  

wai9ng  

execu9ng  

complete  

create  

spawn  

respawn  

Page 15: Kokkos% Hierarchical%TaskData%Parallelism%on-demand.gputechconf.com/gtc/2016/presentation/s6145... · 2016. 4. 26. · Photos placed in horizontal position with even amount of white

14  

CPU/GPU  Portable  Scheduler  § Mul9ple  Queues  of  Tasks  §  Wai9ng  on  incomplete  task(s);  i.e.  task  dependences  §  Mul9ple  ready  to  execute  queues  with  priori9es:  high,  regular,  low  §  Queues  managed  with  atomic  opera9ons  

§ Persistent  Threads  Strategy  (CPU  pthreads,  GPU  thread  blocks)  §  Pop  ready  tasks,  by  priority,  and  execute  §  When  task  is  complete  update  dependent  tasks;  e.g.,  move  to  ready  queue    §  When  task  is  dereferenced  (last  future  is  destroyed)  reclaim  memory  

§ Constraint:  Tasks  reside  within  fixed  size  Memory  Pool  §  Tasks  can  create  (allocate)  and  spawn  new  tasks,  even  on  GPU  §  Algorithms  must  account  for  fixed  size  constraint  when  crea9ng  new  tasks  

§  E.g.,  Incomplete  Cholesky  factoriza9on  throWles  back  task  crea9on  

Page 16: Kokkos% Hierarchical%TaskData%Parallelism%on-demand.gputechconf.com/gtc/2016/presentation/s6145... · 2016. 4. 26. · Photos placed in horizontal position with even amount of white

15  

Use  Case  1:  Incomplete  Cholesky  Factoriza9on  GPU  Performance  Evalua9on  

§  Ini9al  prototype,  to-­‐be-­‐done  improvements  &  op9miza9on  ü  Successfully  executes  dynamic  task-­‐DAG  

§ Recall  thread-­‐team  tasks  are  mapped  to  CUDA  thread  blocks  Ø  Requires  high  degree  of  intra-­‐team  parallelism  

§   Sparse  submatrix  tasks  use  intra-­‐team  parallelism  §  Dominated  by  non-­‐coalesced  indirect  memory  access  (reference  chasing)  §  Insufficient  work,  low  computa9onal  intensity,  high  register  usage  Ø  12%  occupancy  and  poor  memory  access  paWerns  

§  To-­‐do  Improvement:  map  thread-­‐team  tasks  to  CUDA  warps  §  Requires  refactoring  of  intra-­‐team  parallel  paWerns  and  policies  §  The  revisit  memory  access  paWerns  §  …  stay  tuned    

unacceptable  

Page 17: Kokkos% Hierarchical%TaskData%Parallelism%on-demand.gputechconf.com/gtc/2016/presentation/s6145... · 2016. 4. 26. · Photos placed in horizontal position with even amount of white

16  

Conclusion  and  Ongoing  R&D    ü  Prototype  Portable  Dynamic  Task-­‐DAG    §  Portable:  CPU  and  NVIDIA  GPU  architectures  §  Collec9on  of  heterogeneous  parallel  computa9ons;  a.k.a.,  tasks  §  With  directed  acyclic  graph  (DAG)  of  task  dependences  §  Dynamic  –  tasks  may  create  tasks  §  Hierarchical  –  thread-­‐team  data  parallelism  within  tasks  

§  Challenges,  primarily  for  GPU  portability  and  performance  §  Non-­‐blocking  tasks  !  respawn  instead  of  wait  §  Memory  pool  for  dynamically  allocatable  tasks  §  TBD:  Thread-­‐team  map  onto  GPU  warps,  not  thread  blocks  

§  In  progress:  Refactoring  of  thread-­‐team  mapping  for  GPU  Ø Unfortunately,  the  clock  ran  out  on  us  for  GPU-­‐Tech  2016  §  …  stay  tuned