CILK/CILK++ AND REDUCERS - WordPress.com · Pictures from “Reducers and Other CILK+...

Post on 16-Aug-2020

14 views 0 download

Transcript of CILK/CILK++ AND REDUCERS - WordPress.com · Pictures from “Reducers and Other CILK+...

CILK/CILK++ AND REDUCERS YUNMING ZHANG

RICE UNIVERSITY

1

OUTLINE •  CILK and CILK++ Language Features and

Usages •  Work stealing runtime •  CILK++ Reducers •  Conclusions

2

IDEALIZED SHARED MEMORY ARCHITECTURE

3

•  Hardware model •  Processors •  Shared global

memory •  Software model

•  Threads •  Shared variables •  Communication •  Synchronization

Slide from Comp 422 Rice University Lecture 4

CILK AND CILK++ DESIGN GOALS •  Programmer friendly

•  Dynamic tasking •  Parallel extension to C

•  Scalable performance •  Efficient runtime system •  Minimum program overhead

4

CILK KEYWORDS •  Cilk: a Cilk function •  Spawn: call can execute asynchronously

in a concurrent thread •  Sync: current thread waits for all locally-

spawned functions

5

CILK EXAMPLE cilk int fib(n) {

if (n < 2) return n; else { int n1, n2; n1 = spawn fib(n-1); n2 = spawn fib(n-2); sync; return (n1 + n2); }

}

6 Borrowed from Comp 422 Rice University Lecture 4

CILK++ EXAMPLE int fib(n) {

if (n < 2) return n; else { int n1, n2; n1 = cilk_spawn fib(n-1); n2 = fib(n-2); cilk_sync; return (n1 + n2); }

}

7 Borrowed from Comp 422 Rice University Lecture 4

CILK++ EXAMPLE WITH DAG

8

Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

OUTLINE •  CILK and CILK++ Language Features and

Usages •  Work stealing runtime •  CILK++ Reducers •  Conclusions

9

WORK FIRST PRINCIPLE •  Work: T1 •  Critical path length: T∞ •  Number of processor: P •  Expected time

•  Tp = T1/P + O(T∞) •  Parallel slackness assumption

•  T1/P >> C∞T∞

10

WORK FIRST PRINCIPLE •  Minimize scheduling overhead borne by

work at the expense of increasing critical path •  Tp ≤ C1Ts/P + C∞T∞ ≈ C1Ts/P Minimize C1 even at the expense of a larger C∞

11

WORK STEALING DESIGN GOALS •  Minimizing contentions

•  Decentralized task deque •  Doubly linked deque

•  Minimize communication •  Steal work rather than push work

•  Load balance across cores •  Lazy task creation •  Steal from the top of the deque

12

CILK WORK STEALING SCHEDULER

13 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

CILK WORK STEALING SCHEDULER

14 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

CILK WORK STEALING SCHEDULER

15 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

CILK WORK STEALING SCHEDULER

16 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

CILK WORK STEALING SCHEDULER

17 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

CILK WORK STEALING SCHEDULER

18 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

CILK WORK STEALING SCHEDULER

Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

CILK WORK STEALING SCHEDULER

Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

CILK WORK STEALING SCHEDULER

21 Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

CILK WORK STEALING SCHEDULER

22

Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

CILK WORK STEALING SCHEDULER

23

Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

CILK WORK STEALING SCHEDULER

24

Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

CILK WORK STEALING SCHEDULER

25

Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

CILK WORK STEALING SCHEDULER

26

Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

TWO CLONE STRATEGY •  Fast clone

•  Identical in most respects to the C elision of the Cilk program

•  Very little execution overhead •  Sync statements compile to no op •  Allocates an continuation

•  Program variables and instruction pointer •  Slow clone

•  Convert a spawn schedule to slow clone only when it is stolen

•  Restores program state from activation frame that contains local variables, program counter and other parts of the procedure instance

27

FAST CLONE

28

SLOW CLONE Slow_fib(frame * _cilk_frame){

switch (_cilk_frame->header.entry) { fast_fib(_cilk_frame->n - 1 ); case 1: goto _cilk_sync1; fast_fib(_cilk_frame->n - 2 ); case 2: goto _cilk_sync2; sync (not a no op) case 3: goto _cilk_sync3; }

}

29

FRAMES •  C++ Main Frame

•  Local variables of the procedure instance •  Temporary variables •  Linkage information for return values

30

FRAMES •  CILK++ Stack Frame

•  Everything in C++ Main Frame •  Continuation •  Parent pointer •  Have exactly one child •  Used by Fast Clone •  A worker can have multiple Stack Frames

31

FRAMES •  CILK++ Full Frame (used by slow clone)

•  Everything in CILK++ Stack Frame •  Lock •  Join counter •  List of children (has more than one

children) •  A worker has at most one Full Frame

32

EXTENDED DEQUE WITH CALL STACKS

33

Stack frame

Full frame

Extended Deque

Call stack

FUNCTION CALL

34

Stack frame

Full frame

Extended Deque (Before Function Call) Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

FUNCTION CALL

35

Stack frame

Full frame

Extended Deque (After Function Call) Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

New stack frame

SPAWN

36

Stack frame

Full frame

Extended Deque (Before Spawn Call) Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

SPAWN

37

Stack frame

Full frame

Extended Deque (After Spawn Call) Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

Set continuation in last stack frame

RESUME FULL FRAME

38

Stack frame

Full frame

Extended Deque Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

Set the full frame to be the only frame in the call stack, resume execution on the continuation

RANDOMLY STEAL

39

Stack frame

Full frame

Extended Deque Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

Steal this call stack

RANDOMLY STEAL

40

Stack frame

Full frame

Extended Deque Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

Steal this call stack 1 1 1

RANDOMLY STEAL

41

Stack frame

Full frame

Extended Deque Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

1

1 1

PROVABLY GOOD STEAL

42

Stack frame

Full frame

Extended Deque Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

0

UNCONDITIONALLY STEAL

43

Stack frame

Full frame

Extended Deque Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

2

FUNCTION CALL RETURN

44

Stack frame

Full frame

Extended Deque (Before Return from a Call Case1) Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

FUNCTION CALL RETURN

45

Stack frame

Full frame

Extended Deque (Return from a Call Case 1) Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

FUNCTION CALL RETURN

46

Stack frame

Full frame

Extended Deque (Return from a Call Case2) Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

Worker executes an unconditional steal

SPAWN RETURN

47

Stack frame

Full frame

Extended Deque (Before Spawn return Case 1) Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

SPAWN RETURN

48

Stack frame

Full frame

Extended Deque (After Spawn return Case 1) Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

SPAWN RETURN

49

Stack frame

Full frame

Extended Deque (Return from a SpawnCase2) Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

Worker executes an provably good steal

SYNC

50

Stack frame

Full frame

Extended Deque (Sync Case 1) Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

Do nothing if it is a stack frame (No Op)

SYNC

51

Stack frame

Full frame

Extended Deque (Sync Case 2) Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

Pop the frame, provably good steal

OUTLINE •  CILK and CILK++ Language Features and

Usages •  Work stealing runtime •  CILK++ Reducers •  Conclusions

52

PROBLEMS WITH NON-LOCAL VARIABLES bool has_property(Node *) List<Node *> output_list; void walk(Node *x) {

if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); walk(x->right); cilk_sync;

}

53

REDUCER DESIGN GOALS •  Support parallelization of programs

containing global variables •  Enable efficient parallel scaling by

avoiding a single point of contention •  Provide deterministic result for

associative reduce operations •  Operate independently of any control

constructs

54

REDUCER EXAMPLE bool has_property(Node *) List_append_reducer<Node *> output_list; void walk(Node *x) {

if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); walk(x->right); cilk_sync;

}

55

HYPER OBJECTS

56

Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

REDUCER

57

Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

SEMANTICS OF REDUCERS •  The child strand owns the view owned by

parent function before cilk_spawn •  The parent strand owns a new view,

initialized to identity view e, •  A special optimization ensures that if a

view is unchanged when combined with the identity view 3

•  Parent strand P own the view from completed child strands

58

REDUCING OVER LIST CONCATENATION

59

Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

REDUCING OVER LIST CONCATENATION

60

Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

IMPLEMENTATION OF REDUCER •  Each worker maintains a hypermap •  Hypermap

•  Maps reducers to the views •  User

•  The view of the current procedure •  Children

•  The view of the children procedures •  Right

•  The view of right sibling •  Identity

•  The default value of a view

61

UNDERSTANDING HYPERMAPS bool has_property(Node *) List_append_reducer<Node *> output_list; void walk(Node *x) ß------------ Proc A {

if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); ß---------proc B cilk_spawn walk(x->right); ß-------- proc C cilk_sync;

}

62

LAZY CREATION •  A new view will only be created

•  after a steal •  On demand

63

HYPERMAP CREATION

64

Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

HYPERMAP CREATION

65

Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

HYPERMAP CREATION

66

Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

HYPERMAP CREATION

67

Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

HYPERMAP CREATION

68

Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

LOOK UP FAILURE •  Inserts a view containing an identity

element for the reducer into the hypermap. •  Following the lazy principle

•  Look up returns the newly inserted identity view

69

RANDOM WORK STEALING A random steal operation steals a full frame P and replaces it with a new full frame C in the victim.

USERC ← USERP; U S E R P ← 0/ ; CHILDRENP←0/; RIGHTP←0/.

70

RANDOM WORK STEALING

71

Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

RETURN FROM A CALL Let C be a child frame of the parent frame P that originally called C, and suppose that C returns.

72

RETURN FROM A CALL

73

Stack frame

Full frame

Extended Deque (Before Return from a Call Case1) Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

RETURN FROM A CALL

74

Stack frame

Full frame

Extended Deque (Return from a Call Case 1) Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

RETURN FROM A CALL Let C be a child frame of the parent frame P that originally called C, and suppose that C returns.

• If C is a stack frame, do nothing,

75

FUNCTION CALL RETURN

76

Stack frame

Full frame

Extended Deque (Return from a Call Case2) Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Resume full frame

Worker executes an unconditional steal

RETURN FROM A CALL Let C be a child frame of the parent frame P that originally called C, and suppose that C returns.

• If C is a stack frame, do nothing, • If C is a full frame.

• Transfer ownership of view • Children and Right are empty • USERP ← USERC

77

RETURN FROM A SPAWN Let C be a child frame of the parent frame P that originally spawned C, and suppose that C returns. •  Always do USERC ← REDUCE(USERC,RIGHTC) •  If C is a stack frame, do nothing •  If C is a full frame

•  If C has siblings, •  RIGHTL ← REDUCE(RIGHTL,USERC)

•  C is the leftmost child •  CHILDRENP ←

REDUCE(CHILDRENP,USERC)

78

RETURN FROM A SPAWN EXAMPLE bool has_property(Node *) List_append_reducer<Node *> output_list; void walk(Node *x) ß------------ Proc A {

if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); ß---------proc B cilk_spawn walk(x->right); ß-------- proc C cilk_sync;

}

79

RETURN FROM A SPAWN EXAMPLE bool has_property(Node *) List_append_reducer<Node *> output_list; void walk(Node *x) ß------------ Proc A {

if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); ß---------proc B cilk_spawn walk(x->right); ß-------- proc C cilk_sync;

}

80

RETURN FROM A SPAWN EXAMPLE bool has_property(Node *) List_append_reducer<Node *> output_list; void walk(Node *x) ß------------ Proc A {

if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); ß---------proc B cilk_spawn walk(x->right); ß-------- proc C cilk_sync;

}

81

SYNC A cilk_sync statement waits until all children have com- pleted. When frame P executes a cilk_sync, one of following two cases applies: •  If P is a stack frame, do nothing. •  If P is a full frame,

•  USERP ← REDUCE(CHILDRENP,USERP).

82

BENEFITS OF REDUCERS

83

OUTLINE •  CILK and CILK++ Language Features and

Usages •  Work stealing runtime •  CILK++ Reducers •  Conclusions

84

CONCLUSIONS •  CILK and CILK++ provide a programmer

friendly programming model •  Extension to C •  Incremental parallelism •  Scaling on future machines

•  Non-compromising performance •  Work stealing runtime •  Minimizing overheads •  Reducers

85

FINAL NOTES •  Designed for an idealized shared memory

model •  Today’s architectures are typically NUMA

•  Task creation can be lazier •  http://ieeexplore.ieee.org/xpls/abs_all.jsp?

arnumber=6012915&tag=1 •  Cilk_for

•  Divide and conquer parallelization

86