ParaFormanceTM,:,An,Advanced,Refactoring, Toolfor ... · achieved are 12, 11 and 16 on titanic...
Transcript of ParaFormanceTM,:,An,Advanced,Refactoring, Toolfor ... · achieved are 12, 11 and 16 on titanic...
ParaFormanceTM : An Advanced Refactoring Tool for Parallelising C++ Programs – Part 3
Chris Brown, Vladimir Janjic, Kevin Hammond University of St Andrews, Scotland @chrismarkbrown @rephrase_eu
Outline
2
1. Introduc+on and overview to Safety Checking 2. Live Demonstra+on of Safety Checking 3. Follow along interac+vely 4. Ant Colony 5. Wrap up 6. Ques+ons and Answers
What is Safety Checking?
1. Introduces parallelism into an applica+on semi-‐automa+cally 2. Refactors a sequen+al por+on of code into a parallel version 3. Introduces all parallel ‘business logic’
3
Why?
• Huge saving in effort over manual • Difficult to get right! • Can work on inserted code or exis+ng code • Without safety checking there is basically no way to know if your
exis+ng applica+on will run without errors or not
4
DemonstraRon
5
PredicRng Performance
6
Which ConfiguraRon? 7
Predicted vs. Actual Speedups
8
1 2 4 6 8 10 12 14 16 18 20 22 24
1
4
8
12
16
20
24
Cores
Speedup
Speedups for denoise
Pipe(G, D)
Pipe(G, ParMap(D)
Pipe(Farm(G), ParMap(D))
Pred. Pipe(G, D)
Pred. Pipe(G, ParMap(D)
Pred. Pipe(Farm(G), ParMap(D))
Predicted vs. Actual Speedups
9
1 2 4 6 8 10 12 14 16 18 20 22 24
1
4
8
12
16
20
24
Cores
Speedu
p
Speedups for denoise
Pipe(G, D)
Pipe(G, ParMap(D)
Pipe(Farm(G), ParMap(D))
Pred. Pipe(G, D)
Pred. Pipe(G, ParMap(D)
Pred. Pipe(Farm(G), ParMap(D))
1 2 4 6 8 10 12 14 16 18 20 22 24
1
4
8
12
16
20
24
Cores
Speedu
p
Speedups for sumEuler
Farm
ParMap with 400 Chunk
Pred. Farm
Fig. 5 Predicted (dashed) vs. Actual Speedups (solid) for denoise(1024) andsumEuler(10000).
Stage 2: Chunking While using a task farm for sumEuler creates a reasonableamount of parallelism, the parallelism is too fine-grained and the program doesnot scale as we expect. This is a common problem in the early stages of writingparallel programs. To combat this, we use the IntroChunkFarm rule from theIntroduce Chunking Refactoring. This refactoring allows us to group a numberof small tasks into one larger parallel task, where each parallel thread operatesover a sub-list, rather than just one element. We want each worker thread tobe busy, so we chunk by assigning 416 to the parameter C below for groupsof 416 elements (10000 tasks/24 workers). By chunking in this way, we alsodecrease the communication costs, and reduce parallel overheads. Chunkingcan generally be achieved in a variety of di↵erent ways. In our example, werefactor our task farm into a map with a chunking and de-chunking stage:
sumEuler(N) -> skel:run ([{map , [{seq , fun ?MODULE:euler /1}],
fun ?MODULE:partition /1,
fun ?MODULE:combine /1}],
chunk(List , C, lists:length(List))),
The refactoring also introduces new functions, combine and partition:
partition(X) -> X.
combine ([]) - >[];
combine ([X|Xs]) -> lists:append(X, combine(Xs)).
5.3 Predicted versus Actual times
All measurements have been made on an 800 MHz 24 core, dual AMD Opteron6176 architecture, running Centos Linux 2.6.18-274.el5. and Erlang 5.9.1 R15B01, averaging over 10 runs. Figure 5, left, compares the predicted (dashed)speedups against the actual (solid) speedups. The overall predicted speedupfor denoise on 24 cores for the Pipe(Farm(G), ParMap(D)) version is 19.09
16
Wrap up
10
Comparison of Development Times
Man. Time Refac. Time
ConvoluRon 24 hours 3 hours
Ant Colony 8 hours 1 hour
Basic N2 40 hours 5 hours
Graphical Lasso 15 hours 2 hours
Image ConvoluRon
12
0 1 2 4 6 8 10 12 14 16 18 20 22 24
1
4
8
12
16
20
24
Nr. threads
Spee
dup
Speedups for Image Convolution on titanic
0 1 2 4 6 8 10 12
1
4
8
12
Nr. threads
Speedups for Image Convolution on xookik
0 20 40 60 80 100 120 140 160
1
4
8
12
16
20
24
28
32
36
Nr. threads
Speedups for Image Convolution on power8
OpenMP (s | m)
OpenMP (m | m)OpenMP m
TBB (s | m)
TBB (m | m)TBB m
FF (s | m)
FF (m | m)FF m
Fig. 10. Image Convolution speedups on titanic, xookik and power. Here, | is a parallel pipeline, m is a parallel map and s is a sequential stage.
1 for (j=0; j<num iter; j++) {2 for (i=0; i<num ants; i++)3 cost[i ] = solve (i, p,d,w,t);4 best t = pick best(&best result);5 for (i=0; i<n; i++)6 t [i ] = update(i, best t, best result);7 }
Since pick best in Line 4 cannot start until all of the ants havecomputed their solutions, and the for loop that updates t cannotstart until pick best finishes, we have implicit ordering in thecode above. Therefore, the structure can be described in theRPL with:
seq (solve) ; pick best ; seq (update)
where ; denotes the ordering between computations. Due to anordering between solve, pick best and update, the only way toparallelise the sequential code is to convert seq (solve) and/orseq (update) into maps. Therefore, the possible parallelisationsare:
1) map (solve) ; pick best ; update2) solve ; pick best ; map (update)3) map (solve) ; pick best ; map (update)
Since solve dominates computing time, we are going to con-sider only parallelisations 1) and 3). Speedups for these twoparallelisations, on titanic, xookik and power, with a varyingnumber of CPU threads used, are given in Figure 11. In thefigure, we denote map by m and a sequential stage by s.Therefore, (m ; s ; s) denotes that solve is a parallel map,pick best is sequential and update is also sequential. From theFigure 11, we can observe (similarly to the Image Convolutionexample) that speedups are similar for all parallel libraries.The only exception is Fastflow on power, which gives slightlybetter speedup than other libraries. Furthermore, both differentparallelisations give approximately the same speedups, withthe (m ; s ; m) parallelisation using more resources (threads)altogether. This indicates that it is not always the best ideato parallelise everything that can be parallelised. Finally, wecan note that none of the libraries is able to achieve linearspeedups, and on each system speedups tail off after certainnumber of threads is used. This is due to a fact that a lot ofdata is shared between threads and data-access is slower for
cores that are farther from the data. The maximum speedupsachieved are 12, 11 and 16 on titanic, xookik and power,respectively.
VI. RELATED WORK
Early work in refactoring has been described in [22]. A goodsurvey (as of 2004) can be found in [17]. There has sofar been only a limited amount of work on refactoring forparallelism [4]. In [5], a parallel refactoring methodology forErlang programs, including a refactoring tool, is introducedfor Skeletons in Erlang. Unlike the work presented here, thetechnique is limited to Erlang and does not evaluate reductionsin development time. Other work on parallel refactoring hasmostly considered loop parallelisation in Fortran [21] and Java[9]. However, these approaches are limited to concrete andsimple structural changes (e.g. loop unrolling).
Parallel design patterns are provided as algorithmic skele-tons in a number of different parallel programming frameworks[11] and several different authors advocated the massive usageof patterns for writing parallel applications [16], [20] afterthe well-known Berkeley report [6] indicated parallel designpatterns as a viable way to solve the problems related tothe development of parallel applications with traditional (lowlevel) parallel programming frameworks.
In the algorithmic skeleton research frameworks, thereis a lot of work on improving extra-functional features ofparallel programs by using pattern rewriting rules [15], [2],[13]. We use these rules to support design space explorationin our system. Other authors use rewriting/refactoring to sup-port efficient code generation from skeletons/patterns [12],which is a similar concept w.r.t. our approach. Finally, [14]proposed a “parallel” embedded DSL exploiting annotations,differently from what we use here, which is an external DSL.Other authors proposed to adopt DSL approaches to parallelprogramming [18], [25] similarly to what we propose here,although the DSL proposed is an embedded DSL and mostlyaims at targeting heterogenous CPU/GPU hardware.
VII. CONCLUSIONS AND FUTURE WORK
In this paper, we presented a high-level domain-specific lan-guage, Refactoring Pattern Language (RPL), that can be usedto concisely and efficiently capture parallel patterns, and there-fore describe the parallel structure of an application. RPL can
0 1 2 4 6 8 10 12 14 16 18 20 22 24
1
4
8
12
16
20
24
Nr. threads
Spee
dup
Speedups for Image Convolution on titanic
0 1 2 4 6 8 10 12
1
4
8
12
Nr. threads
Speedups for Image Convolution on xookik
0 20 40 60 80 100 120 140 160
1
4
8
12
16
20
24
28
32
36
Nr. threads
Speedups for Image Convolution on power8
OpenMP (s | m)
OpenMP (m | m)OpenMP m
TBB (s | m)
TBB (m | m)TBB m
FF (s | m)
FF (m | m)FF m
Fig. 10. Image Convolution speedups on titanic, xookik and power. Here, | is a parallel pipeline, m is a parallel map and s is a sequential stage.
1 for (j=0; j<num iter; j++) {2 for (i=0; i<num ants; i++)3 cost[i ] = solve (i, p,d,w,t);4 best t = pick best(&best result);5 for (i=0; i<n; i++)6 t [i ] = update(i, best t, best result);7 }
Since pick best in Line 4 cannot start until all of the ants havecomputed their solutions, and the for loop that updates t cannotstart until pick best finishes, we have implicit ordering in thecode above. Therefore, the structure can be described in theRPL with:
seq (solve) ; pick best ; seq (update)
where ; denotes the ordering between computations. Due to anordering between solve, pick best and update, the only way toparallelise the sequential code is to convert seq (solve) and/orseq (update) into maps. Therefore, the possible parallelisationsare:
1) map (solve) ; pick best ; update2) solve ; pick best ; map (update)3) map (solve) ; pick best ; map (update)
Since solve dominates computing time, we are going to con-sider only parallelisations 1) and 3). Speedups for these twoparallelisations, on titanic, xookik and power, with a varyingnumber of CPU threads used, are given in Figure 11. In thefigure, we denote map by m and a sequential stage by s.Therefore, (m ; s ; s) denotes that solve is a parallel map,pick best is sequential and update is also sequential. From theFigure 11, we can observe (similarly to the Image Convolutionexample) that speedups are similar for all parallel libraries.The only exception is Fastflow on power, which gives slightlybetter speedup than other libraries. Furthermore, both differentparallelisations give approximately the same speedups, withthe (m ; s ; m) parallelisation using more resources (threads)altogether. This indicates that it is not always the best ideato parallelise everything that can be parallelised. Finally, wecan note that none of the libraries is able to achieve linearspeedups, and on each system speedups tail off after certainnumber of threads is used. This is due to a fact that a lot ofdata is shared between threads and data-access is slower for
cores that are farther from the data. The maximum speedupsachieved are 12, 11 and 16 on titanic, xookik and power,respectively.
VI. RELATED WORK
Early work in refactoring has been described in [22]. A goodsurvey (as of 2004) can be found in [17]. There has sofar been only a limited amount of work on refactoring forparallelism [4]. In [5], a parallel refactoring methodology forErlang programs, including a refactoring tool, is introducedfor Skeletons in Erlang. Unlike the work presented here, thetechnique is limited to Erlang and does not evaluate reductionsin development time. Other work on parallel refactoring hasmostly considered loop parallelisation in Fortran [21] and Java[9]. However, these approaches are limited to concrete andsimple structural changes (e.g. loop unrolling).
Parallel design patterns are provided as algorithmic skele-tons in a number of different parallel programming frameworks[11] and several different authors advocated the massive usageof patterns for writing parallel applications [16], [20] afterthe well-known Berkeley report [6] indicated parallel designpatterns as a viable way to solve the problems related tothe development of parallel applications with traditional (lowlevel) parallel programming frameworks.
In the algorithmic skeleton research frameworks, thereis a lot of work on improving extra-functional features ofparallel programs by using pattern rewriting rules [15], [2],[13]. We use these rules to support design space explorationin our system. Other authors use rewriting/refactoring to sup-port efficient code generation from skeletons/patterns [12],which is a similar concept w.r.t. our approach. Finally, [14]proposed a “parallel” embedded DSL exploiting annotations,differently from what we use here, which is an external DSL.Other authors proposed to adopt DSL approaches to parallelprogramming [18], [25] similarly to what we propose here,although the DSL proposed is an embedded DSL and mostlyaims at targeting heterogenous CPU/GPU hardware.
VII. CONCLUSIONS AND FUTURE WORK
In this paper, we presented a high-level domain-specific lan-guage, Refactoring Pattern Language (RPL), that can be usedto concisely and efficiently capture parallel patterns, and there-fore describe the parallel structure of an application. RPL can
Image ConvoluRon
13
0 1 2 4 6 8 10 12 14 16 18 20 22 24
1
4
8
12
16
20
24
Nr. threads
Spee
dup
Speedups for Image Convolution on titanic
0 1 2 4 6 8 10 12
1
4
8
12
Nr. threads
Speedups for Image Convolution on xookik
0 20 40 60 80 100 120 140 160
1
4
8
12
16
20
24
28
32
36
Nr. threads
Speedups for Image Convolution on power8
OpenMP (s | m)
OpenMP (m | m)OpenMP m
TBB (s | m)
TBB (m | m)TBB m
FF (s | m)
FF (m | m)FF m
Fig. 10. Image Convolution speedups on titanic, xookik and power. Here, | is a parallel pipeline, m is a parallel map and s is a sequential stage.
1 for (j=0; j<num iter; j++) {2 for (i=0; i<num ants; i++)3 cost[i ] = solve (i, p,d,w,t);4 best t = pick best(&best result);5 for (i=0; i<n; i++)6 t [i ] = update(i, best t, best result);7 }
Since pick best in Line 4 cannot start until all of the ants havecomputed their solutions, and the for loop that updates t cannotstart until pick best finishes, we have implicit ordering in thecode above. Therefore, the structure can be described in theRPL with:
seq (solve) ; pick best ; seq (update)
where ; denotes the ordering between computations. Due to anordering between solve, pick best and update, the only way toparallelise the sequential code is to convert seq (solve) and/orseq (update) into maps. Therefore, the possible parallelisationsare:
1) map (solve) ; pick best ; update2) solve ; pick best ; map (update)3) map (solve) ; pick best ; map (update)
Since solve dominates computing time, we are going to con-sider only parallelisations 1) and 3). Speedups for these twoparallelisations, on titanic, xookik and power, with a varyingnumber of CPU threads used, are given in Figure 11. In thefigure, we denote map by m and a sequential stage by s.Therefore, (m ; s ; s) denotes that solve is a parallel map,pick best is sequential and update is also sequential. From theFigure 11, we can observe (similarly to the Image Convolutionexample) that speedups are similar for all parallel libraries.The only exception is Fastflow on power, which gives slightlybetter speedup than other libraries. Furthermore, both differentparallelisations give approximately the same speedups, withthe (m ; s ; m) parallelisation using more resources (threads)altogether. This indicates that it is not always the best ideato parallelise everything that can be parallelised. Finally, wecan note that none of the libraries is able to achieve linearspeedups, and on each system speedups tail off after certainnumber of threads is used. This is due to a fact that a lot ofdata is shared between threads and data-access is slower for
cores that are farther from the data. The maximum speedupsachieved are 12, 11 and 16 on titanic, xookik and power,respectively.
VI. RELATED WORK
Early work in refactoring has been described in [22]. A goodsurvey (as of 2004) can be found in [17]. There has sofar been only a limited amount of work on refactoring forparallelism [4]. In [5], a parallel refactoring methodology forErlang programs, including a refactoring tool, is introducedfor Skeletons in Erlang. Unlike the work presented here, thetechnique is limited to Erlang and does not evaluate reductionsin development time. Other work on parallel refactoring hasmostly considered loop parallelisation in Fortran [21] and Java[9]. However, these approaches are limited to concrete andsimple structural changes (e.g. loop unrolling).
Parallel design patterns are provided as algorithmic skele-tons in a number of different parallel programming frameworks[11] and several different authors advocated the massive usageof patterns for writing parallel applications [16], [20] afterthe well-known Berkeley report [6] indicated parallel designpatterns as a viable way to solve the problems related tothe development of parallel applications with traditional (lowlevel) parallel programming frameworks.
In the algorithmic skeleton research frameworks, thereis a lot of work on improving extra-functional features ofparallel programs by using pattern rewriting rules [15], [2],[13]. We use these rules to support design space explorationin our system. Other authors use rewriting/refactoring to sup-port efficient code generation from skeletons/patterns [12],which is a similar concept w.r.t. our approach. Finally, [14]proposed a “parallel” embedded DSL exploiting annotations,differently from what we use here, which is an external DSL.Other authors proposed to adopt DSL approaches to parallelprogramming [18], [25] similarly to what we propose here,although the DSL proposed is an embedded DSL and mostlyaims at targeting heterogenous CPU/GPU hardware.
VII. CONCLUSIONS AND FUTURE WORK
In this paper, we presented a high-level domain-specific lan-guage, Refactoring Pattern Language (RPL), that can be usedto concisely and efficiently capture parallel patterns, and there-fore describe the parallel structure of an application. RPL can
0 1 2 4 6 8 10 12 14 16 18 20 22 24
1
4
8
12
16
20
24
Nr. threads
Spee
dup
Speedups for Image Convolution on titanic
0 1 2 4 6 8 10 12
1
4
8
12
Nr. threads
Speedups for Image Convolution on xookik
0 20 40 60 80 100 120 140 160
1
4
8
12
16
20
24
28
32
36
Nr. threads
Speedups for Image Convolution on power8
OpenMP (s | m)
OpenMP (m | m)OpenMP m
TBB (s | m)
TBB (m | m)TBB m
FF (s | m)
FF (m | m)FF m
Fig. 10. Image Convolution speedups on titanic, xookik and power. Here, | is a parallel pipeline, m is a parallel map and s is a sequential stage.
1 for (j=0; j<num iter; j++) {2 for (i=0; i<num ants; i++)3 cost[i ] = solve (i, p,d,w,t);4 best t = pick best(&best result);5 for (i=0; i<n; i++)6 t [i ] = update(i, best t, best result);7 }
Since pick best in Line 4 cannot start until all of the ants havecomputed their solutions, and the for loop that updates t cannotstart until pick best finishes, we have implicit ordering in thecode above. Therefore, the structure can be described in theRPL with:
seq (solve) ; pick best ; seq (update)
where ; denotes the ordering between computations. Due to anordering between solve, pick best and update, the only way toparallelise the sequential code is to convert seq (solve) and/orseq (update) into maps. Therefore, the possible parallelisationsare:
1) map (solve) ; pick best ; update2) solve ; pick best ; map (update)3) map (solve) ; pick best ; map (update)
Since solve dominates computing time, we are going to con-sider only parallelisations 1) and 3). Speedups for these twoparallelisations, on titanic, xookik and power, with a varyingnumber of CPU threads used, are given in Figure 11. In thefigure, we denote map by m and a sequential stage by s.Therefore, (m ; s ; s) denotes that solve is a parallel map,pick best is sequential and update is also sequential. From theFigure 11, we can observe (similarly to the Image Convolutionexample) that speedups are similar for all parallel libraries.The only exception is Fastflow on power, which gives slightlybetter speedup than other libraries. Furthermore, both differentparallelisations give approximately the same speedups, withthe (m ; s ; m) parallelisation using more resources (threads)altogether. This indicates that it is not always the best ideato parallelise everything that can be parallelised. Finally, wecan note that none of the libraries is able to achieve linearspeedups, and on each system speedups tail off after certainnumber of threads is used. This is due to a fact that a lot ofdata is shared between threads and data-access is slower for
cores that are farther from the data. The maximum speedupsachieved are 12, 11 and 16 on titanic, xookik and power,respectively.
VI. RELATED WORK
Early work in refactoring has been described in [22]. A goodsurvey (as of 2004) can be found in [17]. There has sofar been only a limited amount of work on refactoring forparallelism [4]. In [5], a parallel refactoring methodology forErlang programs, including a refactoring tool, is introducedfor Skeletons in Erlang. Unlike the work presented here, thetechnique is limited to Erlang and does not evaluate reductionsin development time. Other work on parallel refactoring hasmostly considered loop parallelisation in Fortran [21] and Java[9]. However, these approaches are limited to concrete andsimple structural changes (e.g. loop unrolling).
Parallel design patterns are provided as algorithmic skele-tons in a number of different parallel programming frameworks[11] and several different authors advocated the massive usageof patterns for writing parallel applications [16], [20] afterthe well-known Berkeley report [6] indicated parallel designpatterns as a viable way to solve the problems related tothe development of parallel applications with traditional (lowlevel) parallel programming frameworks.
In the algorithmic skeleton research frameworks, thereis a lot of work on improving extra-functional features ofparallel programs by using pattern rewriting rules [15], [2],[13]. We use these rules to support design space explorationin our system. Other authors use rewriting/refactoring to sup-port efficient code generation from skeletons/patterns [12],which is a similar concept w.r.t. our approach. Finally, [14]proposed a “parallel” embedded DSL exploiting annotations,differently from what we use here, which is an external DSL.Other authors proposed to adopt DSL approaches to parallelprogramming [18], [25] similarly to what we propose here,although the DSL proposed is an embedded DSL and mostlyaims at targeting heterogenous CPU/GPU hardware.
VII. CONCLUSIONS AND FUTURE WORK
In this paper, we presented a high-level domain-specific lan-guage, Refactoring Pattern Language (RPL), that can be usedto concisely and efficiently capture parallel patterns, and there-fore describe the parallel structure of an application. RPL can
Image ConvoluRon – 20 Cores!
14
0 1 2 4 6 8 10 12 14 16 18 20 22 24
1
4
8
12
16
20
24
Nr. threads
Spee
dup
Speedups for Image Convolution on titanic
0 1 2 4 6 8 10 12
1
4
8
12
Nr. threads
Speedups for Image Convolution on xookik
0 20 40 60 80 100 120 140 160
1
4
8
12
16
20
24
28
32
36
Nr. threads
Speedups for Image Convolution on power8
OpenMP (s | m)
OpenMP (m | m)OpenMP m
TBB (s | m)
TBB (m | m)TBB m
FF (s | m)
FF (m | m)FF m
Fig. 10. Image Convolution speedups on titanic, xookik and power. Here, | is a parallel pipeline, m is a parallel map and s is a sequential stage.
1 for (j=0; j<num iter; j++) {2 for (i=0; i<num ants; i++)3 cost[i ] = solve (i, p,d,w,t);4 best t = pick best(&best result);5 for (i=0; i<n; i++)6 t [i ] = update(i, best t, best result);7 }
Since pick best in Line 4 cannot start until all of the ants havecomputed their solutions, and the for loop that updates t cannotstart until pick best finishes, we have implicit ordering in thecode above. Therefore, the structure can be described in theRPL with:
seq (solve) ; pick best ; seq (update)
where ; denotes the ordering between computations. Due to anordering between solve, pick best and update, the only way toparallelise the sequential code is to convert seq (solve) and/orseq (update) into maps. Therefore, the possible parallelisationsare:
1) map (solve) ; pick best ; update2) solve ; pick best ; map (update)3) map (solve) ; pick best ; map (update)
Since solve dominates computing time, we are going to con-sider only parallelisations 1) and 3). Speedups for these twoparallelisations, on titanic, xookik and power, with a varyingnumber of CPU threads used, are given in Figure 11. In thefigure, we denote map by m and a sequential stage by s.Therefore, (m ; s ; s) denotes that solve is a parallel map,pick best is sequential and update is also sequential. From theFigure 11, we can observe (similarly to the Image Convolutionexample) that speedups are similar for all parallel libraries.The only exception is Fastflow on power, which gives slightlybetter speedup than other libraries. Furthermore, both differentparallelisations give approximately the same speedups, withthe (m ; s ; m) parallelisation using more resources (threads)altogether. This indicates that it is not always the best ideato parallelise everything that can be parallelised. Finally, wecan note that none of the libraries is able to achieve linearspeedups, and on each system speedups tail off after certainnumber of threads is used. This is due to a fact that a lot ofdata is shared between threads and data-access is slower for
cores that are farther from the data. The maximum speedupsachieved are 12, 11 and 16 on titanic, xookik and power,respectively.
VI. RELATED WORK
Early work in refactoring has been described in [22]. A goodsurvey (as of 2004) can be found in [17]. There has sofar been only a limited amount of work on refactoring forparallelism [4]. In [5], a parallel refactoring methodology forErlang programs, including a refactoring tool, is introducedfor Skeletons in Erlang. Unlike the work presented here, thetechnique is limited to Erlang and does not evaluate reductionsin development time. Other work on parallel refactoring hasmostly considered loop parallelisation in Fortran [21] and Java[9]. However, these approaches are limited to concrete andsimple structural changes (e.g. loop unrolling).
Parallel design patterns are provided as algorithmic skele-tons in a number of different parallel programming frameworks[11] and several different authors advocated the massive usageof patterns for writing parallel applications [16], [20] afterthe well-known Berkeley report [6] indicated parallel designpatterns as a viable way to solve the problems related tothe development of parallel applications with traditional (lowlevel) parallel programming frameworks.
In the algorithmic skeleton research frameworks, thereis a lot of work on improving extra-functional features ofparallel programs by using pattern rewriting rules [15], [2],[13]. We use these rules to support design space explorationin our system. Other authors use rewriting/refactoring to sup-port efficient code generation from skeletons/patterns [12],which is a similar concept w.r.t. our approach. Finally, [14]proposed a “parallel” embedded DSL exploiting annotations,differently from what we use here, which is an external DSL.Other authors proposed to adopt DSL approaches to parallelprogramming [18], [25] similarly to what we propose here,although the DSL proposed is an embedded DSL and mostlyaims at targeting heterogenous CPU/GPU hardware.
VII. CONCLUSIONS AND FUTURE WORK
In this paper, we presented a high-level domain-specific lan-guage, Refactoring Pattern Language (RPL), that can be usedto concisely and efficiently capture parallel patterns, and there-fore describe the parallel structure of an application. RPL can
0 1 2 4 6 8 10 12 14 16 18 20 22 24
1
4
8
12
16
20
24
Nr. threads
Spee
dup
Speedups for Image Convolution on titanic
0 1 2 4 6 8 10 12
1
4
8
12
Nr. threads
Speedups for Image Convolution on xookik
0 20 40 60 80 100 120 140 160
1
4
8
12
16
20
24
28
32
36
Nr. threads
Speedups for Image Convolution on power8
OpenMP (s | m)
OpenMP (m | m)OpenMP m
TBB (s | m)
TBB (m | m)TBB m
FF (s | m)
FF (m | m)FF m
Fig. 10. Image Convolution speedups on titanic, xookik and power. Here, | is a parallel pipeline, m is a parallel map and s is a sequential stage.
1 for (j=0; j<num iter; j++) {2 for (i=0; i<num ants; i++)3 cost[i ] = solve (i, p,d,w,t);4 best t = pick best(&best result);5 for (i=0; i<n; i++)6 t [i ] = update(i, best t, best result);7 }
Since pick best in Line 4 cannot start until all of the ants havecomputed their solutions, and the for loop that updates t cannotstart until pick best finishes, we have implicit ordering in thecode above. Therefore, the structure can be described in theRPL with:
seq (solve) ; pick best ; seq (update)
where ; denotes the ordering between computations. Due to anordering between solve, pick best and update, the only way toparallelise the sequential code is to convert seq (solve) and/orseq (update) into maps. Therefore, the possible parallelisationsare:
1) map (solve) ; pick best ; update2) solve ; pick best ; map (update)3) map (solve) ; pick best ; map (update)
Since solve dominates computing time, we are going to con-sider only parallelisations 1) and 3). Speedups for these twoparallelisations, on titanic, xookik and power, with a varyingnumber of CPU threads used, are given in Figure 11. In thefigure, we denote map by m and a sequential stage by s.Therefore, (m ; s ; s) denotes that solve is a parallel map,pick best is sequential and update is also sequential. From theFigure 11, we can observe (similarly to the Image Convolutionexample) that speedups are similar for all parallel libraries.The only exception is Fastflow on power, which gives slightlybetter speedup than other libraries. Furthermore, both differentparallelisations give approximately the same speedups, withthe (m ; s ; m) parallelisation using more resources (threads)altogether. This indicates that it is not always the best ideato parallelise everything that can be parallelised. Finally, wecan note that none of the libraries is able to achieve linearspeedups, and on each system speedups tail off after certainnumber of threads is used. This is due to a fact that a lot ofdata is shared between threads and data-access is slower for
cores that are farther from the data. The maximum speedupsachieved are 12, 11 and 16 on titanic, xookik and power,respectively.
VI. RELATED WORK
Early work in refactoring has been described in [22]. A goodsurvey (as of 2004) can be found in [17]. There has sofar been only a limited amount of work on refactoring forparallelism [4]. In [5], a parallel refactoring methodology forErlang programs, including a refactoring tool, is introducedfor Skeletons in Erlang. Unlike the work presented here, thetechnique is limited to Erlang and does not evaluate reductionsin development time. Other work on parallel refactoring hasmostly considered loop parallelisation in Fortran [21] and Java[9]. However, these approaches are limited to concrete andsimple structural changes (e.g. loop unrolling).
Parallel design patterns are provided as algorithmic skele-tons in a number of different parallel programming frameworks[11] and several different authors advocated the massive usageof patterns for writing parallel applications [16], [20] afterthe well-known Berkeley report [6] indicated parallel designpatterns as a viable way to solve the problems related tothe development of parallel applications with traditional (lowlevel) parallel programming frameworks.
In the algorithmic skeleton research frameworks, thereis a lot of work on improving extra-functional features ofparallel programs by using pattern rewriting rules [15], [2],[13]. We use these rules to support design space explorationin our system. Other authors use rewriting/refactoring to sup-port efficient code generation from skeletons/patterns [12],which is a similar concept w.r.t. our approach. Finally, [14]proposed a “parallel” embedded DSL exploiting annotations,differently from what we use here, which is an external DSL.Other authors proposed to adopt DSL approaches to parallelprogramming [18], [25] similarly to what we propose here,although the DSL proposed is an embedded DSL and mostlyaims at targeting heterogenous CPU/GPU hardware.
VII. CONCLUSIONS AND FUTURE WORK
In this paper, we presented a high-level domain-specific lan-guage, Refactoring Pattern Language (RPL), that can be usedto concisely and efficiently capture parallel patterns, and there-fore describe the parallel structure of an application. RPL can
Comparable Performance
15
1 2 4 6 8 10 12 14 16
1
2
4
6
8
10
No of �2 workers
Spee
dup
Speedups for Convolution �1(G) k �2(F )
�1 = 1
�1 = 2
�1 = 4
�1 = 6
�1 = 8
�1 = 10
1 2 4 6 8 10 12 14 16 18 20 22 24
1
2
4
6
8
10
12
14
16
18
20
22
24
No of Workers
Spee
dup
Speedups for Ant Colony, BasicN2 and Graphical Lasso
BasicN2
BasicN2 Manual
Graphical Lasso
Graphical Lasso Manual
Ant Colony Optimisation Manual
Ant Colony Optimisation
Figure 3. Refactored Use Case Results in FastFlow
code and simply points the refactoring tool towards them. Theactual parallelisation is then performed by the refactoring tool,supervised by the programmer. This can give significant sav-ings in effort, of about one order of magnitude. This is achievedwithout major performance losses: as desired, the speedupsachieved with the refactoring tool are approximately the sameas for full-scale manual implementations by an expert. Infuture we expect to develop this work in a number of newdirections, including adding advanced performance models tothe refactoring process, thus allowing the user to accuratelypredict the parallel performance from applying a particularrefactoring with a specified number of threads. This may beparticularly useful when porting the applications to differentarchitectures, including adding refactoring support for GPUprogramming in OpenCl. Also, once sufficient automisationof the refactoring tool is achieved, the best parametrisationregarding parallel efficiency can be determined via optimisa-tion, further facilitating this approach. In addition, we alsoplan to implement more skeletons, particularly in the field ofcomputer alegbra and physics, and demonstrate the refactoringapproach with these new skeletons on a wide range of realisticapplications. This will add to the evidence that our approach isgeneral, usable and scalable. Finally, we intend to investigatethe limits of scalability that we have obvserved for some of ouruse-cases, aiming to determine whether the limits are hardwareartefacts or algorithmic.
REFERENCES
[1] M. Aldinucci, M. Danelutto, P. Kilpatrick and M. Torquati. FastFlow:High-Level and Efficient Streaming on Multi-Core. ProgrammingMulti-core and Many-core Computing Systems. Parallel and DistributedComptuing. Chap. 13, 2013. Wiley.
[2] Michael P Allen. Introduction to Molecular Dynamics Simulation.Computational Soft Matter: From Synthetic Polymers to Proteins, 23:1–28, 2004.
[3] M. den Besten, T. Stuetzle, M. Dorigo. Ant Colony Optimization forthe Total Weighted Tardiness Problem PPSN 6, p611-620, Sept. 2000.
[4] C. Brown, K. Hammond, M. Danelutto, P. Kilpatrick, and A. Elliott.Cost-Directed Refactoring for Parallel Erlang Programs. in Interna-tional Journal Parallel Processing. HLPP 2013 Special Issue. Springer.Paris, September 2013. DOI 10.1007/s10766-013-0266-5
[5] C. Brown, K. Hammond, M. Danelutto, and P. Kilpatrick. A Language-Independent Parallel Refactoring Framework. in Proc. of the FifthWorkshop on Refactoring Tools (WRT ’12)., Pages 54-58. ACM, NewYork, USA. 2012.
[6] C. Brown, H. Li, and S. Thompson. An Expression Processor: A CaseStudy in Refactoring Haskell Programs. Eleventh Symp. on Trends inFunc. Prog., May 2010.
[7] C. Brown, H. Loidl, and K. Hammond. Paraforming: Forming HaskellPrograms using Novel Refactoring Techniques. 12th Symp. on Trendsin Func. Prog., Spain, May 2011.
[8] C. Brown, K. Hammond, M. Danelutto, P. Kilpatrick, H. Schöner,and T. Breddin. Paraphrasing: Generating Parallel Programs UsingRefactoring. In 10th International Symposium, FMCO 2011. Turin,Italy, October 3-5, 2011. Revised Selected Papers. Springer-Berlin-Heidelberg. Pages 237-256.
[9] R. M. Burstall and J. Darlington. A Transformation System forDeveloping Recursive Programs. J. of the ACM, 24(1):44–67, 1977.
[10] M. Cole. Algorithmic Skeletons: Structured Management of ParallelComputations. Research Monographs in Par. and Distrib. Computing.Pitman, 1989.
[11] M. Cole. Bringing Skeletons out of the Closet: A Pragmatic Manifestofor Skeletal Parallel Programming. Par. Computing, 30(3):389–406,2004.
[12] D. Dig. A Refactoring Approach to Parallelism. IEEE Softw., 28:17–22,January 2011.
[13] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse In-verse Covariance Estimation with the Graphical Lasso. Biostatistics,9(3):432–441, July 2008.
[14] R. Loogen, Y. Ortega-Mallén, and R. Peña-Marí. Parallel Func. Prog.in Eden. J. of Func. Prog., 15(3):431–475, 2005.
[15] T. Mens and T. Tourwé. A Survey of Software Refactoring. IEEETrans. Softw. Eng., 30(2):126–139, 2004.
[16] H. Partsch and R. Steinbruggen. Program Transformation Systems.ACM Comput. Surv., 15(3):199–236, 1983.
[17] K. Hammond, M. Aldinucci, C. Brown, F. Cesarini, M. Danelutto,H. Gonzalez-Velez, P. Kilpatrick, R. Keller, T. Natschlager, andG. Shainer. The ParaPhrase Project: Parallel Patterns for AdaptiveHeterogeneous Multicore Systems. FMCO. Feb. 2012.
[18] K. Hammond, J. Berthold, and R. Loogen. Automatic Skeletonsin Template Haskell. Parallel Processing Letters, 13(3):413–424,September 2003.
[19] W. Opdyke. Refactoring Object-Oriented Frameworks. PhD Thesis,Dept. of Comp Sci, University of Illinois at Urbana-Champaign, Cham-paign, IL, USA (1992).
[20] T. Sheard and S. P. Jones. Template Meta-Programming for Haskell.SIGPLAN Not., 37:60–75, December 2002.
[21] D. B. Skillicorn and W. Cai. A Cost Calculus for Parallel FunctionalProgramming. J. Parallel Distrib. Comput., 28(1):65–83, 1995.
[22] J. Wloka, M. Sridharan, and F. Tip. Refactoring for reentrancy. InESEC/FSE ’09, pages 173–182, Amsterdam, 2009. ACM.
Road Map – Watch this space!
16
First Version Released
May 2017 Now
Conclusions
§ Refactoring tool support: § Guides a programmer through steps to achieve parallelism § Warns the user if they are going wrong § Avoids common pi[alls § Helps with understanding and intui+on § Reduces amount of boilerplate code
§ Allows programmer to concentrate on algorithm, rather than parallelism.
17
ParaFormanceTM Needs You!
• Please join our mailing list and help grow our user community § news items § access to free development so_ware § chat to the developers § free developer workshops § bug tracking and fixing § Sign up for tool demos/trials
• Subscribe at
18
hbps://mailman.cs.st-‐andrews.ac.uk/mailman/lis+nfo/paraformance-‐news
Q&A
19
THANK YOU!
http://rephrase-ict.eu
@rephrase_eu
20