Parallel Programming languages on heterogeneous architectures4 .docx

7/25/2019 Parallel Programming languages on heterogeneous architectures4 .docx

1/26

Title

PARALLEL PROGRAMMING LANGUAGES ON

HETEROGENEOUS ARCHITECTURES USING

OPENMPC, OMPSS, OPENACC AND OPENMP

Titulo

LENGUAJES PARA PROGRAMACION PARALELA

EN ARQUITECTURAS HETEROGENEAS

UTILIZANDO OPENMPC, OMPSS, OPENACC Y

OPENMP

Authors

Esteban Hernndez B.

Network Engineering with Master degree on software engineering and Free Software

construccin, minor degree on a!ied mathematics and network software construction.

Now running the "octorate studies on Engineering of "istrita! #ni$ersit% and work as

rincia! architect on N'. (our works in focus on )ara!!e! )rogramming, high


2/26

erformance comuting, comutationa! numerica! simu!ation and numerica! weather

forecast.

*ontact+ ehernandezb-udistrita!.edu.co

erardo de /es0s Monto%a a$iria,

Engineer in meteoro!og% from the #ni$ersit% of 1eningrad, with a doctorate in h%sica!2

mathematica! sciences of state moscow uni$ersit%, ioneer in the area of meteoro!og% in

*o!ombia, in charge of meteoro!og% graduate of the Nationa! #ni$ersit% of *o!ombia,

researcher and director of more than 34 graduate theses d%namic area metereoro!og5a and

numerica! forecast, air 6ua!it%, efficient use of c!imate mode!s and weather. He is current!%

a fu!! rofessor of the Facu!t% of eosciences of the Nationa! #ni$ersit% of *o!ombia.

*ontact+ gdmonto%ag-una!.edu.co

*ar!os Enri6ue Montenegro

S%stem Engineering, )hd and Master degree on 7nformatics, director of research grou

77&8 with focus on Socia! Network 8na!%zing, e1earning and data $isua!ization. He is

current!% associate rofessor of Engineering Facu!t% on "istrita! #ni$ersit%.

*ontact+ cemontenegrom-udistrita!.edu.co

Abstract: 9n the fie!d of ara!!e! rograming has seen arri$e a new big !a%er in the !ast 3:

%ears. 'he )#;s has taken a re!e$ant imortance on scientific comuting because offers
mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]


3/26

high erformance comuting, !ow cost and sim!icit% of im!ementation. Howe$er one of

the most imortant cha!!enges it the rogram !anguages used for this de$ices. 'he effort for

recoding a!gorithms designed for *)#s is an imortant rob!em. 7n this artic!e we re$iew

three of rincia! frameworks for rogramming *#"8 de$ices comared with the new

directi$es introduced on the 9enM) < standard reso!$ing the /acobi iterati$e

method=Meierink > ?orst, 3@AA Cang, n.d.D.

Resumen: En e! camo de !a rogramacin ara!e!a, ha arribado un nue$o gran ugador en

!os 0!timos 3: aos. 1as )#s han tomado una imortancia re!e$ante en !a comutacin

cient5fica debido a 6ue ofrecen a!to rendimiento comutaciona!, bao costos % sim!icidad

de im!ementacin, sin embargo uno de !os desaf5os ms grandes 6ue oseen son !os

!enguaes uti!izados ara !a rogramacin de !os disositi$os. E! esfuerzo de reescribir

a!goritmos diseados origina!mente ara *)#s in uno de !os rob!emas ms re!e$antes. En

este art5cu!o se re$isan tres frameworks de rogramacin ara !a tecno!og5a *#"8 % se

rea!iza una comaracin con e! reciente estndar 9enM) $ersin ?orst, 3@AA Cang, n.d.D.

KeywordsCUDA, /acobbi method,9mSS 9en8**,9enM), 9enM), )ara!!e!

)rogramming

Palabras clave; Mtodo de acobi, 9mSS, 9en8**,9enM), )rogramacin ara!e!a.


4/26

I. INTRODUCTION

Since 3: %ears ago, the massi$e!% ara!!e! rocessors ha$e used the )#s as rincia!

e!ement on the new aroach in ara!!e! rogramming itGs e$o!$ed from a grahics2secific

acce!erator to a genera!2urose comuting de$ice and at this time is considered to be in the

era of )#s.=Nicko!!s > "a!!%, 4:3:D, howe$er the rincia! obstac!e for !arge adotion on

the rogrammer communit% has been the !ake of standards that a!!ow rogramming on

unified form different eisting hardware so!utions=Nicko!!s > "a!!%, 4:3:D. 'he most imort

!a%er on )# so!utions is N$idia I with the *#"8I !anguage rogramming and his own

comi!er =n$ccD=Hi!! > Mart%, 4::JD, with thousands of so!utions insta!!ed and reward on

toK:: suercomuter !ist, howe$er the ortabi!it% itGs the rincia! rob!em. Some

communit% roect and some hardware a!!iance ha$e roosed so!utions for reso!$e this

issue. 9mSS, 9en8** and 9enM)* ha$e emerged as the most romising so!utions

=?etter, 4:34D using the 9enM) base mode!. 7n the !ast %ear 9enM) board re!ease the

$ersion


5/26

A. Ompss

9mss =a rogramming mode! form Barce!ona Suercomuter center based on 9enM)

and StarSsD itGs framework focus on task decomosition aradigm for de$e!o ara!!e!

a!ications on c!uster en$ironments with heterogeneous architectures. 7t ro$ides a set of

comi!er directi$es that can be used to annotate a se6uentia! code. 8dditiona! features ha$e

been added to suort the use of acce!erators !ike )#s. 9mSS is based on StartsS a task

based rogramming mode!. 7t is based on annotating a seria! a!ication with directi$es that

are trans!ated b% the comi!er. Cith it, the same rogram that runs se6uentia!!% in a node

with a sing!e )# can run in ara!!e! in mu!ti!e )#s either !oca! =sing!e nodeD or remote

=c!uster of )#sD. Besides erforming a task2based ara!!e!ization, the runtime s%stem

mo$es the data as needed between the different nodes and )#s minimizing the imact of

communication b% using afnit% schedu!ing, caching, and b% o$er!aing communication

with the comutationa! task.

9mSs is based on the 9enM) rogramming mode! with modications to its eecution and

memor% mode!. 7t a!so ro$ides some etensions for s%nchronization, data motion and

heterogeneit% suort.

3D Eecution mode!+ 9mSs uses a thread2oo! eecution mode! instead of the traditiona!

9enM) fork2oin mode!. 'he master thread starts the eecution and a!! other threads

cooerate eecuting the work it creates =whether it is from work sharing or task constructsD.

'herefore, there is no need for a ara!!e! region. Nesting of constructs a!!ows other threads

to generate work as we!! =Figure 3D.


6/26

4D Memor% mode!+ 9mSs assumes that mu!ti!e address saces ma% eist. 8s such shared

data ma% reside in memor% !ocations that are not direct!% accessib!e from some of the

comutationa! resources. 'herefore, a!! ara!!e! code can on!% safe!% access ri$ate data and

shared data which has been marked e!icit!% with our etended s%nta. 'his assumtion is

true e$en for SM) machines as the im!ementation ma% rea!!ocate shared data to imro$e

memor% accesses =e.g., N#M8D.

LD Etensions+

Function tasks+ 9mSs a!!ows to annotate function dec!arations or denitions

*i!k="uran, )erez, 8%guad, Badia, > 1abarta, 4::JD, with a task directi$e. 7n this case, an%

ca!! to the function creates a new task that wi!! eecute the function bod%. 'he data

en$ironment of the task is catured from the function arguments.

"eendenc% s%nchronization+ 9mSs integrates the StarSs deendence suort ="uran et

a!., 4::JD. 7t a!!ows annotating tasks with three c!auses+ inut, outut, inOout. 'he% a!!ow

eressing, resecti$e!%, that a gi$en task deends on some data roduced before, which wi!!

roduce some data, or both. 'he s%nta in the c!ause a!!ows secif%ing sca!ars, arra%s,

ointers and ointed data.


7/26

Figure 1. OmSS e!e"u#i$% m$&e'

B. OpenACC

9en8** is an industr% standard roosed for heterogeneous comuting on

Suer*omuter *onference 4:33. 9en8** has fo!!owing the 9enM) aroach, with

annotation on Se6uentia! code with comi!er directi$es =ragmasD, indicating those regions

of code suscetib!e to be eecuted in the )#.

'he eecution mode! targeted b% 9en8** 8)72enab!ed im!ementations is host2

directed eecution with an attached acce!erator de$ice, such as a )#. Much of a user

a!ication eecutes on the host. *omute intensi$e regions are off!oaded to the acce!erator

de$ice under contro! of the host. 'he de$ice eecutes ara!!e! regions, which t%ica!!%

contain work2 sharing !oos, or kerne!s regions, which t%ica!!% contain one or more !oos

which are eecuted as kerne!s on the acce!erator. E$en in acce!erator2targeted regions, the

host ma% orchestrate the eecution b% a!!ocating memor% on the acce!erator de$ice,

initiating data transfer, sending the code to the acce!erator, assing arguments to the comute

region, 6ueuing the de$ice code, waiting for com!etion, transferring resu!ts back to the


8/26

host, and de2a!!ocating memor% =Figure 4D. 7n most cases, the host can 6ueue a se6uence of

oerations to be eecuted on the de$ice, one after the other.=Co!fe, 4:3LD

'he actua! rob!ems with 9en8** are re!ationshi with the on!% for2oin mode! suort

and suort for on!% commercia! comi!ers can suorts his directi$es =)7, *ra% and

*8)SD=Co!fe, 4:3LD=&e%es, 1ez, Fumero, > Sande, 4:34D. 7n the !ast %ear on!% one

oen source im!ementations has suort =acc#11D =&e%es > 1ez2&odr5guez, 4:34D.

Figure (. Oe%ACC e!e"u#i$% m$&e'

C. OpenMPC

'he 9enM)* =9enM) etendent for *#"8D itGs a framework to hide de com!eit% of

rogramming mode! and memor% mode! to user=1ee > Eigenmann, 4:3:D. 9enM)*

consists of a standar 9enM) 8)7 !us a new set of directi$es and en$ironment $ariab!es to

contro! imortant *#"82re!ated arameters and otimizations.


9/26

9enM)* addresses two imortant issues on ))# rogramming+ rogrammabi!it%

and tunabi!it%. 9enM)* as a front2end rogramming mode! ro$ides rogrammers with

abstractions of the com!e *#"8 rogramming mode! and high2!e$e! contro!s o$er

$arious otimizations and *#"82re!ated arameters. 9enM)* inc!uded fu!!% automatic

comi!ation and user2assisted tuning s%stem suorting 9enM)*. 7n addition to a range of

comi!er transformations and otimizations, the s%stem inc!udes tuning caabi!ities for

generating, runing, and na$igating the search sace of comi!ation $ariants.

9enM)* use the comi!er cetus="a$e, Bae, Min, > 1ee, 4::@D for automatic

ara!!e!ization source to source. 'he Source code on * has L !e$e! of ana!%zing

)ri$atization

&eduction ?ariab!e &ecognition

7nduction ?ariab!e substitution

9enM)* adding a numbers of ragmas for annotate 9enM) ara!!e! regions and se!ect

otimization regions. 'he ragmas added has the fo!!owing form+ ragma cuda

PPfunctionQQ.

D. OpenMP release 4

9enM) is the most used framework for rogramming ara!!e! software with shared

memor% and suort on most of the eisting comi!ers. Cith the e!osion of mu!ticore and

man%core s%stem, 9enM) gain accetance on ara!!e! rogramming communit% and


10/26

hardware $endors. From his creation to $ersion L the focus of 8)7 was the *)#s

en$ironments, but with the introduction of )#s and $ector acce!erators the new < re!ease

inc!ude suort for eterna! de$ices=9enM), 4:3LD. Historica!!% 9enM) has suort

Sim!e 7nstruction Mu!ti!e "ata =S7M"D mode! on!% focus on fork2oin mode! =Figure LD,

but in this new re!ease the task2base mode! ="uran et a!., 4::J )odobas, Brorsson, > Fan,

4:3:D has been introduced to gain erformance with more ara!!e!ism on eterna! de$ices .

'he most imortant directi$es introduced were+ target, teams and distributed. 'his

directi$es ermit that a grou of threads was distributed on a secia! de$ices and the resu!t

was been co% to host memor% =Figure LD.

Figure ). Oe%MP * e!e"u#i$% m$&e'

III. JACO+I ITERATIE METHOD

7terati$e methods are suitab!e for !arge sca!e !inear e6uations. 'here are three common!%

used iterati$e methods+ /acobiGs method, auss method and S9& iterati$e

methods=ra$$anis, Fi!e!is2)aadoou!os, > 1iitakis, 4:3L Huang, 'eng, Cahid, > Ro,

4::@D.


11/26

BAx= 'he !ast two iterati$e methods con$ergence seed is faster than /acobiGs

iterati$e method but !ack of ara!!e!ism. 'he% ha$e ad$antages to /acobi method on!% when

im!emented in se6uentia! fashion and eecuted on traditiona! *)#s. 9n the other hand,

/acobiGs iterati$e method has inherent ara!!e!ism. 7tGs suitab!e to be im!emented on *#"8

or $ector acce!erators to run concurrent!% on man% cores. 'he basic idea of /acobi method

is con$ert the s%stem+ into e6ui$a!ent s%stem then we so!$ed E6uation =3D

and E6uation =4D+

LLLL4L4LLL3

4L4L4444443

3L3L4343333

bxaxaxa

bxaxaxa

bxaxaxa

=++

=++

=++

LL

L4

LL

L43

LL

L3L

44

4L

44

4L3

44

434

33

3L

33

3L4

33

343

a

bx

a

ax

a

ax

a

bx

a

ax

a

ax

a

bx

a

ax

a

ax

+=

+=

+=

E6uation =3D


12/26

=

ij

jk

jii

ii

ik xab

ax 3

,

,

3

on each iteration we so!$e +

E6uation =4D

Chere the $a!ues from the =k-1Dst iteration are used to comute the $a!ues for the kth

iteration. 'he seudo code for /acobi method ="ongarra, o!ub, rosse, Mo!er, > Moore,

4::JD+

*hoose an initia! guest to the so!ution .

for k3,4,T

for i3,4,Tni:

for 3,4,T,i23,iU3,Tn

i iU ai,=k23D

end

i =biU iDO ai,

end=kDcheck con$ergence continue if necessar%

end

'his iterati$e method can be im!emented on a ara!!e! form, using shared or distributed

memor% =Margaris, Soura$!as, > &oume!iotis, 4:3


13/26

Figure *. P-r-''e' $rm $ J-"$/i me#0$& $% 0-re& mem$r2=8!semmeri, n.d.D

I. OPENMP IMPLEMENTATION

int n17M7'VN

int m17M7'VM8WnXWmX OO the "23 matri

8new WnXWmX

%V$ectorWnX OOb $ectorOOfi!! the matriz with initia! conditions

T

ragma om ara!!e! for shared=m, n, 8new, 8D for= int 3 P n23 UUD

Y

for= int i 3 i P m23 iUU D

Y 8newWXWiX :.4Kf Z = 8WXWiU3X U 8WXWi23X

U 8W23XWiX U 8WU3XWiXD

error fmaf= error, fabsf=8newWXWiX28WXWiXDD [

[

ragma om ara!!e! for shared=m, n, 8new, 8D

for= int 3 P n23 UUD Y

for= int i 3 i P m23 iUU D Y

8WXWiX 8newWXWiX

[

[


14/26


15/26

ragma acc kerne!s

for= int 3 P n23 UUD Y

for= int i 3 i P m23 iUU D

Y 8WXWiX 8newWXWiX

[

[ if=iter \ 3:: :D rintf=]\Kd, \:.^f_n], iter, errorD

iterUU

[TOOrint the resu!t

7n this im!ementation aear a new annotation ragma acc kerne!s, where itGs indicate that

the code wi!! be eecuted. 'his sim!e annotation hide a com!e im!ementation of *#"8

kerne!, the co% of data from host to de$ices and de$ices to )# and definition of grid of

threads and the data maniu!ation =8morim > Haase, 4::@ Sanders > Randrot, 4:33

`hang et a!., 4::@D.

I. PURECUDA IMPLEMENTATION=Cang, n.d.D

T

int dimB, dim'

dim' 4K^

dimB =dim O dim'D U 3

f!oat err 3.:

OO set u the memor% for )# f!oat Z 1#Vd

f!oat Z BVd f!oat Z diagVd

f!oat ZVd, ZVo!dVd

f!oat Z tm

cudaMa!!oc= =$oid ZZD >BVd, sizeof=f!oatD Z dim D

cudaMa!!oc= =$oid ZZD >diagVd, sizeof=f!oatD Z dim D

cudaMa!!oc= =$oid ZZD >1#Vd, sizeof=f!oatD Z dim Z dimD


16/26

cudaMemc%= 1#Vd, 1#, sizeof=f!oatD Z dim Z dim, cudaMemc%Host'o"e$iceD

cudaMemc%= BVd, B, sizeof=f!oatD Z dim, cudaMemc%Host'o"e$iceD

cudaMemc%= diagVd, diag, sizeof=f!oatD Z dim, cudaMemc%Host'o"e$iceD

cudaMa!!oc= =$oid ZZD >Vd, sizeof=f!oatD Z dimD

cudaMa!!oc= =$oid ZZD >Vo!dVd, sizeof=f!oatD Z dimD

cudaMa!!oc= =$oid ZZD >tm, sizeof=f!oatD Z dimD

T

OOca!! to cuda kerne!s

OO 4. *omute b% 8 Vo!d

cudaMemc%= Vo!dVd, Vo!d, sizeof=f!oatD Z dim, cudaMemc%Host'o"e$iceD

matMu!t?ecPPPdimB, dim'QQQ=1#Vd, Vo!dVd, tm, dim, dimD OO use Vo!d to comute 1#

Vo!d and store the resu!t in tm

substractPPPdimB, dim'QQQ=BVd, tm, Vd, dimD OO get the =B 2 1# Vo!dD, which is

stored in Vd

diaMu!t?ecPPPdimB, dim'QQQ=diagVd, Vd, dimD OO get the new

OO L. co% the new back to the Host Memor%

cudaMemc%= , Vd, sizeof=f!oatD Z dim, cudaMemc%"e$ice'oHostD

OO


17/26

[

row7d U grid"im. Z b!ock"im.

[

VVs%ncthreads=D[

OO cuda kerne! for $ector subtraction

VVg!oba!VV $oid substract=f!oat ZaVd,

f!oat ZbVd,

f!oat ZcVd,

int dimD

Y

int tid thread7d. U b!ock7d. Z b!ock"im.

whi!e = tid P dim D

Y

cVdWtidX aVdWtidX 2 bVdWtidX

tid U grid"im. Z b!ock"im.

[

[

VVg!oba!VV $oid ?ecMa= f!oat Z $ec, int dimD

Y int tid thread7d. U b!ock7d. Z b!ock"im.

whi!e =dim Q 3D

Y

int mid dim O 4 OO get the ha!f size

if = tid P midD OO fi!ter the acti$e thread

Y

if = $ecWtidX P $ecWtidUmidX D OO get the !arger one between $ecWtidX and $ecWtidUmidX

$ecWtidX $ecWtidUmidX OO and store the !arger one in $ecWtidX [

OOdea! with the odd case

if = dim \ 4 D OO if dim is odd...we need care about the !ast e!ement

Y

if = tid : D OO on!% use the $ecW:X to comare with $ecWdim23X

Y

if = $ecWtidX P $ecWdim23X D

$ecWtidX $ecWdim23X

[ [

VVs%ncthreads=D OO s%nc a!! threads

dim O 4 OO make the $ector ha!f size short.

[

[


18/26

'he effort for write three kerne!s, management the !ogic of grids dimensions, co% data

from hosts to )# and )# to host, the asect of s%nchronization thread on the grous of

'hreads on )# and some asects as 'hread7" ca!cu!ation re6uired high effort of

know!edge of hardware de$ices and code rogramming.

II. TEST TECHNIQUE

Ce take the tree im!ementations of /acobi method =)ure 9enM), 9en8**, 9enM)*D

and running its on S#' =S%stem #nder 'estD of 'ab!e 3, and make mu!ti!es running with

s6uare matrices of incrementa! sizes ='ab!e 4D, take rocessing time for ana!%zing the

erformance =Rim, n.d. Sun > ustafson, 3@@3D against the code number !ines needed on

the a!gorithm.

T-/'e 1. C0-r-"#eri#i" $ S2#em u%&er Te#

ystem Suermicro S(S23:4A&2'&F

CPU 7nte!I eonI 3:2core EK24^J: ?4 *)#s - 4.J: Hz

!emory L4B ""&L 3^::MHz

"PU N?7"78 Re!er R


19/26

4: &am, 4:3LD.

I4. CONCLUSION

'he new de$ices for high erformance comuting needs standardized methods and

!anguages that ermits interoerabi!it%, eas% rogramming and we!! defined interfaces for


20/26

integrated the data interchange between memor% segments of the rocessors =*)#sD and

de$ices. 7s necessar% too, that !anguages has suort for work with two or more de$ices on

ara!!e! using the same code but running segments of high ara!!e!ism in automatica!!%

form. 9enM) is the facto standard for shared memor% rogramming mode! but the

suort for heterogeneous de$ices =us, acce!erators, fga, etcD is in $er% ear!% stage, the

new frameworks and industria! 8)7 need he! to grown and maturation of standard.

4. FINANCING

'his this research was de$e!oment with own resources, in the "octora! thesis rocess on

"istrita! #ni$ersit%

4I. FUTURE 5OR6S

7s necessar% ana!%zing the imact of frameworks for mu!ti!e de$ices on the same host for

management rob!ems of memor% !oca!it%, unified memor% between *)# and de$ices and

7O9 oerations from de$ices using the most recent $ersion of it standard framework. Cith

the suort of )#s de$ice on the fo!!owing $ersion of **K, is ossib!e a comi!er

erformance comarison between commercia! and standards oensource so!utions. 7tGs

necessar% too re$iew the imact too of use c!uster with mu!ti2de$ices, messaging ass and

M)7 integration for high erformance comuting using the !anguage etensions =Schaa >

Rae!i, 4::@D.


21/26


22/26

"annert, '., Marek, 8., > &am, M. =4:3LD. )orting 1arge H)* 8!ications to )#

*!usters+ 'he *odes ENE and ?E&'E. &etrie$ed from

htt+OOari$.orgOabsO3L3:.3 Moore, R. =4::JD. Net!ib and N82Net+

Bui!ding a Scientific *omuting *ommunit%.Annals of *e 0is*or2 of Comp)*ing

333, ,=4D, L:


23/26

uta, S., iang, )., > `hou, H. =4:3LD. 8na!%zing !oca!it% of memor% references in )#

architectures. & of *e ACM (8PLA5 :orksop on Memor2 &, 34.

doi+3:.33


24/26

1=3LAD, 3 1ez2&odr5guez, 7. =4:34D. acc#11+ 8n 9en8** im!ementation with

*#"8 and 9en*1 suort.3)ro-Par +,1+ Parallel &, +=44JL@JD, JA3JJ4.

&etrie$ed from htt+OO!ink.sringer.comOchaterO3:.3::AO@AJ2L2^ Randrot, E. =4:33D. *#"8 b% Eam!e.An n*ro!)/*ion *o 8eneral-P)rpose

8PU Programming &. &etrie$ed from htt+OOscho!ar.goog!e.comOscho!ar

h!en>btnSearch>6intit!e+*#"8Ub%UEam!e3


25/26

Schaa, "., > Rae!i, ". =4::@D. E!oring the mu!ti!e2)# design sace. +,, 333

n*erna*ional (2mposi)m on Parallel ' Dis*rib)*e! Pro/essing, 334.

doi+3:.33:@O7)")S.4::@.K3^3:^J

Sun, ., > ustafson, /. =3@@3D. 'oward a better ara!!e! erformance metric.Parallel

Comp)*ing, 3:@L33:@. &etrie$ed from

htt+OOwww.sciencedirect.comOscienceOartic!eOiiOS:3^AJ3@3:KJ::4J^

?etter, /. =4:34D. 9n the road to Easca!e+ !essons from contemorar% sca!ab!e )#

s%stems. & 7A C@C :orksop on A//elera*or e/nologies for &. &etrie$ed from

htt+OOwww.acrc.a2star.edu.sgOastaratiregV4:34O)roceedingsO)resentation 2 /effre%

?etter.df

Cang, (. =n.d.D. /acobi iteration on )#. &etrie$ed Setember 4K, 4:3


26/26

Parallel Programming languages on heterogeneous architectures4 .docx

Documents

Transcript of Parallel Programming languages on heterogeneous architectures4 .docx