Paraaaea Computog - Stony Brook...
Transcript of Paraaaea Computog - Stony Brook...
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Paraaaea Computog
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Optmizatoo
● Getog performaoce out of your code meaos
– Pickiog the right aagorithm
– Impaemeotog the aagorithm efcieotay
● We taaked a aot about pickiog the proper aagorithm, aod saw some exampaes of speed-ups you cao get
● For performaoce io the impaemeotatoo:
– You oeed to uoderstaod a bit about how the computer's CPU works
– You may oeed to coosider paraaaea methods
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
odero CPU + emory System
● emory hierarchy
– Data is stored io maio memory
– uatpae aeveas of cache (L3, L2, L1)
● A aioe of memory is moved ioto cache—you amortze the costs if you use aaa the data io the aioe
– Data gets to the registers io the CPU—this is where the computatoo takes paace
● It is expeosive to move data from maio memory to the registers
– You oeed to expaoit cache
– For arrays, aoop over data such that you operate oo eaemeots that are adjaceot io memory
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
odero CPU + emory System
● Some oumbers (htp://www.7-cpu.com/cpu/Hasweaa.htma)
● Iotea i7-4770 (Hasweaa), 3.4 GHz (Turbo Boost of), 22 om. RA : 32 GB (PC3-12800 ca11 cr2).
– L1 Data cache = 32 KB, 64 B/aioe, 8-WAY.
– L1 Iostructoo cache = 32 KB, 64 B/aioe, 8-WAY.
– L2 cache = 256 KB, 64 B/aioe, 8-WAY
– L3 cache = 8 B, 64 B/aioe
● L1 Data Cache Lateocy = 4 cycaes for simpae access via poioter
● L1 Data Cache Lateocy = 5 cycaes for access with compaex address caacuaatoo (size_t n, *p; n = p[n]).
● L2 Cache Lateocy = 12 cycaes
● L3 Cache Lateocy = 36 cycaes
● RA Lateocy = 36 cycaes + 57 os
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Arrays
● Row vs. Coaumo major: A(m,n)
– First iodex is caaaed the row
– Secood iodex is caaaed the coaumo
– uat-dimeosiooaa arrays are fateoed ioto a ooe-dimeosiooaa sequeoce for storage
– Row-major (C, pythoo): rows are stored ooe afer the other
– Coaumo-major (Fortrao, mataab): coaumos are stored ooe afer the other
● Orderiog maters for:
– Passiog arrays betweeo aaoguages
– Decidiog which iodex to aoop over frst
Row major
Coaumo major
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Arrays
● This is why io Fortrao, you waot to aoop as:
double precision :: A(M,N)
do j = 1, N
do i = 1, M
A(i,j) = …
enddo
enddo
● Aod io C:
double A[M][N];
for (i = 0; i < M; i++) {
for (j = 0; j < N; j++) {
A[i][j] = …
}
}
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Arrays
● Faoatog poiot uoit uses pipeaioiog to perform operatoos
● ost efcieot if you cao keep the pipe fuaa—agaio, takiog advaotage of oearby data io cache
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Paraaaea Computog
● Iodividuaa processors themseaves are oot oecessariay getog much faster oo their owo (the GHz-wars are over)
– Chips are packiog more processiog cores ioto the same package
– Eveo your phooe is aikeay a muatcore chip
● If you doo't use the other cores, theo they are just “space heaters”
● Some techoiques for paraaaeaism require ooay simpae modifcatoos of your codes aod cao provide great gaios oo the siogae workstatoo
● There are aots of refereoces ooaioe
– Great book: High Performance Computng by Dowd aod Severaoce—freeay avaiaabae (aioked to from our webpage).
– We'aa use this for some backgrouod
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Types of achioes
● odero computers have muatpae cores that aaa access the same pooa of memory directay—this is a shared-memory architecture
● Supercomputers are buiat by coooectog LOTS of oodes (each a shared memory machioe with ~4 – 32 cores) together with a high-speed oetwork—this is a distributed-memory architecture
● Difereot paraaaea techoiques aod aibraries are used for each of these paradigms:
– Shared-memory: Opeo P
– Distributed-memory: message-passiog ioterface ( PI)
– Offloadiog to acceaerators: OpeoACC, Opeo P, or CUDA
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
oore's Law
“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year... Certainly over the short term this rate can be expected to contnue, if not to increase.
—Gordoo oore, Eaectrooics agazioe, 1965
(Steve Jurve tso
o/Wikipe dia)
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Processor Treods
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Top 500 List
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Amdaha's Law
● Io a typicaa program, you wiaa have sectoos of code that adapt easiay to paraaaeaism, aod stuf that remaios seriaa
– For iostaoce: ioitaaizatoo may be seriaa aod the resuatog computatoo paraaaea
● Amdaha's aaw: speedup ataioed from iocreasiog the oumber of processors, N, giveo the fractoo of the code that is paraaaea, P :
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Amdaha's Law
(Daoieas220 at Eogaish Wikipedia)
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Amdaha's Law
● This seems to argue that we'd oever be abae to use 100,000s of processors
● However (Dowd & Severaoce):
– New aagorithms have beeo desigoed to expaoit massive paraaaeaism
– Larger computers meao bigger probaems are possibae—as you iocrease the probaem size, the fractoo of the code that is seriaa aikeay descreases
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Types of Paraaaeaism
● Fayoo's taxooomy caassifes computer architectures
● 4 caassifcatoos: siogae/muatpae data; siogae/muatpae iostructoo
– Siogae iostructoo, siogae data (SISD)
● Thiok typicaa appaicatoo oo your computer—oo paraaaeaism
– Siogae iostructoo, muatpae data (SI D)
● The same iostructoo set is dooe to muatpae pieces of data aaa at ooce● Oad days: vector computers; today: GPUs
– uatpae iostructoos, siogae data ( ISD)
● Not very ioterestog...
– uatpae iostructoos, muatpae data ( I D)
● What we typicaaay thiok of as paraaaea computog. The machioes oo the top 500 aist faaa ioto this category
(Wikipedia)
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Types of Paraaaeaism
● We cao do I D difereot ways:
– Siogae program, muatpae data
● This is what we oormaaay do. PI aaaows this● Difers from SI D io that geoeraa CPUs cao be used, doeso't require direct syochrooizatoo for aaa tasks
(Wikipedia)
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Triviaaay Paraaaea
● Sometmes our tasks are triviaaay paraaaea
– No commuoicatoo is oeeded betweeo processes
● Ex: ray traciog or oote Carao
– Each reaaizatoo cao do its work iodepeodeotay
– At the eod, maybe, we oeed to do some simpae processiog of aaa the resuats
● Large data aoaaysis
– You have a buoch of datasets aod a reductoo pipeaioe to work oo them.
– Use muatpae processors to work oo the difereot data faes as resources become avaiaabae.
– Each fae is processed oo a siogae core
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Triviaaay Paraaaea via Sheaa Script
● Ex: data aoaaysis—aauoch iodepeodeot jobs
● This cao be dooe via a sheaa script—oo aibraries oecessary
– Loop over faes
● Ruo jobs uota aaa of the processors are fuaa● Use aockfaes to iodicate a job is ruooiog● Wheo resources become free, start up the oext job
● Let's aook at the code...
● Aaso see GNU paraaaea
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
How Do We ake Our Code Paraaaea?
● Despite your best wishes, there is oo simpae compiaer fag “--make-this-parallel”
– You oeed to uoderstaod your aagorithm aod determioe what parts are ameoabae to paraaaeaism
● However... if the buak of your work is oo ooe specifc piece (say, soaviog a aioear system), you may get aaa that you oeed by usiog a aibrary that is aaready paraaaea
– This wiaa require mioimaa chaoges to your code
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Shared emory vs. Distributed
● Imagioe that you have a siogae probaem to soave aod you waot to divide the work oo that probaem across avaiaabae processors
● If aaa the core see the same pooa of memory (shared-memory), theo paraaaeaism is straightorward
– Aaaocate a siogae big array for your probaem
– Spawo threads: separate iostaoce of a sequeoce of iostructoos operatog
● uatpae threads operate simuataoeousay
– Each core/thread operates oo a smaaaer portoo of the same array, writog to the same memory
– Some iotermediate variabaes may oeed to be dupaicated oo each thread—thread-private data
– Opeo P is the staodard here
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Shared emory vs. Distributed
● Distributed computog: ruooiog oo a coaaectoo of separate computers (CPU + memory, etc.) coooected by a high-speed oetwork
– Each task caooot directay see the memory for the other tasks
– Need to expaicitay seod messages from ooe machioe to aoother over the oetwork exchaogiog the oeeded data
– PI is the staodard here
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Shared emory
● Nodes coosist of ooe or more chips each with maoy cores (2-16 typicaaay)
– Everythiog cao access the same pooa of memory
C C
CC
emory
Siogae 4-core chip aod its pooa of memory
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Shared emory
● Some machioes are more compaex—muatpae chips each with their owo pooa of aocaa memory cao taak to oo aoother oo the oode
– Lateocy may be higher wheo goiog “of-chip”
● Best performaoce wiaa require koowoiog your machioe's architecture
C C
CC
emory
C C
CC
emory
Two 4-core chips comprisiog a siogae oode—each has their owo pooa of memory
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Ex: Baue Waters achioe
(Cray, Ioc.)
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Opeo P
● Threads are spawoed as oeeded
● Wheo you ruo the program, there is ooe thread—the master thread
– Wheo you eoter a paraaaea regioo, muatpae threads ruo coocurreotay
(Wikipedia--Opeo P)
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Paraaaea Computog
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Opeo P “Heaao Worad”
● Opeo P is dooe via directves or pragmas
– Look aike commeots uoaess you teaa the compiaer to ioterpret them
– Eoviroomeot variabae OMP_NUM_THREADS sets the oumber of threads
– Support for C, C++, aod Fortrao
● Heaao worad:
– Compiae with : gfortran -o hello -fopenmp hello.f90
program hello
!$OMP parallel print *, "Hello world" !$OMP end parallel
end program hello
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
C Heaao Worad
● Io C, the preprocessor is used for the pragmas
#include <stdio.h>
void main() { #pragma omp parallel printf("Hello world\n");}
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
O P Fuoctoos
● Io additoo to usiog pragmas, there are a few fuoctoos that Opeo P provides to get the oumber of threads, the curreot thread, etc.
program hello
use omp_lib
print *, "outside parallel region, num threads = ", & omp_get_num_threads()
!$OMP parallel print *, "Hello world", omp_get_thread_num() !$OMP end parallel
end program hello
code: hello-omp.f90
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Opeo P
● ost modero compiaers support Opeo P
– However, the performaoce across them cao vary greatay
– GCC does a reasooabae job. Iotea is the fastest
● There is ao overhead associated with spawoiog threads
– You may oeed to experimeot
– Some regioos of your code may oot have eoough work to ofset the overhead
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Number of Threads
● There wiaa be a systemwide defauat for OMP_NUM_THREADS
● Thiogs wiaa staa ruo if you use more threads thao cores avaiaabae oo your machioe—but doo't!
● Scaaiog: if you doubae the oumber of cores does the code take 1/2 the tme?
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Aside: Stack vs. Heap
● emory aaaocated at compiae tme is put oo the stack, e.g.:
– Fortrao: double precision a(1000)
– C: double a[1000]
● Stack memory has a fxed (somewhat smaaa size)
– It's maoaged by the operatog system
– You doo't oeed to caeao up this memory
● Dyoamic aaaocatoo puts the memory oo the heap
– uch bigger pooa
– You are respoosibae for deaaaocatog
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Paraaaea Loops
● Ex: matrix muatpaicatoo:
program matmul
use omp_lib
implicit none integer, parameter :: N = 50000 double precision, allocatable :: a(:,:) double precision :: x(N), b(N) double precision :: start_omp, finish_omp integer :: i, j
start_omp = omp_get_wtime() allocate(a(N,N))
!$omp parallel private(i, j) !$omp do do j = 1, N do i = 1, N a(i,j) = dble(i + j) enddo x(j) = j b(j) = 0.0 enddo !$omp end do
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Paraaaea Loops
!multiply !$omp do do j = 1, N do i = 1, N b(i) = b(i) + a(i,j)*x(j) enddo enddo !$omp end do !$omp end parallel
finish_omp = omp_get_wtime()
print *, "execution time: ", finish_omp - start_omp
end program matmul
Cootoued...
code: matmul.f90
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Timiog
● We cao use the omp_get_wtime() commaod to get the curreot waaacaock tme (io secoods)
– This is beter thao, e.g., the Fortrao cpu_time() iotriosic, which measures tme for aaa threads summed together
OMP_NUM_THREADS run 1time (s)
run 2time (s)
run 3time (s)
1 26.276 26.294 26.696
2 18.696 17.514 16.287
4 8.628 9.072 10.680
8 4.744 6.582 4.923
16 3.066 3.146 3.111
Timiogs oo 2x Iotea Xeoo Goad 5115 CPU usiog gfortrao, N = 50000
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Loop Orderiog
● This is a great exampae to see the efects of aoop orderiog—what happeos if you switch the order of the aoops?
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Loop Paraaaea
● We waot to paraaaeaize aaa aoops possibae
– Iostead of f(:,:) = 0.d0, we write out aoops aod thread
● Private data
– Ioside the aoop, aaa threads wiaa have access to aaa the variabaes decaared io the maio program
– For some thiogs, we wiaa waot a private copy oo each thread. These are put io the private() caause
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Reductoo
● Suppose you are fodiog the mioimum vaaue of somethiog, or summiog
– Loop spread across threads
– How do we get the data from each thread back to a siogae variabae that aaa threads see?
● reduction() caause
– Has both shared aod private behaviors
– Compiaer eosures that the data is syochrooized at the eod
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Reductoo
● Exampae of a reductoo
program reduce
implicit none
integer :: i
double precision :: sum
sum = 0.0d0
!$omp parallel do private (i) reduction(+:sum) do i = 1, 10000 sum = sum + exp((mod(dble(i), 5.0d0) - 2*mod(dble(i),7.0d0))) end do !$omp end parallel do
print *, sum
end program reduce
Do we get the same aoswer wheo ruo with diferiog oumber of threads?
code: reduce.f90
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Exampae: Reaaxatoo
● Io two-dimeosioos, with Δx = Δy , we have:
– Red-baack Gauss-Seidea:● Update io-paace● First update the red ceaas (baack ceaas are uochaoged)
● Theo update baack ceaas (red ceaas are uochaoged)
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Exampae Reaaxatoo
● Let's aook at the code
● Aaa two-dimeosiooaa aoops are wrapped with Opeo P directves
● We cao measure the performaoce
– Fortrao 95 has a cpu_time() iotriosic● Be carefua though—it returos the CPU tme summed across aaa threads
– Opeo P has the omp_get_wtime() fuoctoo● This returos waaacaock tme
– Lookiog at waaacaock: if we doubae the oumber of processors, we waot the code to take 1/2 the waaacaock tme
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Exampae Reaaxatoo
● Performaoce:
This is ao exampae of a stroog scaaiog test—the amouot of work is head fxed as the oumber of cores is iocreased
code: relax.f90
groot w/ gfortran -Ofast
512x512
threads wallclock time 1 1.583 2 0.8413 4 0.3979 8 0.2253 16 0.1634
1024x1024
threads wallclock time 1 7.211 2 3.179 4 1.717 8 0.8832 16 0.5076
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Threadsafe
● Wheo shariog memory you oeed to make sure you have private copies of aoy data that you are chaogiog directay
● Appaies to fuoctoos that you caaa io the paraaaea regioos too!
● What if your aoswer chaoges wheo ruooiog with muatpae threads?
– Some rouodof-aevea error is to be expected if sums are dooe io difereot order
– Large difereoces iodicate a bug—most aikeay somethiog oeeds to be private that is oot
● Uoit testog
– Ruo with 1 aod muatpae threads ao compare the output
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Threadsafe
● Fortrao:
– Commoo baocks are simpay a aist of memory spaces where data cao be fouod. This is shared across muatpae routoes
● Very daogerous—if ooe thread updates somethiog io a commoo baock, every other thread sees that update
● uch safer to use argumeots to share data betweeo fuoctoos
– Save statemeot: the vaaue of the data persists from ooe caaa to the oext
● What if a difereot thread is the oext to caaa that fuoctoo—is the saved quaotty the correct vaaue?
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Critcaa Sectoos
● Withio a paraaaea regioo, sometmes you oeed to eosure that ooay ooe thread at a tme cao write to a variabae
● Coosider the foaaowiog:
– If this is io the middae of a aoop, what happeos if 2 difereot threads meet the criteria?
– arkiog this sectoo as critcaa wiaa eosure ooay ooe thread chaoges thiogs at a tme
● Waroiog: critcaa sectoos cao be VERY saow
if ( a(i,j) > maxa ) then maxa = a(i,j) imax = i jmax = jendif
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Opeo P
● Opeo P is reaatveay big
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Portog to Opeo P
● You cao paraaaeaize your code piece-by-piece
● Sioce Opeo P directves aook aike commeots to the compiaer, your oad versioo is staa there
● Geoeraaay, you are oot chaogiog aoy of your origioaa code—just addiog directves
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
ore Advaoced Opeo P
● if caause teaas Opeo P ooay to paraaaeaize if a certaio cooditoo is met (e.g. a test of the size of ao array)
● firstprivate: aike private except each copy is ioitaaized to the vaaue from the origioaa vaaue
● schedule: afects the baaaoce of the work distributed to threads
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Opeo P io Pythoo
● Pythoo eoforces a “gaobaa ioterpreter aock” that meaos ooay ooe thread cao taak to the ioterpreter at aoy ooe tme
– Opeo P withio pure pythoo is oot possibae
● However, C (or Fortrao) exteosioos caaaed from pythoo cao do shared-memory paraaaeaism
– Uoderayiog code cao do paraaaea Opeo P
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
PI
● The essage Passiog Library ( PI) is the staodard aibrary for distributed paraaaea computog
– Now each core caooot directay see each other's memory
– You oeed to maoage how the work is divided aod expaicitay seod messages from ooe process to the other as oeeded.
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
PI Heaao Worad
● No aooger do we simpay use commeots—oow we caaa subroutoes io the aibrary:
program hello
use mpi
implicit none
integer :: ierr, mype, nprocs
call MPI_Init(ierr)
call MPI_Comm_Rank(MPI_COMM_WORLD, mype, ierr) call MPI_Comm_Size(MPI_COMM_WORLD, nprocs, ierr)
if (mype == 0) then print *, "Running Hello, World on ", nprocs, " processors" endif
print *, "Hello World", mype
call MPI_Finalize(ierr)end program hello
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
PI Heaao Worad
● PI jobs are ruo usiog a commaodaioe tooa
– usuaaay mpirun or mpiexec
– Eg: mpiexec -n 4 ./hello
● You oeed to iostaaa the PI aibraries oo your machioe to buiad aod ruo PI jobs
– PICH is the most popuaar
– Fedora: dnf install mpich mpich-devel mpich-autoload
code: hello_mpi.f90
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
PI Coocepts
● A separate iostaoce of your program is ruo oo each processor—these are the PI processes
– Threadsafety is oot ao issue here, sioce each iostaoce of the program is isoaated from the others
● You oeed to teaa the aibrary the datatype of the variabae you are commuoicatog aod how big it is (the bufer size).
– Together with the address of the bufer specify what is beiog seot
● Processors cao be grouped together
– Commuoicators aabea difereot groups
– MPI_COMM_WORLD is the defauat commuoicator (aaa processes)
● aoy types of operatoos:
– Seod/receive, coaaectve commuoicatoos (broadcast, gather/scater)
(based oo Using MPI)
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
PI Coocepts
● There are > 100 fuoctoos
– But you cao do aoy messagiog passiog aagorithm with ooay 6:
● MPI_Init● MPI_Comm_Size● MPI_Comm_Rank● MPI_Send● MPI_Recv● MPI_Finalize
– ore efcieot commuoicatoo cao be dooe by usiog some of the more advaoced fuoctoos
– System veodors wiaa usuaaay provide their owo PI impaemeotatoo that is weaa-matched to their hardware
(based oo Using MPI)
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Ex: Computog Pi
● This is ao exampae from Using MPI
– Compute π by doiog the iotegraa:
● We wiaa divide the iotervaa up, so that each processor sees ooay a smaaa portoo of [0,1]
● Each processor computes the sum for its iotervaas● Add aaa the iotegraas together at the eod to get the vaaue of the totaa iotegraa
– We'aa pick ooe processor as the I/O processor—it wiaa commuoicate with us
– Let's aook at the code...
(based oo Using MPI)
code: pi.f90
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Seod/Receive Exampae
● The maio idea io PI is seodiog messages betweeo processes.
● MPI_Send() aod MPI_Recv() pairs provide this fuoctooaaity
– This is a baockiog seod/receive
● For the seodiog code, the program resumes wheo it is safe to reuse the bufer
● For the receiviog code, the program resumes wheo the message was received
– ay cause oetwork cooteotoo if the destoatoo process is busy doiog its owo commuoicatoo
– See Using MPI for some diagoostcs oo this
● There are ooo-baockiog seod, seods where you expaicitay atach a bufer
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Seod/Receive Exampae
● Simpae exampae (mimics ghost ceaa faaiog)
– Oo each processor aaaocate ao ioteger array of 5 eaemeots
– Fiaa the middae 3 with a sequeoce (proc 0: 0,1,2; proc 1: 3,4,5, ...)
– Seod messages to faa the aef aod right eaemeot with the correspoodiog eaemeot from the oeighboriog processors
code: send_recv.f90
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Seod/Receive
● Good commuoicatoo performaoce ofeo requires staggeriog the commuoicatoo
● A combioed sendrecv() caaa cao heap avoid deadaockiog
● Let's aook at the same task with PI_Sendrecv()
code: sendrecv.f90
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Paraaaea Computog
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Reaaxatoo
● Let's do the same reaaxatoo probaem, but oow usiog PI iostead of Opeo P
– Io the Opeo P versioo, we aaaocated a siogae array coveriog the eotre domaio, aod aaa processors saw the whoae array
– Io the PI versioo, each processor wiaa aaaocate a smaaaer array, coveriog ooay a portoo of the eotre domaio, aod they wiaa ooay see their part directay.
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Reaaxatoo
● We wiaa do 1-d domaio decompositoo
– Each processor aaaocates a saab that covers the fuaa y-exteot of the domaio
– Width io x is nx/nprocs● if oot eveoay divisibae, theo some saabs have a width of 1 more ceaa
– Perimeter of 1 ghost ceaa surrouodiog each subgrid
● We wiaa refer to a gaobaa iodex space [0:nx-1]×[0:ny-1]
– emory oeeds spread across aaa processors
– Arrays aaaocated as:
f(ilo-ng:ihi+ng,jlo-ng:jhi+ng)
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Reaaxatoo
● Lef set of ghost ceaas are faaed by receiviog a message from processor (saab) to aef
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Reaaxatoo
● Right set of ghost ceaas are faaed by receiviog a message from processor (saab) to right
● Top aod botom ghost ceaas are physicaa bouodaries
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Domaio Decompositoo
● Geoeraaay speakiog, you waot to mioimize the surface-to-voaume (this reduces commuoicatoo)
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Reaaxatoo
● ost of the paraaaeaism comes io the ghost ceaa faaiog
– Fiaa aef GCs by receiviog data from processor to the aef
– Fiaa right GCs by receiviog data from processor to the right
– Seod/receive pairs—we waot to try to avoid cooteotoo (this cao be very tricky, aod peopae speod a aot of tme worryiog about this...)
● Oo the physicaa bouodaries, we simpay faa as usuaa
● The way this is writeo, our reaaxatoo routoe doeso't oeed to do aoy paraaaeaism itseaf—it just operates oo the domaio it is giveo.
● For computog a oorm, we wiaa oeed to reduce the aocaa sums across processors
● Let's aook at the code...
code: relax_mpi.f90
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
PI Reaaxatoo Resuats
● Note that the smaaaer probaem sizes become work starved more easiay
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Weak vs. Stroog Scaaiog
● Io assessiog the paraaaea performaoce of your code there are two methods that are commooay used
– Stroog scaaiog: keep the probaem size fxed aod iocrease the oumber of processors
● Eveotuaaay you wiaa become work-starved, aod your scaaiog wiaa stop (commuoicatoo domioates)
– Weak scaaiog: iocrease the amouot of work io proportoo to the oumber of processors
● Io this case, perfect scaaiog wiaa resuat io the same waaacaock tme for aaa processor couots
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Ex: aestro Scaaiog
● aestro is a pubaicay avaiaabae adaptve mesh refoemeot aow ach oumber hydrodyoamics code
– odeas astrophysicaa fows
– Geoeraa equatoo of state, reactoos, impaicit difusioo
– Eaaiptc coostraiot eoforced via muatgrid
– htps://github.com/A ReX-Astro/ AESTRO
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Ex: aestro Scaaiog
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Ex: aestro Scaaiog
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Ex: Castro Scaaiog
● Castro is a pubaicay avaiaabae adaptve mesh refoemeot compressibae radiatoo hydrodyoamics code
– Used to modea steaaar expaosioos
– Seaf-gravity soaved via muatgrid
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Ex: Castro Scaaiog
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Debuggiog
● There are paraaaea debuggers (but these are pricey)
● It's possibae to spawo muatpae gdb sessioos, but this gets out of haod quickay
● Priot is staa your frieod
– Ruo as smaaa of a probaem as possibae oo as few processors as oecessary
● Some rouodof-aevea difereoces are to be expected from sums (difereot order of operatoos)
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Hybrid Paraaaeaism
● To get good performaoce oo curreot supercomputers, you oeed to do hybrid paraaaeaism:
– Opeo P withio a oode, PI across oodes
● For exampae, io our PI reaaxatoo code, we couad spait the aoops over each subdomaio over muatpae cores oo a oode usiog Opeo P.
– Theo we have PI to commuoicate across oodes aod Opeo P withio oodes
– This hybrid approach is ofeo oeeded to get the best performaoce oo big machioes
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Paraaaea Pythoo
● PI has ioterfaces for Fortrao aod C/C++
● There are severaa pythoo moduaes for PI
– mpi4py: moduae that cao be imported ioto pythoo
– py PI: chaoges the pythoo ioterpreter itseaf
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Paraaaea Pythoo
● Heaao worad:
● Ruo with mpiexec -n 4 python hello.py
from mpi4py import MPI
comm = MPI.COMM_WORLDrank = comm.Get_rank()
print "Hello, world", rank
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Paraaaea Pythoo
● We cao easiay paraaaeaize our oote Carao poker odds code
– Each processor coosiders haods iodepeodeotay
– Do a reductoo at the eod
code: poker-mpi.py
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Paraaaea Libraries
● There are aots of aibraries that provide paraaaea frameworks for writog your appaicatoo.
● Some exampaes:
– Lioear Aagebra / PDEs
● PETSc: aioear aod oooaioear system soavers, paraaaea matrix/vector routoes● hypre: sparse aioear system soaver
– I/O
● HDF5: paatorm iodepeodeot paraaaea I/O buiat oo PI-IO
– Adaptve mesh refoemeot (grids)
● BoxLib: aogicaaay Cartesiao A R with eaaiptc soavers
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Coarray Fortrao
● Part of the Fortrao 2008 staodard
– Paraaaea versioo of Fortrao
– Separate image (iostaoce of the program) is ruo oo each processor
– [ ] oo arrays is used to refer to difereot processors
– Not yet wideay avaiaabae
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
GPUs
● GPU offloadiog cao greatay acceaerate computog
● aio issue: data oeeds to traosfer from the CPU across the (reaatveay saow) PCIe bus to GPU
– Good performaoce requires that aots of work is dooe oo the data to “pay” the cost of the traosfer
● GPUs work as SI D paraaaea machioes
– The same iostructoos operate oo aaa the data io aockstep
– Braochiog (if-tests) is saower
● Best performaoce requires that you structure your code to be vectorized
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
OpeoACC
● OpeoACC is a directves-based method for offloadiog computog to GPUs
– Looks aike Opeo P
– A big difereoce is that you oeed to expaicitay add directves that cootroa data movemeot
● There's a big cost io moviog data from the CPU to the GPU
– You oeed to do a aot of computog oo the GPU to cover that expeose
– We cao separateay cootroa what is copied to aod from the GPU
● We cao do our same reaaxatoo exampae usiog OpeoACC
– Note: we oeed to expaicitay write out the separate red-baack updates to eosure that a aoop doeso't access adjaceot eaemeots
code: relax-openacc.f90
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
OpeoACC
code: relax-openacc.f90
!$acc data copyin(f, dx, imin, imax, jmin, jmax, bc_lo_x, bc_hi_x, bc_lo_y, bc_hi_y) copy(v) do m = 1, nsmooth
!$acc parallel !$acc loop do j = jmin, jmax v(imin-1,j) = 2*bc_lo_x - v(imin,j) v(imax+1,j) = 2*bc_hi_x - v(imax,j) enddo
!$acc loop do i = imin, imax v(i,jmin-1) = 2*bc_lo_y - v(i,jmin) v(i,jmax+1) = 2*bc_hi_y - v(i,jmax) enddo
!$acc wait !$acc loop collapse(2) do j = jmin, jmax, 2 do i = imin, imax, 2 v(i,j) = 0.25d0*(v(i-1,j) + v(i+1,j) + & v(i,j-1) + v(i,j+1) - dx*dx*f(i,j)) enddo enddo
...
This is part of the smoother fuoctoo marked up with OpeoACC
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
OpeoACC
● This reaaxatoo code ruos about 30× faster oo the GPU vs. CPU (siogae core) oo a aocaa machioe
– Note: wheo compariog CPU to GPU, a fair comparisoo wouad iocaude aaa of the CPU cores, so for a 12 core machioe, it is about 2.5× faster oo the GPU.
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Supercomputog Ceoters
● Supercomputog ceoters
– Natooaa ceoters ruo by NSF (through XSEDE program) aod DOE (NERSC, OLCF, ALCF)
– You cao appay for tme—starter accouots avaiaabae at most ceoters to get up to speed
– To get aots of tme, you oeed to demoostrate that your codes cao scaae to O(104) processors or more
● Queues
– You submit your job to a queue, specifyiog the oumber of processors ( PI + Opeo P threads) aod aeogth of tme
– Typicaa queue wiodows are 2 – 24 hours
– Job waits uota resources are avaiaabae
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Supercomputog Ceoters
● Checkpoiot/restart
– Loog jobs woo't be abae to foish io the aimited queue wiodow
– You oeed to write your code so that it saves aaa of the data oecessary to restart where it aef of
● Archiviog
– ass storage at ceoters is provided (usuaaay through HPSS)
– Typicaaay you geoerate far more data thao is reasooabae to briog back aocaaay—remote aoaaysis aod visuaaizatoo oecessary
PHY 604: Computatooaa ethods io Physics aod Astrophysics II
Future...
● The big thiog io supercomputog these days is acceaerators
– GPUs or Iotea Phi boards
– Adds a SI D-aike capabiaity to the more geoeraa CPU
● Origioaaay with GPUs, there were proprietary aaoguages for ioteractog with them (e.g. CUDA)
● Curreotay, OpeoACC is ao Opeo P-aike way of deaaiog with GPUs/acceaerators
– Staa maturiog
– Portabae
– Wiaa merge with Opeo P io the oear future
● Data traosfer to the acceaerators moves across the saow system bus
– Future processors may move these capabiaites oo-die