Getting Performance from OpenMP Programs on NUMA …
Transcript of Getting Performance from OpenMP Programs on NUMA …
EUH2020CentreofExcellence(CoE)1October2015– 31March2018
GrantAgreementNo676553
GettingPerformancefromOpenMPProgramsonNUMAArchitecturesChristianTerboven,RWTHAachenUniversity
• AmongthemostobscurethingsthatcannegativelyimpactperformanceofOpenMPprogramsarecc-NUMAeffects
• ThesearenotrestrictedtoOpenMP• ButtheymostshowupbecauseyouusedOpenMP• Inanycasetheyareimportantenoughtobecoveredhere
2
OpenMPandPerformance
8/4/17
• Setofprocessorsisorganizedinsidealocalitydomainwithalocallyconnectedmemory• Thememoryofalllocalitydomainsisaccessibleoverasharedvirtualaddressspace• Otherlocalitydomainsareaccessoverainterconnect,thelocaldomaincanbeaccessedveryefficientlywithoutresortingtoanetworkofanykind
8/4/17 3
Whatiscc-NUMA?
Core
memory
Core
on-chipcache
Core Core
memory
interconnect
on-chipcache
on-chipcache
on-chipcache
Virtualaddress space
• Advantages• Scalableintermsofmemorybandwidth• “Arbitrarily”largenumbersofprocessors:Thereexistsystemswithover1024processors
• Disadvantages• Efficientprogrammingrequiresprecautionswithrespecttolocalandremotememory,althoughallprocessorsshareoneaddressspace• Cachecoherenceishardandexpensiveinimplementation
• e.g.recentwritesneedinvalidationandmayconsumealotoftheavailablebandwidth
8/4/17 4
Advantages&Disadvantages
• You should have abasic understanding of the system topology.Youcould use one of the following options onatarget machine:• numactl tool to control the LinuxNUMApolicy
• numactl --hardware
• Delivers compact information about NUAnodes and the associated processor ID• IntelMPI‘s cpuinfo tool
• cpuinfo
• Delivers information about the number of sockets (=packages)and the mapping ofprocessor IDsto CPUcores used by the OS
• hwlocs‘hwloc-ls tool (comes with Open-MPI)• hwloc-ls
• Displaysa(graphical)representation of the system topology,separated into NUMAnodes,along with the mapping of processor IDsto CPUcores used by the OSandadditionalinformation oncaches
8/4/17 5
QueryyourSystem
• Streamexample with and without NUMA-awaredata placement• 2socketsystem with XeonX5675processors,12OpenMPthreads
8/4/17 6
TobeNUMAornotto be
copy scale add triad
ser_init 18.8GB/s 18.5GB/s 18.1GB/s 18.2GB/s
par_init 41.3GB/s 39.3GB/s 40.3GB/s 40.4GB/s
CPU0
T1 T2 T3
T4 T5 T6
CPU1
T7 T8 T9
T10 T11 T12
MEM
a[0,N-1]b[0,N-1]c[0,N-1]
CPU0
T1 T2 T3
T4 T5 T6
CPU1
T7 T8 T9
T10 T11 T12
MEM
a[0,(N/2)-1]b[0,(N/2)-1]c[0,(N/2)-1]
ser_init:
par_init:
MEM
MEM
a[N/2,N-1]b[N/2,N-1]c[N/2,N-1]
double* A;
A = (double*)malloc(N * sizeof(double));
for (int i = 0; i < N; i++) {
A[i] = 0.0;
}
8/4/17 7
DataPlacement?
Core
memory
Core
on-chipcache
Core Core
memory
interconnect
on-chipcache
on-chipcache
on-chipcache
How To Distribute The Data ?
double* A;
A = (double*)malloc(N * sizeof(double));
for (int i = 0; i < N; i++) {
A[i] = 0.0;
}
8/4/17 8
SerialDataPlacement
Core
memory
Core
on-chipcache
Core Core
memory
interconnect
on-chipcache
on-chipcache
on-chipcache
A[0]…A[N]
• Serialcode:allarray elements are allocated inthe memory of the NUMAnode closest to thecore executing the initializer thread (first touch)
double* A;
A = (double*)malloc(N * sizeof(double));
#pragma omp parallel for
for (int i = 0; i < N; i++) {
A[i] = 0.0;
}
8/4/17 9
FirstTouchDataPlacement
Core
memory
Core
on-chipcache
Core Core
memory
interconnect
on-chipcache
on-chipcache
on-chipcache
• Serialcode:allarray elements are allocated inthe memory of the NUMAnode closest to thecore executing the initializer thread (first touch)
A[0]…A[N/2] A[N/2]…A[N]
• Selecting the „right“binding strategy depends notonly onthetopology,butalsoonthe characteristics of your application• Putting threads far apart,i.e.,ondifferentsockets
• Mayimprove the aggregated memory bandwidth available to your application• Mayimprove the combined cache size available to your application• Maydecrease performance of synchronization constructs
• Putting threads close together,i.e.,ontwo adjacent cores that possibly sharesome caches• Mayimprove performance of synchronization constructs• Maydecrease the available memory bandwidth and cache size
• If you are unsure,justtry afew options and then select the best one.
8/4/17 10
DecideforBindingStrategy
• DefineOpenMPplaces• setofOpenMPthreadsrunningononeormoreprocessors• canbedefinedbytheuser,i.e.,OMP_PLACES=cores
• DefineasetofOpenMPthreadaffinitypolicies• SPREAD:spreadOpenMPthreadsevenlyamongtheplaces,partitiontheplacelist• CLOSE:packOpenMPthreadsnearmasterthread• MASTER:collocateOpenMPthreadswithmasterthread
• Goals• userhasawaytospecifywheretoexecuteOpenMPthreadsforlocalitybetweenOpenMPthreads/lessfalsesharing/memorybandwidth
8/4/17 11
OpenMP4.0:Places+Policies
• Assume the following machine:
• 2sockets,4cores persocket,4hyper-threads percore
• Abstractnames for OMP_PLACES:• threads:Eachplacecorrespondstoasinglehardwarethreadonthetargetmachine• cores:Eachplacecorrespondstoasinglecore(havingoneormorehardwarethreads)onthetargetmachine• sockets:Eachplacecorrespondstoasinglesocket(consistingofoneormorecores)onthe target machine
8/4/17 12
OMP_PLACESenv.variable
p0 p1 p2 p3 p4 p5 p6 p7
• Example‘s Objective:• separatecores for outer loop and near cores for inner loop
• Outer ParallelRegion:proc_bind(spread),Inner:proc_bind(close)• spread creates partition,close binds threads within respective partitionOMP_PLACES=(0,1,2,3), (4,5,6,7), ... = (0-4):4:8 = cores#pragma omp parallel proc_bind(spread) num_threads(4)#pragma omp parallel proc_bind(close) num_threads(4)
• Example• initial
• spread 4
• close 48/4/17 13
OpenMP4.0:Places+Policies
p0 p1 p2 p3 p4 p5 p6 p7
p0 p1 p2 p3 p4 p5 p6 p7
p0 p1 p2 p3 p4 p5 p6 p7
• FirstTouch:Modernoperating systems (i.e.,Linux>=2.4)determine thephysical location of amemory page during the first page fault,when thepage is first „touched“,and put it close to the CPUthat causes the pagefault• Everything under control?• Inprinciple Yes,butonly if
• threads can be bound explicitly,• data can be placed well by first-touch,or can be migrated,
• What if the data access pattern changes over time?
• ExplicitMigration:Selectedregions of memory (pages)are moved fromone NUMAnode to another viaexplicitOSsyscall
• Automatic Migration:No support for this incurrent operating systems
8/4/17 14
NUMAStrategies:Overview
• ExplicitNUMA-awarememory allocation:• By carefully touching data by the thread which later uses it• By changing the default memory allocation strategy with numactl• By explicitmigration of memory pages
• Linux:move_pages()• Include <numaif.h> header,linkwith -lnumalong move_pages(int pid,unsigned long count,void **pages, const int *nodes,int *status,int flags);
• Example:using numactl to distribute pages round-robin:• numactl –interleave=all ./a.out
• Example:Runonnode 0with memory allocated onnodes 0and 1• numactl --cpubind=0 --membind=0,1 ./a.out
8/4/17 15
UserControlof MemoryAffinity
8/4/17 16
Contact:https://www.pop-coe.eumailto:[email protected]
ThisprojecthasreceivedfundingfromtheEuropeanUnion‘sHorizon2020researchandinnovationprogramme undergrantagreementNo676553.
PerformanceOptimisation andProductivityACentreofExcellenceinComputingApplications