Lecture 7. Performance Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel...
-
Upload
landyn-weston -
Category
Documents
-
view
218 -
download
4
Transcript of Lecture 7. Performance Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel...
Lecture 7. Performance
Prof. Taeweon SuhComputer Science Education
Korea University
COM503 Parallel Computer Architecture & Programming
Korea Univ
Parallel Performance of OpenMP
• Performance is influenced by at least the following factors Memory access pattern by the individual threads
• If each thread accesses a distinct portion of data consistently throughout the program, it probably makes excellent use of the memory hierarchy
Overhead of OpenMP constructs• When a parallel region is created, threads might have to be created or
woken up and some data structures have to be set up to carry information needed by the runtime system
Load imbalance between synchronization points• Threads might have to wait for a member of team to carry out the
work of a single construct
Other synchronization costs• Threads typically waste time waiting for access to a critical region (or
to acquire a lock)
2
Korea Univ
#threads on Performance
• When running the parallel application, make sure that the load (#threads) on the system does not exceed the number of processors If it does, the system is said to be oversubscribed
It not only degrades performance, but also makes it hard to analyze the program’s behavior
• On an SMP system, a program should use a fewer threads than the number of processors OS daemons and services need to run on a processor, too
If all processors are in use by the application, even a relatively lightweight daemon disrupts the execution of the user program, because one thread has to give way to this process
3
Korea Univ
Performance
• Sequential performance of an application program is still a major concern when creating a parallel program
• Poor sequential performance is often caused by suboptimal usage of the cache found in contemporary computers In particular, a cache-miss is expensive because it implies that the
data must be fetched from main memory
If the cache miss happens frequently, it can severely reduce program performance
• On an SMP system, the impact of the cache miss could be even stronger due to the limited bandwidth and latency of interconnection network
4
Korea Univ
Cache
• A major goal is to organize data accesses so that data are used as often as possible while they are still in cache
• The most common strategies are based on the fact that programming languages typically specify that the elements of arrays be stored contiguously in memory Take advantage of temporal and spatial localities
5
Korea Univ
Cache-friendly Code
• In C, a 2-dimensional array is stored according to row-major ordering Example: int A[10][8]
6
for (i=0 ; i<n; i++) for (j=0 ; j<n; j++) sum += a[i][j];
A Cache Line (block)
Data cache
A[0][0]A[0][1]A[0][2]A[0][3]A[0][5]A[0][6]A[0][7] A[0][4]
A[3][0]A[3][1]A[3][2]A[3][3]A[3][5]A[3][6]A[3][7] A[3][4]
A[1][0]A[1][1]A[1][2]A[1][3]A[1][5]A[1][6]A[1][7] A[1][4]
A[2][1]A[2][0]A[2][3] A[2][0]A[2][5]A[2][6]A[2][7] A[2][4]
... ... ... ... ... ... ... ...
A[0][0]A[0][1]A[0][2]A[0][3]A[0][4]A[0][5]A[0][6]A[0][7]
A[1][0]
A[1][1]
A[1][2]
A[1][3]
A[1][4]
A[1][5]
A[1][6]
A[1][7]
A[2][0]
A[2][1]
...
...
memory
Korea Univ
Cache-friendly Code
• In Fortran, a 2-dimensional array is stored according to column-major ordering Example: INTEGER A(10,8)
7
A Cache Line (block)
Data cache
A(0,0)A(1,0)A(2,0)A(3,0)A(5,0)A(6,0)A(7,0) A(4,0)
A(0,3)A(1,3)A(2,3)A(3,3)A(5,3)A(6,3)A(7,3) A(4,3)
A(0,1)A(1,1)A(2,1)A(3,1)A(5,1)A(6,1)A(7,1) A(4,1)
A(1,2)A(2,2)A(3,2) A(0,2)A(5,2)A(6,2)A(7,2) A(4,2)
... ... ... ... ... ... ... ...
A(0,0)A(1,0)A(2,0)A(3,0)A(4,0)A(5,0)A(6,0)A(7,0)
A(0,1)
A(1,1)
A(2,1)
A(3,1)
A(4,1)
A(5,1)
A(6,1)
A(7,1)
A(0,2)
A(1,2)
...
...
memory
DO J= 1, 8 DO I = 1, 10 sum = sum + A(I,J) END DO END DO
Korea Univ
TLB Consideration
• The page size is determined by the size the CPU supports, plus the choice offered by the operating system Typically, the page size is 4KB
• TLB is on the critical path for performance Think about PIPT cache
• Just as with data cache, it is important to make good use of the TLB entries
8
Korea Univ
Loop Optimizations
• Both the programmer and the compiler can improve the use of memory
• A simple reordering of the statements inside the body of the loop nest may make a difference Loop Interchange (or Loop Exchange) Loop Unrolling Unroll and Jam Loop Fusion Loop Fision Loop Tiling (or Blocking)
9
Korea Univ
Loop Interchange
10
j=0
i=0
/* Before */for (j=0; j<100; j++) for (i=0; i<5000; i++) x[i][j] = 2*x[i][j]
/* After */for (i=0; i<5000; i++) for (j=0; j<100; j++) x[i][j] = 2*x[i][j]
j=0i=0
Improved cache efficiency
Row-major ordering
What is the worst that could
happen?
Prof Sean Lee’s Slide in Georgia Tech
Korea Univ
Loop Unrolling
11
for (int i=0; i<100; i++) { a[i] = b[i] + 1; c[i] = b[i] + a[i-1] + b[i-1];}
• The loop overhead includes incrementing the loop variable, testing its value and branching to the start of the loop
• Unroll the loop (in the example, unroll factor = 2) brings; Overall loop overhead roughly halved
Data reuse improved• The value of a[i] just computed can be used immediately
ILP could be increased
• Nowadays, a programmer seldom needs to apply this transformation manually, since compilers are very good at doing this
for (int i=0; i<100; i+=2) { a[i] = b[i] + 1; c[i] = b[i] + a[i-1] + b[i-1]; a[i+1] = b[i+1] + 1; c[i+1] = b[i+1] + a[i] + b[i];}
Unrolled factor = 2
Korea Univ
Unroll and Jam
• Unroll and Jam is an extension of loop unrolling that is appropriate for some loop nests with multiple loops
12
for (int j=0; j<n; j++) for (int i=0; i<n; i++) a[i][j] = b[i][j] + 1;
for (int j=0; j<n; j+=2) { for (int i=0; i<n; i++) a[i][j] = b[i][j] + 1; for (int i=0; i<n; i++) a[i][j+1] = b[i][j+1] + 1;}
Outer loop unrolling
for (int j=0; j<n; j+=2) { for (int i=0; i<n; i++) a[i][j] = b[i][j] + 1; a[i][j+1] = b[i][j+1] + 1;}
Jam
Korea Univ
Loop Fusion
• Loop Fusion merges 2 or more loops to create a bigger loop May improve cache efficiency Could increase the amount of computation per
iteration in order to improve the ILP Lower loop overheads
13
for (int i=0; i<n; i++) a[i] = b[i] * 2;
for (int i=0; i<n; i++) { x[i] = 2 * x[i]; c[i] = a[i] * 2;}
for (int i=0; i<n; i++) a[i] = b[i] * 2; x[i] = 2 * x[i]; c[i] = a[i] * 2;}
Korea Univ
Loop Fission
• Loop Fission is a transformation that breaks up a loop into several loops May improve use of cache or isolate a part that
inhibits full optimization of the loop Likely to be most useful if a loop nest is large
and its data does not fit into cache
14
for (int i=0; i<n; i++) c[i] = exp(i/n); for (int j=0; j<m; j++) a[j][i] = b[j][i] + d[j] * e[i]}
for (int i=0; i<n; i++) c[i] = exp(i/n);
for (int j=0; j<m; j++) for (int i=0; i<n; i++) a[j][i] = b[j][i] + d[j] * e[i]
Loop fission
Loop Interchange
Korea Univ
Why Loop Blocking?
15
/* Before */for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) x[i][j] = y[i][k]*z[k][j];
i
k
k
jy[i][k] z[k][j]
i
X[i][j]
Does not exploit locality!Modified Slide from Prof. Sean Lee in Georgia
Tech
Korea Univ
Loop Blocking (Loop Tilting)
• Partition the loop’s iteration space into many smaller chunks and ensure that the data stays in the cache until it is reused
16
i
k
k
j
y[i][k] z[k][j]
i
j
X[i][j]
Modified Slide from Prof. Sean Lee in Georgia Tech
/* After */for (jj=0; jj<N; jj=jj+B) // B: blocking factor for (kk=0; kk<N; kk=kk+B)for (i=0; i<N; i++) for (j=jj; j< min(jj+B,N); j++) for (k=kk; k< min(kk+B,N); k++) x[i][j] += y[i][k]*z[k][j];
B
kk
kk
jj
k k
k
k
j j
① ① ① ① ① ① ①② ② ② ②
② ② ②③④⑤
⑧
⑥⑦
⑨
⑨ ⑨ ⑨ ③ ③ ③④ ④ ④⑤ ⑤ ⑤⑥ ⑥ ⑥⑦ ⑦ ⑦⑧ ⑧ ⑧⑨ ⑨ ⑨
Korea Univ
Use of Pointers and Contiguous Memory in C
• Pointers pose a serious challenge for performance tuning
• Pointer Aliasing Problem The memory model in C is such that, without additional
information, one must assume that all pointers may reference any memory address
It prevents a compiler from performing many program optimizations, since it cannot determine that they are safe
If pointers are guaranteed to point to portions of non-overlapping memory, (for example, because each pointer targets memory allocated through a call to malloc()), more aggressive techniques can be applied
In general, only the programmer knows what memory locations a pointer may refer to
17
Korea Univ
Use of Pointers and Contiguous Memory in C
• The restrict keyword is provided in C99 to inform the compiler that the memory referenced by one pointer does not overlap with a memory section pointed to by another pointer
18
void mxv(int m, int n, double * restrict a, double * restrict b, double * restrict c){ int i, j;
for (i=0; i<m; i++) { a[i] = 0.0; for (j=0; j<n; j++) a[i] += b[i*n+j]*c[j]; }}
C99: Informal name for ISO/IEC 9899:1999, a past version of the C language standard in 1999. -wiki
Korea Univ
Using Compilers
• Modern compilers implement most, if not all, of the loop optimizations They perform a variety of analyses (such as data dependence
analysis) to determine whether they may be applied
Check out what kinds of compiler options are there
• However, the compiler’s ability to transform code is limited by its ability to analyze the program It may be hindered by the presence of pointers
• So, the programmer has to take action. Some rewriting of the source code may lead to better results
19
Korea Univ
Best Practices
• General recommendations for efficient OpenMP program Optimize barrier use Avoid the ordered construct Avoid large critical regions Maximize parallel regions Avoid parallel regions in inner loops Load balancing
• Additional performance considerations Single vs Master construct Private vs Shared data Avoid false sharing
20
Korea Univ
Optimize Barrier Use
• No matter how efficiently barriers are implemented, they are expensive operations It is always worthwhile to reduce their usage to the minimum
The nowait clause makes it easy to eliminate the barrier that is implied on several constructs
21
#pragma omp parallel{ ..
#pragma omp for for (int i=0; i<n; i++) .. ..
#pragma omp for nowait for (int i=0; i<n; i++) ..
} // barrier is implied
• Since the implied barrier in the 2nd parallel loop is redundant and can be removed (a barrier is implied with the parallel loop)• A compiler might do this anyway
Korea Univ
Optimize Barrier Use Example
22
#pragma omp parallel default (none) \ shared (n, a, b, c, d, sum) private (i){
#pragma omp for nowait for (int i=0; i<n; i++) a[i] += b[i] ;
#pragma omp for nowait for (int i=0; i<n; i++) c[i] += d[i] ; #pragma omp barrier
#pragma omp for nowait reduction (+:sum) for (int i=0; i<n; i++) sum += a[i] + c[i];
} // barrier is implied
Korea Univ
Avoid the Ordered Construct
• The ordered construct ensures that the corresponding block of code within a parallel loop is executed in the order of the loop iterations It is expensive to implement The runtime system has to keep track which iterations
have finished and possibly keep threads in a wait state until their results are needed
It inevitably slows the program execution
23
Korea Univ
Avoid the Large Critical Regions
• A critical region is used to ensure that no two threads executes a piece of code simultaneously The more code contained in the critical region, the greater
the likelihood that threads have to wait to enter it
Thus, the programmer should minimize the amount of code enclosed within a critical region
• If possible, an atomic update is to be preferred Whereas a critical region forces threads to perform all of
the code enclosed within it one at a time, an atomic update enforces exclusive access to just one memory location
24
Korea Univ
Maximize Parallel Regions
• Indiscriminate use of parallel regions may give rise to suboptimal performance Overheads are associated with starting and terminating a parallel region
• Large parallel regions offer more opportunities for using data in cache and provide a bigger context for other compiler optimizations
25
#pragma omp parallel for for ( … ) { /* Work-sharing loop 1 */ }
#pragma omp parallel for for ( … ) { /* Work-sharing loop 2 */ }
#pragma omp parallel for for ( … ) { /* Work-sharing loop 3 */ }
#pragma omp parallel{ #pragma omp for { … ) /* Work-sharing loop 1 */
#pragma omp for { … ) /* Work-sharing loop 2 */
#pragma omp for { … ) /* Work-sharing loop 3 */}
• Fewer implied barriers• Potential for cache data reuse between
loops• Downside: no adjustment of #threads
on a per loop basis
Korea Univ
Avoid Parallel Regions in Inner Loops
• Another common technique to improve performance is to move parallel regions out of innermost loops Otherwise, we repeatedly experience the overheads of
the parallel construct
26
for (i=0; i<n; i++) for (j=0; j<n; j++) #pragma omp parallel for for (k=0; k<n; k++) { ……… }
#pragma omp parallelfor (i=0; i<n; i++) for (j=0; j<n; j++) #pragma omp for for (k=0; k<n; k++) { ……… }
• The parallel construct overheads are minimized
• Overheads of the parallel region are incurred n2 times
Korea Univ
Load Balancing
• In some parallel algorithms, threads have different amounts of work to do One solution is to use the schedule clause with a
non-static schedule The caveat is that the dynamic and guided
schedules have higher overheads than does the static schedule
However, if the load imbalance is severe enough, this cost is offset by the more flexible allocation of work to threads
27
Korea Univ
Pipelined Processing
28
for (i=0; i<N; i++) {
ReadFromFile(i, …); for (j=0; j<ProcessingNum; j++) ProcessData();
WriteResultsToFile(i);
}
#pragma omp parallel{/* preload data to be used in the first iteration of the i-loop */ #pragma omp single ReadFromFile (0, … ); for (i=0; i<N; i++) {
#pragma omp single nowait ReadFromFile(i+1, …); #pragma omp for schedule (dynamic) for (j=0; j<ProcessingNum; j++) ProcessChunkOfData();
#pragma omp single nowait WriteResultsToFile(i); }}
• The barrier at the end of j loop ensures; Data for the next loop iteration is available The results of the previous iteration have been written before work proceeds
Korea Univ
Single vs Master
• The functionality of the single and master constructs is similar
• The difference is that a single region can be executed by any thread (typically the first to encounter it), whereas this is not the case for the master region
• The efficiency is implementation and application-dependent In general, the master construct is more efficient, as the single
construct requires more work in the OpenMP library
29
Korea Univ
Private vs Shared
• The programmer may often choose whether data should be shared or private Either choice might lead to a correct application, but the performance
impact can be substantial if the wrong choice is made
• As an example, if threads need unique read/write access to a 1-dimensional array, there are 2 options: Declare a 2-dimensional shared array with one row accessed by each
thread, or allocate a 1-dimensional private array by each thread
In general, the latter is to be preferred over the former• In the former, when modifying shared data, a data element might be in the
same cache line as the data modified by another thread. Thus, performance degrades because of false sharing
30
Korea Univ
Private vs Shared
• If data is always read in a parallel region, it could be shared
• But, it could also be privatized so that each thread has a local copy of the data, using the firstprivate clause to initialize it to the values prior to the parallel region
• Both approaches work, but the performance could be different Sharing the data seems the reasonable choice
There is no risk of false sharing because the data is not modified, memory usage does not increase, and there is no runtime overhead to copy the data
• How about on a ccNUMA system?
31
Korea Univ
Avoid False Sharing
• One of the factors limiting scalable performance is false sharing It is a side-effect of the cache-line granularity of
cache coherence When threads running on different processors
update different words in the same cache line, the cache coherence protocol maintains the data consistency by invalidation of the entire cache line
If some or all of the threads update the same cache line frequently, performance degrades
32
Korea Univ
False Sharing Example
• Assuming that Cache line size is 8 words #threads is 8 (Nthreads=8)
33
#pragma omp parallel for shared(Nthreads, a) schedule(static, 1)for (i=0; i<Nthreads; i++) a[i] += i;
• The array padding can be used to eliminate the problem• Padding the array by dimensioning it as a[n][8] and changing
the indexing from a[i] to a[i][0] eliminates the false sharing
• Given that the size of a cache line needs to be taken into account, it is also non-portable
Korea Univ
Avoid False Sharing
• In general, using private data instead of shared data significantly reduces the risk of false sharing In contrast of array padding, it is also a
portable optimization
34
Korea Univ35
Binding Threads to CPUs
• Use an environment variable export GOMP_CPU_AFFINITY=“0 4 1 5”
• Thread0 attached to CPU0
• Thread1 attached to CPU 4
• Thread2 attached to CPU1
• Thread3 attached to CPU5
• Try lstopo command (show the topology of the system)
Korea Univ36
Our Server Config.
Korea Univ
Single Thread Overhead
• Single thread overhead: How effective the parallel version is when executed on single thread Ideally, the execution time of the OpenMP version is equal
to the sequential version
In many cases, the sequential version would be faster
However, there is also a chance that the OpenMP version on a single thread might be faster because of a difference in compiler optimizations
37
Overheadsingle thread
= 100 x ( Elapsed Time
(Sequential)
Elapsed Time (OpenMPsingle
thread) - 1) %
Korea Univ
Case Study
38
• Matrix x Vector Product
• Experiment environment Sun Fire E6900 NUMA with UltraSparc IV (dual-core), 2006
• Sun Fire was a series of server computers introduced in 2001
• 6 CPU and memory boards (each board (SB#) can have up to 4 UltraSpac)
• 24 processors (=48 cores)
Solaris 9 OS
• In general, performance results are significantly influenced by Coding style by the application developer
Compiler, compiler options, and its runtime libraries
OS features including its support for memory allocation and thread scheduling
Hardware characteristics: memory hierarchy, cache coherence mechanisms, support for atomic operations, and more
Korea Univ
Case Study
• Single-thread overhead
39
Korea Univ
Case Study
• Performance
40
Korea Univ
Superlinear Speedup
• With a parallel program, there can be a positive effect, offsetting some of the performance loss caused by sequential code and the various overheads
• This is because a parallel program has more aggregate cache capacity at its disposal since each thread will have some amount of local cache
• It might result in a superlinear speedup The speedup exceeds the number of processors used
41
Korea Univ
Backup
42
Korea Univ
Overheads of the OpenMP Translation
• A cost is associated with the creation of OpenMP parallel regions, with the sharing of work among threads, and with all kinds of synchronization
• The sources of the overheads include The cost of starting up threads and creating their
execution environment The potential additional expense incurred by the
encapsulation of a parallel region in a separate function The cost of computing the schedule The time taken to block and unblock threads, and the
time for them to fetch work and signal that they are ready
43
Korea Univ
Overheads of the OpenMP Translation
• Minor overheads are incurred by using firstprivate and lastprivate clauses In most cases, however, these are relatively modest
compared to the cost of barriers and other forms of thread synchronization, as well as the loss in speedup whenever one or more threads are idle
• Dynamic forms of scheduling can lead to much more thread interaction than do static schedules, and therefore inevitably incur higher overheads On the other hand, they may reduce thread idle
time in the presence of load imbalance
44
Korea Univ
Overheads of the OpenMP Translation
• The EPCC microbenchmarks were created to help programmers estimate the relative cost of using different OpenMP constructs Overheads for major OpenMP constructs as measured by the EPCC
microbenchmarks for the first version of the OpenUH compiler
• A few results; Overheads for the for directive and for the barrier are almost identical
Overheads for the parallel loop consist of calling the static loop schedule and the barrier
Overheads for the parallel for are just slightly higher than those for parallel
• This result is accounted for by the overheads of sharing the work, which is negligible for the default static scheduling policy
The single directive has higher overheads than a barrier• This is not surprising, as the overheads consist of a call to a runtime library routine that
ensures that one thread executes the region, and then a barrier at the end
The reduction clause is costly because it is implemented via a critical region
45
Korea Univ
Overheads of the OpenMP Translation
46
single
reduction
parallel forparallel
barrierfor
Korea Univ
Overheads of the OpenMP Translation
• Overheads for the different kinds of loop schedules It clearly shows the performance benefits of a static schedule,
and the penalties incurred by a dynamic schedule (where loops must grab chunks of work, especially small chunks at run time)
47
Dynamic ,n
Guided, n
Static, n
Static