Lecture 7. Performance Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel...

Lecture 7. Performance

Prof. Taeweon SuhComputer Science Education

Korea University

COM503 Parallel Computer Architecture & Programming

Korea Univ

Parallel Performance of OpenMP

• Performance is influenced by at least the following factors Memory access pattern by the individual threads

• If each thread accesses a distinct portion of data consistently throughout the program, it probably makes excellent use of the memory hierarchy

Overhead of OpenMP constructs• When a parallel region is created, threads might have to be created or

woken up and some data structures have to be set up to carry information needed by the runtime system

Load imbalance between synchronization points• Threads might have to wait for a member of team to carry out the

work of a single construct

Other synchronization costs• Threads typically waste time waiting for access to a critical region (or

to acquire a lock)

2

Korea Univ

#threads on Performance

• When running the parallel application, make sure that the load (#threads) on the system does not exceed the number of processors If it does, the system is said to be oversubscribed

It not only degrades performance, but also makes it hard to analyze the program’s behavior

• On an SMP system, a program should use a fewer threads than the number of processors OS daemons and services need to run on a processor, too

If all processors are in use by the application, even a relatively lightweight daemon disrupts the execution of the user program, because one thread has to give way to this process

3

Korea Univ

Performance

• Sequential performance of an application program is still a major concern when creating a parallel program

• Poor sequential performance is often caused by suboptimal usage of the cache found in contemporary computers In particular, a cache-miss is expensive because it implies that the

data must be fetched from main memory

If the cache miss happens frequently, it can severely reduce program performance

• On an SMP system, the impact of the cache miss could be even stronger due to the limited bandwidth and latency of interconnection network

4

Korea Univ

Cache

• A major goal is to organize data accesses so that data are used as often as possible while they are still in cache

• The most common strategies are based on the fact that programming languages typically specify that the elements of arrays be stored contiguously in memory Take advantage of temporal and spatial localities

5

Korea Univ

Cache-friendly Code

• In C, a 2-dimensional array is stored according to row-major ordering Example: int A[10][8]

6

for (i=0 ; i<n; i++) for (j=0 ; j<n; j++) sum += a[i][j];

A Cache Line (block)

Data cache

A[0][0]A[0][1]A[0][2]A[0][3]A[0][5]A[0][6]A[0][7] A[0][4]

A[3][0]A[3][1]A[3][2]A[3][3]A[3][5]A[3][6]A[3][7] A[3][4]

A[1][0]A[1][1]A[1][2]A[1][3]A[1][5]A[1][6]A[1][7] A[1][4]

A[2][1]A[2][0]A[2][3] A[2][0]A[2][5]A[2][6]A[2][7] A[2][4]

... ... ... ... ... ... ... ...

A[0][0]A[0][1]A[0][2]A[0][3]A[0][4]A[0][5]A[0][6]A[0][7]

A[1][0]

A[1][1]

A[1][2]

A[1][3]

A[1][4]

A[1][5]

A[1][6]

A[1][7]

A[2][0]

A[2][1]

...

...

memory

Korea Univ

Cache-friendly Code

• In Fortran, a 2-dimensional array is stored according to column-major ordering Example: INTEGER A(10,8)

7

A Cache Line (block)

Data cache

A(0,0)A(1,0)A(2,0)A(3,0)A(5,0)A(6,0)A(7,0) A(4,0)

A(0,3)A(1,3)A(2,3)A(3,3)A(5,3)A(6,3)A(7,3) A(4,3)

A(0,1)A(1,1)A(2,1)A(3,1)A(5,1)A(6,1)A(7,1) A(4,1)

A(1,2)A(2,2)A(3,2) A(0,2)A(5,2)A(6,2)A(7,2) A(4,2)

... ... ... ... ... ... ... ...

A(0,0)A(1,0)A(2,0)A(3,0)A(4,0)A(5,0)A(6,0)A(7,0)

A(0,1)

A(1,1)

A(2,1)

A(3,1)

A(4,1)

A(5,1)

A(6,1)

A(7,1)

A(0,2)

A(1,2)

...

...

memory

DO J= 1, 8 DO I = 1, 10 sum = sum + A(I,J) END DO END DO

Korea Univ

TLB Consideration

• The page size is determined by the size the CPU supports, plus the choice offered by the operating system Typically, the page size is 4KB

• TLB is on the critical path for performance Think about PIPT cache

• Just as with data cache, it is important to make good use of the TLB entries

8

Korea Univ

Loop Optimizations

• Both the programmer and the compiler can improve the use of memory

• A simple reordering of the statements inside the body of the loop nest may make a difference Loop Interchange (or Loop Exchange) Loop Unrolling Unroll and Jam Loop Fusion Loop Fision Loop Tiling (or Blocking)

9

Korea Univ

Loop Interchange

10

j=0

i=0

/* Before */for (j=0; j<100; j++) for (i=0; i<5000; i++) x[i][j] = 2*x[i][j]

/* After */for (i=0; i<5000; i++) for (j=0; j<100; j++) x[i][j] = 2*x[i][j]

j=0i=0

Improved cache efficiency

Row-major ordering

What is the worst that could

happen?

Prof Sean Lee’s Slide in Georgia Tech

Korea Univ

Loop Unrolling

11

for (int i=0; i<100; i++) { a[i] = b[i] + 1; c[i] = b[i] + a[i-1] + b[i-1];}

• The loop overhead includes incrementing the loop variable, testing its value and branching to the start of the loop

• Unroll the loop (in the example, unroll factor = 2) brings; Overall loop overhead roughly halved

Data reuse improved• The value of a[i] just computed can be used immediately

ILP could be increased

• Nowadays, a programmer seldom needs to apply this transformation manually, since compilers are very good at doing this

for (int i=0; i<100; i+=2) { a[i] = b[i] + 1; c[i] = b[i] + a[i-1] + b[i-1]; a[i+1] = b[i+1] + 1; c[i+1] = b[i+1] + a[i] + b[i];}

Unrolled factor = 2

Korea Univ

Unroll and Jam

• Unroll and Jam is an extension of loop unrolling that is appropriate for some loop nests with multiple loops

12

for (int j=0; j<n; j++) for (int i=0; i<n; i++) a[i][j] = b[i][j] + 1;

for (int j=0; j<n; j+=2) { for (int i=0; i<n; i++) a[i][j] = b[i][j] + 1; for (int i=0; i<n; i++) a[i][j+1] = b[i][j+1] + 1;}

Outer loop unrolling

for (int j=0; j<n; j+=2) { for (int i=0; i<n; i++) a[i][j] = b[i][j] + 1; a[i][j+1] = b[i][j+1] + 1;}

Jam

Korea Univ

Loop Fusion

• Loop Fusion merges 2 or more loops to create a bigger loop May improve cache efficiency Could increase the amount of computation per

iteration in order to improve the ILP Lower loop overheads

13

for (int i=0; i<n; i++) a[i] = b[i] * 2;

for (int i=0; i<n; i++) { x[i] = 2 * x[i]; c[i] = a[i] * 2;}

for (int i=0; i<n; i++) a[i] = b[i] * 2; x[i] = 2 * x[i]; c[i] = a[i] * 2;}

Korea Univ

Loop Fission

• Loop Fission is a transformation that breaks up a loop into several loops May improve use of cache or isolate a part that

inhibits full optimization of the loop Likely to be most useful if a loop nest is large

and its data does not fit into cache

14

for (int i=0; i<n; i++) c[i] = exp(i/n); for (int j=0; j<m; j++) a[j][i] = b[j][i] + d[j] * e[i]}

for (int i=0; i<n; i++) c[i] = exp(i/n);

for (int j=0; j<m; j++) for (int i=0; i<n; i++) a[j][i] = b[j][i] + d[j] * e[i]

Loop fission

Loop Interchange

Korea Univ

Why Loop Blocking?

15

/* Before */for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) x[i][j] = y[i][k]*z[k][j];

i

k

k

jy[i][k] z[k][j]

i

X[i][j]

Does not exploit locality!Modified Slide from Prof. Sean Lee in Georgia

Tech

Korea Univ

Loop Blocking (Loop Tilting)

• Partition the loop’s iteration space into many smaller chunks and ensure that the data stays in the cache until it is reused

16

i

k

k

j

y[i][k] z[k][j]

i

j

X[i][j]

Modified Slide from Prof. Sean Lee in Georgia Tech

/* After */for (jj=0; jj<N; jj=jj+B) // B: blocking factor for (kk=0; kk<N; kk=kk+B)for (i=0; i<N; i++) for (j=jj; j< min(jj+B,N); j++) for (k=kk; k< min(kk+B,N); k++) x[i][j] += y[i][k]*z[k][j];

B

kk

kk

jj

k k

k

k

j j

① ① ① ① ① ① ①② ② ② ②

② ② ②③④⑤

⑧

⑥⑦

⑨

⑨ ⑨ ⑨ ③ ③ ③④ ④ ④⑤ ⑤ ⑤⑥ ⑥ ⑥⑦ ⑦ ⑦⑧ ⑧ ⑧⑨ ⑨ ⑨

Korea Univ

Use of Pointers and Contiguous Memory in C

• Pointers pose a serious challenge for performance tuning

• Pointer Aliasing Problem The memory model in C is such that, without additional

information, one must assume that all pointers may reference any memory address

It prevents a compiler from performing many program optimizations, since it cannot determine that they are safe

If pointers are guaranteed to point to portions of non-overlapping memory, (for example, because each pointer targets memory allocated through a call to malloc()), more aggressive techniques can be applied

In general, only the programmer knows what memory locations a pointer may refer to

17

Korea Univ

Use of Pointers and Contiguous Memory in C

• The restrict keyword is provided in C99 to inform the compiler that the memory referenced by one pointer does not overlap with a memory section pointed to by another pointer

18

void mxv(int m, int n, double * restrict a, double * restrict b, double * restrict c){ int i, j;

for (i=0; i<m; i++) { a[i] = 0.0; for (j=0; j<n; j++) a[i] += b[i*n+j]*c[j]; }}

C99: Informal name for ISO/IEC 9899:1999, a past version of the C language standard in 1999. -wiki

Korea Univ

Using Compilers

• Modern compilers implement most, if not all, of the loop optimizations They perform a variety of analyses (such as data dependence

analysis) to determine whether they may be applied

Check out what kinds of compiler options are there

• However, the compiler’s ability to transform code is limited by its ability to analyze the program It may be hindered by the presence of pointers

• So, the programmer has to take action. Some rewriting of the source code may lead to better results

19

Korea Univ

Best Practices

• General recommendations for efficient OpenMP program Optimize barrier use Avoid the ordered construct Avoid large critical regions Maximize parallel regions Avoid parallel regions in inner loops Load balancing

• Additional performance considerations Single vs Master construct Private vs Shared data Avoid false sharing

20

Korea Univ

Optimize Barrier Use

• No matter how efficiently barriers are implemented, they are expensive operations It is always worthwhile to reduce their usage to the minimum

The nowait clause makes it easy to eliminate the barrier that is implied on several constructs

21

#pragma omp parallel{ ..

#pragma omp for for (int i=0; i<n; i++) .. ..

#pragma omp for nowait for (int i=0; i<n; i++) ..

} // barrier is implied

• Since the implied barrier in the 2nd parallel loop is redundant and can be removed (a barrier is implied with the parallel loop)• A compiler might do this anyway

Korea Univ

Optimize Barrier Use Example

22

#pragma omp parallel default (none) \ shared (n, a, b, c, d, sum) private (i){

#pragma omp for nowait for (int i=0; i<n; i++) a[i] += b[i] ;

#pragma omp for nowait for (int i=0; i<n; i++) c[i] += d[i] ; #pragma omp barrier

#pragma omp for nowait reduction (+:sum) for (int i=0; i<n; i++) sum += a[i] + c[i];

} // barrier is implied

Korea Univ

Avoid the Ordered Construct

• The ordered construct ensures that the corresponding block of code within a parallel loop is executed in the order of the loop iterations It is expensive to implement The runtime system has to keep track which iterations

have finished and possibly keep threads in a wait state until their results are needed

It inevitably slows the program execution

23

Korea Univ

Avoid the Large Critical Regions

• A critical region is used to ensure that no two threads executes a piece of code simultaneously The more code contained in the critical region, the greater

the likelihood that threads have to wait to enter it

Thus, the programmer should minimize the amount of code enclosed within a critical region

• If possible, an atomic update is to be preferred Whereas a critical region forces threads to perform all of

the code enclosed within it one at a time, an atomic update enforces exclusive access to just one memory location

24

Korea Univ

Maximize Parallel Regions

• Indiscriminate use of parallel regions may give rise to suboptimal performance Overheads are associated with starting and terminating a parallel region

• Large parallel regions offer more opportunities for using data in cache and provide a bigger context for other compiler optimizations

25

#pragma omp parallel for for ( … ) { /* Work-sharing loop 1 */ }



#pragma omp parallel{ #pragma omp for { … ) /* Work-sharing loop 1 */

#pragma omp for { … ) /* Work-sharing loop 2 */

#pragma omp for { … ) /* Work-sharing loop 3 */}

• Fewer implied barriers• Potential for cache data reuse between

loops• Downside: no adjustment of #threads

on a per loop basis

Korea Univ

Avoid Parallel Regions in Inner Loops

• Another common technique to improve performance is to move parallel regions out of innermost loops Otherwise, we repeatedly experience the overheads of

the parallel construct

26

for (i=0; i<n; i++) for (j=0; j<n; j++) #pragma omp parallel for for (k=0; k<n; k++) { ……… }

#pragma omp parallelfor (i=0; i<n; i++) for (j=0; j<n; j++) #pragma omp for for (k=0; k<n; k++) { ……… }

• The parallel construct overheads are minimized

• Overheads of the parallel region are incurred n2 times

Korea Univ

Load Balancing

• In some parallel algorithms, threads have different amounts of work to do One solution is to use the schedule clause with a

non-static schedule The caveat is that the dynamic and guided

schedules have higher overheads than does the static schedule

However, if the load imbalance is severe enough, this cost is offset by the more flexible allocation of work to threads

27

Korea Univ

Pipelined Processing

28

for (i=0; i<N; i++) {

ReadFromFile(i, …); for (j=0; j<ProcessingNum; j++) ProcessData();

WriteResultsToFile(i);

}

#pragma omp parallel{/* preload data to be used in the first iteration of the i-loop */ #pragma omp single ReadFromFile (0, … ); for (i=0; i<N; i++) {

#pragma omp single nowait ReadFromFile(i+1, …); #pragma omp for schedule (dynamic) for (j=0; j<ProcessingNum; j++) ProcessChunkOfData();

#pragma omp single nowait WriteResultsToFile(i); }}

• The barrier at the end of j loop ensures; Data for the next loop iteration is available The results of the previous iteration have been written before work proceeds

Korea Univ

Single vs Master

• The functionality of the single and master constructs is similar

• The difference is that a single region can be executed by any thread (typically the first to encounter it), whereas this is not the case for the master region

• The efficiency is implementation and application-dependent In general, the master construct is more efficient, as the single

construct requires more work in the OpenMP library

29

Korea Univ

Private vs Shared

• The programmer may often choose whether data should be shared or private Either choice might lead to a correct application, but the performance

impact can be substantial if the wrong choice is made

• As an example, if threads need unique read/write access to a 1-dimensional array, there are 2 options: Declare a 2-dimensional shared array with one row accessed by each

thread, or allocate a 1-dimensional private array by each thread

In general, the latter is to be preferred over the former• In the former, when modifying shared data, a data element might be in the

same cache line as the data modified by another thread. Thus, performance degrades because of false sharing

30

Korea Univ

Private vs Shared

• If data is always read in a parallel region, it could be shared

• But, it could also be privatized so that each thread has a local copy of the data, using the firstprivate clause to initialize it to the values prior to the parallel region

• Both approaches work, but the performance could be different Sharing the data seems the reasonable choice

There is no risk of false sharing because the data is not modified, memory usage does not increase, and there is no runtime overhead to copy the data

• How about on a ccNUMA system?

31

Korea Univ

Avoid False Sharing

• One of the factors limiting scalable performance is false sharing It is a side-effect of the cache-line granularity of

cache coherence When threads running on different processors

update different words in the same cache line, the cache coherence protocol maintains the data consistency by invalidation of the entire cache line

If some or all of the threads update the same cache line frequently, performance degrades

32

Korea Univ

False Sharing Example

• Assuming that Cache line size is 8 words #threads is 8 (Nthreads=8)

33

#pragma omp parallel for shared(Nthreads, a) schedule(static, 1)for (i=0; i<Nthreads; i++) a[i] += i;

• The array padding can be used to eliminate the problem• Padding the array by dimensioning it as a[n][8] and changing

the indexing from a[i] to a[i][0] eliminates the false sharing

• Given that the size of a cache line needs to be taken into account, it is also non-portable

Korea Univ

Avoid False Sharing

• In general, using private data instead of shared data significantly reduces the risk of false sharing In contrast of array padding, it is also a

portable optimization

34

Korea Univ35

Binding Threads to CPUs

• Use an environment variable export GOMP_CPU_AFFINITY=“0 4 1 5”

• Thread0 attached to CPU0

• Thread1 attached to CPU 4



• Try lstopo command (show the topology of the system)

Korea Univ36

Our Server Config.

Korea Univ

Single Thread Overhead

• Single thread overhead: How effective the parallel version is when executed on single thread Ideally, the execution time of the OpenMP version is equal

to the sequential version

In many cases, the sequential version would be faster

However, there is also a chance that the OpenMP version on a single thread might be faster because of a difference in compiler optimizations

37

Overheadsingle thread

= 100 x ( Elapsed Time

(Sequential)

Elapsed Time (OpenMPsingle

thread) - 1) %

Korea Univ

Case Study

38

• Matrix x Vector Product

• Experiment environment Sun Fire E6900 NUMA with UltraSparc IV (dual-core), 2006

• Sun Fire was a series of server computers introduced in 2001

• 6 CPU and memory boards (each board (SB#) can have up to 4 UltraSpac)

• 24 processors (=48 cores)

Solaris 9 OS

• In general, performance results are significantly influenced by Coding style by the application developer

Compiler, compiler options, and its runtime libraries

OS features including its support for memory allocation and thread scheduling

Hardware characteristics: memory hierarchy, cache coherence mechanisms, support for atomic operations, and more

Korea Univ

Case Study

• Single-thread overhead

39

Korea Univ

Case Study

• Performance

40

Korea Univ

Superlinear Speedup

• With a parallel program, there can be a positive effect, offsetting some of the performance loss caused by sequential code and the various overheads

• This is because a parallel program has more aggregate cache capacity at its disposal since each thread will have some amount of local cache

• It might result in a superlinear speedup The speedup exceeds the number of processors used

41

Korea Univ

Backup

42

Korea Univ

Overheads of the OpenMP Translation

• A cost is associated with the creation of OpenMP parallel regions, with the sharing of work among threads, and with all kinds of synchronization

• The sources of the overheads include The cost of starting up threads and creating their

execution environment The potential additional expense incurred by the

encapsulation of a parallel region in a separate function The cost of computing the schedule The time taken to block and unblock threads, and the

time for them to fetch work and signal that they are ready

43

Korea Univ


• Minor overheads are incurred by using firstprivate and lastprivate clauses In most cases, however, these are relatively modest

compared to the cost of barriers and other forms of thread synchronization, as well as the loss in speedup whenever one or more threads are idle

• Dynamic forms of scheduling can lead to much more thread interaction than do static schedules, and therefore inevitably incur higher overheads On the other hand, they may reduce thread idle

time in the presence of load imbalance

44

Korea Univ


• The EPCC microbenchmarks were created to help programmers estimate the relative cost of using different OpenMP constructs Overheads for major OpenMP constructs as measured by the EPCC

microbenchmarks for the first version of the OpenUH compiler

• A few results; Overheads for the for directive and for the barrier are almost identical

Overheads for the parallel loop consist of calling the static loop schedule and the barrier

Overheads for the parallel for are just slightly higher than those for parallel

• This result is accounted for by the overheads of sharing the work, which is negligible for the default static scheduling policy

The single directive has higher overheads than a barrier• This is not surprising, as the overheads consist of a call to a runtime library routine that

ensures that one thread executes the region, and then a barrier at the end

The reduction clause is costly because it is implemented via a critical region

45

Korea Univ


46

single

reduction

parallel forparallel

barrierfor

Korea Univ


• Overheads for the different kinds of loop schedules It clearly shows the performance benefits of a static schedule,

and the penalties incurred by a dynamic schedule (where loops must grab chunks of work, especially small chunks at run time)

47

Dynamic ,n

Guided, n

Static, n

Static

Lecture 7. Performance Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel...

Documents

Transcript of Lecture 7. Performance Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel...