Performance Analysis for Parallel Applications · 2020-03-23 · Spring 2019 CS4823/6643 Parallel...

Spring 2019 CS4823/6643 Parallel Computing 1

Performance Analysis for Parallel Applications

Wei Wang


Achieving Desired Performance

● Ideally, parallel speedup should be linear– i.e., using n processors => n times faster than using 1

processor

● Reality is stark– Seldom do we have linear speedup

– Sometimes using more processors means slow down

● It is important to be able to understand and analyze performance issues of parallel applications


Theoretical Performance Models


Theoretical Performance Models

● Theoretical performance models improve basic understandings on parallel parallel– They also define the maximum and minimum possible

speedups

● Theoretical models– Amdahl’s Law

– Gustafson's law– Karp–Flatt metric

– Speedup model with communication cost


The Basic Speedup Definition

● Speedup of a parallel application is defined as

Speedup=Sequential Execution TimeParallel Execution Time


Amdahl’s Law

● Intuition of Amdahl’s law:– If an application has a p% parallelizable part and a

(1-p%) serial part, ● the minimum execution time of the application is the

execution time of the serial part● The maximum speedup is bounded by the serial part.

Serial: 1-p% parallelizable: p% Serial: 1-p%

Original Execution Time Minimum execution time.Given enough parallel processors,The paralleled-able part’s executiontime may approach 0.


Amdahl’s Law: Equation for Speedup

● The equation of Amdahl’s Law:

– Speedup is the overall speedup of the whole application after parallelization

– p% is the percentage of the execution time of the parallelizable part

– s is the speedup of the parallel part

Speedup=Sequential Exec TimeParallel Exec Time

= 1−p + p

(1−p %)+ p%s

= 1

(1−p %)+ p %s


Amdahl’s Law: The Limit on Speedup

● If the parallel part speedup approaches infinity:

– i.e., the maximum speedup is bounded by the execution time of the serial part

lim(s→∞)

Speedup= 1

1−p%+ p%s

= 11−p %


Another Visualization of Amdahl’s Law

* Figure by Daniels220 at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=6678551


Karp-Flatt Metric

● In practice, the ratio of serial is usually hard to know● However,

– Speedup can be easily measured

– Parallel part speedup, s, is usually assumed to be the same as number of processors

● If both Speedup and s are known, then we can estimate the ratio of the serial part with Amdahl’s law equation

● Karp-Flatt metric, e, is the name of the “serial part ratio” estimated based on experimentally measured Speedup

e=1−p %=

1Speedup

−1s

1−1s


The Limitations of Amdahl’s Law and Karp-Flatt Metric

● Amdahl’s law defines the performance limit of a parallel application, with– A fixed problem size, i.e., the ratios of the parallel

part and serial part are fixed. ● In real life, problem size usually grows with system size● parallel part usually grows faster than serial part● This limitation is addressed by Gustafson’s law

● Amdahl’s law does not provide any insights about parallel overhead


Gustafson's Law

● Gustafson’s Law considers the growth of problem size

● Intuitively, Gustafson’s Law says that – if we can indefinitely scale up the problem size (and its

parallelizable) with the increase of the processor count, the overall speedup of the whole application is the same as the speedup of the parallelizable part.

● Gustafson’s law reveals the true power of parallelization: we can solve much bigger problem within reasonable amount of time


Deriving Gustafson’s Law

● Let (similar to Amdahl’s law), – p% be the percentage of the execution time of the original

workload size– s be the speedup of the parallel part– W be the original problem size on one processor

● With more processors, increase the workload size to

● The speed for the new workload is then

new _ workload=(1−p%)×W+(s×p %)×W

Speedup=

(1−p %)×W1

+(s×p %)×W1

(1−p %)×W1

+(s×p %)×W

s

=1−p %+s×p%


Maximum Speedup By Gustafson’s Law

● If the parallelizable part is indefinitely large

– i.e., if the problem size is large enough, especially, if the parallelizable part is large enough, the overall speedup is the same as the speedup of parallelizable part

lim(p→1)

Speedp=1−p %+s×p %=s


The Limitations of Gustafson's Law

● The sizes of some problems are limited.– e.g., the size of a chess game at a certain stage.

– Essentially, these applications do not benefit much from large scale parallelization.

● Similar to Amdahl’s law, Gustafson’s law does not provide any insights about parallel overhead


Speedup Equation with Communication Cost

● Probably originated from Michael Quinn’s Parallel Computing text book

● Let p% be the percentage of the execution time of the original workload size

● Sequential execution time is (still),Sequential Execution Time=1−p%+ p %=1


Speedup Equation with Communication Cost

● Let – n be the number of processors

– k(n) be the cost of communication on n processors

● The parallel execution time is then (assuming the parallel computation has linear speedup)

● Overall speedup isParallel Execution Time=1−p %+ p%

n+k (n)

Speedup=Sequential Exec TimeParallel Exec Time

= 1

1−p%+ p %n

+k (n)


Plotting the Execution Time of the Parallel Part (p%/n)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Number of Processors (n)

Exe

cutio

n T

ime

(p%

/n)


Plotting the Communication Cost k(n)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25


Com

m C

ost

(k(n

))


Plotting the Parallel Execution time (%p/n + k(n)) and Speedup

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Para Part Time Comm Cost Speedup


Tot

al E

xec

Tim

e (%

p/n

+ k

(n))

Diminishing Returns


Real Speedup with Communication Cost

● Due to the increasing communication cost, – real speedup is usually sub-linear

– Speedup would reach the maximum with certain number of processors and adding more cores would only reduce speedup


Super-linear Speedup

● It is impossible to get super-linear speedup with our current theoretical models.

● However, super-linear speedup is occasionally possible in practice (very rare)– Super-linear speedup is usually caused by

improved utilization and/or size of memory resources, such as cache or DRAMs

– Super-linear is also possible in some algorithms employing exploratory decomposition


Basic Performance Analysis for Parallel Applications


Common Causes for Poor Parallel Performance

● Algorithm issues:– Work is not evenly partitioned among tasks/threads/processes.

– Some computations are duplicated/repeated.

● Communication issues:– Blocking communications/synchronizations cause idling.

– Spinning synchronizations consume too much processor time.

– Too many data need to be communicated.

● Memory/data issues:– Threads/processes contend for cache and DRAM resources.

– Cache-space andmemory-bandwidth are limited.

– Data and computations are too far away (on large scale systems).

● System management issues:– Processors/cores are not load-balanced.


Determine Performance Issues by Observing Application Behaviors

● Performance analysis is essentially to,1.Observe application behaviors

2.And based on the behaviors, determine the potential causes of poor performance

• Some examples of application behaviors,– CPU (Processor) utilization– System time, user time.

– Number of synchronization operations

– Memory usage

– Instruction counts

– CPU stalled cycles

• Usually a specific behavior can be associated with several potential performance issues. – Further analysis of the algorithm, application code and system configure is required to pin-point the

exact cause.

– It is also common that an application is affected multiple performance issues at the same time.


Performance Analysis on Linux

● Modern operating system (OS) provides a myriad of performance measurements. – Available on almost all OSes, including Windows, Linux, and macOS.

● Modern CPU/GPU also provides many performance measures to investigate the internal usage of CPU resources.– These measurements are called hardware performance monitoring Units

(PMU).

– Available on almost all CPU/GPUs from Intel, AMD, ARM, IBM and Nvidia.

– OS provides high-level interface to use PMUs.

● We will study some basic performance measurements on Linux:– From OS: CPU utilization.– From PMU: Instruction count, CPU cycles, CPU stalls and cache misses.


Read CPU Utilization on Linux

● CPU utilization indicates the percentage of execution time that an application actively uses CPU.– E.g., 80% CPU utilization means that 80% of the execution time that

the application is running on CPU and 20% of the time the application is idling.

– On multi-core processors, the maximum CPU utilization per core is 100%. The maximum CPU utilization of all cores is 100% * #_of_cores.

● e.g., An octa-core CPU has a max CPU utilization of 800%.

● CPU utilization can be acquired by– Executing the parallel application with /usr/bin/time (like we did in

our homework)


Use CPU Utilization to determine Performance Issues

● App behavior: Low CPU utilization, i.e.,– Application’s CPU utilization is much smaller than the possible maximum CPU utilization

● Performance issues (that cause low CPU utilization):– Uneven workload partitioning

● Verification: check algorithm and/or code● Solution: revise algorithm and/or code

– Blocking communication● Verification: check blocking communication in code● Solution:

– Use non-blocking communication – Use spinning-based (non-blocking) synchronizations– Create more processors/threads to overlap computation with communications

– Processors/cores not load-balanced● Verification: check if every processor/core executes the same number of threads/processes● Solution: adjust process/thread scheduling policy


Read PMUs on Linux

● Linux provides a kernel tool call “perf” to read various PMUs

● To read the performance counters for a program, execute “perf” with the following parameters,– perf stat -e “event,event,...” program prog_params

– E.g., to read the total number of cycles for program “ls -l” use command:

● perf stat -e “cycles” ls -l


Read PMUs on Linux

● Events specify what performance data to collect.● The common events are:

– cycles: total cycles used by a program

– instructions: total instructions executed by a program– stalled-cycles-frontend and stalled-cycles-backend: the stalled CPU cycles of the frontend (instruction issues) and backend (instruction write-backs) of a CPU, sum them to get the total stalled CPU cycles during whole execution

● To get a whole list of events, run: perf list


Use CPU Instructions to determine Performance Issues

● App behavior: total instruction count is much larger in parallel execution model than sequential execution model

● Performance Issues (that increase CPU instructions):– Computation is duplicated

● Verification: check algorithm and/or code● Solution: revise algorithm and/or code

– Spinning synchronizations consume too much processor time● Verification: check the time spent on spinning synchronizations● Solution: use blocking synchronizations, and create more

threads/processors to overlap computation with synchronization


Use CPU Cycles to determine Performance Issues

● App Behavior: total CPU cycles count is much larger in parallel execution model than sequential execution model

● Performance Issues (that increase CPU cycles):– Too many data to communication

● Verification: check communications in code● Solution: reduce and/or combine communications

– Memory/data issues (more on following slides):● Verification: check if CPU stalled cycles increase● Solution: more on following slides


Use CPU Stalled Cycles to determine Performance Issues

● App Behavior: total CPU stalled cycles count is much larger in parallel execution model than sequential execution model

● Performance Issues (that increase CPU stalled cycles):– Cache contention:

● Verification: check if cache misses count increase● Solution: rearrange memory accesses to avoid contention

– Memory bandwidth limitation:● Verification: check if reaches maximum memory bandwidth● Solution: improve cache utilization, or use high-bandwidth machine

– Data/computation ● Verification: check if data and computation are located in the same machine or same

network● Solution: change process/job scheduling policy to co-locate data and computation in the

same machine or same network


Other Performance Monitoring Tools

● The “perf” from Linux kernel supports a very limited set of events. To access more PMU events try:– Perfmon2 (http://perfmon2.sourceforge.net/)

– PAPI (http://icl.cs.utk.edu/papi/)● Allows reading PMU counters program-ably, i.e., read counters in code.

– Oprofile (http://oprofile.sourceforge.net/news/)

– I use my own tool based on perfmon2 for thread-level monitoring (https://github.com/wwang/pfm_multi)

● Intel offers a comprehensive commercial performance analysis tool – Vtune ● There are also application instrumentation tools to monitor behavior at

function/instruction-level. E.g., – Valgrind

– Intel Pin

● There are similar support for PMU reading on Windows and macOS, as well as GPUs.

http://icl.cs.utk.edu/papi/

Performance Analysis for Parallel Applications · 2020-03-23 · Spring 2019 CS4823/6643 Parallel...

Documents

Transcript of Performance Analysis for Parallel Applications · 2020-03-23 · Spring 2019 CS4823/6643 Parallel...