Performance Analysis for Parallel Applications · 2020-03-23 · Spring 2019 CS4823/6643 Parallel...
Transcript of Performance Analysis for Parallel Applications · 2020-03-23 · Spring 2019 CS4823/6643 Parallel...
Spring 2019 CS4823/6643 Parallel Computing 1
Performance Analysis for Parallel Applications
Wei Wang
Spring 2019 CS4823/6643 Parallel Computing 2
Achieving Desired Performance
● Ideally, parallel speedup should be linear– i.e., using n processors => n times faster than using 1
processor
● Reality is stark– Seldom do we have linear speedup
– Sometimes using more processors means slow down
● It is important to be able to understand and analyze performance issues of parallel applications
Spring 2019 CS4823/6643 Parallel Computing 3
Theoretical Performance Models
Spring 2019 CS4823/6643 Parallel Computing 4
Theoretical Performance Models
● Theoretical performance models improve basic understandings on parallel parallel– They also define the maximum and minimum possible
speedups
● Theoretical models– Amdahl’s Law
– Gustafson's law– Karp–Flatt metric
– Speedup model with communication cost
Spring 2019 CS4823/6643 Parallel Computing 5
The Basic Speedup Definition
● Speedup of a parallel application is defined as
Speedup=Sequential Execution TimeParallel Execution Time
Spring 2019 CS4823/6643 Parallel Computing 6
Amdahl’s Law
● Intuition of Amdahl’s law:– If an application has a p% parallelizable part and a
(1-p%) serial part, ● the minimum execution time of the application is the
execution time of the serial part● The maximum speedup is bounded by the serial part.
Serial: 1-p% parallelizable: p% Serial: 1-p%
Original Execution Time Minimum execution time.Given enough parallel processors,The paralleled-able part’s executiontime may approach 0.
Spring 2019 CS4823/6643 Parallel Computing 7
Amdahl’s Law: Equation for Speedup
● The equation of Amdahl’s Law:
– Speedup is the overall speedup of the whole application after parallelization
– p% is the percentage of the execution time of the parallelizable part
– s is the speedup of the parallel part
Speedup=Sequential Exec TimeParallel Exec Time
= 1−p + p
(1−p %)+ p%s
= 1
(1−p %)+ p %s
Spring 2019 CS4823/6643 Parallel Computing 8
Amdahl’s Law: The Limit on Speedup
● If the parallel part speedup approaches infinity:
– i.e., the maximum speedup is bounded by the execution time of the serial part
lim(s→∞)
Speedup= 1
1−p%+ p%s
= 11−p %
Spring 2019 CS4823/6643 Parallel Computing 9
Another Visualization of Amdahl’s Law
* Figure by Daniels220 at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=6678551
Spring 2019 CS4823/6643 Parallel Computing 10
Karp-Flatt Metric
● In practice, the ratio of serial is usually hard to know● However,
– Speedup can be easily measured
– Parallel part speedup, s, is usually assumed to be the same as number of processors
● If both Speedup and s are known, then we can estimate the ratio of the serial part with Amdahl’s law equation
● Karp-Flatt metric, e, is the name of the “serial part ratio” estimated based on experimentally measured Speedup
e=1−p %=
1Speedup
−1s
1−1s
Spring 2019 CS4823/6643 Parallel Computing 11
The Limitations of Amdahl’s Law and Karp-Flatt Metric
● Amdahl’s law defines the performance limit of a parallel application, with– A fixed problem size, i.e., the ratios of the parallel
part and serial part are fixed. ● In real life, problem size usually grows with system size● parallel part usually grows faster than serial part● This limitation is addressed by Gustafson’s law
● Amdahl’s law does not provide any insights about parallel overhead
Spring 2019 CS4823/6643 Parallel Computing 12
Gustafson's Law
● Gustafson’s Law considers the growth of problem size
● Intuitively, Gustafson’s Law says that – if we can indefinitely scale up the problem size (and its
parallelizable) with the increase of the processor count, the overall speedup of the whole application is the same as the speedup of the parallelizable part.
● Gustafson’s law reveals the true power of parallelization: we can solve much bigger problem within reasonable amount of time
Spring 2019 CS4823/6643 Parallel Computing 13
Deriving Gustafson’s Law
● Let (similar to Amdahl’s law), – p% be the percentage of the execution time of the original
workload size– s be the speedup of the parallel part– W be the original problem size on one processor
● With more processors, increase the workload size to
● The speed for the new workload is then
new _ workload=(1−p%)×W+(s×p %)×W
Speedup=
(1−p %)×W1
+(s×p %)×W1
(1−p %)×W1
+(s×p %)×W
s
=1−p %+s×p%
Spring 2019 CS4823/6643 Parallel Computing 14
Maximum Speedup By Gustafson’s Law
● If the parallelizable part is indefinitely large
– i.e., if the problem size is large enough, especially, if the parallelizable part is large enough, the overall speedup is the same as the speedup of parallelizable part
lim(p→1)
Speedp=1−p %+s×p %=s
Spring 2019 CS4823/6643 Parallel Computing 15
The Limitations of Gustafson's Law
● The sizes of some problems are limited.– e.g., the size of a chess game at a certain stage.
– Essentially, these applications do not benefit much from large scale parallelization.
● Similar to Amdahl’s law, Gustafson’s law does not provide any insights about parallel overhead
Spring 2019 CS4823/6643 Parallel Computing 16
Speedup Equation with Communication Cost
● Probably originated from Michael Quinn’s Parallel Computing text book
● Let p% be the percentage of the execution time of the original workload size
● Sequential execution time is (still),Sequential Execution Time=1−p%+ p %=1
Spring 2019 CS4823/6643 Parallel Computing 17
Speedup Equation with Communication Cost
● Let – n be the number of processors
– k(n) be the cost of communication on n processors
● The parallel execution time is then (assuming the parallel computation has linear speedup)
● Overall speedup isParallel Execution Time=1−p %+ p%
n+k (n)
Speedup=Sequential Exec TimeParallel Exec Time
= 1
1−p%+ p %n
+k (n)
Spring 2019 CS4823/6643 Parallel Computing 18
Plotting the Execution Time of the Parallel Part (p%/n)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Number of Processors (n)
Exe
cutio
n T
ime
(p%
/n)
Spring 2019 CS4823/6643 Parallel Computing 19
Plotting the Communication Cost k(n)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Number of Processors (n)
Com
m C
ost
(k(n
))
Spring 2019 CS4823/6643 Parallel Computing 20
Plotting the Parallel Execution time (%p/n + k(n)) and Speedup
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Para Part Time Comm Cost Speedup
Number of Processors (n)
Tot
al E
xec
Tim
e (%
p/n
+ k
(n))
Diminishing Returns
Spring 2019 CS4823/6643 Parallel Computing 21
Real Speedup with Communication Cost
● Due to the increasing communication cost, – real speedup is usually sub-linear
– Speedup would reach the maximum with certain number of processors and adding more cores would only reduce speedup
Spring 2019 CS4823/6643 Parallel Computing 22
Super-linear Speedup
● It is impossible to get super-linear speedup with our current theoretical models.
● However, super-linear speedup is occasionally possible in practice (very rare)– Super-linear speedup is usually caused by
improved utilization and/or size of memory resources, such as cache or DRAMs
– Super-linear is also possible in some algorithms employing exploratory decomposition
Spring 2019 CS4823/6643 Parallel Computing 23
Basic Performance Analysis for Parallel Applications
Spring 2019 CS4823/6643 Parallel Computing 24
Common Causes for Poor Parallel Performance
● Algorithm issues:– Work is not evenly partitioned among tasks/threads/processes.
– Some computations are duplicated/repeated.
● Communication issues:– Blocking communications/synchronizations cause idling.
– Spinning synchronizations consume too much processor time.
– Too many data need to be communicated.
● Memory/data issues:– Threads/processes contend for cache and DRAM resources.
– Cache-space andmemory-bandwidth are limited.
– Data and computations are too far away (on large scale systems).
● System management issues:– Processors/cores are not load-balanced.
Spring 2019 CS4823/6643 Parallel Computing 25
Determine Performance Issues by Observing Application Behaviors
● Performance analysis is essentially to,1.Observe application behaviors
2.And based on the behaviors, determine the potential causes of poor performance
• Some examples of application behaviors,– CPU (Processor) utilization– System time, user time.
– Number of synchronization operations
– Memory usage
– Instruction counts
– CPU stalled cycles
• Usually a specific behavior can be associated with several potential performance issues. – Further analysis of the algorithm, application code and system configure is required to pin-point the
exact cause.
– It is also common that an application is affected multiple performance issues at the same time.
Spring 2019 CS4823/6643 Parallel Computing 26
Performance Analysis on Linux
● Modern operating system (OS) provides a myriad of performance measurements. – Available on almost all OSes, including Windows, Linux, and macOS.
● Modern CPU/GPU also provides many performance measures to investigate the internal usage of CPU resources.– These measurements are called hardware performance monitoring Units
(PMU).
– Available on almost all CPU/GPUs from Intel, AMD, ARM, IBM and Nvidia.
– OS provides high-level interface to use PMUs.
● We will study some basic performance measurements on Linux:– From OS: CPU utilization.– From PMU: Instruction count, CPU cycles, CPU stalls and cache misses.
Spring 2019 CS4823/6643 Parallel Computing 27
Read CPU Utilization on Linux
● CPU utilization indicates the percentage of execution time that an application actively uses CPU.– E.g., 80% CPU utilization means that 80% of the execution time that
the application is running on CPU and 20% of the time the application is idling.
– On multi-core processors, the maximum CPU utilization per core is 100%. The maximum CPU utilization of all cores is 100% * #_of_cores.
● e.g., An octa-core CPU has a max CPU utilization of 800%.
● CPU utilization can be acquired by– Executing the parallel application with /usr/bin/time (like we did in
our homework)
Spring 2019 CS4823/6643 Parallel Computing 28
Use CPU Utilization to determine Performance Issues
● App behavior: Low CPU utilization, i.e.,– Application’s CPU utilization is much smaller than the possible maximum CPU utilization
● Performance issues (that cause low CPU utilization):– Uneven workload partitioning
● Verification: check algorithm and/or code● Solution: revise algorithm and/or code
– Blocking communication● Verification: check blocking communication in code● Solution:
– Use non-blocking communication – Use spinning-based (non-blocking) synchronizations– Create more processors/threads to overlap computation with communications
– Processors/cores not load-balanced● Verification: check if every processor/core executes the same number of threads/processes● Solution: adjust process/thread scheduling policy
Spring 2019 CS4823/6643 Parallel Computing 29
Read PMUs on Linux
● Linux provides a kernel tool call “perf” to read various PMUs
● To read the performance counters for a program, execute “perf” with the following parameters,– perf stat -e “event,event,...” program prog_params
– E.g., to read the total number of cycles for program “ls -l” use command:
● perf stat -e “cycles” ls -l
Spring 2019 CS4823/6643 Parallel Computing 30
Read PMUs on Linux
● Events specify what performance data to collect.● The common events are:
– cycles: total cycles used by a program
– instructions: total instructions executed by a program– stalled-cycles-frontend and stalled-cycles-backend: the stalled CPU cycles of the frontend (instruction issues) and backend (instruction write-backs) of a CPU, sum them to get the total stalled CPU cycles during whole execution
● To get a whole list of events, run: perf list
Spring 2019 CS4823/6643 Parallel Computing 31
Use CPU Instructions to determine Performance Issues
● App behavior: total instruction count is much larger in parallel execution model than sequential execution model
● Performance Issues (that increase CPU instructions):– Computation is duplicated
● Verification: check algorithm and/or code● Solution: revise algorithm and/or code
– Spinning synchronizations consume too much processor time● Verification: check the time spent on spinning synchronizations● Solution: use blocking synchronizations, and create more
threads/processors to overlap computation with synchronization
Spring 2019 CS4823/6643 Parallel Computing 32
Use CPU Cycles to determine Performance Issues
● App Behavior: total CPU cycles count is much larger in parallel execution model than sequential execution model
● Performance Issues (that increase CPU cycles):– Too many data to communication
● Verification: check communications in code● Solution: reduce and/or combine communications
– Memory/data issues (more on following slides):● Verification: check if CPU stalled cycles increase● Solution: more on following slides
Spring 2019 CS4823/6643 Parallel Computing 33
Use CPU Stalled Cycles to determine Performance Issues
● App Behavior: total CPU stalled cycles count is much larger in parallel execution model than sequential execution model
● Performance Issues (that increase CPU stalled cycles):– Cache contention:
● Verification: check if cache misses count increase● Solution: rearrange memory accesses to avoid contention
– Memory bandwidth limitation:● Verification: check if reaches maximum memory bandwidth● Solution: improve cache utilization, or use high-bandwidth machine
– Data/computation ● Verification: check if data and computation are located in the same machine or same
network● Solution: change process/job scheduling policy to co-locate data and computation in the
same machine or same network
Spring 2019 CS4823/6643 Parallel Computing 34
Other Performance Monitoring Tools
● The “perf” from Linux kernel supports a very limited set of events. To access more PMU events try:– Perfmon2 (http://perfmon2.sourceforge.net/)
– PAPI (http://icl.cs.utk.edu/papi/)● Allows reading PMU counters program-ably, i.e., read counters in code.
– Oprofile (http://oprofile.sourceforge.net/news/)
– I use my own tool based on perfmon2 for thread-level monitoring (https://github.com/wwang/pfm_multi)
● Intel offers a comprehensive commercial performance analysis tool – Vtune ● There are also application instrumentation tools to monitor behavior at
function/instruction-level. E.g., – Valgrind
– Intel Pin
● There are similar support for PMU reading on Windows and macOS, as well as GPUs.