Post on 13-Jan-2022
AbstractThis white paper is targeted at users, administrators, and developers of parallel applications on the Solaris™
OS. It describes the current challenges of performance analysis and tuning of complex applications, and the
tools and techniques available to overcome these challenges. Several examples are provided, from a range of
application types, environments, and business contexts, with a focus on the use of the Solaris Dynamic Tracing
(DTrace) utility. The examples cover virtualized environments, multicore platforms, I/O and memory bottlenecks,
Open Multi-Processing (MP) and Message Passing Interface (MPI).
TUNING PARALLEL CODE ON THE SOLARIS™ OPERATING SYSTEM — LESSONS LEARNED FROM HPCWhite PaperSeptember 2009
Sun Microsystems, Inc.
Table of Contents
Introduction...................................................................................................1
Challenges of Tuning Parallel Code .......................................................................1
Some Solaris™ OS Performance Analysis Tools .......................................................2
DTrace ............................................................................................................2
System Information Tools ................................................................................2
Sun™ Studio Performance Analyzer ...................................................................3
Combining DTrace with Friends ........................................................................3
Purpose of this Paper ..........................................................................................4
DTrace and its Visualization Tools .....................................................................5
Chime ................................................................................................................5
DLight ................................................................................................................6
gnuplot ..............................................................................................................7
Performance Analysis Examples .......................................................................8
Analyzing Memory Bottlenecks with !"#$%&!', (#))$&$, and the Sun Studio
Performance Analyzer .........................................................................................8
Improving Performance by Reducing Data Cache Misses .................................... 8
Improving Scalability by Avoiding Memory Stalls Using the Sun Studio Analyzer 11
Memory Placement Optimization with OpenMP .............................................. 16
Analyzing I/O Performance Problems in a Virtualized Environment....................... 19
Using DTrace to Analyze Thread Scheduling with OpenMP ................................... 21
Using DTrace with MPI ...................................................................................... 27
Setting the Correct *"+%#, Privileges ........................................................... 28
A Simple Script for MPI Tracing ...................................................................... 28
Tracking Down MPI Memory Leaks ................................................................. 30
Using the DTrace *"+"-%#)- Provider ......................................................... 33
A Few Simple *"+"-%#)- Usage Examples ................................................... 35
Conclusion ...................................................................................................37
Acknowledgments ............................................................................................ 37
References ...................................................................................................38
Sun Microsystems, Inc.
Appendix .....................................................................................................39
Chime Screenshots ............................................................................................ 39
Listings for Matrix Addition !"#$%&!' Example ................................................ 40
gnuplot Program to Graph the Data Generated by !"#$%&!' ........................ 40
!"#$%&!'.&//.%012/&$& ................................................................. 40
!"#$%&!'.&//.!032/&$& ................................................................. 40
Listings for Matrix Addition (#))$&$ Example .................................................. 42
gnuplot Program to Graph the Data Generated by (#))$&$ .......................... 42
(#))$&$.&//.%012/&$& ................................................................... 42
(#))$&$.&//.!032/&$& ................................................................... 42
Sun Studio Analyzer Hints to Collect Cache Misses on an UltraSPARC T1® Processor 44
Analyzing I/O Performance Problems in a Virtualized Environment — a Complete
Description ....................................................................................................... 45
Tracking MPI Requests with DTrace — *"+)$&$2/ ........................................... 51
Sun Microsystems, Inc.1 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
1. In the Solaris™ Operating System (OS), these include various analysis and debugging tools such as truss(1), apptrace(1) and mdb(1). Other operating environments have similar tools.
Chapter 1
Introduction
Most computers today are equipped with multicore processors and manufacturers
are increasing core and thread counts, as the most cost effective and energy efficient
route to increased processing throughput. While the next level of parallelism — grid
computers with fast interconnects — is still very much the exclusive domain of high
performance computing (HPC) applications, mainstream computing is developing
ever increasing levels of parallelism. This trend is creating an environment in
which the optimization of parallel code is becoming a core requirement of tuning
performance in a range of application environments. However, tuning parallel
applications brings with it complex challenges that can only be overcome with
increasingly advanced tools and techniques.
Challenges of Tuning Parallel CodeSingle threaded code, or code with a relatively low level of parallelism, can be tuned
by analyzing its performance in test environments, and improvements can be made
by identifying and resolving bottlenecks. Developers tuning in this manner are
confident that any results they obtain in testing can result in improvements to the
corresponding production environment.
As the level of parallelism grows, due to interaction between various concurrent
events typical of complex production environments, it becomes more difficult to
create a test environment that closely matches the production environment. In turn,
this means that analysis and tuning of highly multithreaded, parallel code in test
environments is less likely to improve its performance in production environments.
Consequentially, if the performance of parallel applications is to be improved, they
must be analyzed in production environments. At the same time, the complex
interactions typical of multithreaded, parallel code are far more difficult to analyze
and tune using simple analytical tools, and advanced tools and techniques are
needed.
Conventional tools1 used for performance analysis were developed with low levels
of parallelism in mind, are not dynamic, and do not uncover problems relating to
timing or transient errors. Furthermore, these tools do not provide features that
allow the developer to easily combine data from different processes and to execute
an integrated analysis of user and kernel activity. More significantly, when these
tools are used, there is a risk that the act of starting and stopping, or tracing can
affect and obscure the problems that require investigation, thus rendering any
results found suspect and creating a potential risk to business processing.
Sun Microsystems, Inc.2 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
2. See discussion of the proposed cpc DTrace provider and the DTrace limitations page in the Solaris internals wiki through links in “References” on page 38
Some Solaris™ OS Performance Analysis ToolsIn addition to performance and scalability, the designers of the Solaris OS set
observability as a core design goal. Observability is achieved by providing users
with tools that allow them to easily observe the inner workings of the operating
system and applications, and analyze, debug, and optimize them. The set of tools
needed to achieve this goal was expanded continuously in each successive version
of the Solaris OS, culminating in the Solaris 10 OS with DTrace — arguably the most
advanced observability tool available today in any operating environment.
DTraceDTrace provides the Solaris 10 OS user with the ultimate observability tool — a
framework that allows the dynamic instrumentation of both kernel and user
level code. DTrace permits users to trace system data safely without affecting
performance. Users can write programs in D — a dynamically interpreted scripting
language — to execute arbitrary actions predicated on the state of specific data
exposed by the system or user component under observation. However, currently
DTrace has certain limitations — most significantly, it is unable to read hardware
counters, although this might change in the near future2.
System Information ToolsThere are many other tools that are essential supplements to the capabilities of
DTrace that can help developers directly observe system and hardware performance
characteristics in additional ways. These tools and their use are described in
this document through several examples. There are other tools that can be used
to provide an overview of the system more simply and quickly than doing the
equivalent in DTrace. See Table 1 for a non-exhaustive list of these tools.
Table 1. Some Solaris system information tools
Command Purpose
+,$%)$&$4567 Gathers and displays run-time interrupt statistics per device and processor or processor set
(#))$&$4567 Report memory bus related performance statistics
!"#$%&!'456789!"#)$&$4567
Monitor system and/or application performance using CPU hardware counters
$%&")$&$4567 Reports trap statistic per processor or processor set
"%)$&$4567 Report active process statistics
:*)$&$4567 Report virtual memory statistics
Sun Microsystems, Inc.3 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Sun™ Studio Performance AnalyzerThe Sun Studio Performance Analyzer and the Sun Studio Collector are components
of the Sun Studio integrated development environment (IDE). The collector is used
to run experiments to collect performance related data for an application, and the
analyzer is used to analyze and display this data with an advanced GUI.
The Sun Studio Performance Analyzer and Collector can perform a variety of
measurements on production codes without special compilation, including clock- and
hardware-counter profiling, memory allocation tracing, Message Passing Interface
(MPI) tracing, and synchronization tracing. The tool provides a rich graphical user
interface, supports the two common programming models for HPC — OpenMP and
MPI — and can collect data on lengthy production runs.
Combining DTrace with FriendsWhile DTrace is an extremely powerful tool, there are other tools that can be
executed from the command line, which makes them more appropriate to view a
system’s performance. These tools can be better when a quick, initial diagnosis is
needed to answer high level questions. A few examples of this type of question are:
Are there a significant number of cache misses?
Is a CPU accessing local memory or is it accessing memory controlled by another
CPU?
How much time is spent in user versus system mode?
Is the system short on memory or other critical resources?
Is the system running at high interrupt rates and how are they assigned to
different processors?
What are the system’s I/O characteristics?
An excellent example is the use of the "%)$&$ utility which, when invoked with the
;<* option, provides microstates per thread that offer a wide view of the system.
This view can help identify what type of further investigation is required, using
DTrace or other more specific tools, as described in Table 2.
Sun Microsystems, Inc.4 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Table 2. Using the "%)$&$ command for initial analysis
!"#$%$ column
Meaning Investigate if high values are seen
=>? % of time in user mode Profile user mode with DTrace using either "+/ or !"#$%& providers
>@> % of time in system mode Profile the kernel
<AB % of time waiting for locks Use the "30!')$&$ DTrace provider or the "30!')$&$4567 utility to see which user locks are used extensively
><C % of time sleeping Use the )!D-/ DTrace provider and view call stacks with DTrace to see why the threads are sleeping
EF< or GF<
% of time processing text or data page faults
Use the :*+,H0 DTrace provider to identify the source of the page faults
Once the various tools are used to develop an initial diagnosis of a problem, it is
possible to use DTrace, Sun Studio Performance Analyzer, and other more specific
tools for an in-depth analysis. This approach is described in further detail in the
examples to follow.
Purpose of this PaperThis white paper is targeted at users, administrators, and developers of parallel
applications. The lessons learned in the course of performance tuning efforts are
applicable to a wide range of similar scenarios common in today’s computing
environments in different public sector, business, and academic settings.
Sun Microsystems, Inc.5 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Chapter 2
DTrace and its Visualization Tools
DTrace is implemented through thousands of probes (also known as tracing points)
embedded in the Solaris 10 kernel, utilities, and in other software components that
run on the Solaris OS. The kernel probes are interpreted in kernel context, providing
detailed insight into the inner workings of the kernel. The information exposed by
the probes is accessed through scripts written in the D programming language.
Administrators can monitor system resources to analyze and debug problems,
developers can use it to help debugging and performance tuning, and end-users can
analyze applications for performance and logic problems.
The D language provides primitives to print text to a file or the standard-output of
a process. While suitable for brief, simple analysis sessions, direct interpretation of
DTrace output in more complex scenarios can be quite difficult and lengthy. A more
intuitive interface that allows the user to easily analyze DTrace output is provided by
several tools — Chime, DLight, and gnuplot — that are covered in this paper.
For detailed tutorials and documentation on DTrace and D, see links in “References”
on page 38.
ChimeChime is a standalone visualization tool for DTrace and is included in the NetBeans
DTrace GUI plug-in. It is written in the Java™ programming language, supports the
Python and Ruby programming languages, includes predefined scripts and examples,
and uses XML to define how to visualize the results of a DTrace script. The Chime
GUI provides an interface that allows the user to select different views, to easily drill
down into the data, and to access the underlying DTrace code.
Chime includes the following additional features and capabilities:
Graphically displays aggregations from arbitrary DTrace programs
Displays moving averages
Supports record and playback capabilities
Provides an XML based configuration interface for predefined plots
When Chime is started it displays a welcome screen that allows the user to select
from a number of functional groups of displays through a pull-down menu, with
Initial Displays the default group shown. The Initial Displays group contains various
general displays including, for example, System Calls, that when selected brings up
a graphical view showing the application executing the most system calls, updated
Sun Microsystems, Inc.6 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
every second (Figure 8, in the section “Chime Screenshots” on page 39). The time
interval can be changed with the pull-down menu at the bottom left corner of the
window.
Figure 1. File I/O displayed with the rfilio script from the DTraceToolkit
Another useful choice available from the main window allows access to the
probes based on the DTraceToolkit, that are much more specific than those in
Initial Displays. For example, selecting rfileio brings up a window (Figure 1) with
information about file or block read I/O. The output can be sorted by the number
of reads, helping to identify the file that is the target of most of the system read
operations at this time — in this case, a raw device. Through the context menu of
the suspect entry, it is possible to bring up a window with a plotting of the reads
from this device over time (Figure 9, on page 39).
For further details on Chime, see links in “References” on page 38.
DLightDLight is a GUI for DTrace, built into the Sun Studio 12 IDE as a plug-in. DLight unifies
application and system profiling and introduces a simple drag-and-drop interface
that interactively displays how an application is performing. DLight includes a set of
DTrace scripts embedded in XML files. While DLight is not as powerful as the DTrace
Sun Microsystems, Inc.7 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
3. According to the developers of gnuplot (http://www.gnuplot.info/faq/faq.html#SECTION00032000000000000000) The real name of the program is “gnuplot” and is never capitalized.
command line interface, it does provide a useful way to see a quick overview of
standard system activities and can be applied directly to binaries using the DTrace
scripts included with it.
After launching DLight, the user selects a Sun Studio project, a DTrace script, or
a binary executable to apply DLight to. For the purpose of this discussion, the tar
utility was used to create a load on the system and DLight was used to observe it.
The output of the tar command is shown in the bottom right section of the display
(Figure 2). The time-line for the selected view (in this case File System Activity) is
shown in the middle panel of the display. Clicking on a sample prints the details
associated with it in the lower left panel of the display. The panels can be detached
from the main window and their size adjusted.
Figure 2. DLight main user interface window
For further details on DLight, see links in “References” on page 38.
gnuplotgnuplot3 is a powerful, general purpose, open-source, graphing tool that can help
visualize DTrace output. gnuplot can generate two- and three-dimensional plots of
DTrace output in many graphic formats, using scripts written in gnuplot’s own text-
based command syntax. An example using gnuplot with the output of the !"#$%&!'
and (#))$&$ commands is provided in the section “Analyzing Memory Bottlenecks
with cputrack, busstat, and the Sun Studio Performance Analyzer” on page 8, in the
“Performance Analysis Examples” chapter.
Sun Microsystems, Inc.8 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Chapter 3
Performance Analysis Examples
In the following sections, several detailed examples of using DTrace and other Solaris
observability tools to analyze the performance of Solaris applications are provided.
These use-cases include the following:
1. Use (#))$&$, !"#$%&!', and the Sun Studio Performance Analyzer to explain
performance of different SPARC® processors and scalability issues resulting from
data cache misses and memory bank stalls
2. Use !"#)$&$ to analyze memory placement optimization with OpenMP code on
non-uniform memory access (NUMA) architectures
3. Use DTrace to resolve I/O related issues
4. Use DTrace to analyze thread scheduling with Open Multi-Processing (MP)
5. Use DTrace to analyze MPI performance
Analyzing Memory Bottlenecks with !"#$%&!', (#))$&$, and the Sun Studio Performance AnalyzerMost applications require fast memory access to perform optimally. In the following
sections, several use-cases are described where memory bottlenecks cause sub-
optimal application performance, resulting in sluggishness and scalability problems.
Improving Performance by Reducing Data Cache MissesThe !"#)$&$4567 and !"#$%&!'4567 commands provide access to CPU hardware
counters. (#))$&$4567 reports system-wide bus-related performance statistics.
These commands are passed sets of platform dependent event-counters to monitor.
In this example, the performance of two functionally identical instances of a small
program that sums two matrices are analyzed — &//.!03 and &//.%01. &//.!03
uses the data-cache inefficiently, with many cache misses, while &//.%01 uses the
data-cache efficiently. The insights provided by these examples are applicable in a
wide range of contexts.
Sun Microsystems, Inc.9 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
The following listing, &//.!032!, shows the implementation of &//.!03:
59I+,!3#/-9J)$/3+(2DKL9'(&$)&*+,-.****/01112M9(#34%&*56+,-.76+,-.78*46+,-.76+,-.78*96+,-.76+,-.7:N9;)<*=5;)>;)<*5"?98*9@5"*A5"?B67C*DO9**E;F&G<*;8*H:P9**I#"*>;*J*1:*;*K*+,-.:*;LLCQ9****I#"*>H*J*1:*H*K*+,-.:*HLLCR9******96H76;7*J*56H76;7*L*46H76;7:S9T
&//.%012! is identical to &//.!032!, except that line 8 — the line that executes
the addition of the matrix cells, runs across rows and takes the following form:
M*******96;76H7*J*56;76H7*L*46;76H7:
The programs are compiled without optimization or prefetching, to side-step the
compiler’s ability to optimize memory access (that would make the differences
between the programs invisible) and with large pages to avoid performance
problems resulting from the relatively small TLB cache available to the UltraSPARC
T1® processor and the UltraSPARC T2® processor:
I9!!9;U"&V-)+W-XLOP*9;U"%-H-$!DX,09;09&//.%019&//.%012!I9!!9;U"&V-)+W-XLOP*9;U"%-H-$!DX,09;09&//.!039&//.!032!
Using !"#$%&!' to Count Data Cache Misses
The programs are first run with !"#$%&!', while tracing data cache misses as
reported by the CPU’s hardware counters. The output is piped through the sed
command to remove the last line in the file, since it contains the summary for the
run and should not be plotted, and directed to !"#$%&!'.&//.!032/&$& and
!"#$%&!'.&//.%012/&$&, as appropriate:
'*9!3<"59N*O)*O&I*O9*!;91JPQG=;EE8!;9/J,)E<"G9)<*RS5((G"#T*U*V***********************E&(*WX(Y*Z*9!3<"59NG5((G"#TR(5<5'*9!3<"59N*O)*O&I*O9*!;91JPQG=;EE8!;9/J,)E<"G9)<*RS5((G9#%*U*V***********************E&(*WX(Y*Z*9!3<"59NG5((G9#%R(5<5
The ;, option is used to prevent the output of column headers, ;- instructs
!"#$%&!' to follow all -U-!4L7, or -U-!:-4L7 system calls, ;H instructs
!"#$%&!' to follow all child processes created by the H0%'4L7, H0%'54L7, or
:H0%'4L7 system calls. The ;! option is used to specify the set of events for the CPU
performance instrumentation counters (PIC) to monitor:
!;91JPQG=;EE — counts the number of data-cache misses
!;9/J,)E<"G9)< — counts the number of instructions executed by the CPU
Plotting the results of the two invocations with V,#"30$ (Figure 3) shows that the
number of cache-misses in the column-wise matrix addition is significantly higher
Sun Microsystems, Inc.10 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
4. The V,#"30$ program used to generate the graph in Figure 3 is listed in the section “Listings for Matrix Addition cputrack Example” on page 40, followed by the listings of !"#$%&!'.&//.!032/&$&, and !"#$%&!'.&//.%012/&$&.
than in the row-wise matrix addition, resulting in a much lower instruction count per
second and a much longer execution time for the column-wise addition4.
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
0 10 20 30 40 50 60
L1 D
-cac
he m
isse
s /s
Inst
ruct
ion
coun
t /s
time [s]
D-cache misses colD-cache misses rowInstruction count col
Instruction count row
Figure 3. Comparison of data-cache misses in row-wise and column-wise matrix addition
Using (#))$&$ to Monitor Memory Bus Use
The (#))$&$ command is only able to trace the system as a whole, so it is necessary
to invoke it in the background, run the program that is to be measured, wait for a
period of time, and terminate it. This must be done on a system that is not running
a significant workload at the same time, since it is not possible to isolate the data
generated by the test program from the other processes. The following short shell-
script, %#,.(#))$&$2)D, is used to run the test:
59'[S4;)SE@L943EE<5<*O)*OT*("5=18!;91J=&=G"&5(E*/*\M9!;(JX[N9&B5%*X]O9)3--"95P9N;%%*X!;(
The command to run — in this case &//.%01 or &//.!03 — is passed to the script
as a command line parameter and the script performs the following steps:
Runs (#))$&$ in the background in line 2
Saves (#))$&$’s process ID in line 3
Runs the command in line 4
After the command exits, the script waits for one second in line 5
Terminates (#))$&$ in line 6
Sun Microsystems, Inc.11 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
5. The V,#"30$ program used to generate the graph in Figure 4, is listed in the section “Listings for Matrix Addition busstat Example” on page 42, followed by the listings of (#))$&$.&//.!032/&$&, and (#))$&$.&//.%012/&$&.
The ;, (#))$&$ option is used to prevent the output of column headers, ;1
specifies the device to be monitored — in this case the memory bus (("5=1) and
!;91J*-*.%-&/) defines the counter to count the number of memory reads
performed by DRAM controller #0 over the bus. The number 5 at the end of the
command instructs (#))$&$ to print out the accumulated information every second.
The %#,.(#))$&$2)D script is used to run &//.!03 and &//.%01, directing the
output to (#))$&$.&//.!032/&$& and (#))$&$.&//.%012/&$&, as appropriate:
'*RS"3)G43EE<5<RE@*5((G9#%*Z*(#))$&$.&//.!032/&$&'*RS"3)G43EE<5<RE@*5((G"#T*Z*43EE<5<G5((G"#TR(5<5
Plotting the results of the two invocations with V,#"30$ (Figure 4) shows that the
number of cache-misses in the column-wise addition seen by !"#$%&!' is reflected
in the results reported by (#))$&$, that shows a much larger number of reads over
the memory bus5.
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
2000000
2200000
2400000
2600000
0 10 20 30 40 50 60
read
s/s
time [s]
reads colreads row
Figure 4. Comparison of memory bus performance in row-wise and column-wise matrix addition
Improving Scalability by Avoiding Memory Stalls Using the Sun Studio AnalyzerIn this example, a Sun Fire™ T2000 server with a 1 GHz UltraSPARC T1 processor, with
eight CPU cores and four threads per core is compared with a Sun Fire V100 server
with a 550 MHz UltraSPARC-IIe single core processor. The Sun Fire V100 server was
chosen since its core design is similar to the Sun Fire T2000 server. The software used
Sun Microsystems, Inc.12 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
for the comparison is John the Ripper — a password evaluation tool that includes a
data encryption standard (DES) benchmark capability, has a small memory footprint,
and performs no floating point operations. Multiple instances of John the Ripper are
run in parallel to achieve any level of parallelism required.
The benchmark is compiled with the Sun Studio C compiler using the YH&)$ and
YU&%!DX,&$+:-PN options, and a single instance is executed on each of the servers.
Due to its higher clock-rate, it may seem reasonable to expect the Sun Fire T2000
server to perform twice as fast as the Sun Fire V100 server for a single invocation,
however, the UltraSPARC T1 processor is designed for maximum multithreaded
throughput so this expectation might not be valid. To demonstrate the reason for
this result it is necessary to first quantify it in further detail. With the !"#$%&!'
command it is possible to find out the number of instructions per cycle:
<0111Z*9!3<"59N*O^*_*O<*O9*!;9/J,)E<"G9)<*RSH#@)*O<&E<*OI#"=5<`P.+
The !"#$%&!' command is used to monitor the instruction count by defining
!;9/J,)E<"G9)<, and the ;E option sets the sampling interval to five seconds. The
output generated by this invocation indicates an average ratio of 0.58 instructions
per CPU cycle, calculated by dividing the total instruction count from the bottom cell
of the "+!5 column by the number of cycles in the bottom cell of the Z$+!' column:
$+*- 31" -:-,$ Z$+!' "+!5
_R1/a 5 $+!' _1/b00cM/0 bd1/edda//
/1R10a 5 $+!' PN5LM5NPQL b/M0de1/_M
/1R1b0 5 -U+$ 55NLPL5QQML c_Mdd0_//1
When the same command is run on the Sun Fire V100 server, the average
performance per core of the UltraSPARC-IIe architecture is higher, resulting in a ratio
of 2.05 instructions per CPU cycle.
The lower single threaded performance of the UltraSPARC T1 processor can be
attributed to the differing architectures — the single in-order issue pipeline of
each core of the UltraSPARC T1 processor is designed for lower single threaded
performance than the four-way superscalar architecture of the UltraSPARC IIe
processor. In this case, it is the different CPU designs that explain the results seen.
Next, the benchmark is run on the Sun Fire T2000 server in multiple process
invocations to check its scalability. While one to eight parallel processes scale
linearly as expected, at 12 processes performance is flat, with 16 processes, the
throughput goes down to a level comparable with that of four processes, and at 32
processes, the total throughput is similar to that of a dual process invocation.
Sun Microsystems, Inc.13 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
6. –g instructs the compiler to produce additional symbol table information needed by the analyzer, and –xhwcprof, in conjunction with –g, instructs the compiler to generate information that helps tools associate profiled load and store instructions with data-types and structure members.
Due to the fact that the UltraSPARC T1 processor has a 12-way L2 cache, it seems
reasonable to suspect that the scaling problem, where throughput drops rapidly
for more than 12 parallel invocations, is a result of L2 cache misses. To validate this
assumption, an analysis of the memory access patterns is needed to see if there is a
significant increase in memory access over the memory bus as the application scales.
This analysis can take advantage of the superior user interface of the Sun Studio
Performance Analyzer. To use it, John the Ripper is first recompiled, adding the ;V
and ;UD1!"%0H options to each compilation and link6. These options instruct the
collector to trace hardware counters and the analyzer to use them in analysis. In
addition, the analyzer requires UltraSPARC T1 processor specific hints to be added to
one of the .er.rc files, that instruct the collector to sample additional information,
including max memory stalls on a specific function. See the section “Sun Studio
Analyzer Hints to Collect Cache Misses on an UltraSPARC T1® Processor” on page 44 in
the Appendix for a list of these hints.
Next John the Ripper is run with the collector:
<0111Z*9#%%&9<*O!*L#)*Of*9#!g*Oh*#)*RSH#@)*OO<&E<*OOI#"=5<`P.+
The O!*L#) option triggers the collection of clock-based profiling data with the
default profiling interval of approximately 10 milliseconds, the Of*9#!g option
requests that the load objects used by the target process are copied into the
recorded experiment, and the ;F90, option instructs the collector to record the data
on descendant processes that are created by H0%' and -U-! system call families.
Figure 5. Profile of John the Ripper on a Sun Fire T2000 server
Once the performance data is collected, the analyzer is invoked and the data
collected for John the Ripper selected. The profile information (Figure 5) shows that
Sun Microsystems, Inc.14 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
7. To improve performance, the Solaris OS uses a page coloring algorithm which allocates pages to a virtual address space with a specific predetermined relationship between the virtual address to which they are mapped and their underlying physical address.
of the total of 9.897 seconds the process has run on the CPU, it stalled on memory
access for 9.387 seconds, or nearly 95% of the time. Furthermore, a single function
— DES_bs_crypt_25 — is responsible for almost 99% of the stalls.
When the hardware counter data related to the memory/cache subsystem is
examined (Figure 6) it is seen that of the four UST1_Bank Memory Objects, stalls
were similar on objects 0, 1 and 2 at 1.511, 1.561, and 1.321 seconds respectively
while on object 3 it was 4.993 seconds, or 53% of the total.
Figure 6. UST1_Bank memory objects in profile of John the Ripper on a Sun Fire T2000 server
Note: When hardware-counter based memory access profiling is used, the Sun Studio Analyzer can identify the specific instruction that accessed the memory more reliably than in the case of clock-based memory access profiling.
Further analysis (Figure 7) shows that 85% of the memory stalls are on a single page,
with half of all the stalls associated with a single memory bank. Note that the Solaris
OS uses larger pages for the sun4v architecture to help small TLB caches. This can
create memory hot spots, as illustrated here, since it limits cache mapping flexibility
and when the default page coloring7 is retained, it can cause unbalanced L2 cache
use. In the case of the Sun Fire T2000 server, this behavior is changed by adding a
)-$9!0,)+)$-,$.!030%+,VXL in S&<9SEgE<&= or by applying patch 118833-03 or
later. Note that this patch can be installed as part of a recommended patch set or a
Solaris update.
Sun Microsystems, Inc.15 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Figure 7. 64K memory pages in profile of John the Ripper on a Sun Fire T2000 server
Note: While the change in the !0,)+)$-,$.!030%+,V parameter in S&<9SEgE<&= improves the performance of John the Ripper, it can adversely affect the performance of other applications.
The (#))$&$4567 command can be used to demonstrate the effect of the change
in the !0,)+)$-,$.!030%+,V parameter by measuring the memory stalls before
and after modifying the page coloring, when running John the Ripper with 1 to 32
invocations:
'*43EE<5<*OT*("5=18!;91J45)NG43EgGE<5%%E8!;9/J=&=G"&5(GT";<&*0
The ;1 option is used to instruct (#))$&$ to monitor the memory bus — ("5=18
while !;91J45)NG43EgGE<5%%E assigns the first counter to count the number of
memory stalls, and "+!5X*-*.%-&/.1%+$- assigns the second counter to count
the number of memory operations over the bus. The number L at the end of the
command instructs (#))$&$ to print out the accumulated information every two
seconds. The total number of memory accesses across the memory bus seen in the
results is 337 million, with 904 million memory stall events, indicating a very low L2
cache hit-rate.
After the !0,)+)$-,$.!030%+,V setting is modified to L8 John the Ripper
executed, and memory access measured again with the (#))$&$ command,
the total number of memory accesses across the memory bus goes down to 10.6
million, with 17.7 million memory stall events. This reduction of 97% in the number
of memory accesses over the memory bus indicates a huge improvement in the
L2 cache hit-rate. Running the multiple invocations of John the Ripper, scalability
Sun Microsystems, Inc.16 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
improves drastically, with 32 parallel invocations resulting in 1.7 million DES
encryptions/second, compared to less than 400,000 DES encryptions/second before
the !0,)+)$-,$.!030%+,V setting was changed.
Memory Placement Optimization with OpenMPThe Solaris OS optimizes memory allocation for NUMA architectures. This is currently
relevant to UltraSPARC IV+, UltraSPARC T2+, SPARC64-VII®, AMD Opteron™, and Intel®
Xeon® processor 5500 series multiprocessor systems. Each CPU has its own memory
controller and through it — its own memory.
On the systems with NUMA CPUs running the Solaris OS, by default, the first time
uninitialized memory is accessed or when memory is allocated due to a page fault,
it is allocated on the memory local to the CPU that runs the thread accessing the
memory. This mechanism is called memory placement optimization. However, if
the first thread accessing the memory is moved to another CPU, or if the memory
is accessed by a thread running on another CPU, this can result in higher memory
latency.
The program used to test this behavior implements a matrix by vector multiplication
that scales almost linearly with the number of CPUs, for large working sets. The
benefits of first-touch optimization will show in the working routines, where the
data is accessed repeatedly. However, the optimization itself must be applied when
the data is initialized, since that is when the memory is allocated. By making use
of the Solaris MPO the code can ensure that data gets initialized by the thread that
processes the data later. This results in reduced memory latency and improved
memory bandwidth. Following is the initialization code that scales almost linearly
for large matrix and vector sizes:
599+,+$./&$94+,$9*89+,$9,89/0#(3-9[:589/0#(3-9[[*589/0#(3-9[:L7L9*DM9***;)<*;8*H:N9*'!"5?=5*#=!*!5"5%%&%*(&I53%<>)#)&C*VO99999999999)D&%-/4*8,8:58*58:L79"%+:&$-4+8\7P9***DQ99I"%&V*&90*"9H0%R9*****I#"*>HJ1:*HK):*HLLCS9*******B06H7*J*/R1:/1*I"%&V*&90*"9H0%559****I#"*>;J1:*;K=:*;LLC*D5L9******B/6;7*J*O/a_eR1:5M9******I#"*>HJ1:*HK):*HLLC5N9********=/6;76H7*J*;:5O9****i*SAOO*.)(*#I*#=!*I#"*OOAS5P9**i*SAOO*.)(*#I*!5"5%%&%*"&?;#)*OOAS5Q9T
Sun Microsystems, Inc.17 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
8. The OpenMP API supports multiplatform shared-memory parallel programming in C/C++ and Fortran. OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface to develop parallel applications.
This code implements two top level loops:
Lines 8 and 9 write sequentially to floating point vector :L
Lines 11 to 15 implement an outer loop, in which the floating point elements of
vector :5 are set to -1957.0 (line 12)
In the inner loop, for every iteration of the external loop, a sequence of elements
of the matrix *5 is set to the value of the outer loop’s index (lines 13 and 14)
The three I"%&V*&90*" compile time directives instruct the compiler to use the
OpenMP8 API to generate code that executes the top level loops in parallel. OpenMP
generates code that spreads the loop workload over several threads. In this case,
the initialization is the first time the memory written to in these loops is accessed,
and first-touch attaches the memory the threads use to the CPU they run on. Here
the code is executed with one or two threads on a four-way dual-core AMD Opteron
system, with two cores —1 and 2 — on different processors defined as a processor
set using the ")%)-$4567 command. This setup helps ensure almost undisturbed
execution and makes the !"#)$&$4567 command output easy to parse, while
optimizing memory allocation. The !"#)$&$ command is invoked as follows:
9!3E<5<*O9*W!;91JjkG=&=G9<"%"G!5?&G599&EE83=5EN1J1l1/8V***!;9/JjkG=&=G9<"%"G!5?&G599&EE83=5EN/J1l108V***!;90JjkG=&=G9<"%"G!5?&G599&EE83=5EN0J1l1d8EgEY*0
The command line parameters instruct !"#)$&$ to read the AMD Opteron CPU’s
hardware performance counters to monitor memory accesses broken down
by internal latency. This is based on the #*&)' value, which associates each
performance instrumentation counter (PIC) with the L1 cache, L2 cache, and main
memory, respectively. The EgE token directs cpustat to execute in system mode, and
the sampling interval is set to two seconds.
Note: Many PICs are available for the different processors that run the Solaris OS and are documented in the processor developers’ guides. See “References” on page 32 for a link to the developer’s guide for the AMD Opteron processor relevant to this section.
To illustrate the affect of NUMA and memory placement on processing, the code is
executed in the following configurations:
One thread without first-touch optimization — achieved by setting the NCPUS
environment variable to 1
Two threads without first-touch optimization — achieved by removing all of the
I"%&V*&90*" directives from the code, resulting in code where a single thread
initializes and allocates all memory but both threads work on the working set
Two threads, with first-touch optimization — achieved by setting the NCPUS
environment variable to 2 and retaining all three I"%&V*&90*" directives
Sun Microsystems, Inc.18 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Following are sample results of the invocations with one thread without first-touch
optimization, that show that the vast majority of memory accesses is handled by
CPU #2:
$+*- !"# -:-,$ !;91 "+!5 "+!L
0R11M 5 $+!' PLSRN 5M55N /1bM
0R11M L $+!' /b_b_d11 /a101a0e bcMe1dc
222 222 222 222 222 222
/1R11M 5 $+!' SPR5R /add1 5LMM
/1R11M L $+!' 5MLRPRLO /a1cd0ab bcaca10
Following are sample results of the invocation with two threads without first-touch
optimization. The results show that even though two threads are used, only one of
these threads had allocated the memory, so it is local to a single chip resulting in a
memory access pattern similar to the single thread example above:
$+*- !"# -:-,$ !;91 "+!5 "+!L
0R11b 5 $+!' 011M/M/e eM1eMM/ 0b01M0/
0R11b L $+!' daa1_ ONSS bM1
222 222 222 222 222 222
/1R11b 5 $+!' /111b_M0e LSOQOSL /M1e00_
/1R11b L $+!' QOL55 /01/_M QL5Q
/0R11b 5 $+!' O5QLQ d11a LSS
/0R11b L $+!' _11c_ MR5N MQ5
Finally, the sample results of the invocations with two threads with first-touch
optimization show that memory access is balanced over both controllers as each
thread has initialized it’s own working set:
$+*- !"# -:-,$ !;91 "+!5 "+!L
0R11a 5 $+!' _b1c/11b 5MN5RNNS /10b_cce
0R11a L $+!' _b010d/c /b0d1c1c /1/cMd1/
222 222 222 222 222 222
/1R11a 5 $+!' dcd/1/c0 aM10b1 5RSPPM
/1R11a L $+!' NPMSPMPR SQNM5N 5SPMLL
/0R11a 5 $+!' _/1Me MMS5 L5N
/0R11a L $+!' NSPRL M5PQ MSS
When the memory is initialized from one core and accessed from another
performance degrades, since access is through the hyperlink, which is slower than
direct access by the on-chip memory controller. The first-touch capability enables
each of the threads to access its data directly.
Sun Microsystems, Inc.19 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
For a large memory footprint the performance is about 800 MFLOPS for a single
thread, about 1050 MFLOPS for two threads without first-touch optimization, and
close to 1600 MFLOPS — which represents perfect scaling — for two threads with
first-touch optimization.
Analyzing I/O Performance Problems in a Virtualized EnvironmentA common performance problem is system sluggishness due to a high rate of I/O
system calls. To identify the cause it is necessary to determine which I/O system calls
are called and in what frequency, by which process, and to determine why. The first
step in the analysis is to narrow down the number of suspected processes, based on
the ratio of the time each process runs in system context versus user context since
processes that spend a high proportion of their running time in system context, can
be assumed to be requesting a lot of I/O.
Good tools to initiate the analysis of this type of problem are the :*)$&$4567 and
"%)$&$4567 commands, which examine all active processes on a system and report
statistics based on the selected output mode and sort order for specific processes,
users, zones, and processor-sets.
In the example described here and in further detail in the Appendix (“Analyzing I/O
Performance Problems in a Virtualized Environment — a Complete Description” on
page 45), a Windows 2008 server is virtualized on the OpenSolaris™ OS using the Sun™
xVM hypervisor for x86 and runs fine. When the system is activated as an Active
Directory domain controller, it becomes extremely sluggish.
As a first step in diagnosing this problem, the system is examined using the
:*)$&$9O command, which prints out a high-level summary of system activity every
five seconds, with the following results:
Examining these results shows that the number of system calls reported in the
Eg column increases rapidly as soon as the affected virtual machine is booted,
and remains quite high even though the CPU is constantly more than 79% idle,
as reported in the +/ column. While it is known from past experience that a CPU-
'$D% =&=#"g "&V- /+)' H$) !"#% ( 1 )1&" H%-- %- *H "+"0 H% /- )% =1 *5 *L *M+, Eg !) #) Eg +/1 1 1 5QPMOQLN d1acb_c 1 1 1 1 1 1 1 M M 1 1 SSN NN5 Q5Q 1 L SR1 1 1 5QPMOQLN d1acb_c 1 1 1 1 1 1 1 1 1 1 1 SP5 N5P Q5M 1 1 /111 1 1 5QPM5NNR d1a__0M QS NPO 1 1 1 1 1 1 1 1 1 /1ed a01_ 5NLR 5 L SQ1 1 1 /ec1d_0d d1e0/dM d1e NOON 1 5 5 1 1 P P 1 1 /1__M QLQRM 010/b N 5Q QS1 1 1 5QOSORLR d1c0bc1 /10 RLR 1 1 1 1 1 M M 1 1 MNN5 NNQNQ /1_01 5 5N RO1 1 1 5QOSRNSL d1cdc0M L L 1 1 1 1 1 5 5 1 1 OMPM 0M_1M RQOL L M SO1 1 1 5QOSRN5L d1c_1cM 1 1 1 1 1 1 1 01 01 1 1 /e1ee Mb10d b1cbb O Q RR1 1 1 /e_aM/1M d1c_/bc 1 1 1 1 1 1 1 1 1 1 1 RSO5 NPNOP /c/d1 L N SM
Sun Microsystems, Inc.20 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
bound workload on this system normally generates less than 5 thousand calls per
five second interval, here the number of calls is constantly more than 9 thousand
from the third interval and on, ranging as high as 83 thousand. Clearly something is
creating a very high system call load.
In the next step in the analysis, the processes are examined with "%)$&$9;<9;*. The
;< option instructs the "%)$&$ command to report statistics for each thread
separately, and the thread ID is appended to the executable name. The ;* option
instructs the "%)$&$ command to report information such as the percentage of time
the process has spent processing system traps, page faults, latency time, waiting for
user locks, and waiting for CPU, while the virtual machine is booted:
The first line of the output indicates that the ]-*#;/* process executes a very large
number of system calls — 200,000 — as shown in the >A< column. A factor of 100
more than U-,)$0%-/, the process with the second largest number of system calls.
At this stage, while it is known that the ]-*#;/* process is at fault, to solve the
problem it is necessary to identify which system calls are called. This is achieved with
the help of a D-script, 9#3)<GEgE95%%ER(, which prints system call rates for the
top-ten processes/system call combinations every five seconds (for the listing and
results of invocation see the Appendix).
When 9#3)<GEgE95%%ER( is run, of the top 10 process/system call pairs that are
printed, seven are executed by ]-*#;/*, with the number of executions of 3)--'
and 1%+$- the most prevalent by far.
The next stage of the investigation is to identify why ]-*#;/* calls these I/O
system calls. To answer this, an analysis of the I/O is implemented with another
D-script — ]-*#;)$&$2/, which collects the statistics of the I/O calls performed by
]-*#;/*, while focusing on 3)--' and 1%+$-9(see listing and results of invocation
in the Appendix).
The results of this invocation show that all calls to 3)--' with an absolute position
and most of the calls to 1%+$- target a single file. The vast majority of the calls to
3)--' move the file pointer of this file to an absolute position that is one byte away
m,P n+.ojfp. =>? >@> E?C EF< GF< <AB ><C 2f^ ^A_ ,Qq >A< +,Q morQ.++S2sm,P/cdM1 U:* P2S S2R 1R1 1R1 1R1 LQ ON 52R b1t 55N 2L6 1 u&=3O(=SbMPM U:* 1R/ 1R0 1R1 1R1 1R1 1R1 /11 1R1 N 5 LB 1 l&)E<#"&(S/5PMQN %00$ 1R1 1R/ 1R1 1R1 1R1 /11 1R1 1R1 /1 1 5B 1 (<"59&S/5PNN U:* 1R/ 1R/ 1R1 1R1 1R1 MM PP 1R1 OPS Q RMO 1 u&=3O(=SbLMSS %00$ 1R1 1R/ 1R1 1R1 1R1 1R1 /11 1R1 NS 1 MRR 1 EE@(S/5PMQP %00$ 1R1 1R/ 1R1 1R1 1R1 1R1 /11 1R1 MR 1 LSQ 1 !"E<5<S///e1_ U:* 1R1 1R/ 1R1 1R1 1R1 _1 _1 1R1 OQP 5O ROR 1 u&=3O(=Sd5POMP %00$ 1R1 1R/ 1R1 1R1 1R1 1R1 /11 1R1 NR 1 LRP 1 B)9B;&T&"S/222^#<5%`*bc*!"#9&EE&E8*/0a*%T!E8*%#5(*5B&"5?&E`*1Rcd8*1Rbe8*1Rb/
Sun Microsystems, Inc.21 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
9. See http://support.microsoft.com/kb/888794
from the previous position, and the vast majority of the calls to the 1%+$- system
call write a single byte. In other words, ]-*#;/* writes a data stream as single
bytes, without any buffering — an access pattern that is inherently slow.
Next, the !$%&E command is used to identify the file accessed by ]-*#;/* through
file descriptor O as the virtual Windows system disk:
'*!$%&E*/cdM1222_`*+G,ho.v*=#(&`1c11*(&B`/M08c__db*;)#`0c*3;(`c1*?;(`1*E;F&`//c0ba0be/0******rGoPsoUrG2fov.h,2.******SlB=S@&"=;5S(;ENG9SB(;ENRB=(N222
To further analyze this problem, it must be seen where the calls to 3)--' originate
by viewing the call stack. This is performed with the ]-*#;!&33)$&!'2/ script,
which prints the three most common call stacks for the 3)--' and 1%+$- system
calls every five seconds (see listing and invocation in the Appendix).
By examining the 1%+$- stack trace, it seems that the virtual machine is flushing the
disk cache very often, apparently for every byte. This could be the result of a disabled
disk cache. Further investigation uncovers the fact that if a Microsoft Windows server
acts as an Active Directory domain controller, the Active Directory directory service
performs unbuffered writes and tries to disable the disk write cache on volumes
hosting the Active Directory database and log files9. Active Directory also works in
this manner when it runs in a virtual hosting environment. Clearly this issue can only
be solved by modifying the behavior of the Microsoft Windows 2008 virtual server.
Using DTrace to Analyze Thread Scheduling with OpenMPTo achieve the best performance on a multithreaded application the workload needs
to be well balanced for all threads. When tuning a multithreaded application, the
analysis of the scheduling of the threads on the different processing elements —
whether CPUs, cores, or hardware threads — can provide significant insight into the
performance characteristics of the application.
Note: While this example focuses on the use of DTrace, many other issues important to profiling applications that rely on OpenMP for parallelization are addressed by the Sun Studio Performance Analyzer. For details see the Sun Studio Technical Articles Web site linked from “References” on page 38.
The following program — "&%$-)$2! — is includes a matrix initialization loop,
followed by a loop that multiplies two matrices. The program is compiled with the
;U&#$0"&% compiler and linker option, which instructs the compiler to generate
Sun Microsystems, Inc.22 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
10. If the program is executed on a single core system, it is likely to run slower, since it runs the code for multithreaded execution without benefiting from the performance improvement of running on multiple cores.
code that uses the OpenMP API to enable the program to run in several threads in
parallel on different cores, accelerating it on multicore hardware10:
;)<*=5;)>;)<*5"?98*9@5"*A5"?B67C*D****%#)?*;8*H:****I#"*>;*J*1:*;*K*,^.o:*;LLC* 56;7*J*46;7*J*96;7*J*;:****!3<E>w2rrm0xC:****I#"*>H*J*1:*H*K*o.m.f^:*HLLC* I#"*>;*J*1:*;*K*,^.o:*;LLC* ****96;7*LJ*56;7*A*46;7:T
This code lacks calls to library functions or system calls and is thus fully CPU bound.
As a result, it is only affected by CPU scheduling considerations.
The $D%-&/)!D-/2/ DTrace script that uses the schedule provider )!D-/ is
implemented to analyze the execution of "&%$-)$. This script can be used to
analyze the scheduling behavior of any multithreaded program.
$D%-&/)!D-/2/ uses the following variables:
Variable Function
(&)-3+,- Used to save the initial time-stamp when the script starts
!#%!"#;K!"#.+/ Current CPU ID
!#%!"#;K!"#.3V%" Current CPU locality group
)-3H;K/-3$& Thread local variable that measures the time elapsed between two events on a single thread
"+/ Process ID
)!&3- Scaling factor of time-stamps from nanoseconds to milliseconds
)-3H;K3&)$!"# Thread local ID of previous CPU the thread ran on
)-3H;K3&)$3V%" Thread local ID of previous locality group the thread ran on
)-3H;K)$&*" Thread local time-stamp which is also used as an initialization flag
$+/ Thread ID
59'[S3E"SE4;)S(<"59&*OEL9I"%&V*&9G90"$+0,9]#+-$M9k.v,jN9DO9**45E&%;)&*J*T5%%<;=&E<5=!:P9**E95%&*J*/111111:Q9T
The k.v,j probe fires when the script starts and is used to initialize the baseline
timestamp employed to measure all other times from the 1&33$+*-)$&*"
built-in DTrace variable. 1&33$+*-)$&*" and all other DTrace timestamps are
measured in nanoseconds, which is a much higher resolution than needed, so the
measurement is scaled down by a factor of 1 million to milliseconds. The scaling
factor is saved in the variable )!&3-.
Sun Microsystems, Inc.23 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
R9*E9@&(```#)O9!3S9*S*!;(*JJ*X<5"?&<*\\*[E&%IOZE<5=!*S/1*D559**E&%IOZE<5=!*J*T5%%<;=&E<5=!:5L9**E&%IOZ%5E<9!3*J*93"9!3OZ9!3G;(:5M9**E&%IOZ%5E<%?"!*J*93"9!3OZ9!3G%?"!:5N9**E&%IOZE<5=!*J*>T5%%<;=&E<5=!*O*45E&%;)&C*S*E95%&:5O9**!";)<I>wya(`yOa(*^,P*yb(*Qmn*yb(>y(C*9"&5<&(V)x85P9**E&%IOZE<5=!8*18*<;(8*93"9!3OZ9!3G;(8*93"9!3OZ9!3G%?"!C:5Q9T
The E9@&(``#)O9!3 probe fires whenever a thread is scheduled for running.
The first part of the predicate on line 9 — !;(*JJ*X<5"?&< — is common to all of
the probes in this example and ensures that the probe only fires for processes that
are controlled by this script.
DTrace initializes variables to zero, so the fact that the )-3H;K)$&*" thread-local
variable is still zero indicates that the thread is running for the first time and
)-3H;K)$&*" and the other thread-local variables are initialized in lines 11-14.
5R9E9@&(```#)O9!35S9S*!;(*JJ*X<5"?&<*\\*E&%IOZE<5=!*\\*E&%IOZ%5E<9!3*V01***************************************[J*93"9!3OZ9!3G;(*SL59DLL9**E&%IOZ(&%<5*J*>T5%%<;=&E<5=!*OE&%IOZE<5=!CSE95%&:LM9**E&%IOZE<5=!*J*T5%%<;=&E<5=!:LN9**E&%IOZE<5=!*J*>T5%%<;=&E<5=!*O*45E&%;)&C*S*E95%&:LO9**!";)<I>wya(`yOa(*^,P*yb(*I"#=OQmn*y(>y(C*w8*E&%IOZE<5=!8LP9**********E&%IOZ(&%<58*<;(8*E&%IOZ%5E<9!38*E&%IOZ%5E<%?"!C:LQ9**!";)<I>w<#OQmn*y(>y(C*Qmn*=;?"5<;#)V)x8LR9*******93"9!3OZ9!3G;(8*93"9!3OZ9!3G%?"!C:LS9**E&%IOZ%5E<9!3*J*93"9!3OZ9!3G;(:b1***E&%IOZ%5E<%?"!*J*93"9!3OZ9!3G%?"!:M59T
This code runs when a thread switches from one CPU to another.
The timestamp, the time the thread spent on the previous CPU, the last CPU this
thread ran on, and the CPU it is switching to are printed in lines 25-28.
ML9E9@&(```#)O9!3MM9S*!;(*JJ*X<5"?&<*\\*E&%IOZE<5=!*\\*E&%IOZ%5E<9!3*JJ*VMN9******************************************93"9!3OZ9!3G;(*SMO9DMP9**E&%IOZ(&%<5*J*>T5%%<;=&E<5=!*O*E&%IOZE<5=!C*S*E95%&:MQ9**E&%IOZE<5=!*J*T5%%<;=&E<5=!:MR9**E&%IOZE<5=!*J*>T5%%<;=&E<5=!*O*45E&%;)&C*S*E95%&:MS9**!";)<I>wya(`yOa(*^,P*yb(*Qmn*yb(>y(Cx8*E&%IOZE<5=!8d1********E&%IOZ(&%<58*<;(8*93"9!3OZ9!3G;(8*93"9!3OZ9!3G%?"!C:N59**!";)<I>w"&E<5"<&(*#)*E5=&*QmnV)x8NL9T
This code runs when the thread is rescheduled to run on the same CPU it ran on
the previous time it was scheduled to run.
Sun Microsystems, Inc.24 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
The timestamp is updated in line 37 and a message that the thread is rescheduled
to run on the same CPU is printed together with the timestamp, in lines 39-41.
NM9E9@&(```#IIO9!3NN9S*!;(*JJ*X<5"?&<*\\*E&%IOZE<5=!*SNO9DNP9**E&%IOZ(&%<5*J*>T5%%<;=&E<5=!*O*E&%IOZE<5=!C*S*E95%&:NQ9**E&%IOZE<5=!*J*T5%%<;=&E<5=!:NR9**E&%IOZE<5=!*J*>T5%%<;=&E<5=!*O*45E&%;)&C*S*E95%&:NS9**!";)<I>wya(`yOa(*^,P*yb(*Qmn*yb(>y(C*!"&&=!<&(V)x8*_1*9999999999)-3H;K)$&*"89)-3H;K/-3$&89$+/89!#%!"#;K!"#.+/89O59**********93"9!3OZ9!3G%?"!C:OL9T
The E9@&(``#IIO9!3 probe fires whenever a thread is about to be stopped by
the scheduler.
OM9E9@&(```E%&&!ON9S*!;(*JJ*X<5"?&<*SOO9DOP9**E&%IOZE#4H*J*>93"%T!E;)I#OZ!"GE<g!&*JJ*+rkzGpn^.q*{OQ9**wN&")&%*=3<&lx*`*93"%T!E;)I#OZ!"GE<g!&*JJ*+rkzGos2rQt*{OR9**wN&")&%*os*%#9Nx*`*93"%T!E;)I#OZ!"GE<g!&*JJ*+rkzGQ|*{OS9**w9#)(*B5"x*`*93"%T!E;)I#OZ!"GE<g!&*JJ*+rkzG+.pf*{c1***wN&")&%*E&=5!@#"&x*`*93"%T!E;)I#OZ!"GE<g!&*JJ*+rkzGn+.o*{P59**w3E&"O%&B&%*%#9Nx*`*93"%T!E;)I#OZ!"GE<g!&*JJ*+rkzGn+.oGm,*{*PL9**w3E&"O%&B&%*m,*%#9Nx*`*93"%T!E;)I#OZ!"GE<g!&*JJ*PM9**********************+rkzG+}n^^2.*{*wE@3<<%&x*`*w3)N)#T)xC:PN9**E&%IOZ(&%<5*J*>T5%%<;=&E<5=!*O*E&%IOZE<5=!C*SE95%&:PO9**E&%IOZE<5=!*J*T5%%<;=&E<5=!:PP9**E&%IOZE<5=!*J*>T5%%<;=&E<5=!*O*45E&%;)&C*S*E95%&:PQ9**!";)<I>wya(`yOa(*^,P*yb(*E%&&!;)?*#)*WyEYV)x8PR9******************E&%IOZE<5=!8*E&%IOZ(&%<58*<;(8*E&%IOZE#4HC:PS9T
The E9@&(```E%&&! probe fires before the thread sleeps on a synchronization
object.
The type of synchronization object is printed in lines 67-68.
e1*E9@&(```E%&&!Q59S*!;(*JJ*X<5"?&<*\\*>*93"%T!E;)I#OZ!"GE<g!&*JJ*+rkzGQ|*UUQL9**93"%T!E;)I#OZ!"GE<g!&*JJ*+rkzGn+.o*UUQM9**93"%T!E;)I#OZ!"GE<g!&*JJ*+rkzGn+.oGm,C*SQN9DQO9****3E<59N>C:QP9T
The second E9@&(```E%&&! probe fires when a thread is put to sleep on a
condition variable or user-level lock, which are typically caused by the application
itself, and prints the call-stack.
Sun Microsystems, Inc.25 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
11. For compatibility with legacy programs, setting the mfof22.2 environment variable has the same effect as setting rpmGjnpG^}o.fP+. However, if they are both set to different values, the runtime library issues an error message.
The ")%)-$ command is used to set up a processor set with two CPUs (0, 4) to
simulate CPU over-commitment:
@#E<'*!E"E&<*O9*1*d
Note: The numbering of the cores for the ")%)-$ command is system dependent.
The number of threads is set to three with the rpmGjnpG^}o.fP+11 environment
variable and $D%-&/)!D-/2/ is executed with "&%$-)$:
@#E<'*rpmGjnpG^}o.fP+Jb*RS<@"&5(E9@&(R(*O9*RS!5"<&E<
The output first shows the startup of the main thread (lines 1 to 5). The second
thread first runs at line 6 and the third at line 12:
59 1 ` 1 ^,P 59AC= 1>1C*9"&5<&(L9 1 ` 1 ^,P 59AC= 1>1C*"&E<5"<&(*#)*E5=&*QmnM9 1 ` 1 ^,P 59AC= 1>1C*!"&&=!<&(N9 1 ` 1 ^,P 59AC= 1>1C*"&E<5"<&(*#)*E5=&*QmnO9 1 ` 1 ^,P 59AC= 1>1C*!"&&=!<&(*P9 NS ` 1 ^,P L9AC= 1>1C*9"&5<&(Q9 NS ` 1 ^,P L9AC= 1>1C*"&E<5"<&(*#)*E5=&*QmnR9 NS ` 1 ^,P L9AC= 1>1C*!"&&=!<&(S9 NS ` 1 ^,P L9AC= 1>1C*"&E<5"<&(*#)*E5=&*Qmn/19 NS ` 1 ^,P 0*E%&&!;)?*#)*W3E&"O%&B&%*%#9NY559 NS ` 1 ^,P L9AC= 1>1C*!"&&=!<&(5L9 NS ` 1 ^,P M9AC= 1>1C*9"&5<&(5M9 NS ` 1 ^,P M9AC= 1>1C*"&E<5"<&(*#)*E5=&*Qmn
5N9 d01 ` be1 ^,P M9AC= 1>1C*!"&&=!<&(5O9 222
As the number of available CPUs is set to two, only two of the three threads can
run simultaneously resulting in many thread migrations between CPUs, as seen in
lines 19, 21, 36, 39, 43, and 46. At lines 24, 33, and 53 respectively, each of the three
threads goes to sleep on a condition variable:
Note: OpenMP generates code to synchronize threads which cause them to sleep. This is not related to threads migrating between CPUs resulting from a shortage of available cores, as demonstrated here.
5P9 <``CL5Q9 /ec10d ` /111 ^,P L9AC= 1>1C*!"&&=!<&(5R9 /ec10d ` 1 ^,P L9AC= 1>1C*"&E<5"<&(*#)*E5=&*Qmn5S9 /ecM1d ` 1 ^,P b*I"#=OQmn*d>1C*<#OQmn*1>1C*Qmn*=;?"5<;#)019 /ecM1d ` 1 ^,P M9AC= 1>1C*"&E<5"<&(*#)*E5=&*QmnL59 /ecM1d ` 1 ^,P /*I"#=OQmn*d>1C*<#OQmn*1>1C*Qmn*=;?"5<;#)LL9 /ecM1d ` 1 ^,P 59AC= 1>1C*"&E<5"<&(*#)*E5=&*QmnLM9 /ec10d ` 1 ^,P M9AC= d>1C*"&E<5"<&(*#)*E5=&*Qmn
Sun Microsystems, Inc.26 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
LN9 /ec/1d ` M1 ^,P b*E%&&!;)?*#)*W9#)(*B5"YLO9 /ec/1d ` 1 ^,P M9AC= d>1C*!"&&=!<&(LP9 5QPNRN ` bM1 ^,P M9AC= d>1C*"&E<5"<&(*#)*E5=&*QmnLQ9 5QPNRN ` 1 ^,P M9AC= d>1C*!"&&=!<&(LR9 5QPNRN ` b__1 ^,P 59AC= d>1C*"&E<5"<&(*#)*E5=&*QmnLS9 5QPPLN ` /d1 ^,P 59AC= d>1C*!"&&=!<&(b19 5QPPLN ` /d1 ^,P M9AC= d>1C*"&E<5"<&(*#)*E5=&*QmnM59 5R5LPP ` /M1 ^,P M9AC= d>1C*!"&&=!<&(ML9 5R5LPP ` /M1 ^,P 59AC= d>1C*"&E<5"<&(*#)*E5=&*QmnMM9 5R5LPP ` 1 ^,P /*E%&&!;)?*#)*W9#)(*B5"YMN9 /M_1ed ` /c1 ^,P 59AC= d>1C*"&E<5"<&(*#)*E5=&*QmnMO9 5RO5PN ` a1 ^,P 59AC= d>1C*!"&&=!<&(MP9 5RO5PN ` a1 ^,P 0*I"#=OQmn*1>1C*<#OQmn*d>1C*Qmn*=;?"5<;#)MQ9 5RO5PN ` 1 ^,P L9AC= d>1C*"&E<5"<&(*#)*E5=&*QmnMR9 5ROS5N ` e_1 ^,P L9AC= d>1C*!"&&=!<&(MS9 5ROS5N ` /01 ^,P /*I"#=OQmn*1>1C*<#OQmn*d>1C*Qmn*=;?"5<;#)d19 5ROS5N ` 1 ^,P 59AC= d>1C*"&E<5"<&(*#)*E5=&*QmnN59 /Mc1bd ` 1 ^,P M9AC= 1>1C*"&E<5"<&(*#)*E5=&*QmnNL9 /Mc1ed ` d1 ^,P M9AC= 1>1C*!"&&=!<&(NM9 /Mc1ed ` 1 ^,P /*I"#=OQmn*d>1C*<#OQmn*1>1C*Qmn*=;?"5<;#)NN9 /Mc1ed ` 1 ^,P 59AC= 1>1C*"&E<5"<&(*#)*E5=&*QmnNO9 5RPLQO ` 011 ^,P 59AC= 1>1C*!"&&=!<&(*NP9 5RPLQO ` /_1 ^,P 0*I"#=OQmn*d>1C*<#OQmn*1>1C*Qmn*=;?"5<;#)NQ9 5RPLQO ` 1 ^,P L9AC= 1>1C*"&E<5"<&(*#)*E5=&*QmnNR9 222NS9 01/bc0 ` 1 ^,P 59AC= d>1C*!"&&=!<&(_19 01/bc0 ` 1 ^,P M9AC= d>1C*"&E<5"<&(*#)*E5=&*QmnO59 01/bc0 ` 1 ^,P M9AC= d>1C*!"&&=!<&(OL9 01/bc0 ` 1 ^,P 59AC= d>1C*"&E<5"<&(*#)*E5=&*QmnOM9 01/bc0 ` 1 ^,P /*E%&&!;)?*#)*W9#)(*B5"Y
From line 54, the call stack dump shows that the last function called is $D%".\0+,,
which indicates the end of a parallelized section of the program with all threads
concluding their processing and only the main thread of the process remaining:
ON9 %;49RE#R/~GG%T!GT5;<L1ldOO9 %;49RE#R/~G<@"!GH#;)L1lbMOP9 %;4=<ENRE#R/~<@"&5(EG$);L1l/eMOQ9 %;4=<ENRE#R/~%;4=<ENG$);L1l/9OR9 %;4=<ENRE#R/~95%%G5""5gL1l51OS9 %;4=<ENRE#R/~95%%G$);L1l41c19 %;4=<ENRE#R/~5<&l;<G$);L1lM1P59 %;49RE#R/~G&l;<@5)(%&L1lddPL9 %;49RE#R/~&l;<L1ldPM9 !5"<&E<~GE<5"<L1l/Md
Sun Microsystems, Inc.27 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Using DTrace with MPIThe Message Passing Interface (MPI) de facto standard is a specification for
an API that allows many computers to communicate with one another. Sun
HPC ClusterTools™ software is based directly on the Open MPI open-source
implementation of the MPI standard, and reaps the benefits of that community
initiative, which has deployments and demonstrated scalability on very large scale
systems. Sun HPC ClusterTools software is the default MPI distribution in the Sun
HPC Software. It is fully tested and supported by Sun on the wide spectrum of Sun
x86 and SPARC processor-based systems. Sun HPC ClusterTools provides the libraries
and run-time environment needed for creating and running MPI applications and
includes DTrace providers, enabling additional profiling capabilities.
Note: While this example focuses on the use of DTrace with MPI, alternative, complementary, and supplementary ways to profile MPI based applications are provided by the Sun Studio Performance Analyzer. The Analyzer directly supports the profiling of MPI, and provides features that help understand message transmission issues and MPI stalls. For details see the Sun Studio Technical Articles Web site linked from “References” on page 38.
The MPI standard states that MPI profiling should be implemented with wrapper
functions. The wrapper function performs the required profiling tasks, and the real
MPI function is called inside the wrapper through a profiling MPI (PMPI) interface.
However, using DTrace for profiling has a number of advantages over the standard
approach:
The PMPI interface is compiled code that requires restarting a job every time a
library is changed. DTrace is dynamic and profiling code can be implemented in D
and attached to a running process.
MPI profiling changes the target system, resulting in differences between the
behavior of profiling and non-profiling code. DTrace enables the testing of an
actual production system with, at worst, a negligible affect on the system under
test.
DTrace allows the user to define probes that capture MPI tracing information with
a very powerful and concise syntax.
The profiling interface itself is implemented in D, which includes built-in
mechanisms that allow for safe, dynamic, and flexible tracing on production
systems.
When DTrace is used in conjunction with MPI, DTrace provides an easy way to
identify the potentially problematic function and the desired job, process, and
rank that require further scrutiny.
D has several built-in functions that help in analyzing and trouble shooting
problematic programs.
Sun Microsystems, Inc.28 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Setting the Correct *"+%#, PrivilegesThe *"+%#, command controls several aspects of program execution and uses
the Open Run-Time Environment (ORTE) to launch jobs. /$%&!-."%0! and
/$%&!-.#)-% privileges are needed to run a script with the *"+%#, command,
otherwise DTrace fails and reports an insufficient privileges error. To determine
whether the correct privileges are assigned, the following shell script, *"""%+:2)D,
must be executed by *"+%#,:
'[S4;)SE@I9*"""%+:2)D9;9%#,9""%+:4579#,/-%9&9)D-339$09)--9$D-9I9"%+:+3-V-)90H9$D-9"%0!-))9$D&$9*"+%#,9!%-&$-)!!";B*XX
Execute the following command to determine whether the correct privileges are
assigned for a cluster consisting of host1 and host2:
Z9*"+%#,9;,"9L9;;D0)$9D0)$58D0)$L9*"""%+:2)D
The privileges can be set by the system administrator. Once they are set,
/$%&!-.#)-% and /$%&!-."%0! privileges are reported by *"""%+:2)D as follows:
d1M_`*O9E@****�5?E*J*K)#)&Z****.`45E;98(<"59&G!"#98(<"59&G3E&"****,`45E;98(<"59&G!"#98(<"59&G3E&"****m`45E;98(<"59&G!"#98(<"59&G3E&"****2`*5%%
A Simple Script for MPI TracingThe following script, *"+$%&!-2/, is used to trace the entry into and exit from all of
the MPI API calls:
!;(X<5"?&<`%;4=!;`pm,GA`&)<"gD****!";)<I>w.)<&"&(*yERRRx8*!"#4&I3)9C:T!;(X<5"?&<`%;4=!;`pm,GA`"&<3")D****!";)<I>w&l;<;)?8*"&<3")*B5%3&*J*y(V)x8*5"?/C:T
To trace the entry and exit of MPI calls in four instances of an MPI application,
*"+&""8 running under dtrace and using the *"+$%&!-2/ script, it seems you
should execute the following command:
Z9*"+%#,9;,"9N9/$%&!-9;09*"+&""2$%&!-9;)9*"+$%&!-2/9;!9*"+&""
Sun Microsystems, Inc.29 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Where:
;,"9N instructs the *"+%#, command to invoke four copies of /$%&!-
;)9*"+$%&!-2/ instructs /$%&!- to run the *"+$%&!-2/ D-script
;!9*"+&"" instructs each instance of the /$%&!- command to run *"+&""
In this invocation, the output for all processes is directed to the standard-output and
the different output from different processes is mixed up in the single output file
*"+&""2$%&!-. To direct the output of each invocation of the /$%&!- command to
a separate file, invoke it using the following "&%$%&!-2)D script:
'[S4;)SE@(<"59&*OE*X/*O9*X0*O#*X0RXrpm,GQrppGsro2PGofjtR<"59&
Where:
OE*X/ instructs /$%&!- to run the D-script passed to it as the first argument
O9*X0 instructs /$%&!- to run the application named in the second argument
O#*X0RXrpm,GQrppGsro2PGofjtR<"59& directs the trace-output to the file
named from the concatenation of the second argument, the MPI-rank of the shell
and the string trace. If this file exists, the output is appended to it.
Now invoke four copies of the "&%$%&!-2)D script that invokes *"+&"":
Z9*"+%#,9;,"9N9"&%$%&!-2)D9*"+$%&!-2/9*"+&""
Note: The status of the OMPI_COMM_WORLD_RANK shell variable is unstable and subject to change. The user may need to change it to its replacement.
To attach the /$%&!- command with the same *"+$%&!-2/ D-script to a running
instance of the MPI program *"+&"" with process id 24768 run the following
command:
Z9/$%&!-9;"9LNQPR9;)9*"+$%&!-2/
When DTrace running the *"+$%&!-2/ script is attached to a job that performs )-,/
and %-!: operations, the output looks similar to the following:
y*(<"59&*Ou*O!*0dee1*OE*=!;<"59&R(.)<&"&(*pm,G+&)(RRR&l;<;)?8*"&<3")*B5%3&*J*1.)<&"&(*pm,Go&9BRRR&l;<;)?8*"&<3")*B5%3&*J*1.)<&"&(*pm,G+&)(RRR&l;<;)?8*"&<3")*B5%3&*J*1.)<&"&(*pm,Go&9BRRR&l;<;)?8*"&<3")*B5%3&*J*1.)<&"&(*pm,G+&)(RRR&l;<;)?8*"&<3")*B5%3&*J*1
Sun Microsystems, Inc.30 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
12. In MPI, a communicator is an opaque object with a number of attributes together with simple rules that govern its creation, use, and destruction. Communicators are the basic data types of MPI.
The *"+$%&!-2/ script can be easily modified to include an argument list. The
resulting output resembles output of the $%#)) command:
59*!;(X<5"?&<`%;4=!;`pm,G+&)(`&)<"g8L9*!;(X<5"?&<`%;4=!;`pm,GAE&)(`&)<"g8M9*!;(X<5"?&<`%;4=!;`pm,Go&9B`&)<"g8N9*!;(X<5"?&<`%;4=!;`pm,GA"&9B`&)<"gO9*DP9***!";)<I>wyE>1lyl8*y(8*1lyl8*y(8*y(8*1lylCx8!"#4&I3)98Q9************************5"?18*5"?/8*5"?08*5"?b8*5"?d8*5"?_C:R99TS9*!;(X<5"?&<`%;4=!;`pm,G+&)(`"&<3")8/1*!;(X<5"?&<`%;4=!;`pm,GAE&)(`"&<3")8559!;(X<5"?&<`%;4=!;`pm,Go&9B`"&<3")85L9!;(X<5"?&<`%;4=!;`pm,GA"&9B`"&<3")5M9D5N9**!";)<I>wV<V<*J*y(V)x8*5"?/C:5O9T
The *"+$%#))2/ script demonstrates the use of wildcard names to match all send
and receive type function calls in the MPI library. The first probe shows the usage of
the built-in &%V variables to print out the argument list of the traced function. The
following example shows a sample invocation of the *"+$%#))2/ script:
y*(<"59&*Ou*O!*0dee1*OE*=!;<"3EER(pm,G+&)(>1lM1de1418*/8*1lM1c1IdM8*18*/8*1lM1c1(dMC*J*1pm,Go&9B>1lM1de15M8*/8*1lM1c1IdM8*18*18*1lM1c1(dMC*J*1pm,G+&)(>1lM1de1418*/8*1lM1c1IdM8*18*/8*1lM1c1(dMC*J*1pm,Go&9B>1lM1de15M8*/8*1lM1c1IdM8*18*18*1lM1c1(dMC*J*1
Tracking Down MPI Memory LeaksMPI communicators12 are allocated by the MPI middleware at the request of
application code, making it difficult to identify memory leaks. The following
D-script — *"+!0**!D-!'2/9a9demonstrates how DTrace can help overcome this
challenge.
*"+!0**!D-!'2/ probes for MPI functions that allocate and deallocate
communicators, and keeps track of the call-stack for each function. Every 10 seconds,
the script dumps out the count of MPI communicator calls and the total number of
instances of communicator that were allocated and freed. At the end of the DTrace
session, the script prints the total count and the different stack traces, as well as the
number of times those stack traces were seen.
59k.v,jL9DM9**5%%#9E*J*1:*(&5%%#9E*J*1:*E&9#)(E*J*1:N9T
The k.v,j probe fires when the script starts and initializes the count of
allocations, deallocations, and seconds counter.
Sun Microsystems, Inc.31 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
O9*!;(X<5"?&<`%;4=!;`pm,GQ#==G9"&5<&`&)<"g8P9*!;(X<5"?&<`%;4=!;`pm,GQ#==G(3!`&)<"g8Q9*!;(X<5"?&<`%;4=!;`pm,GQ#==GE!%;<`&)<"gR9*DS9***LL5%%#9E:/1***]9#3)<E6!"#4&I3)97*J*9#3)<>C:559**]E<59NE63E<59N>C7*J*9#3)<>C:5L9T
This code section implements the communicator constructor probes —
pm,GQ#==G9"&5<&, pm,GQ#==G(3!, or pm,GQ#==GE!%;<.
Line 9 increments the allocation counter, line 10 saves the count of calls to the
specific function, and line 11 saves the count of the stack frames.
5M9!;(X<5"?&<`%;4=!;`pm,GQ#==GI"&&`&)<"g5N9D5O9**LL(&5%%#9E:5P9**]9#3)<E6!"#4&I3)97*J*9#3)<>C:5Q9**]E<59NE63E<59N>C7*J*9#3)<>C:5R9T
This code section implements the communicator destructor probe —
pm,GQ#==GI"&&R
Line 15 increments the deallocation counter, line 16 saves the count of calls to
pm,GQ#==GI"&&,9and line 17 saves the count of the stack frames.
5S9!"#$%&```<;9NO/1E&901*DL59**!";)<I>wJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJxC:LL9**!";)<5>]9#3)<EC:LM9**!";)<I>wQ#==3);95<#"*5%%#9E*J*y(V)x8*5%%#9EC:LN9**!";)<I>wQ#==3);95<#"*(&5%%#9E*J*y(V)x8*(&5%%#9EC:LO9**E&9#)(E*J*1:LP9T
The code section runs every 10 seconds and prints the counters.
LQ9.jPLR9DLS9**!";)<I>wQ#==*5%%#9E*J*y(8*Q#==*(&5%%#9E*J*y(V)x8b1*******5%%#9E8*(&5%%#9EC:M59T
The .jP probe fires when the DTrace session exits.
The count of the number of times each call stack was seen is printed in line 29.
DTrace prints aggregations on exit by default, so the three communicator
constructors and the communicator destructor call stacks are printed.
When this script is run, a growing difference between the total number of allocations
and deallocations can indicate the existence of a memory leak. Using the stack
traces dumped, it is possible to determine where in the application code the objects
Sun Microsystems, Inc.32 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
are allocated and deallocated to help diagnose the issue. The following example
demonstrates such a diagnosis.
The *"+!0**3-&' MPI application performs three pm,GQ#==G(3! operations and
two pm,GQ#==GI"&& operations. The program thus allocates one communicator
more than it frees with each iteration of a specific loop. Note that this in itself does
not necessarily mean that the extra allocation is in fact a memory leak, since the
extra object might be referred to by code that is external to the loop in question.
When DTrace is attached to *"+!0**3-&' using *"+!0**!D-!'2/, the count
of allocations and deallocations is printed every 10 seconds. In this example, this
output shows that the count of the allocated communicators is growing faster than
the count of deallocations.
When the DTrace session exits, the .jP code outputs a total of five stack traces
— three for the constructor and two for the destructor call stacks, as well as the
number of times each call stack was encountered.
The following command attaches the *"+!0**!D-!'2/9D-script to process 24952:
Z9/$%&!-9;]9;"9LNSOL9;)9*"+!0**!D-!'2/
In the following output, the difference between the number of communicators
allocated and freed grows continuously, indicating a potential memory leak:
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
pm,GQ#==GI"&& N
pm,GQ#==G(3! P
Q#==3);95<#"*f%%#95<;#)E*J*c
A0**#,+!&$0%9G-&330!&$+0,)9X9N
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
pm,GQ#==GI"&& R
pm,GQ#==G(3! 5L
Q#==3);95<#"*f%%#95<;#)E*J*/0
A0**#,+!&$0%9G-&330!&$+0,)9X9R
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
pm,GQ#==GI"&& 5L
pm,GQ#==G(3! 5R
Q#==3);95<#"*f%%#95<;#)E*J*/M
A0**#,+!&$0%9G-&330!&$+0,)9X95L
Sun Microsystems, Inc.33 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
After the D-script is halted (by the user interrupting DTrace), the final count of
allocations and deallocations, together with the table of stack traces, is printed:
Q#==3);95<#"*f%%#95<;#)E*J*0/8*Q#==3);95<#"*P&5%%#95<;#)E*J*/d
%;4=!;RE#R1R1R1Ypm,GQ#==GI"&&
=!;9#==%&5NY(&5%%#95<&G9#==EL1l/a
=!;9#==%&5NY=5;)L1lc(
=!;9#==%&5NY1lM1_1M/5
Q
%;4=!;RE#R1R1R1Ypm,GQ#==GI"&&
=!;9#==%&5NY(&5%%#95<&G9#==EL1l0c
=!;9#==%&5NY=5;)L1lc(
=!;9#==%&5NY1lM1_1M/5
Q
%;4=!;RE#R1R1R1Ypm,GQ#==G(3!
=!;9#==%&5NY5%%#95<&G9#==EL1l/&
=!;9#==%&5NY=5;)L1l_4
=!;9#==%&5NY1lM1_1M/5
Q
%;4=!;RE#R1R1R1Ypm,GQ#==G(3!
=!;9#==%&5NY5%%#95<&G9#==EL1lb1
=!;9#==%&5NY=5;)L1l_4
=!;9#==%&5NY1lM1_1M/5
Q
%;4=!;RE#R1R1R1Ypm,GQ#==G(3!
=!;9#==%&5NY5%%#95<&G9#==EL1ld0
=!;9#==%&5NY=5;)L1l_4
=!;9#==%&5NY1lM1_1M/5
Q
In analyzing the output of the script, it is clear that 50% more communicators are
allocated than are deallocated, indicating a potential memory leak. The five stack
traces, each occurring seven times, point to the code that called the communicator
constructors and destructors. By analyzing the handling of the communicator
objects by the code executed following the calls to their constructors, it is possible to
identify whether there is in fact a memory leak, and its cause.
Using the DTrace *"+"-%#)- ProviderPERUSE is an extension interface to MPI that exposes performance related processes
and interactions between application software, system software, and MPI message-
passing middleware. PERUSE provides a more detailed view of MPI performance
Sun Microsystems, Inc.34 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
compared to the standard MPI profiling interface (PMPI). Sun HPC ClusterTools
includes a DTrace provider named *"+"-%#)- that supports DTrace probes into the
MPI library.
The DTrace *"+"-%#)- provider exposes the events defined in the MPI PERUSE
specification that track the MPI requests. To use an *"+"-%#)- probe, add the
string =!;!&"3E&X<5"?&<``` before the probe-name, remove the PERUSE_ prefix,
convert the letters into lower-case, and replace the underscore characters with
dashes. For example, to specify a probe to capture a PERUSE_COMM_REQ_ACTIVATE
event, use =!;!&"3E&X<5"?&<```9#==O"&uO59<;B5<&R*If the optional
object and function fields in the probe description are omitted, all of the
!0**;%-];&!$+:&$- probes in the MPI library and all of its plug-ins will fire.
All of the *"+"-%#)- probes recognize the following four arguments:
Argument Function
*"+!0**+,H0.$9[+ Provides a basic source and destination for the request and which protocol is expected to be used for the transfer.
#+,$"$%.$9#+/ The PERUSE unique ID for the request that fired the probe (as defined by the PERUSE specifications). For OMPI this is the address of the actual request.
#+,$.$90" Indicates whether the probe is for a send or receive request.
*"+!0**.)"-!.$9[!) Mimics the )"-! structure, as defined in the PERUSE specification.
To run D-scripts with *"+"-%#)- probes, use the O- (upper case) switch with the
/$%&!- command since the *"+"-%#)- probes do not exist at initial load time:
'[S4;)SE@'*!5"<"59&RE@*O*5*@&%!&"*E9";!<*<#*(<"59&*pm,*H#4E*I"#=I9$D-9)$&%$90H9$D-9\0(2(<"59&*O-*OE*X/*O9*X0*O#*X0RXrpm,GQrppGsro2PGofjtR<"59&
Tracking MPI Message Queues with DTrace
Sun HPC ClusterTools™ software includes an example of using *"+"-%#)- probes
on the Solaris OS that allows users to see the activities of the MPI message queues.
To run the script, *"+)$&$2/, a compiled MPI program that is executed with the
*"+%#, command is needed to create two MPI processes. In this example, the
process ID of the traced process is 5MMQ5.
Run the *"+)$&$2/ D-script:
Z9/$%&!-9;]9;"95MMQ59;)9*"+)$&$2/
Sun Microsystems, Inc.35 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
The *"+)$&$2/ D-script prints out the following columns of information on the MPI
message queues every second:
Col PERUSE event probe Meaning
1 o.�Gqh.oGQrj^,jn. Count of bytes received
2 o.�GfQ^,|f^. Number of message transfers activated
3 o.�G,j+.o^G,jGmr+^.PG� Number of messages inserted in the posted request queue
4 p+vG,j+.o^G,jGnj.qG� Number of messages inserted into the unexpected queue
5 p+vGpf^Q}Gmr+^.PGo.� Number of incoming messages matched to a posted request
6 o.�Gpf^Q}Gnj.q Number of user requests matched to an unexpected message
7 o.�Gqh.oG.jP Count of bytes that have been scheduled for transfer
8 o.�GfQ^,|f^. Number of times MPI has initiated the transfer of a message
A Few Simple *"+"-%#)- Usage ExamplesThe examples in this section show how to perform the described DTrace operations
from the command line. In each example, the process ID shown (254) must be
substituted with the process ID of the process to be monitored.
Count the number of messages to or from a host:
y*(<"59&*O!*0_d*O)*W=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O&)(*D*V9**]65"?E617OZ9;G"&=#<&7*J*9#3)<>C:iY(<"59&`*(&E9";!<;#)*W=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O&)(W**&$!D-/95Q9"%0(-)bAD0)$L9%-!:9MD0)$L9)-,/9M
Count the number of messages to or from specific MPI byte transfer layers (BTLs):
y*(<"59&*O!*0_d*O)*W=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O&)(*D*V9**]65"?E617OZ9;G!"#<#9#%7*J*9#3)<>C:iY(<"59&`*(&E9";!<;#)*W=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O&)(*W**&$!D-/95Q9"%0(-)bAE=*c1
Sun Microsystems, Inc.36 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Obtain distribution plots of message sizes sent or received from a host:
y*(<"59&*O!*0_d*O)*W=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O&)(*D*V9**]65"?E617OZ9;G"&=#<&7*J*u35)<;F&>5"?E6b7OZ=9EG9#3)<C:iY(<"59&`*(&E9";!<;#)*W=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O&)(*W**&$!D-/95Q9"%0(-)bA=g@#E<
:&3#- ;;;;;;;;;;;;;9G+)$%+(#$+0,9;;;;;;;;;;;;; !0#,$
L U 1
N U]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] N
R U 1
Create distribution plots of message sizes by communicator, rank, and send/receive:
y*(<"59&*O!*0_d*O)*W=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O&)(*D*V*9**]65"?E6b7OZ=9EG9#==8*5"?E6b7OZ=9EG!&&"8*5"?E6b7OZ=9EG#!7*V9**J*u35)<;F&>5"?E6b7OZ=9EG9#3)<C:iY(<"59&`*(&E9";!<;#)*W=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O&)(*W**&$!D-/95S9"%0(-)bA5MNP5NRPN9 59%-!:
:&3#- ;;;;;;;;;;;;;9G+)$%+(#$+0,9;;;;;;;;;;;;; !0#,$
L U 1
N U]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] S
R U 15MNP5NRPN9 59)-,/
:&3#- ;;;;;;;;;;;;;9G+)$%+(#$+0,9;;;;;;;;;;;;; !0#,$
L U 1
N U]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] S
R U 1
Sun Microsystems, Inc.37 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Chapter 4
Conclusion
The increased complexity of modern computing and storage architectures with
multiple CPUs, cores, threads, virtualized operating systems, networking, and
storage devices present serious challenges to system architects, administrators,
developers, and users. The requirements of high availability and reliability,
together with the increasing pressure on datacenter budgets and resources,
create an environment where it is necessary to maintain systems at a high level of
performance — without the ability to simply add resources to resolve performance
issues.
To achieve these seemingly conflicting objectives, Sun Microsystems provides a
comprehensive set of tools on the Solaris 10 OS with DTrace at their core, that
enable unprecedented levels of observability and insight into the workings of the
operating system and the applications running on it. These tools allow datacenter
professionals to quickly analyze and diagnose issues in the systems they maintain
without increasing operational risk, making the promise of observability as a driver
of consistent system performance and stability a reality.
AcknowledgmentsThomas Nau of the Infrastructure Department, Ulm University, provided the content
for the chapters on DTrace and its Visualization Tools and Performance Analysis
Examples, except for the section on Using DTrace with MPI.
The section on Using DTrace with MPI is based on the article Using the Solaris DTrace
Utility With Open MPI Applications, by Terry Dontje, Karen Norteman, and Rolf vande
Vaart of Sun Microsystems.
Sun Microsystems, Inc.38 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Chapter 5
References
Table 3. Web sites for more information
Description URL
DLight tutorial http://blogs.sun.com/solarisdev/entry/project_d_light_tutorial
Another DLight tutorial http://developers.sun.com/sunstudio/documentation/tutorials/d_light_tutorial/index.html
Yet another DLight tutorial http://developers.sun.com/sunstudio/documentation/tutorials/dlight/
Chime Visualization Tool for DTrace http://opensolaris.org/os/project/dtrace-chime/
The NetBeans DTrace GUI Plugin Tutorial (includes Chime)
http://netbeans.org/kb/docs/ide/NetBeans_DTrace_GUI_Plugin_0_4.html
DTraceToolkit http://opensolaris.org/os/community/dtrace/dtracetoolkit/
Proposed cpc DTrace provider http://wikis.sun.com/display/DTrace/cpc+Provider
DTrace limitations page in the Solaris internals wiki
http://solarisinternals.com/wiki/index.php/DTrace_Topics_Limitations
Sun Studio documentation http://developers.sun.com/sunstudio/documentation/index.jsp
Sun Studio Technical Articles http://developers.sun.com/sunstudio/documentation/techart/
Sun HPC ClusterTools http://sun.com/clustertools
Sun HPC ClusterTools 8 Documentation
http://docs.sun.com/app/docs/prod/hpc.cluster80~hpc-clustertools8?l=en#hic
MPO/lgroup observability tools http://opensolaris.org/os/community/performance/numa/observability/tools/
John the Ripper password evaluation tool
http://openwall.com/john/
BIOS and Kernel Developer’s Guide for the AMD Athlon 64 and AMD Opteron Processors AMD publication #26094
http://support.amd.com/us/Processor_TechDocs/26094.PDF
Open MPI: Open-source High Performance Computing Web site
http://open-mpi.org
Using the Solaris DTrace Utility With Open MPI Applications
http://developers.sun.com/solaris/articles/dtrace_article.html
Sun Microsystems, Inc.39 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Chapter 6
Appendix
Chime Screenshots
Figure 8. Processes sorted by system call load
Figure 9. Plotting of reads over time with the rfilio DTraceToolkit script
Sun Microsystems, Inc.40 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Listings for Matrix Addition !"#$%&!' Example
gnuplot Program to Graph the Data Generated by !"#$%&!'
)-$9$-%*+,&39"0)$9-")9!030%9-,D)-$90#$"#$9c!"#$%&!'.&//.%-&/)2-")dE&<*l%54&%*w<;=&*6E7xE&<*g%54&%*w2/*PO959@&*=;EE&ESExE&<*g0%54&%*w,)E<"39<;#)*9#3)<SExE&<*I#"=5<*g**wyR/0?xE&<*I#"=5<*g0*wyRb?xE&<*N&g*";?@<*9&)<&"!%#<*w9!3<"59NG5((G9#%R(5<5x*3E;)?*_*<;<%&*wPO959@&*=;EE&E*9#%xV*T;<@*%;)&E*%<*/8*w9!3<"59NG5((G"#TR(5<5x*3E;)?*_*<;<%&*V*wPO959@&*=;EE&E*"#Tx*T;<@*%;)&E*%<*08w9!3<"59NG5((G9#%R(5<5xV9*3E;)?*c*<;<%&*w,)E<"39<;#)*9#3)<*9#%x*5l&E*l/g0*T;<@*%;)&E8Vw9!3<"59NG5((G"#TR(5<5x*3E;)?*c*<;<%&*w,)E<"39<;#)*9#3)<*"#TxV*5l&E*l/g0*T;<@*%;)&E
!"#$%&!'.&//.%012/&$& /R/ba* 01d1e* /* <;9N* _dbMec* Me/_1e_/0R/ba* 01d1e* /* <;9N* /1M1_/_* /e0M_c01dbR/ba* 01d1e* /* <;9N* e_/ddM* /01/a1cbbdR/Ma* 01d1e* /* <;9N* M11cd1* /0M1ddec__Rdbc* 01d1e* /* <;9N* /1/ca0b* /c0e1/abdcR10a* 01d1e* /* <;9N* /b01ddd* /1_cd/00/eRb0/* 01d1e* /* <;9N* 0/ee1b1* /ed/e/e10MR//a* 01d1e* /* <;9N* /eee_a0* /d00/dadeaR/0a* 01d1e* /* <;9N* /_daMb_* /0baabbMM/1R/dM*01d1e* /* <;9N* /_cbe/1* /0_/1beb1//R/da*01d1e* /* <;9N* 000ea1M* /eM0d000_/0R1bc*01d1e* /* <;9N* /0ca_cc* /1/_e1caM/bR10a*01d1e* /* <;9N* 00/bedM* /ee/1aba0/dR1da*01d1e* /* <;9N* /_cd/e1* /0_/d1/1c/_R1_a*01d1e* /* <;9N* /_dbaeM* /0b_0_1db/cR1ca*01d1e* /* <;9N* 00_/cdd* /M1/d//c0/eR/da*01d1e* /* <;9N* /cab1bc* /b_d_11_b/MR/_a*01d1e* /* <;9N* 00_/10_* /M11a/Md1/aR/ca*01d1e* /* <;9N* 00_/0/d* /M1/1cMe1
!"#$%&!'.&//.!032/&$& /R1ee* 01db_* /* <;9N* /Mbe* /M/c/10R01_* 01db_* /* <;9N* bacb* __a01bR/__* 01db_* /* <;9N* __a0* __a01dR///* 01db_* /* <;9N* __ad* __ad1_R//M* 01db_* /* <;9N* _cbaa10* _cd1deadcR1cM* 01db_* /* <;9N* _bdbd1_* _bdbadbMeR1eM* 01db_* /* <;9N* _ce1/ec* _ce1eMcMMR/0M* 01db_* /* <;9N* _Mcc10b* _Mccc0d0aR1MM* 01db_* /* <;9N* _ba0e0b* _bab0e__
Sun Microsystems, Inc.41 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
/1R1bM*01db_* /* <;9N* _bd/bae* _bd/ac/a//R1dM*01db_* /* <;9N* _cca0cb* _ccaMccc/0R1MM*01db_* /* <;9N* _Mde1ca* _Mdeceed/bR1dM*01db_* /* <;9N* _bab1Me* _babc_cb/dR1_M*01db_* /* <;9N* _cca1dc* _ccac_0M/_R1aM*01db_* /* <;9N* _M_1/0M* _M_1e0a//cR1_M*01db_* /* <;9N* _bM0bc/* _bM0adbb/eR1cM*01db_* /* <;9N* _ceaac1* _cM1_cbb/MR/1M*01db_* /* <;9N* _MbM/Mc* _MbMeMMa/aR1cM*01db_* /* <;9N* _bad0aa* _badM_Mb01R1eM*01db_* /* <;9N* _ce_a_0* _cec_bcM0/R1MM*01db_* /* <;9N* _e1dcd_* _e1_1deb00R1aM*01db_* /* <;9N* _eb0c_0* _eb0ac0e0bR/1M*01db_* /* <;9N* _eb/M_d* _eb0/c/M0dR//M*01db_* /* <;9N* _e0aMc/* _eb1/ecd0_R/0M*01db_* /* <;9N* _eb/eaM* _eb0/1Me0cR/bM*01db_* /* <;9N* _e0_ccc* _e0_aece0eR/dM*01db_* /* <;9N* _eb/edc* _eb01__d0MR10M*01db_* /* <;9N* daa1dd/* daa1e/de0aR/aM*01db_* /* <;9N* ccd1Me0* ccd/0b01b1R01M*01db_* /* <;9N* _eb/1c1* _eb/be1eb/R0/M*01db_* /* <;9N* _eb0MMc* _ebb/aceb0R10M*01db_* /* <;9N* d_ac/cc* d_acd/dbbbR1bM*01db_* /* <;9N* _eb1c_c* _eb1ae1ebdR1dM*01db_* /* <;9N* _ebb0d/* _ebb__0_b_R1_M*01db_* /* <;9N* _eb0/1b* _eb0d/dcbcR1cM*01db_* /* <;9N* _ebd01M* _ebd_/MebeR1eM*01db_* /* <;9N* _ebdeMM* _eb_1aMebMR1MM*01db_* /* <;9N* _eb10_b* _eb1_c/_baR1aM*01db_* /* <;9N* _eb0b1a* _eb0c/Mad1R/1M*01db_* /* <;9N* _eb0_aa* _eb0a/1_d/R//M*01db_* /* <;9N* _eb/ce_* _eb/aMdad0R/0M*01db_* /* <;9N* _eb/dbM* _eb/edMedbR/bM*01db_* /* <;9N* _ebbM/b* _ebd/0dcddR/dM*01db_* /* <;9N* _eb0e_e* _ebb1c_Md_R/_M*01db_* /* <;9N* _eb0bdM* _eb0cc1/dcR/cM*01db_* /* <;9N* _eb/bbd* _eb/cdb1deR/eM*01db_* /* <;9N* _ebb1c1* _ebbbe/ddMR/MM*01db_* /* <;9N* _eb/acc* _eb00ecddaR/aM*01db_* /* <;9N* _ebb01/* _ebb_/0/_1R01M*01db_* /* <;9N* _eb0cea* _eb0aa1c_/R0/M*01db_* /* <;9N* _e01eaM* _e0//1_e_0R10M*01db_* /* <;9N* d_adMcM* d_a_//cb_bR1bM*01db_* /* <;9N* _eb01eM* _eb0bMMe_dR1dM*01db_* /* <;9N* _eb0a_1* _ebb0c1M
Sun Microsystems, Inc.42 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Listings for Matrix Addition (#))$&$ Example
gnuplot Program to Graph the Data Generated by (#))$&$
)-$9$-%*+,&39"0)$9-")9!030%9-,D)-$90#$"#$9c(#))$&$.&//.%-&/)2-")dE&<*l%54&%*w<;=&*6E7xE&<*g%54&%*w"&5(ESExE&<*I#"=5<*g*wyR/1?xE&<*N&g*";?@<*9&)<&""30$9c(#))$&$.&//.!032(5<5x*3E;)?*d*<;<%&*w"&5(E*9#%x*T;<@*%;)&E8Vc(#))$&$.&//.%012/&$&d9#)+,V9N9$+$3-9c%-&/)9%01d91+$D93+,-)
(#))$&$.&//.%012/&$& /* ("5=1* =&=G"&5(E* _bbacb* =&=G"&5(E* _bbac/0* ("5=1* =&=G"&5(E* eaabaa* =&=G"&5(E* eaad1/b* ("5=1* =&=G"&5(E* _a0a/d* =&=G"&5(E* _a0a/bd* ("5=1* =&=G"&5(E* _accb/* =&=G"&5(E* _accb/_* ("5=1* =&=G"&5(E* cdc01/* =&=G"&5(E* cdc010c* ("5=1* =&=G"&5(E* ec1a/0* =&=G"&5(E* ec1a/Me* ("5=1* =&=G"&5(E* a/d/d1* =&=G"&5(E* a/d/0/M* ("5=1* =&=G"&5(E* eab1a1* =&=G"&5(E* eab//1a* ("5=1* =&=G"&5(E* M1MM/a* =&=G"&5(E* M1MM0//1* ("5=1* =&=G"&5(E* M1c1Md* =&=G"&5(E* M1c1c0//* ("5=1* =&=G"&5(E* a_0_M_* =&=G"&5(E* a_0_a1/0* ("5=1* =&=G"&5(E* eMc_01* =&=G"&5(E* eMc_bd/b* ("5=1* =&=G"&5(E* a1Mc_/* =&=G"&5(E* a1Mc_//d* ("5=1* =&=G"&5(E* eMeadb* =&=G"&5(E* eMeadc/_* ("5=1* =&=G"&5(E* ea00be* =&=G"&5(E* ea00bd/c* ("5=1* =&=G"&5(E* a1add0* =&=G"&5(E* a1add0/e* ("5=1* =&=G"&5(E* eMMa//* =&=G"&5(E* eMMa/d/M* ("5=1* =&=G"&5(E* a1aMMd* =&=G"&5(E* a1aMMb/a* ("5=1* =&=G"&5(E* a1a1ee* =&=G"&5(E* a1a1e_01* ("5=1* =&=G"&5(E* MabMaa* =&=G"&5(E* MabMMe0/* ("5=1* =&=G"&5(E* e/0b_d* =&=G"&5(E* e/0bc100* ("5=1* =&=G"&5(E* e1cac_* =&=G"&5(E* e1caba
(#))$&$.&//.!032/&$& /* ("5=1* =&=G"&5(E* dbbce1* =&=G"&5(E* dbbcea0* ("5=1* =&=G"&5(E* debaea* =&=G"&5(E* debaeab* ("5=1* =&=G"&5(E* daadca* =&=G"&5(E* daade1d* ("5=1* =&=G"&5(E* _/_1Md* =&=G"&5(E* _/_1ea_* ("5=1* =&=G"&5(E* 0/Me_0/* =&=G"&5(E* 0/Me_/_c* ("5=1* =&=G"&5(E* 0/bb110* =&=G"&5(E* 0/bb110e* ("5=1* =&=G"&5(E* 0bbM1c/* =&=G"&5(E* 0bbMbMdM* ("5=1* =&=G"&5(E* 0000_00* =&=G"&5(E* 0000011a* ("5=1* =&=G"&5(E* 0/0ae/e* =&=G"&5(E* 0/0ae/_/1* ("5=1* =&=G"&5(E* 00aab/a* =&=G"&5(E* 00aacdM//* ("5=1* =&=G"&5(E* 00_eM_d* =&=G"&5(E* 00_e_0c
Sun Microsystems, Inc.43 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
/0* ("5=1* =&=G"&5(E* 0/b1a0c* =&=G"&5(E* 0/b1a0_/b* ("5=1* =&=G"&5(E* 0b/11dM* =&=G"&5(E* 0b/1bea/d* ("5=1* =&=G"&5(E* 00_//a1* =&=G"&5(E* 00_1M_a/_* ("5=1* =&=G"&5(E* 0/baec/* =&=G"&5(E* 0/baec0/c* ("5=1* =&=G"&5(E* 0b/Ma_d* =&=G"&5(E* 0b/a/c_/e* ("5=1* =&=G"&5(E* 00ce1aM* =&=G"&5(E* 00ccMMe/M* ("5=1* =&=G"&5(E* 0/bbeac* =&=G"&5(E* 0/bbeac/a* ("5=1* =&=G"&5(E* 0b/ce/c* =&=G"&5(E* 0b/e1db01* ("5=1* =&=G"&5(E* 00deedb* =&=G"&5(E* 00ded/_0/* ("5=1* =&=G"&5(E* 0/0ccdc* =&=G"&5(E* 0/0ccdc00* ("5=1* =&=G"&5(E* 0d00bb1* =&=G"&5(E* 0d00bb/0b* ("5=1* =&=G"&5(E* 0/0c/d1* =&=G"&5(E* 0/0c/ba0d* ("5=1* =&=G"&5(E* 00_eeae* =&=G"&5(E* 00_M/0_0_* ("5=1* =&=G"&5(E* 00a11ab* =&=G"&5(E* 00Maecc0c* ("5=1* =&=G"&5(E* 00M_/_e* =&=G"&5(E* 00M_/_e0e* ("5=1* =&=G"&5(E* 0d0/dMM* =&=G"&5(E* 0d0/dMM0M* ("5=1* =&=G"&5(E* 0/0e1cM* =&=G"&5(E* 0/0e1ce0a* ("5=1* =&=G"&5(E* 0b1c/0a* =&=G"&5(E* 0b1cd_/b1* ("5=1* =&=G"&5(E* 00d/b/e* =&=G"&5(E* 00d1aacb/* ("5=1* =&=G"&5(E* 0/0c/1a* =&=G"&5(E* 0/0c/1ab0* ("5=1* =&=G"&5(E* 0d0b/cc* =&=G"&5(E* 0d0b0bdbb* ("5=1* =&=G"&5(E* 0/0ca0c* =&=G"&5(E* 0/0cMc/bd* ("5=1* =&=G"&5(E* 0be0b__* =&=G"&5(E* 0be0cedb_* ("5=1* =&=G"&5(E* 0/e_1_M* =&=G"&5(E* 0/edebcbc* ("5=1* =&=G"&5(E* 0/cdd1_* =&=G"&5(E* 0/cdc/bbe* ("5=1* =&=G"&5(E* 0bab/M_* =&=G"&5(E* 0ba0aecbM* ("5=1* =&=G"&5(E* 0/0e1cM* =&=G"&5(E* 0/0e1cMba* ("5=1* =&=G"&5(E* 0bedddM* =&=G"&5(E* 0bedc_cd1* ("5=1* =&=G"&5(E* 001ba_M* =&=G"&5(E* 001be_/d/* ("5=1* =&=G"&5(E* 0/bdM_e* =&=G"&5(E* 0/b_/Med0* ("5=1* =&=G"&5(E* 0d/M_b/* =&=G"&5(E* 0d/M01/db* ("5=1* =&=G"&5(E* 0/0acac* =&=G"&5(E* 0/0aca_dd* ("5=1* =&=G"&5(E* 0d/_/bM* =&=G"&5(E* 0d/_de0d_* ("5=1* =&=G"&5(E* 0/b0b_1* =&=G"&5(E* 0/b01/cdc* ("5=1* =&=G"&5(E* 001M0bc* =&=G"&5(E* 001Md_/de* ("5=1* =&=G"&5(E* 0bd/ada* =&=G"&5(E* 0bd/ebddM* ("5=1* =&=G"&5(E* 0/0_M0c* =&=G"&5(E* 0/0_M0cda* ("5=1* =&=G"&5(E* 0d01a0b* =&=G"&5(E* 0d01a0d_1* ("5=1* =&=G"&5(E* 0/de/cb* =&=G"&5(E* 0/de/c0_/* ("5=1* =&=G"&5(E* 00MMMa0* =&=G"&5(E* 00Ma/1a_0* ("5=1* =&=G"&5(E* 00eb/ca* =&=G"&5(E* 00e0a_b_b* ("5=1* =&=G"&5(E* 0/b/_Md* =&=G"&5(E* 0/b/_Mb_d* ("5=1* =&=G"&5(E* 0d0/01c* =&=G"&5(E* 0d0/01e__* ("5=1* =&=G"&5(E* /MMa__/* =&=G"&5(E* /MMa__M_c* ("5=1* =&=G"&5(E* aedd10* =&=G"&5(E* aedd/__e* ("5=1* =&=G"&5(E* c01/1e* =&=G"&5(E* c011Mb
Sun Microsystems, Inc.44 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Sun Studio Analyzer Hints to Collect Cache Misses on an UltraSPARC T1® Processor
-,./-)!90,+V,0%-.,0.UD1!"%0H=#4HG(&$)&*|5(("*|fPPo=#4HG(&$)&*m5(("*mfPPo=#4HG(&$)&*m"#9&EE*m,P=#4HG(&$)&*^@"&5(*>m,PA/111CL^}o,P=#4HG(&$)&*^@"&5(,P*^}o,P=#4HG(&$)&*+&9#)(E*>^+^fpmS/111111111C=#4HG(&$)&*p;)3<&E*>^+^fpmSc1111111111C=#4HG(&$)&*n+^/Gk5)N*>mfPPo\1l91CZZc=#4HG(&$)&*n+^/G20Q59@&2;)&*>mfPPo\1l0II91CZZc=#4HG(&$)&*n+^/G2/P5<5Q59@&2;)&*>mfPPo\1leI1CZZd=#4HG(&$)&*n+^/G+<"5)(*>Qmn,PC=#4HG(&$)&*n+^/GQ#"&*>Qmn,P\1l/9CZZ0=#4HG(&$)&*|fG20*|fPPoZZc=#4HG(&$)&*|fG2/*|fPPoZZd=#4HG(&$)&*mfG20*mfPPoZZc=#4HG(&$)&*mfG2/*mfPPoZZd|!5?&G0_cp*|fPPoZZ0Mm!5?&G0_cp*mfPPoZZ0M
Sun Microsystems, Inc.45 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Analyzing I/O Performance Problems in a Virtualized Environment — a Complete DescriptionAs described in the summary of this example in the section “Analyzing I/O
Performance Problems in a Virtualized Environment” on page 19, the following is a
full description of resolving system sluggishness due to a high rate of I/O system
calls. To identify the cause it is necessary to determine which I/O system calls are
called and in what frequency, by which process, and to determine why. The first step
in the analysis is to narrow down the number of suspected processes, based on the
ratio of the time each process runs in system context versus user context, with the
:*)$&$4567 and "%)$&$4567 commands.
In the example described here in detail, a Windows 2008 server is virtualized on the
OpenSolaris™ OS using the Sun™ xVM hypervisor for x86 and runs fine. When the
system is activated as an Active Directory domain controller, it becomes extremely
sluggish.
As a first step in diagnosing this problem, the system is examined using the
:*)$&$9O command, which prints out a high-level summary of system activity every
five seconds, with the following results:
The results show that the number of system calls reported in the Eg column
increases rapidly as soon as the affected virtual machine is booted, and remains
quite high even though the CPU is constantly more than 79% idle, as reported in the
+/ column. The number of calls is constantly more than 9 thousand from the third
interval and on, ranging as high as 83 thousand. Clearly something is creating a very
high system call load.
In the next step in the analysis, the processes are examined with "%)$&$9;<9;*.
The ;< option instructs the "%)$&$ command to report statistics for each thread
separately, and the thread ID is the appended to the executable name. The ;* option
instructs the "%)$&$ command to report information such as the percentage of time
the process has spent processing system traps, page faults, latency time, waiting for
user locks, and waiting for CPU, while the virtual machine is booted:
'$D% =&=#"g "&V- /+)' H$) !"#% ( 1 )1&" H%-- %- *H "+"0 H% /- )% =1 *5 *L *M+, Eg !) #) Eg +/1 1 1 5QPMOQLN d1acb_c 1 1 1 1 1 1 1 M M 1 1 SSN NN5 Q5Q 1 L SR1 1 1 5QPMOQLN d1acb_c 1 1 1 1 1 1 1 1 1 1 1 SP5 N5P Q5M 1 1 /111 1 1 5QPM5NNR d1a__0M QS NPO 1 1 1 1 1 1 1 1 1 /1ed a01_ 5NLR 5 L SQ1 1 1 /ec1d_0d d1e0/dM d1e NOON 1 5 5 1 1 P P 1 1 /1__M QLQRM 010/b N 5Q QS1 1 1 5QOSORLR d1c0bc1 /10 RLR 1 1 1 1 1 M M 1 1 MNN5 NNQNQ /1_01 5 5N RO1 1 1 5QOSRNSL d1cdc0M L L 1 1 1 1 1 5 5 1 1 OMPM 0M_1M RQOL L M SO1 1 1 5QOSRN5L d1c_1cM 1 1 1 1 1 1 1 01 01 1 1 /e1ee Mb10d b1cbb O Q RR1 1 1 /e_aM/1M d1c_/bc 1 1 1 1 1 1 1 1 1 1 1 RSO5 NPNOP /c/d1 L N SM
Sun Microsystems, Inc.46 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
The first line of the output indicates that the ]-*#;/* process executes a very large
number of system calls — 200,000 — as shown in the >A< column. A factor of 100
more than U-,)$0%-/, the process with the second largest number of system calls.
At this stage, while it is known that the ]-*#;/* process is at fault, to solve the
problem it is necessary to identify which system calls are called. This is achieved with
the help of the following D-script, 9#3)<GEgE95%%ER(, which prints system call
rates for the top-ten processes/system call combinations every five seconds:
59*'[S3E"SE4;)S(<"59&*OEL99I"%&V*&9G90"$+0,9]#+-$M9*k.v,j*DN9***<;=&"*J*<;=&E<5=!:*SA*)5)#E&9#)(*<;=&E<5=!*ASO99TP9*EgE95%%```&)<"g*DQ9***]95%%G9#3)<6!;(8*&l&9)5=&8*!"#4&I3)97*J*9#3)<>C:R99TS9*<;9NO_E*D/1***<"3)9>]98*/1C:559**)#"=5%;F&>]95%%G9#3)<8*><;=&E<5=!O<;=&"C*S*/111111111C:5L9**!";)<5>wy_(*yO01E*yc](*yEV)x8*]95%%G9#3)<C:5M9**9%&5">]95%%G9#3)<C:5N9**!";)<I>wV)xC:5O9**<;=&"*J*<;=&E<5=!:5P9T
The k.v,j probe fires when the script starts and is used to initialize the timer.
The EgE95%%```&)<"g probe fires whenever a system call is called and the
system call name, the name of executable that called it, and the process ID are
saved in the !&33.!0#,$ aggregation.
The $+!';O) prints the information collected— line 10 truncates the aggregation
to its top 10 entries, line 12 prints the system call count, and line 13 clears the
contents of the aggregation.
When 9#3)<GEgE95%%ER( is run, of the top 10 process/system call pairs that are
printed, seven are executed by ]-*#;/*, with the number of executions of 3)--'
and 1%+$- the most prevalent by far:
m,P n+.ojfp. =>? >@> E?C EF< GF< <AB ><C 2f^ ^A_ ,Qq >A< +,Q morQ.++S2sm,P/cdM1 U:* P2S S2R 1R1 1R1 1R1 LQ ON 52R b1t 55N 2L6 1 u&=3O(=SbMPM U:* 1R/ 1R0 1R1 1R1 1R1 1R1 /11 1R1 N 5 LB 1 l&)E<#"&(S/5PMQN %00$ 1R1 1R/ 1R1 1R1 1R1 /11 1R1 1R1 /1 1 5B 1 (<"59&S/5PNN U:* 1R/ 1R/ 1R1 1R1 1R1 MM PP 1R1 OPS Q RMO 1 u&=3O(=SbLMSS %00$ 1R1 1R/ 1R1 1R1 1R1 1R1 /11 1R1 NS 1 MRR 1 EE@(S/5PMQP %00$ 1R1 1R/ 1R1 1R1 1R1 1R1 /11 1R1 MR 1 LSQ 1 !"E<5<S///e1_ U:* 1R1 1R/ 1R1 1R1 1R1 _1 _1 1R1 OQP 5O ROR 1 u&=3O(=Sd5POMP %00$ 1R1 1R/ 1R1 1R1 1R1 1R1 /11 1R1 NR 1 LRP 1 B)9B;&T&"S/222^#<5%`*bc*!"#9&EE&E8*/0a*%T!E8*%#5(*5B&"5?&E`*1Rcd8*1Rbe8*1Rb/
Sun Microsystems, Inc.47 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
'*RS9#3)<GEgE95%%ER(01a ,)!/ LQ U)$&$
5PMQP "%)$&$ MO "%-&//cdM1 ]-*#;/* 55Q !#%%EgE/cdM1 ]-*#;/* 5LM %-&//cdM1 ]-*#;/* 5MP +0!$3//e1_ ]-*#;/* 5NO !#%%EgE5PNN ]-*#;/* 5O5 !#%%EgE
5PMQN /$%&!- MM5 +0!$3/cdM1 ]-*#;/* MOO5L 3)--'/cdM1 ]-*#;/* b_c1e 1%+$-
The next stage of the investigation is to identify why ]-*#;/* calls these I/O system
calls. To answer this, an analysis of the I/O is implemented with another D-script —
]-*#;)$&$2/, which collects the statistics of the I/O calls performed by ]-*#;/*,
while focusing on 3)--' and 1%+$-:
59*'[S3E"SE4;)S(<"59&*OEL99I"%&V*&9G90"$+0,9]#+-$M9*k.v,j*DN9****E&&N**J*12:O99TP9*EgE95%%``%E&&N`&)<"gQ9*S*&l&9)5=&*JJ*wu&=3O(=x*\\*[5"?0*\\*E&&N*SR9*DS9*****]%E&&N65"?18*5"?/OE&&N7*J*9#3)<>C:/1*****E&&N*J*5"?/:559T5L9EgE95%%``%E&&N`&)<"g5M9S*&l&9)5=&*JJ*wu&=3O(=x*\\*[5"?0*\\*[E&&N*S5N9D5O9****E&&N*J*5"?/:5P9T
The two instances of the EgE95%%``%E&&N`&)<"g probe have predicates (lines 7
and 13) that ensure they are called only if the triggering call to 3)--' sets the file
pointer to an absolute value, by testing that &%VL (whence) is zero, or +..tG+.^.
The probes identify the file according to the file descriptor (5"?1).
To determine the I/O pattern, the D-script saves the last absolute position of the
file pointer passed to 3)--'47 in the variable )--' in line 10. The difference
between the current and previous position of the file pointer is used as the second
index of the aggregation in line 9.
5Q9EgE95%%``T";<&`&)<"g5R9S*&l&9)5=&*JJ*wu&=3O(=x*S5S9D01***]T";<&65"?18*5"?07*J*9#3)<>C:L59T
The EgE95%%``T";<&`&)<"g probe counts the number of times the 1%+$-
system call is called with a specific number of bytes written.
Sun Microsystems, Inc.48 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
LL9<;9NO_E*DLM9**<"3)9>]%E&&N8*_C:LN9**<"3)9>]T";<&8*_C:LO9**!";)<I>w%E&&N*I(&E9*****#IIE&<******9#3)<V)xC:LP9**!";)<5>w******y_%(*y/1%(*y/1]%3V)x8*]%E&&NC:LQ9**!";)<I>wV)T";<&*I(&E9*******E;F&******9#3)<V)xC:LR9**!";)<5>w******y_%(*y/1%(*y/1]%3V)x8*]T";<&C:LS9**!";)<I>wV)V)xC:b1***9%&5">]%E&&NC:M59**9%&5">]T";<&C:ML9T
Lines 23 and 24 truncate the saved counts to the top five values.
Lines 25 and 26 print out the count of calls to 3)--' and lines 27 and 28 print out
the count of calls to 1%+$-.
Following are results of a single invocation of ]-*#;)$&$2/:
3)--'9H/-)!9990HH)-$9999!0#,$9999999999O9999999LP9999999LR9999999999O9999999LS9999999LR**********_********1*******d09999999999O9999999L59999999NL**********_********/***/bd_d11%+$-9H/-)!99999)+W-9999!0#,$9999999999O9999999L59999999NL9999999995O99999999N9999999ON9999999995P99999999N9999999PM9999999995N99999999N999999NN59999999999O9999999959995MNOON
The results of this invocation show that all calls to 3)--' with an absolute position
and most of the calls to 1%+$- target a single file. The vast majority of the calls to
3)--' move the file pointer of this file to an absolute position that is one byte away
from the previous position, and the vast majority of the calls to the 1%+$- system
call write a single byte. In other words, ]-*#;/* writes a data stream as single
bytes, without any buffering — an access pattern that is inherently slow.
Next, the !$%&E command is used to identify the file accessed by ]-*#;/* through
file descriptor O as the virtual Windows system disk:
'*!$%&E*/cdM1222_`*+G,ho.v*=#(&`1c11*(&B`/M08c__db*;)#`0c*3;(`c1*?;(`1*E;F&`//c0ba0be/0******rGoPsoUrG2fov.h,2.******SlB=S@&"=;5S(;ENG9SB(;ENRB=(N222
Sun Microsystems, Inc.49 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
To drill down further into this problem, it is necessary to investigate where the
calls to 3)--' originate by viewing the call stack. This is performed with the
]-*#;!&33)$&!'2/ script, which prints the three most common call stacks for the
3)--' and 1%+$- system calls every five seconds:
59*'[S3E"SE4;)S(<"59&*OEL99I"%&V*&9G90"$+0,9]#+-$M9*EgE95%%``%E&&N`&)<"g8*EgE95%%``T";<&`&)<"gN9*S*&l&9)5=&*JJ*wu&=3O(=x*SO9*DP9*****]96!"#4&I3)98*3E<59N>C7*J*9#3)<>C:Q9TR9<;9NO_E*DS9*****<"3)9>]98*bC:/1*****!";)<5>]9C:559****9%&5">]9C:5L9T
Line 6 saves the call stack of 3)--' and 1%+$-.
Line 10 prints the three most common stacks.
Following is the most common stack trace of the calls to 1%+$-, extracted from a
single invocation of ]-*#;!&33)$&!'2/:
1%+$-****%;49RE#R/~GGT";<&L1l5****u&=3O(=~o^h;%&s";<&L1lbe****u&=3O(=~o^h;%&s";<&f<L1ldM****u&=3O(=~B=(Ns";<&P&E9";!<#"L1l/(_****u&=3O(=~B=(Nh%3E@,=5?&L1l0b****u&=3O(=~B=(Nh%3E@L1la****u&=3O(=~|Ph%3E@L1la/****u&=3O(=~B(;ENG�3E@L1l/9****u&=3O(=~4("BG�3E@L1l0&****u&=3O(=~;(>";<&G(=5G94L1l/Me****u&=3O(=~4("BG5;#G4@G94L1l/c****u&=3O(=~u&=3G4@G!#%%L1l0(****u&=3O(=~=5;)G%##!GT5;<L1l009****u&=3O(=~=5;)G%##!L1le5****u&=3O(=~=5;)L1l/MMc****u&=3O(=~GE<5"<L1lc99999LRQOR
By examining the 1%+$- stack trace, it seems that the virtual machine is flushing the
disk cache very often, apparently for every byte. This could be the result of a disabled
disk cache. Further investigation uncovers the fact that if a Microsoft Windows server
acts as an Active Directory domain controller, the Active Directory directory service
performs unbuffered writes and tries to disable the disk write cache on volumes
hosting the Active Directory database and log files. Active Directory also works in this
manner when it runs in a virtual hosting environment. Clearly this issue can only be
solved by modifying the behavior of the Microsoft Windows 2008 virtual server.
Sun Microsystems, Inc.50 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Once the problem is resolved on the Microsoft Windows 2008 virtual server, the
]-*#;)$&$;]#&,$2/ D-script is implemented to print the patterns of calls to
1%+$-:
59*'[S3E"SE4;)S(<"59&*OEL99I"%&V*&9G90"$+0,9]#+-$M9*EgE95%%``T";<&`&)<"gN9*S*&l&9)5=&*JJ*wu&=3O(=x*\\*5"?1*JJ*X/*SO9*DP9***]T";<&u6wT";<&x7*J*u35)<;F&>5"?0C:Q99TR9*<;9NO_E*DS9***!";)<5>]T";<&uC:/1***9%&5">]T";<&uC:559T
Invoking it shows a much improved pattern of calls, in term of the size of each write:
'*RSu&=3OE<5<Ou35)<R(*_1%+$-99:&3#-99;;;;;;;;;;;;;9G+)$%+(#$+0,9;;;;;;;;;;;;;9!0#,$*****0_c*U*****************************************1*****_/0*U]]]]]]***********************************/d****/10d*U*****************************************1****01dM*U]]]]*************************************/1****d1ac*U]]]]]]]]]]]]]]]]*************************ba****M/a0*U]]]]]]]]]]]]]]]**************************bc***/cbMd*U*****************************************1
Sun Microsystems, Inc.51 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Tracking MPI Requests with DTrace — *"+)$&$2/
59k.v,jL9DM9***"&9BEG4g<&EJ1:*"&9BEG59<J1:*"&9BEG!#E<&(GE;F&J1:*N9***"&9BEG3)&l!GE;F&J1:*"&9BEG!#E<&(G=5<9@&EJ1:*O9***"&9BEG3)&l!G=5<9@&EJ1:*P9***E&)(EG59<J1:*E&)(EG4g<&EJ1:*#3<!3<G9)<*J*1:Q9***!";)<I>w;)!3<>^#<5%C*�OE;F&E******�Op5<9@&ExC:R9***!";)<I>w************#3<!3<V)xC:S9***!";)<I>w4g<&E*59<;B&*!#E<&(*3)&l!*!#E<&(*3)&l!*4g<&EV)xC:/1***!";)<I>w59<;B&V)xy_(*yc(*yc(*y_(*yc(*y_(******y_(*yc(V)x8*559**********"&9BEG4g<&E8*"&9BEG59<8*"&9BEG!#E<&(GE;F&8*5L99999999999%-!:).#,-U".)+W-89%-!:)."0)$-/.*&$!D-)85M9**********"&9BEG3)&l!G=5<9@&E8*E&)(EG4g<&E8*E&)(EG59<C:5N9T5O9SA*m";)<*+<5<;E<;9E*&B&"g*/*E&9*AS5P9!"#$%&```<;9NO/E&95Q9D5R9**!";)<I>wy_(*yc(*yc(*y_(*yc(*y_(******y_(*yc(*V)x8*5S9**********"&9BEG4g<&E8*"&9BEG59<8*"&9BEG!#E<&(GE;F&8*01*9999999999%-!:).#,-U".)+W-89%-!:)."0)$-/.*&$!D-)89L59**********"&9BEG3)&l!G=5<9@&E8*E&)(EG4g<&E8*E&)(EG59<C:LL9**LL#3<!3<G9)<:LM9TLN9SA*Q#%%&9<*f9<;B&*+&)(*o&u3&E<E*ASLO9=!;!&"3E&X<5"?&<```9#==O"&uO59<;B5<&LP9S5"?E6b7OZ=9EG#!JJxE&)(xSLQ9DLR9**LLE&)(EG59<:LS9Tb1*SA*Q#%%&9<*o&=#B5%*#I*+&)(*o&u3&E<E*ASM59=!;!&"3E&X<5"?&<```9#==O"&uO)#<;IgML9S5"?E6b7OZ=9EG#!JJxE&)(xSMM9DMN9**OOE&)(EG59<:MO9TMP9SA*Q#%%&9<*4g<&E*+&)<*ASMQ9=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O&)(MR9S5"?E6b7OZ=9EG#!JJxE&)(xSMS9Dd1***E&)(EG4g<&E*LJ*5"?E6b7OZ=9EG9#3)<:N59TNL9SA*Q#%%&9<*f9<;B&*o&9B*o&u3&E<*ASNM9=!;!&"3E&X<5"?&<```9#==O"&uO59<;B5<&NN9S5"?E6b7OZ=9EG#!JJx"&9BxSNO9DNP9**LL"&9BEG59<:NQ9TNR9SA*Q#%%&9<*o&=#B5%*#I*o&9B*o&u3&E<*ASNS9=!;!&"3E&X<5"?&<```9#==O"&uO)#<;Ig_1*S5"?E6b7OZ=9EG#!JJx"&9Bx\\"&9BEG59<Z1SO59DOL9**OO"&9BEG59<:OM9TON9SA*Q#%%&9<*o&u3&E<*m%59&(*#)*m#E<&(*�*ASOO9=!;!&"3E&X<5"?&<```9#==O"&uO;)E&"<O;)O!#E<&(OuOP9S5"?E6b7OZ=9EG#!JJx"&9BxSOQ9DOR9***LL"&9BEG!#E<&(GE;F&:OS9T
Sun Microsystems, Inc.52 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
c1*SA*Q#%%&9<*pE?*=5<9@&(*m#E<&(*�*ASP59=!;!&"3E&X<5"?&<```9#==O=E?O=5<9@O!#E<&(O"&uPL9S5"?E6b7OZ=9EG#!JJx"&9BxSPM9DPN9***LL"&9BEG!#E<&(G=5<9@&E:PO9TPP9=!;!&"3E&X<5"?&<```9#==O=E?O=5<9@O!#E<&(O"&uPQ9S5"?E6b7OZ=9EG#!JJx"&9Bx\\"&9BEG!#E<&(GE;F&Z1SPR9DPS9***OO"&9BEG!#E<&(GE;F&:e1*TQ59SA*Q#%%&9<*=&EE5?&E*;)*3)&l!*�*ASQL9=!;!&"3E&X<5"?&<```9#==O=E?O;)E&"<O;)O3)&lOuQM9S5"?E6b7OZ=9EG#!JJx"&9BxSQN9DQO9***LL"&9BEG3)&l!GE;F&:QP9TQQ9SA*Q#%%&9<*=&EE5?&E*"&=#B&(*I"#=*3)&l!*�*ASQR9=!;!&"3E&X<5"?&<```9#==O=E?O"&=#B&OI"#=O3)&lOuQS9S5"?E6b7OZ=9EG#!JJx"&9Bx\\"&9BEG3)&l!GE;F&Z1SM1*DR59***OO"&9BEG3)&l!GE;F&:RL9TRM9SA*Q#%%&9<*=&EE5?&E*"&=#B&(*I"#=*3)&l!*�*ASRN9=!;!&"3E&X<5"?&<```9#==O"&uO=5<9@O3)&lRO9S5"?E6b7OZ=9EG#!JJx"&9BxSRP9DRQ9***LL"&9BEG3)&l!G=5<9@&E:RR9TRS9SA*Q#%%&9<*9#3)<*#I*4g<&E*"&9;&B&(*ASa1*=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O9#)<;)3&S59Sf"?E6b7OZ=9EG#!JJx"&9BxSSL9DSM9***"&9BEG4g<&E*LJ*5"?E6b7OZ=9EG9#3)<:SN9T
Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN (9786) Web sun.com© 2009 Sun Microsystems, Inc. All rights reserved. Sun, Sun Microsystems, the Sun logo, Java, OpenSolaris, Solaris, Sun HPC ClusterTools, and Sun Fire are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the US and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. AMD Opteron is a trademarks or registered trademarks of Advanced Micro Devices. Intel Xeon is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Information subject to change without notice. Printed in USA 09/09
Sun Microsystems, Inc.Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC
Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN (9786) Web sun.com