Red Hat Enterprise Linux 7 SystemTap Beginners Guide system /rhel/Red... · Red Hat Enterprise...

62
William Cohen Don Domingo Jacquelynn East Red Hat Enterprise Linux 7 SystemTap Beginners Guide Introduction to SystemTap

Transcript of Red Hat Enterprise Linux 7 SystemTap Beginners Guide system /rhel/Red... · Red Hat Enterprise...

  • William Cohen Don Domingo Jacquelynn East

    Red Hat Enterprise Linux 7SystemTap Beginners Guide

    Introduct ion to SystemTap

  • Red Hat Enterprise Linux 7 SystemTap Beginners Guide

    Introduct ion to SystemTap

    William CohenRed Hat Red [email protected]

    Don DomingoRed Hat Red [email protected]

    Jacquelynn EastRed Hat Red [email protected]

  • Legal Notice

    Copyright © 2013 Red Hat, Inc. and o thers.

    This document is licensed by Red Hat under the Creative Commons Attribution-ShareAlike 3.0Unported License. If you distribute this document, o r a modified version o f it, you must provideattribution to Red Hat, Inc. and provide a link to the original. If the document is modified, all RedHat trademarks must be removed.

    Red Hat, as the licensor o f this document, waives the right to enforce, and agrees not to assert,Section 4d o f CC-BY-SA to the fullest extent permitted by applicable law.

    Red Hat, Red Hat Enterprise Linux, the Shadowman logo, JBoss, MetaMatrix, Fedora, the InfinityLogo, and RHCE are trademarks o f Red Hat, Inc., registered in the United States and o thercountries.

    Linux ® is the registered trademark o f Linus Torvalds in the United States and o ther countries.

    Java ® is a registered trademark o f Oracle and/or its affiliates.

    XFS ® is a trademark o f Silicon Graphics International Corp. or its subsidiaries in the UnitedStates and/or o ther countries.

    MySQL ® is a registered trademark o f MySQL AB in the United States, the European Union andother countries.

    Node.js ® is an o fficial trademark o f Joyent. Red Hat Software Collections is not fo rmallyrelated to or endorsed by the o fficial Joyent Node.js open source or commercial pro ject.

    The OpenStack ® Word Mark and OpenStack Logo are either registered trademarks/servicemarks or trademarks/service marks o f the OpenStack Foundation, in the United States and o thercountries and are used with the OpenStack Foundation's permission. We are not affiliated with,endorsed or sponsored by the OpenStack Foundation, or the OpenStack community.

    All o ther trademarks are the property o f their respective owners.

    AbstractThis guide provides basic instructions on how to use SystemTap to monitor differentsubsystems o f Red Hat Enterprise Linux 7 in finer detail. The SystemTap Beginners Guide isrecommended for users who have taken RHCT or have a similar level o f expertise in Red HatEnterprise Linux 7.

    http://creativecommons.org/licenses/by-sa/3.0/

  • . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Table of Contents

    Preface

    Chapt er 1 . Int roduct ion1.1. Do cumentatio n Go als1.2. SystemTap Cap ab il ities

    Chapt er 2 . Using Syst emT ap2.1. Installatio n and Setup2.2. Generating Instrumentatio n fo r O ther Co mp uters2.3. Running SystemTap Scrip ts

    Chapt er 3. Underst anding How Syst emT ap Works3.1. Architecture3.2. SystemTap Scrip ts3.3. Basic SystemTap Hand ler Co nstructs3.4. Asso ciative Arrays3.5. Array Op eratio ns in SystemTap3.6 . Tap sets

    Chapt er 4 . Useful Syst emT ap Script s4.1. Netwo rk4.2. Disk4.3. Pro fi l ing4.4. Id entifying Co ntend ed User-Sp ace Lo cks

    Chapt er 5. References

    Revision Hist ory

    2

    333

    4467

    1 1111118212229

    3030354455

    57

    58

    T able of Cont ent s

    1

  • Preface

    Red Hat Ent erprise Linux 7 Syst emT ap Beginners Guide

    2

  • Chapter 1. Introduction

    SystemTap is a tracing and probing tool that allows users to study and monitor the activities of theoperating system (particularly, the kernel) in fine detail. It provides information similar to the output oftools like netstat, ps, top, and iostat; however, SystemTap is designed to provide more filteringand analysis options for collected information.

    1.1. Documentat ion Goals

    SystemTap provides the infrastructure to monitor the running Linux system for detailed analysis. Thiscan assist administrators and developers in identifying the underlying cause of a bug orperformance problem.

    Without SystemTap, monitoring the activity of a running kernel would require a tedious instrument,recompile, install, and reboot sequence. SystemTap is designed to eliminate this, allowing users togather the same information by simply running user-written SystemTap scripts.

    However, SystemTap was initially designed for users with intermediate to advanced knowledge of thekernel. This makes SystemTap less useful to administrators or developers with limited knowledge ofand experience with the Linux kernel. Moreover, much of the existing SystemTap documentation issimilarly aimed at knowledgeable and experienced users. This makes learning the tool similarlydifficult.

    To lower these barriers the SystemTap Beginners Guide was written with the following goals:

    To introduce users to SystemTap, familiarize them with its architecture, and provide setupinstructions for all kernel types.

    To provide pre-written SystemTap scripts for monitoring detailed activity in different componentsof the system, along with instructions on how to run them and analyze their output.

    1.2. SystemTap Capabilit ies

    Flexibility: SystemTap's framework allows users to develop simple scripts for investigating andmonitoring a wide variety of kernel functions, system calls, and other events that occur in kernel-space. With this, SystemTap is not so much a tool as it is a system that allows you to develop yourown kernel-specific forensic and monitoring tools.

    Ease-Of-Use: as mentioned earlier, SystemTap allows users to probe kernel-space events withouthaving to resort to the lengthy instrument, recompile, install, and reboot the kernel process.

    Most of the SystemTap scripts enumerated in Chapter 4, Useful SystemTap Scripts demonstrate systemforensics and monitoring capabilities not natively available with other similar tools (such as top, oprofile, or ps). These scripts are provided to give readers extensive examples of the applicationof SystemTap, which in turn will educate them further on the capabilities they can employ whenwriting their own SystemTap scripts.

    Chapt er 1 . Int roduct ion

    3

  • Chapter 2. Using SystemTap

    This chapter instructs users how to install SystemTap, and provides an introduction on how to runSystemTap scripts.

    2.1. Installat ion and Setup

    To deploy SystemTap, SystemTap packages along with the corresponding set of -devel , -debuginfo and -debuginfo-common-arch packages for the kernel need to be installed. To useSystemTap on more than one kernel where a system has multiple kernels installed, install the -devel and -debuginfo packages for each of those kernel versions.

    These procedures will be discussed in detail in the following sections.

    Important

    Many users confuse -debuginfo with -debug . Remember that the deployment of SystemTaprequires the installation of the -debuginfo package of the kernel, not the -debug version ofthe kernel.

    2.1.1. Installing SystemT ap

    To deploy SystemTap, install the following RPMs:

    systemtap

    systemtap-runtime

    Assuming that yum is installed in the system, these two rpms can be installed with yum install systemtap systemtap-runtime. Install the required kernel information RPMs before usingSystemTap.

    2.1.2. Installing Required Kernel Informat ion RPMs

    SystemTap needs information about the kernel in order to place instrumentation in it (i.e. probe it).This information, which allows SystemTap to generate the code for the instrumentation, is containedin the matching -devel , -debuginfo , and -debuginfo-common-arch packages for the kernel.The necessary -devel and -debuginfo packages for the ordinary "vanilla" kernel are as follows:

    kernel-debuginfo

    kernel-debuginfo-common-arch

    kernel-devel

    Likewise, the necessary packages for the PAE kernel would be kernel-PAE-debuginfo , kernel-PAE-debuginfo-common-arch ,and kernel-PAE-devel .

    To determine what kernel your system is currently using, use:

    uname -r

    Red Hat Ent erprise Linux 7 Syst emT ap Beginners Guide

    4

  • For example, if you wish to use SystemTap on kernel version 2.6.32-53.el6 on an i686 machine,then you would need to download and install the following RPMs:

    kernel-debuginfo-2.6.32-53.el6.i686.rpm

    kernel-debuginfo-common-i686-2.6.32-53.el6.i686.rpm

    kernel-devel-2.6.32-53.el6.i686.rpm

    Important

    The version, variant, and architecture of the -devel , -debuginfo and -debuginfo-common-arch packages must match the kernel to be probed with SystemTap exactly.

    To obtain a list of the channels SystemTap needs on the system, use the following script:

    #! /bin/bashpkg=`rpm -q --whatprovides "redhat-release"`releasever=`rpm -q --qf "%{version}" $pkg`variant=`echo $releasever | tr -d "[:digit:]" | tr "[:upper:]" "[:lower:]" `if test -z "$variant"; then echo "No Red Hat Enterprise Linux variant (workstation/client/server) found." exit 1fiversion=`echo $releasever | tr -cd "[:digit:]"`base=`uname -i`echo "rhel-$base-$variant-$version"echo "rhel-$base-$variant-$version-debuginfo"echo "rhel-$base-$variant-optional-$version-debuginfo"echo "rhel-$base-$variant-optional-$version"

    After the channels have been added, install the required -devel , debuginfo , and debuginfo-install arch packages for the kernel using the command debuginfo-install kernelname-version. Replace kernelname with the appropriate kernel variant name (forexample, kernel-PAE), and version with the target kernel's version. For example, to install therequired kernel information packages for the kernel-PAE-2.6.32-53.el6 kernel, run: debuginfo-install kernel-PAE-2.6.32-53.el6

    2.1.3. Init ial T est ing

    If the kernel to be probed with SystemTap is currently being used, it is possible to immediately testwhether the deployment was successful. If a different kernel is to be probed, reboot and load theappropriate kernel.

    To start the test, run the command stap -v -e 'probe vfs.read {printf("read performed\n"); exit()}' . This command simply instructs SystemTap to print read performed then exit properly once a virtual file system read is detected. If the SystemTapdeployment was successful, you should get output similar to the following:

    Pass 1: parsed user script and 45 library script(s) in 340usr/0sys/358real ms.Pass 2: analyzed script: 1 probe(s), 1 function(s), 0 embed(s), 0

    Chapt er 2 . Using Syst emT ap

    5

  • global(s) in 290usr/260sys/568real ms.Pass 3: translated to C into "/tmp/stapiArgLX/stap_e5886fa50499994e6a87aacdc43cd392_399.c" in 490usr/430sys/938real ms.Pass 4: compiled C into "stap_e5886fa50499994e6a87aacdc43cd392_399.ko" in 3310usr/430sys/3714real ms.Pass 5: starting run.read performedPass 5: run completed in 10usr/40sys/73real ms.

    The last three lines of the output (i.e. beginning with Pass 5) indicate that SystemTap was able tosuccessfully create the instrumentation to probe the kernel, run the instrumentation, detect the eventbeing probed (in this case, a virtual file system read), and execute a valid handler (print text thenclose it with no errors).

    2.2. Generat ing Inst rumentat ion for Other Computers

    When users run a SystemTap script, a kernel module is built out of that script. SystemTap then loadsthe module into the kernel, allowing it to extract the specified data directly from the kernel (refer toProcedure 3.1, “SystemTap Session” in Section 3.1, “Architecture” for more information).

    Normally, SystemTap scripts can only be run on systems where SystemTap is deployed (as inSection 2.1, “ Installation and Setup” ). This could mean that to run SystemTap on ten systems,SystemTap needs to be deployed on all those systems. In some cases, this may be neither feasiblenor desired. For instance, corporate policy may prohibit an administrator from installing RPMs thatprovide compilers or debug information on specific machines, which will prevent the deployment ofSystemTap.

    To work around this, use cross-instrumentation. Cross-instrumentation is the process of generatingSystemTap instrumentation modules from a SystemTap script on one computer to be used onanother computer. This process offers the following benefits:

    The kernel information packages for various machines can be installed on a single host machine.

    Each target machine only needs one RPM to be installed to use the generated SystemTapinstrumentation module: systemtap-runtime.

    Note

    For the sake of simplicity, the following terms will be used throughout this section:

    instrumentation module — the kernel module built from a SystemTap script; i.e. theSystemTap module is built on the host system, and will be loaded on the target kernel of targetsystem. host system — the system on which the instrumentation modules (from SystemTap scripts)are compiled, to be loaded on target systems. target system — the system in which the instrumentation module is being built (fromSystemTap scripts). target kernel — the kernel of the target system. This is the kernel which loads/runs theinstrumentation module.

    Procedure 2.1. Conf iguring a Host System and Target Systems

    Red Hat Ent erprise Linux 7 Syst emT ap Beginners Guide

    6

  • 1. Install the systemtap-runtime RPM on each target system.

    2. Determine the kernel running on each target system by running uname -r on each targetsystem.

    3. Install SystemTap on the host system. The instrumentation module will be built for the targetsystems on the host system. For instructions on how to install SystemTap, refer toSection 2.1.1, “ Installing SystemTap” .

    4. Using the target kernel version determined earlier, install the target kernel and related RPMs onthe host system by the method described in Section 2.1.2, “ Installing Required KernelInformation RPMs” . If multiple target systems use different target kernels, repeat this step foreach different kernel used on the target systems.

    After performing Procedure 2.1, “Configuring a Host System and Target Systems” , the instrumentationmodule (for any target system) can now be built on the host system.

    To build the instrumentation module, run the following command on the host system (be sure to specifythe appropriate values):

    stap -r kernel_version script -m module_name -p4

    Here, kernel_version refers to the version of the target kernel (the output of uname -r on the targetmachine), script refers to the script to be converted into an instrumentation module, and module_name is the desired name of the instrumentation module.

    Note

    To determine the architecture notation of a running kernel, run uname -m.

    Once the instrumentation module is compiled, copy it to the target system and then load it using:

    staprun module_name.ko

    For example, to create the instrumentation module simple.ko from a SystemTap script named simple.stp for the target kernel 2.6.32-53.el6, use the following command:

    stap -r 2.6.32-53.el6 -e 'probe vfs.read {exit()}' -m simple -p4

    This will create a module named simple.ko . To use the instrumentation module simple.ko , copy itto the target system and run the following command (on the target system):

    staprun simple.ko

    Important

    The host system must be the same architecture and running the same distribution of Linux asthe target system in order for the built instrumentation module to work.

    2.3. Running SystemTap Scripts

    Chapt er 2 . Using Syst emT ap

    7

  • SystemTap scripts are run through the command stap. stap can run SystemTap scripts fromstandard input or from file.

    Running stap and staprun requires elevated privileges to the system. However, not all users can begranted root access just to run SystemTap. In some cases, for instance, a non-privileged user mayneed to to run SystemTap instrumentation on their machine.

    To allow ordinary users to run SystemTap without root access, add them to both of these usergroups:

    stapdev

    Members of this group can use stap to run SystemTap scripts, or staprun to runSystemTap instrumentation modules.

    Running stap involves compiling SystemTap scripts into kernel modules and loading theminto the kernel. This requires elevated privileges to the system, which are granted to stapdev members. Unfortunately, such privileges also grant effective root access to stapdev members. As such, only grant stapdev group membership to users who can betrusted with root access.

    stapusr

    Members of this group can only use staprun to run SystemTap instrumentation modules.In addition, they can only run those modules from /lib/modules/kernel_version/systemtap/. Note that this directory must be ownedonly by the root user, and must only be writable by the root user.

    Running SystemTap Scripts

    In order to run SystemTap scripts a user must be in both the stapdev and stapusr groups.

    Below is a list of commonly used stap options:

    -v

    Makes the output of the SystemTap session more verbose. This option (for example, stap -vvv script.stp) can be repeated to provide more details on the script's execution. It isparticularly useful if errors are encountered when running the script. This option isparticularly useful if you encounter any errors in running the script.

    -o filename

    Sends the standard output to file (filename).

    -S size,count

    Limit files to size megabytes and limit the number of files kept around to count. The filenames will have a sequence number suffix. This option implements logrotate operations forSystemTap.

    When used with -o , the -S will limit the size of log files.

    -x process ID

    Sets the SystemTap handler function target() to the specified process ID. For moreinformation about target(), refer to SystemTap Functions.

    Red Hat Ent erprise Linux 7 Syst emT ap Beginners Guide

    8

  • -c command

    Sets the SystemTap handler function target() to the specified command. The full path tothe specified command must be used; for example, instead of specifying cp, use /bin/cp(as in stap script -c /bin/cp). For more information about target(), refer toSystemTap Functions.

    -e 'script'

    Use script string rather than a file as input for systemtap translator.

    -F

    Use SystemTap's Flight recorder mode and make the script a background process. Formore information about flight recorder mode, refer to Section 2.3.1, “SystemTap FlightRecorder Mode” .

    stap can also be instructed to run scripts from standard input using the switch -. To illustrate:

    Example 2.1. Running Scripts From Standard Input

    echo "probe timer.s(1) {exit()}" | stap -

    Example 2.1, “Running Scripts From Standard Input” instructs stap to run the script passed by echo to standard input. Any stap options to be used should be inserted before the - switch; forinstance, to make the example in Example 2.1, “Running Scripts From Standard Input” more verbose,the command would be:

    echo "probe timer.s(1) {exit()}" | stap -v -

    For more information about stap, refer to man stap.

    To run SystemTap instrumentation (i.e. the kernel module built from SystemTap scripts during across-instrumentation), use staprun instead. For more information about staprun and cross-instrumentation, refer to Section 2.2, “Generating Instrumentation for Other Computers” .

    Note

    The stap options -v and -o also work for staprun. For more information about staprun,refer to man staprun.

    2.3.1. SystemT ap Flight Recorder Mode

    SystemTap's flight recorder mode allows a SystemTap script to be ran for long periods and justfocus on recent output. The flight recorder mode (the -F option) limits the amount of outputgenerated. There are two variations of the flight recorder mode: in-memory and file mode. In bothcases the SystemTap script runs as a background process.

    2.3.1 .1 . In-memo ry Flight Reco rder

    When flight recorder mode (the -F option) is used without a file name, SystemTap uses a buffer inkernel memory to store the output of the script. Next, SystemTap instrumentation module loads andthe probes start running, then instrumentation will detatch and be put in the background. When the

    Chapt er 2 . Using Syst emT ap

    9

  • interesting event occurs, the instrumentation can be reattached and the recent output in the memorybuffer and any continuing output can be seen. The following command starts a script using the flightrecorder in-memory mode:

    stap -F /usr/share/doc/systemtap-version/examples/io/iotime.stp

    Once the script starts, a message that provides the command to reconnect to the running script willappear:

    Disconnecting from systemtap module.To reconnect, type "staprun -A stap_5dd0073edcb1f13f7565d8c343063e68_19556"

    When the interesting event occurs, reattach to the currently running script and output the recent datain the memory buffer, then get the continuing output with the following command:

    staprun -A stap_5dd0073edcb1f13f7565d8c343063e68_19556

    By default, the kernel buffer is 1MB in size, but it can be increased with the -s option specifying thesize in megabytes (rounded up to the next power over 2) for the buffer. For example -s2 on theSystemTap command line would specify 2MB for the buffer.

    2.3.1 .2 . File Flight Reco rder

    The flight recorder mode can also store data to files. The number and size of the files kept iscontrolled by the -S option followed by two numerical arguments separated by a comma. The firstargument is the maximum size in megabytes for the each output file. The second argument is thenumber of recent files to keep. The file name is specified by the -o option followed by the name.SystemTap adds a number suffix to the file name to indicate the order of the files. The following willstart SystemTap in file flight recorder mode with the output going to files named /tmp/pfaults.log. [0-9]+ with each file 1MB or smaller and keeping latest two files:

    stap -F -o /tmp/pfaults.log -S 1,2 pfaults.stp

    The number printed by the command is the process ID. Sending a SIGTERM to the process willshutdown the SystemTap script and stop the data collection. For example if the previous commandlisted the 7590 as the process ID, the following command whould shutdown the systemtap script:

    kill -s SIGTERM 7590

    Only the most recent two file generated by the script are kept and the older files are been removed.Thus, ls -sh /tmp/pfaults.log.* shows the only two files:

    1020K /tmp/pfaults.log.5 44K /tmp/pfaults.log.6

    One can look at the highest number file for the latest data, in this case /tmp/pfaults.log.6.

    Red Hat Ent erprise Linux 7 Syst emT ap Beginners Guide

    10

  • Chapter 3. Understanding How SystemTap Works

    SystemTap allows users to write and reuse simple scripts to deeply examine the activities of arunning Linux system. These scripts can be designed to extract data, filter it, and summarize itquickly (and safely), enabling the diagnosis of complex performance (or even functional) problems.

    The essential idea behind a SystemTap script is to name events, and to give them handlers. WhenSystemTap runs the script, SystemTap monitors for the event; once the event occurs, the Linux kernelthen runs the handler as a quick sub-routine, then resumes.

    There are several kind of events; entering/exiting a function, timer expiration, session termination, etc.A handler is a series of script language statements that specify the work to be done whenever theevent occurs. This work normally includes extracting data from the event context, storing them intointernal variables, and printing results.

    3.1. Architecture

    A SystemTap session begins when you run a SystemTap script. This session occurs in the followingfashion:

    Procedure 3.1. SystemTap Session

    1. First, SystemTap checks the script against the existing tapset library (normally in /usr/share/systemtap/tapset/ for any tapsets used. SystemTap will then substitute anylocated tapsets with their corresponding definitions in the tapset library.

    2. SystemTap then translates the script to C, running the system C compiler to create a kernelmodule from it. The tools that perform this step are contained in the systemtap package(refer to Section 2.1.1, “ Installing SystemTap” for more information).

    3. SystemTap loads the module, then enables all the probes (events and handlers) in the script.The staprun in the systemtap-runtime package (refer to Section 2.1.1, “ InstallingSystemTap” for more information) provides this functionality.

    4. As the events occur, their corresponding handlers are executed.

    5. Once the SystemTap session is terminated, the probes are disabled, and the kernel module isunloaded.

    This sequence is driven from a single command-line program: stap. This program is SystemTap'smain front-end tool. For more information about stap, refer to man stap (once SystemTap isproperly installed on your machine).

    3.2. SystemTap Scripts

    For the most part, SystemTap scripts are the foundation of each SystemTap session. SystemTapscripts instruct SystemTap on what type of information to collect, and what to do once thatinformation is collected.

    As stated in Chapter 3, Understanding How SystemTap Works, SystemTap scripts are made up of twocomponents: events and handlers. Once a SystemTap session is underway, SystemTap monitors theoperating system for the specified events and executes the handlers as they occur.

    Chapt er 3. Underst anding How Syst emT ap Works

    11

  • Note

    An event and its corresponding handler is collectively called a probe. A SystemTap script canhave multiple probes.

    A probe's handler is commonly referred to as a probe body.

    In terms of application development, using events and handlers is similar to instrumenting the codeby inserting diagnostic print statements in a program's sequence of commands. These diagnosticprint statements allow you to view a history of commands executed once the program is run.

    SystemTap scripts allow insertion of the instrumentation code without recompilation of the code andallows more flexibility with regard to handlers. Events serve as the triggers for handlers to run;handlers can be specified to record specified data and print it in a certain manner.

    Format

    SystemTap scripts use the file extension .stp, and contains probes written in the following format:

    probe event {statements}

    SystemTap supports multiple events per probe; multiple events are delimited by a comma (,). Ifmultiple events are specified in a single probe, SystemTap will execute the handler when any of thespecified events occur.

    Each probe has a corresponding statement block. This statement block is enclosed in braces ({ })and contains the statements to be executed per event. SystemTap executes these statements insequence; special separators or terminators are generally not necessary between multiplestatements.

    Note

    Statement blocks in SystemTap scripts follow the same syntax and semantics as the Cprogramming language. A statement block can be nested within another statement block.

    Systemtap allows you to write functions to factor out code to be used by a number of probes. Thus,rather than repeatedly writing the same series of statements in multiple probes, you can just place theinstructions in a function, as in:

    function function_name(arguments) {statements}probe event {function_name(arguments)}

    The statements in function_name are executed when the probe for event executes. The arguments areoptional values passed into the function.

    Red Hat Ent erprise Linux 7 Syst emT ap Beginners Guide

    12

  • Important

    Section 3.2, “SystemTap Scripts” is designed to introduce readers to the basics of SystemTapscripts. To understand SystemTap scripts better, it is advisable that you refer to Chapter 4,Useful SystemTap Scripts; each section therein provides a detailed explanation of the script, itsevents, handlers, and expected output.

    3.2.1. Event

    SystemTap events can be broadly classified into two types: synchronous and asynchronous.

    Synchronous Events

    A synchronous event occurs when any process executes an instruction at a particular location inkernel code. This gives other events a reference point from which more contextual data may beavailable.

    Examples of synchronous events include:

    syscall.system_call

    The entry to the system call system_call. If the exit from a syscall is desired, appending a .return to the event monitor the exit of the system call instead. For example, to specify theentry and exit of the system call close, use syscall.close and syscall.close.return respectively.

    vfs.file_operation

    The entry to the file_operation event for Virtual File System (VFS). Similar to syscall event,appending a .return to the event monitors the exit of the file_operation operation.

    kernel.funct ion("function")

    The entry to the kernel function function. For example, kernel.function("sys_open")refers to the "event" that occurs when the kernel function sys_open is called by any threadin the system. To specify the return of the kernel function sys_open, append the returnstring to the event statement; i.e. kernel.function("sys_open").return.

    When defining probe events, you can use asterisk (*) for wildcards. You can also trace theentry or exit of a function in a kernel source file. Consider the following example:

    Example 3.1. wildcards.stp

    probe kernel.function("*@net/socket.c") { }probe kernel.function("*@net/socket.c").return { }

    In the previous example, the first probe's event specifies the entry of ALL functions in thekernel source file net/socket.c. The second probe specifies the exit of all thosefunctions. Note that in this example, there are no statements in the handler; as such, noinformation will be collected or displayed.

    kernel.t race("tracepoint")

    The static probe for tracepoint. Recent kernels (2.6.30 and newer) include instrumentation

    Chapt er 3. Underst anding How Syst emT ap Works

    13

  • for specific events in the kernel. These events are statically marked with tracepoints. Oneexample of a tracepoint available in systemtap is kernel.trace("kfree_skb") whichindicates each time a network buffer is freed in the kernel.

    module("module") .funct ion("function")

    Allows you to probe functions within modules. For example:

    Example 3.2. moduleprobe.stp

    probe module("ext3").function("*") { }probe module("ext3").function("*").return { }

    The first probe in Example 3.2, “moduleprobe.stp” points to the entry of all functions for the ext3 module. The second probe points to the exits of all functions for that same module; theuse of the .return suffix is similar to kernel.function(). Note that the probes inExample 3.2, “moduleprobe.stp” do not contain statements in the probe handlers, and assuch will not print any useful data (as in Example 3.1, “wildcards.stp” ).

    A system's kernel modules are typically located in /lib/modules/kernel_version,where kernel_version refers to the currently loaded kernel version. Modules use the file nameextension .ko .

    Asynchronous Events

    Asynchronous events are not tied to a particular instruction or location in code. This family of probepoints consists mainly of counters, timers, and similar constructs.

    Examples of asynchronous events include:

    begin

    The startup of a SystemTap session; i.e. as soon as the SystemTap script is run.

    end

    The end of a SystemTap session.

    t imer events

    An event that specifies a handler to be executed periodically. For example:

    Example 3.3. t imer-s.stp

    probe timer.s(4){ printf("hello world\n")}

    Example 3.3, “ timer-s.stp” is an example of a probe that prints hello world every 4seconds. Note that you can also use the following timer events:

    timer.ms(milliseconds)

    Red Hat Ent erprise Linux 7 Syst emT ap Beginners Guide

    14

  • timer.us(microseconds)

    timer.ns(nanoseconds)

    timer.hz(hertz)

    timer.jiffies(jiffies)

    When used in conjunction with other probes that collect information, timer events allowsyou to print out get periodic updates and see how that information changes over time.

    Important

    SystemTap supports the use of a large collection of probe events. For more information aboutsupported events, refer to man stapprobes. The SEE ALSO section of man stapprobes alsocontains links to other man pages that discuss supported events for specific subsystems andcomponents.

    3.2.2. Systemtap Handler/Body

    Consider the following sample script:

    Example 3.4 . helloworld .stp

    probe begin{ printf ("hello world\n") exit ()}

    In Example 3.4, “helloworld.stp” , the event begin (i.e. the start of the session) triggers the handlerenclosed in { }, which simply prints hello world followed by a new-line, then exits.

    Note

    SystemTap scripts continue to run until the exit() function executes. If the users wants tostop the execution of the script, it can interrupted manually with Ctrl+C .

    print f ( ) Statements

    The printf () statement is one of the simplest functions for printing data. printf () can also beused to display data using a wide variety of SystemTap functions in the following format:

    printf ("format string\n", arguments)

    The format string specifies how arguments should be printed. The format string of Example 3.4,“helloworld.stp” simply instructs SystemTap to print hello world , and contains no formatspecifiers.

    Chapt er 3. Underst anding How Syst emT ap Works

    15

  • You can use the format specifiers %s (for strings) and %d (for numbers) in format strings, dependingon your list of arguments. Format strings can have multiple format specifiers, each matching acorresponding argument; multiple arguments are delimited by a comma (,).

    Note

    Semantically, the SystemTap printf function is very similar to its C language counterpart.The aforementioned syntax and format for SystemTap's printf function is identical to that ofthe C-style printf.

    To illustrate this, consider the following probe example:

    Example 3.5. variab les- in -print f -statements.stp

    probe syscall.open{ printf ("%s(%d) open\n", execname(), pid())}

    Example 3.5, “ variables-in-printf-statements.stp” instructs SystemTap to probe all entries to thesystem call open; for each event, it prints the current execname() (a string with the executablename) and pid() (the current process ID number), followed by the word open. A snippet of thisprobe's output would look like:

    vmware-guestd(2206) openhald(2360) openhald(2360) openhald(2360) opendf(3433) opendf(3433) opendf(3433) openhald(2360) open

    SystemTap Funct ions

    SystemTap supports a wide variety of functions that can be used as printf () arguments.Example 3.5, “ variables-in-printf-statements.stp” uses the SystemTap functions execname() (nameof the process that called a kernel function/performed a system call) and pid() (current process ID).

    The following is a list of commonly-used SystemTap functions:

    t id ( )

    The ID of the current thread.

    uid( )

    The ID of the current user.

    cpu( )

    The current CPU number.

    Red Hat Ent erprise Linux 7 Syst emT ap Beginners Guide

    16

  • The current CPU number.

    get t imeofday_s( )

    The number of seconds since UNIX epoch (January 1, 1970).

    ct ime( )

    Convert number of seconds since UNIX epoch to date.

    pp( )

    A string describing the probe point currently being handled.

    thread_indent ( )

    This particular function is quite useful, providing you with a way to better organize yourprint results. The function takes one argument, an indentation delta, which indicates howmany spaces to add or remove from a thread's " indentation counter" . It then returns a stringwith some generic trace data along with an appropriate number of indentation spaces.

    The generic data included in the returned string includes a timestamp (number ofmicroseconds since the first call to thread_indent() by the thread), a process name,and the thread ID. This allows you to identify what functions were called, who called them,and the duration of each function call.

    If call entries and exits immediately precede each other, it is easy to match them. However, inmost cases, after a first function call entry is made several other call entries and exits maybe made before the first call exits. The indentation counter helps you match an entry with itscorresponding exit by indenting the next function call if it is not the exit of the previous one.

    Consider the following example on the use of thread_indent():

    Example 3.6 . thread_indent .stp

    probe kernel.function("*@net/socket.c") { printf ("%s -> %s\n", thread_indent(1), probefunc())}probe kernel.function("*@net/socket.c").return { printf ("%s

  • 4301 ftp(7223): -> sock_map_fd4644 ftp(7223): -> sock_map_file4699 ftp(7223):

  • 3.3.1. Variables

    Variables can be used freely throughout a handler; simply choose a name, assign a value from afunction or expression to it, and use it in an expression. SystemTap automatically identifies whethera variable should be typed as a string or integer, based on the type of the values assigned to it. Forinstance, if you use set the variable foo to gettimeofday_s() (as in foo = gettimeofday_s()), then foo is typed as a number and can be printed in a printf() with theinteger format specifier (%d ).

    Note, however, that by default variables are only local to the probe they are used in. This means thatvariables are initialized, used and disposed at each probe handler invocation. To share a variablebetween probes, declare the variable name using global outside of the probes. Consider thefollowing example:

    Example 3.8. t imer- jif f ies.stp

    global count_jiffies, count_msprobe timer.jiffies(100) { count_jiffies ++ }probe timer.ms(100) { count_ms ++ }probe timer.ms(12345){ hz=(1000*count_jiffies) / count_ms printf ("jiffies:ms ratio %d:%d => CONFIG_HZ=%d\n", count_jiffies, count_ms, hz) exit ()}

    Example 3.8, “ timer-jiffies.stp” computes the CONFIG_HZ setting of the kernel using timers that countjiffies and milliseconds, then computing accordingly. The global statement allows the script to usethe variables count_jiffies and count_ms (set in their own respective probes) to be shared with probe timer.ms(12345).

    Note

    The ++ notation in Example 3.8, “ timer-jiffies.stp” (i.e. count_jiffies ++ and count_ms ++ ) is used to increment the value of a variable by 1. In the following probe, count_jiffiesis incremented by 1 every 100 jiffies:

    probe timer.jiffies(100) { count_jiffies ++ }

    In this instance, SystemTap understands that count_jiffies is an integer. Because noinitial value was assigned to count_jiffies, its initial value is zero by default.

    3.3.2. Condit ional Statements

    In some cases, the output of a SystemTap script may be too big. To address this, you need to furtherrefine the script's logic in order to delimit the output into something more relevant or useful to yourprobe.

    Chapt er 3. Underst anding How Syst emT ap Works

    19

  • You can do this by using conditionals in handlers. SystemTap accepts the following types ofconditional statements:

    If /Else Statements

    Format:

    if (condition) statement1else statement2

    The statement1 is executed if the condition expression is non-zero. The statement2is executed if the condition expression is zero. The else clause (else statement2) isoptional. Both statement1 and statement2 can be statement blocks.

    Example 3.9 . ifelse.stp

    global countread, countnonreadprobe kernel.function("vfs_read"),kernel.function("vfs_write"){ if (probefunc()=="vfs_read") countread ++ else countnonread ++}probe timer.s(5) { exit() }probe end { printf("VFS reads total %d\n VFS writes total %d\n", countread, countnonread)}

    Example 3.9, “ ifelse.stp” is a script that counts how many virtual file system reads(vfs_read ) and writes (vfs_write) the system performs within a 5-second span. Whenrun, the script increments the value of the variable countread by 1 if the name of thefunction it probed matches vfs_read (as noted by the condition if (probefunc()=="vfs_read")); otherwise, it increments countnonread (else {countnonread ++}).

    While Loops

    Format:

    while (condition) statement

    So long as condition is non-zero the block of statements in statement are executed.The statement is often a statement block and it must change a value so condition willeventually be zero.

    For Loops

    Format:

    Red Hat Ent erprise Linux 7 Syst emT ap Beginners Guide

    20

  • for (initialization; conditional; increment) statement

    The for loop is simply shorthand for a while loop. The following is the equivalent whileloop:

    initializationwhile (conditional) { statement increment}

    Condit ional Operators

    Aside from == (" is equal to"), you can also use the following operators in your conditionalstatements:

    >=

    Greater than or equal to

  • Since associative arrays are normally processed in multiple probes (as we will demonstrate later),they should be declared as global variables in the SystemTap script. The syntax for accessing anelement in an associative array is similar to that of awk, and is as follows:

    array_name[index_expression]

    Here, the array_name is any arbitrary name the array uses. The index_expression is used torefer to a specific unique key in the array. To illustrate, let us try to build an array named foo thatspecifies the ages of three people (i.e. the unique keys): tom, dick, and harry. To assign them theages (i.e. associated values) of 23, 24, and 25 respectively, we'd use the following array statements:

    Example 3.11. Basic Array Statements

    foo["tom"] = 23foo["dick"] = 24foo["harry"] = 25

    You can specify up to nine index expressons in an array statement, each one delimited by a comma(,). This is useful if you wish to have a key that contains multiple pieces of information. The followingline from disktop.stp uses 5 elements for the key: process ID, executable name, user ID, parentprocess ID, and string "W". It associates the value of devname with that key.

    device[pid(),execname(),uid(),ppid(),"W"] = devname

    Important

    All associate arrays must be declared as global , regardless of whether the associate arrayis used in one or multiple probes.

    3.5. Array Operat ions in SystemTap

    This section enumerates some of the most commonly used array operations in SystemTap.

    3.5.1. Assigning an Associated Value

    Use = to set an associated value to indexed unique pairs, as in:

    array_name[index_expression] = value

    Example 3.11, “Basic Array Statements” shows a very basic example of how to set an explicitassociated value to a unique key. You can also use a handler function as both your index_expression and value. For example, you can use arrays to set a timestamp as theassociated value to a process name (which you wish to use as your unique key), as in:

    Example 3.12. Associat ing T imestamps to Process Names

    foo[tid()] = gettimeofday_s()

    Red Hat Ent erprise Linux 7 Syst emT ap Beginners Guide

    22

  • Whenever an event invokes the statement in Example 3.12, “Associating Timestamps to ProcessNames” , SystemTap returns the appropriate tid() value (i.e. the ID of a thread, which is then usedas the unique key). At the same time, SystemTap also uses the function gettimeofday_s() to setthe corresponding timestamp as the associated value to the unique key defined by the function tid(). This creates an array composed of key pairs containing thread IDs and timestamps.

    In this same example, if tid() returns a value that is already defined in the array foo , the operatorwill discard the original associated value to it, and replace it with the current timestamp from gettimeofday_s().

    3.5.2. Reading Values From Arrays

    You can also read values from an array the same way you would read the value of a variable. To doso, include the array_name[index_expression] statement as an element in a mathematicalexpression. For example:

    Example 3.13. Using Array Values in Simple Computat ions

    delta = gettimeofday_s() - foo[tid()]

    This example assumes that the array foo was built using the construct in Example 3.12,“Associating Timestamps to Process Names” (from Section 3.5.1, “Assigning an Associated Value” ).This sets a timestamp that will serve as a reference point, to be used in computing for delta.

    The construct in Example 3.13, “Using Array Values in Simple Computations” computes a value forthe variable delta by subtracting the associated value of the key tid() from the current gettimeofday_s(). The construct does this by reading the value of tid() from the array. Thisparticular construct is useful for determining the time between two events, such as the start andcompletion of a read operation.

    Note

    If the index_expression cannot find the unique key, it returns a value of 0 (for numericaloperations, such as Example 3.13, “Using Array Values in Simple Computations” ) or anull/empty string value (for string operations) by default.

    3.5.3. Increment ing Associated Values

    Use ++ to increment the associated value of a unique key in an array, as in:

    array_name[index_expression] ++

    Again, you can also use a handler function for your index_expression. For example, if youwanted to tally how many times a specific process performed a read to the virtual file system (usingthe event vfs.read ), you can use the following probe:

    Example 3.14 . vfsreads.stp

    Chapt er 3. Underst anding How Syst emT ap Works

    23

  • probe vfs.read{ reads[execname()] ++}

    In Example 3.14, “ vfsreads.stp” , the first time that the probe returns the process name gnome-terminal (i.e. the first time gnome-terminal performs a VFS read), that process name is set as theunique key gnome-terminal with an associated value of 1. The next time that the probe returns theprocess name gnome-terminal , SystemTap increments the associated value of gnome-terminalby 1. SystemTap performs this operation for all process names as the probe returns them.

    3.5.4 . Processing Mult iple Elements in an Array

    Once you've collected enough information in an array, you will need to retrieve and process allelements in that array to make it useful. Consider Example 3.14, “ vfsreads.stp” : the script collectsinformation about how many VFS reads each process performs, but does not specify what to do withit. The obvious means for making Example 3.14, “ vfsreads.stp” useful is to print the key pairs in thearray reads, but how?

    The best way to process all key pairs in an array (as an iteration) is to use the foreach statement.Consider the following example:

    Example 3.15. cumulat ive-vfsreads.stp

    global readsprobe vfs.read{ reads[execname()] ++}probe timer.s(3){ foreach (count in reads) printf("%s : %d \n", count, reads[count])}

    In the second probe of Example 3.15, “ cumulative-vfsreads.stp” , the foreach statement uses thevariable count to reference each iteration of a unique key in the array reads. The reads[count]array statement in the same probe retrieves the associated value of each unique key.

    Given what we know about the first probe in Example 3.15, “ cumulative-vfsreads.stp” , the script printsVFS-read statistics every 3 seconds, displaying names of processes that performed a VFS-readalong with a corresponding VFS-read count.

    Now, remember that the foreach statement in Example 3.15, “ cumulative-vfsreads.stp” prints alliterations of process names in the array, and in no particular order. You can instruct the script toprocess the iterations in a particular order by using + (ascending) or - (descending). In addition,you can also limit the number of iterations the script needs to process with the limit value option.

    For example, consider the following replacement probe:

    probe timer.s(3){ foreach (count in reads- limit 10)

    Red Hat Ent erprise Linux 7 Syst emT ap Beginners Guide

    24

  • printf("%s : %d \n", count, reads[count])}

    This foreach statement instructs the script to process the elements in the array reads indescending order (of associated value). The limit 10 option instructs the foreach to onlyprocess the first ten iterations (i.e. print the first 10, starting with the highest value).

    3.5.5. Clearing/Delet ing Arrays and Array Elements

    Sometimes, you may need to clear the associated values in array elements, or reset an entire arrayfor re-use in another probe. Example 3.15, “ cumulative-vfsreads.stp” in Section 3.5.4, “ProcessingMultiple Elements in an Array” allows you to track how the number of VFS reads per process growsover time, but it does not show you the number of VFS reads each process makes per 3-secondperiod.

    To do that, you will need to clear the values accumulated by the array. You can accomplish thisusing the delete operator to delete elements in an array, or an entire array. Consider the followingexample:

    Example 3.16 . noncumulat ive-vfsreads.stp

    global readsprobe vfs.read{ reads[execname()] ++}probe timer.s(3){ foreach (count in reads) printf("%s : %d \n", count, reads[count]) delete reads }

    In Example 3.16, “noncumulative-vfsreads.stp” , the second probe prints the number of VFS readseach process made within the probed 3-second period only. The delete reads statement clears the reads array within the probe.

    Chapt er 3. Underst anding How Syst emT ap Works

    25

  • Note

    You can have multiple array operations within the same probe. Using the examples fromSection 3.5.4, “Processing Multiple Elements in an Array” and Section 3.5.5,“Clearing/Deleting Arrays and Array Elements” , you can track the number of VFS reads eachprocess makes per 3-second period and tally the cumulative VFS reads of those sameprocesses. Consider the following example:

    global reads, totalreads

    probe vfs.read{ reads[execname()] ++ totalreads[execname()] ++}

    probe timer.s(3){ printf("=======\n") foreach (count in reads-) printf("%s : %d \n", count, reads[count]) delete reads}

    probe end{ printf("TOTALS\n") foreach (total in totalreads-) printf("%s : %d \n", total, totalreads[total])}

    In this example, the arrays reads and totalreads track the same information, and areprinted out in a similar fashion. The only difference here is that reads is cleared every 3-second period, whereas totalreads keeps growing.

    3.5.6. Using Arrays in Condit ional Statements

    You can also use associative arrays in if statements. This is useful if you want to execute asubroutine once a value in the array matches a certain condition. Consider the following example:

    Example 3.17. vfsreads-print - if -1kb.stp

    global readsprobe vfs.read{ reads[execname()] ++}

    probe timer.s(3){

    Red Hat Ent erprise Linux 7 Syst emT ap Beginners Guide

    26

  • printf("=======\n") foreach (count in reads-) if (reads[count] >= 1024) printf("%s : %dkB \n", count, reads[count]/1024) else printf("%s : %dB \n", count, reads[count])}

    Every three seconds, Example 3.17, “ vfsreads-print-if-1kb.stp” prints out a list of all processes, alongwith how many times each process performed a VFS read. If the associated value of a process nameis equal or greater than 1024, the if statement in the script converts and prints it out in kB.

    Test ing for Membership

    You can also test whether a specific unique key is a member of an array. Further, membership in anarray can be used in if statements, as in:

    if([index_expression] in array_name) statement

    To illustrate this, consider the following example:

    Example 3.18. vfsreads-stop-on-stapio2.stp

    global reads

    probe vfs.read{ reads[execname()] ++}

    probe timer.s(3){ printf("=======\n") foreach (count in reads+) printf("%s : %d \n", count, reads[count]) if(["stapio"] in reads) { printf("stapio read detected, exiting\n") exit() }}

    The if(["stapio"] in reads) statement instructs the script to print stapio read detected, exiting once the unique key stapio is added to the array reads.

    3.5.7. Comput ing for Stat ist ical Aggregates

    Statistical aggregates are used to collect statistics on numerical values where it is important toaccumulate new data quickly and in large volume (i.e. storing only aggregated stream statistics).Statistical aggregates can be used in global variables or as elements in an array.

    To add value to a statistical aggregate, use the operator

  • Example 3.19 . stat -aggregates.stp

    global reads probe vfs.read{ reads[execname()]

  • Example 3.20. Mult ip le Array Indexes

    global readsprobe vfs.read{ reads[execname(),pid()]

  • Chapter 4. Useful SystemTap Scripts

    This chapter enumerates several SystemTap scripts you can use to monitor and investigate differentsubsystems. All of these scripts are available at /usr/share/systemtap/testsuite/systemtap.examples/ once you install the systemtap-testsuite RPM.

    4.1. Network

    The following sections showcase scripts that trace network-related functions and build a profile ofnetwork activity.

    4 .1.1. Network Profiling

    This section describes how to profile network activity. nettop.stp provides a glimpse into how muchnetwork traffic each process is generating on a machine.

    net top.stp

    #! /usr/bin/env stap

    global ifxmit, ifrecvglobal ifmerged

    probe netdev.transmit{ ifxmit[pid(), dev_name, execname(), uid()]

  • }

    print("\n")

    delete ifxmit delete ifrecv delete ifmerged}

    probe timer.ms(5000), end, error{ print_activity()}

    Note that function print_activity() uses the following expressions:

    n_xmit ? @sum(ifxmit[pid, dev, exec, uid])/1024 : 0n_recv ? @sum(ifrecv[pid, dev, exec, uid])/1024 : 0

    These expressions are if/else conditionals. The first statement is simply a more concise way of writingthe following psuedo code:

    if n_recv != 0 then @sum(ifrecv[pid, dev, exec, uid])/1024else 0

    nettop.stp tracks which processes are generating network traffic on the system, and provides thefollowing information about each process:

    PID — the ID of the listed process.

    UID — user ID. A user ID of 0 refers to the root user.

    DEV — which ethernet device the process used to send / receive data (e.g. eth0, eth1)

    XMIT_PK — number of packets transmitted by the process

    RECV_PK — number of packets received by the process

    XMIT_KB — amount of data sent by the process, in kilobytes

    RECV_KB — amount of data received by the service, in kilobytes

    nettop.stp provides network profile sampling every 5 seconds. You can change this setting by editingprobe timer.ms(5000) accordingly. Example 4.1, “nettop.stp Sample Output” contains anexcerpt of the output from nettop.stp over a 20-second period:

    Example 4 .1. net top.stp Sample Output

    [...] PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND 0 0 eth0 0 5 0 0 swapper 11178 0 eth0 2 0 0 0 synergyc

    Chapt er 4 . Useful Syst emT ap Script s

    31

  • PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND 2886 4 eth0 79 0 5 0 cups-polld 11362 0 eth0 0 61 0 5 firefox 0 0 eth0 3 32 0 3 swapper 2886 4 lo 4 4 0 0 cups-polld 11178 0 eth0 3 0 0 0 synergyc

    PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND 0 0 eth0 0 6 0 0 swapper 2886 4 lo 2 2 0 0 cups-polld 11178 0 eth0 3 0 0 0 synergyc 3611 0 eth0 0 1 0 0 Xorg

    PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND 0 0 eth0 3 42 0 2 swapper 11178 0 eth0 43 1 3 0 synergyc 11362 0 eth0 0 7 0 0 firefox 3897 0 eth0 0 1 0 0 multiload-apple[...]

    4 .1.2. T racing Funct ions Called in Network Socket Code

    This section describes how to trace functions called from the kernel's net/socket.c file. This taskhelps you identify, in finer detail, how each process interacts with the network at the kernel level.

    socket - t race.stp

    #!/usr/bin/stap

    probe kernel.function("*@net/socket.c").call { printf ("%s -> %s\n", thread_indent(1), probefunc())}probe kernel.function("*@net/socket.c").return { printf ("%s

  • 8 scim-bridge(3883): -> sys_recvfrom12 scim-bridge(3883):-> sock_from_file16 scim-bridge(3883): sock_recvmsg24 scim-bridge(3883):

  • UID CMD PID PORT IP_SOURCE0 sshd 3165 22 10.64.0.2270 sshd 3165 22 10.64.0.227

    4 .1.4 . Monitoring Network Packets Drops in Kernel

    The network stack in Linux can discard packets for various reasons. Some Linux kernels include atracepoint, kernel.trace("kfree_skb"), which easily tracks where packets are discarded.dropwatch.stp uses kernel.trace("kfree_skb") to trace packet discards; the script summarizeswhich locations discard packets every five-second interval.

    dropwatch.stp

    #!/usr/bin/stap

    ############################################################# Dropwatch.stp# Author: Neil Horman # An example script to mimic the behavior of the dropwatch utility# http://fedorahosted.org/dropwatch############################################################

    # Array to hold the list of drop points we findglobal locations

    # Note when we turn the monitor on and offprobe begin { printf("Monitoring for dropped packets\n") }probe end { printf("Stopping dropped packet monitor\n") }

    # increment a drop counter for every location we drop atprobe kernel.trace("kfree_skb") { locations[$location]

  • Example 4 .4 . dropwatch.stp Sample Output

    Monitoring for dropped packets

    51 packets dropped at location 0xffffffff8024cd0f2 packets dropped at location 0xffffffff8044b472

    51 packets dropped at location 0xffffffff8024cd0f1 packets dropped at location 0xffffffff8044b472

    97 packets dropped at location 0xffffffff8024cd0f1 packets dropped at location 0xffffffff8044b472Stopping dropped packet monitor

    To make the location of packet drops more meaningful, refer to the /boot/System.map-`uname -r` file. This file lists the starting addresses for each function, allowing you to map the addresses inthe output of Example 4.4, “dropwatch.stp Sample Output” to a specific function name. Given thefollowing snippet of the /boot/System.map-`uname -r` file, the address 0xffffffff8024cd0f mapsto the function unix_stream_recvmsg and the address 0xffffffff8044b472 maps to the function arp_rcv:

    [...]ffffffff8024c5cd T unlock_new_inodeffffffff8024c5da t unix_stream_sendmsgffffffff8024c920 t unix_stream_recvmsgffffffff8024cea1 t udp_v4_lookup_longway[...]ffffffff8044addc t arp_processffffffff8044b360 t arp_rcvffffffff8044b487 t parp_redoffffffff8044b48c t arp_solicit[...]

    4.2. Disk

    The following sections showcase scripts that monitor disk and I/O activity.

    4 .2.1. Summariz ing Disk Read/Write T raffic

    This section describes how to identify which processes are performing the heaviest disk reads/writesto the system.

    disktop.stp

    #!/usr/bin/stap## Copyright (C) 2007 Oracle Corp.## Get the status of reading/writing disk every 5 seconds,# output top ten entries #

    Chapt er 4 . Useful Syst emT ap Script s

    35

  • # This is free software,GNU General Public License (GPL);# either version 2, or (at your option) any later version.## Usage:# ./disktop.stp#

    global io_stat,deviceglobal read_bytes,write_bytes

    probe vfs.read.return { if ($return>0) { if (devname!="N/A") {/*skip read from cache*/ io_stat[pid(),execname(),uid(),ppid(),"R"] += $return device[pid(),execname(),uid(),ppid(),"R"] = devname read_bytes += $return } }}

    probe vfs.write.return { if ($return>0) { if (devname!="N/A") { /*skip update cache*/ io_stat[pid(),execname(),uid(),ppid(),"W"] += $return device[pid(),execname(),uid(),ppid(),"W"] = devname write_bytes += $return } }}

    probe timer.ms(5000) { /* skip non-read/write disk */ if (read_bytes+write_bytes) {

    printf("\n%-25s, %-8s%4dKb/sec, %-7s%6dKb, %-7s%6dKb\n\n", ctime(gettimeofday_s()), "Average:", ((read_bytes+write_bytes)/1024)/5, "Read:",read_bytes/1024, "Write:",write_bytes/1024)

    /* print header */ printf("%8s %8s %8s %25s %8s %4s %12s\n", "UID","PID","PPID","CMD","DEVICE","T","BYTES") } /* print top ten I/O */ foreach ([process,cmd,userid,parent,action] in io_stat- limit 10) printf("%8d %8d %8d %25s %8s %4s %12d\n", userid,process,parent,cmd, device[process,cmd,userid,parent,action], action,io_stat[process,cmd,userid,parent,action])

    /* clear data */ delete io_stat delete device read_bytes = 0 write_bytes = 0

    Red Hat Ent erprise Linux 7 Syst emT ap Beginners Guide

    36

  • }

    probe end{ delete io_stat delete device delete read_bytes delete write_bytes}

    disktop.stp outputs the top ten processes responsible for the heaviest reads/writes to disk.Example 4.5, “disktop.stp Sample Output” displays a sample output for this script, and includes thefollowing data per listed process:

    UID — user ID. A user ID of 0 refers to the root user.

    PID — the ID of the listed process.

    PPID — the process ID of the listed process's parent process.

    CMD — the name of the listed process.

    DEVICE — which storage device the listed process is reading from or writing to.

    T — the type of action performed by the listed process; W refers to write, while R refers to read.

    BYTES — the amount of data read to or written from disk.

    The time and date in the output of disktop.stp is returned by the functions ctime() and gettimeofday_s(). ctime() derives calendar time in terms of seconds passed since the Unixepoch (January 1, 1970). gettimeofday_s() counts the actual number of seconds since Unixepoch, which gives a fairly accurate human-readable timestamp for the output.

    In this script, the $return is a local variable that stores the actual number of bytes each processreads or writes from the virtual file system. $return can only be used in return probes (e.g. vfs.read.return and vfs.read.return).

    Example 4 .5. d isktop.stp Sample Output

    [...]Mon Sep 29 03:38:28 2008 , Average: 19Kb/sec, Read: 7Kb, Write: 89Kb

    UID PID PPID CMD DEVICE T BYTES0 26319 26294 firefox sda5 W 902290 2758 2757 pam_timestamp_c sda5 R 80640 2885 1 cupsd sda5 W 1678

    Mon Sep 29 03:38:38 2008 , Average: 1Kb/sec, Read: 7Kb, Write: 1Kb

    UID PID PPID CMD DEVICE T BYTES0 2758 2757 pam_timestamp_c sda5 R 80640 2885 1 cupsd sda5 W 1678

    Chapt er 4 . Useful Syst emT ap Script s

    37

  • 4 .2.2. T racking I/O T ime For Each File Read or Write

    This section describes how to monitor the amount of time it takes for each process to read from orwrite to any file. This is useful if you wish to determine what files are slow to load on a given system.

    io t ime.stp

    global startglobal entry_ioglobal fd_ioglobal time_io

    function timestamp:long() { return gettimeofday_us() - start}

    function proc:string() { return sprintf("%d (%s)", pid(), execname())}

    probe begin { start = gettimeofday_us()}

    global filenamesglobal filehandlesglobal filereadglobal filewrite

    probe syscall.open { filenames[pid()] = user_string($filename)}

    probe syscall.open.return { if ($return != -1) { filehandles[pid(), $return] = filenames[pid()] fileread[pid(), $return] = 0 filewrite[pid(), $return] = 0 } else { printf("%d %s access %s fail\n", timestamp(), proc(), filenames[pid()]) } delete filenames[pid()]}

    probe syscall.read { if ($count > 0) { fileread[pid(), $fd] += $count } t = gettimeofday_us(); p = pid() entry_io[p] = t fd_io[p] = $fd}

    Red Hat Ent erprise Linux 7 Syst emT ap Beginners Guide

    38

  • probe syscall.read.return { t = gettimeofday_us(); p = pid() fd = fd_io[p] time_io[p,fd]

  • [...]117061 2460 (pcscd) access /dev/bus/usb/003/001 read: 43 write: 0117065 2460 (pcscd) iotime /dev/bus/usb/003/001 time: 7[...]3973737 2886 (sendmail) access /proc/loadavg read: 4096 write: 03973744 2886 (sendmail) iotime /proc/loadavg time: 11[...]

    Example 4.6, “ iotime.stp Sample Output” prints out the following data:

    A timestamp, in microseconds.

    Process ID and process name.

    An access or iotime flag.

    The file accessed.

    If a process was able to read or write any data, a pair of access and iotime lines should appeartogether. The access line's timestamp refers to the time that a given process started accessing a file;at the end of the line, it will show the amount of data read/written (in bytes). The iotime line willshow the amount of time (in microseconds) that the process took in order to perform the read or write.

    If an access line is not followed by an iotime line, it simply means that the process did not read orwrite any data.

    4 .2.3. T rack Cumulat ive IO

    This section describes how to track the cumulative amount of I/O to the system.

    t raceio .stp

    #! /usr/bin/env stap# traceio.stp# Copyright (C) 2007 Red Hat, Inc., Eugene Teo # Copyright (C) 2009 Kai Meyer # Fixed a bug that allows this to run longer# And added the humanreadable function## This program is free software; you can redistribute it and/or modify# it under the terms of the GNU General Public License version 2 as# published by the Free Software Foundation.#

    global reads, writes, total_io

    probe vfs.read.return { reads[pid(),execname()] += $return total_io[pid(),execname()] += $return}

    probe vfs.write.return { writes[pid(),execname()] += $return total_io[pid(),execname()] += $return}

    Red Hat Ent erprise Linux 7 Syst emT ap Beginners Guide

    4 0

  • function humanreadable(bytes) { if (bytes > 1024*1024*1024) { return sprintf("%d GiB", bytes/1024/1024/1024) } else if (bytes > 1024*1024) { return sprintf("%d MiB", bytes/1024/1024) } else if (bytes > 1024) { return sprintf("%d KiB", bytes/1024) } else { return sprintf("%d B", bytes) }}

    probe timer.s(1) { foreach([p,e] in total_io- limit 10) printf("%8d %15s r: %12s w: %12s\n", p, e, humanreadable(reads[p,e]), humanreadable(writes[p,e])) printf("\n") # Note we don't zero out reads, writes and total_io, # so the values are cumulative since the script started.}

    traceio.stp prints the top ten executables generating I/O traffic over time. In addition, it also tracks thecumulative amount of I/O reads and writes done by those ten executables. This information is trackedand printed out in 1-second intervals, and in descending order.

    Note that traceio.stp also uses the local variable $return, which is also used by disktop.stp fromSection 4.2.1, “Summarizing Disk Read/Write Traffic” .

    Example 4 .7. t raceio .stp Sample Output

    [...] Xorg r: 583401 KiB w: 0 KiB floaters r: 96 KiB w: 7130 KiBmultiload-apple r: 538 KiB w: 537 KiB sshd r: 71 KiB w: 72 KiBpam_timestamp_c r: 138 KiB w: 0 KiB staprun r: 51 KiB w: 51 KiB snmpd r: 46 KiB w: 0 KiB pcscd r: 28 KiB w: 0 KiB irqbalance r: 27 KiB w: 4 KiB cupsd r: 4 KiB w: 18 KiB

    Xorg r: 588140 KiB w: 0 KiB floaters r: 97 KiB w: 7143 KiBmultiload-apple r: 543 KiB w: 542 KiB sshd r: 72 KiB w: 72 KiBpam_timestamp_c r: 138 KiB w: 0 KiB staprun r: 51 KiB w: 51 KiB snmpd r: 46 KiB w: 0 KiB pcscd r: 28 KiB w: 0 KiB irqbalance r: 27 KiB w: 4 KiB cupsd r: 4 KiB w: 18 KiB

    Chapt er 4 . Useful Syst emT ap Script s

    4 1

  • 4 .2.4 . I/O Monitoring (By Device)

    This section describes how to monitor I/O activity on a specific device.

    t raceio2.stp

    #! /usr/bin/env stap

    global device_of_interest

    probe begin { /* The following is not the most efficient way to do this. One could directly put the result of usrdev2kerndev() into device_of_interest. However, want to test out the other device functions */ dev = usrdev2kerndev($1) device_of_interest = MKDEV(MAJOR(dev), MINOR(dev))}

    probe vfs.write, vfs.read{ if (dev == device_of_interest) printf ("%s(%d) %s 0x%x\n", execname(), pid(), probefunc(), dev)}

    traceio2.stp takes 1 argument: the whole device number. To get this number, use stat -c "0x%D" directory, where directory is located in the device you wish to monitor.

    The usrdev2kerndev() function converts the whole device number into the format understood bythe kernel. The output produced by usrdev2kerndev() is used in conjunction with the MKDEV(), MINOR(), and MAJOR() functions to determine the major and minor numbers of a specific device.

    The output of traceio2.stp includes the name and ID of any process performing a read/write, thefunction it is performing (i.e. vfs_read or vfs_write), and the kernel device number.

    The following example is an excerpt from the full output of stap traceio2.stp 0x805, where 0x805 is the whole device number of /home. /home resides in /dev/sda5, which is the device wewish to monitor.

    Example 4 .8. t raceio2.stp Sample Output

    [...]synergyc(3722) vfs_read 0x800005synergyc(3722) vfs_read 0x800005cupsd(2889) vfs_write 0x800005cupsd(2889) vfs_write 0x800005cupsd(2889) vfs_write 0x800005[...]

    4 .2.5. Monitoring Reads and Writes to a File

    Red Hat Ent erprise Linux 7 Syst emT ap Beginners Guide

    4 2

  • This section describes how to monitor reads from and writes to a file in real time.

    inodewatch.stp

    #! /usr/bin/env stap

    probe vfs.write, vfs.read{ # dev and ino are defined by vfs.write and vfs.read if (dev == MKDEV($1,$2) # major/minor device && ino == $3) printf ("%s(%d) %s 0x%x/%u\n", execname(), pid(), probefunc(), dev, ino)}

    inodewatch.stp takes the following information about the file as arguments on the command line:

    The file's major device number.

    The file's minor device number.

    The file's inode number.

    To get this information, use stat -c '%D %i' filename, where filename is an absolute path.

    For instance: if you wish to monitor /etc/crontab, run stat -c '%D %i' /etc/crontab first.This gives the following output:

    805 1078319

    805 is the base-16 (hexadecimal) device number. The lower two digits are the minor device numberand the upper digits are the major number. 1078319 is the inode number. To start monitoring /etc/crontab, run stap inodewatch.stp 0x8 0x05 1078319 (The 0x prefixes indicatebase-16 values).

    The output of this command contains the name and ID of any process performing a read/write, thefunction it is performing (i.e. vfs_read or vfs_write), the device number (in hex format), and the inode number. Example 4.9, “ inodewatch.stp Sample Output” contains the output of stap inodewatch.stp 0x8 0x05 1078319 (when cat /etc/crontab is executed while the script isrunning) :

    Example 4 .9 . inodewatch.stp Sample Output

    cat(16437) vfs_read 0x800005/1078319cat(16437) vfs_read 0x800005/1078319

    4 .2.6. Monitoring Changes to File At t ributes

    This section describes how to monitor if any processes are changing the attributes of a targeted file,in real time.

    inodewatch2-simple.stp

    Chapt er 4 . Useful Syst emT ap Script s

    4 3

  • global ATTR_MODE = 1

    probe kernel.function("inode_setattr") { dev_nr = $inode->i_sb->s_dev inode_nr = $inode->i_ino

    if (dev_nr == ($1 ia_valid & ATTR_MODE) printf ("%s(%d) %s 0x%x/%u %o %d\n", execname(), pid(), probefunc(), dev_nr, inode_nr, $attr->ia_mode, uid()) }

    Like inodewatch.stp from Section 4.2.5, “Monitoring Reads and Writes to a File” , inodewatch2-simple.stp takes the targeted file's device number (in integer format) and inode number asarguments. For more information on how to retrieve this information, refer to Section 4.2.5,“Monitoring Reads and Writes to a File” .

    The output for inodewatch2-simple.stp is similar to that of inodewatch.stp, except that inodewatch2-simple.stp also contains the attribute changes to the monitored file, as well as the ID of the userresponsible (uid()). Example 4.10, “ inodewatch2-simple.stp Sample Output” shows the output ofinodewatch2-simple.stp while monitoring /home/joe/bigfile when user joe executes chmod 777 /home/joe/bigfile and chmod 666 /home/joe/bigfile.

    Example 4 .10. inodewatch2-simple.stp Sample Output

    chmod(17448) inode_setattr 0x800005/6011835 100777 500chmod(17449) inode_setattr 0x800005/6011835 100666 500

    4.3. Profiling

    The following sections showcase scripts that profile kernel activity by monitoring function calls.

    4 .3.1. Count ing Funct ion Calls Made

    This section describes how to identify how many times the system called a specific kernel function ina 30-second sample. Depending on your use of wildcards, you can also use this script to targetmultiple kernel functions.

    funct ioncallcount .stp

    #! /usr/bin/env stap# The following line command will probe all the functions# in kernel's memory management code:## stap functioncallcount.stp "*@mm/*.c"

    probe kernel.function(@1).call { # probe functions listed on commandline called[probefunc()]

  • global called

    probe end { foreach (fn in called-) # Sort by call count (in decreasing order) # (fn+ in called) # Sort by function name printf("%s %d\n", fn, @count(called[fn])) exit()}

    functioncallcount.stp takes the targeted kernel function as an argument. The argument supportswildcards, which enables you to target multiple kernel functions up to a certain extent.

    The output of functioncallcount.stp contains the name of the function called and how many times itwas called during the sample time (in alphabetical order). Example 4.11, “ functioncallcount.stpSample Output” contains an excerpt from the output of stap functioncallcount.stp "*@mm/*.c":

    Example 4 .11. funct ioncallcount .stp Sample Output

    [...]__vma_link 97__vma_link_file 66__vma_link_list 97__vma_link_rb 97__xchg 103add_page_to_active_list 102add_page_to_inactive_list 19add_to_page_cache 19add_to_page_cache_lru 7all_vm_events 6alloc_pages_node 4630alloc_slabmgmt 67anon_vma_alloc 62anon_vma_free 62anon_vma_lock 66anon_vma_prepare 98anon_vma_unlink 97anon_vma_unlock 66arch_get_unmapped_area_topdown 94arch_get_unmapped_exec_area 3arch_unmap_area_topdown 97atomic_add 2atomic_add_negative 97atomic_dec_and_test 5153atomic_inc 470atomic_inc_and_test 1[...]

    4 .3.2. Call Graph T racing

    This section describes how to trace incoming and outgoing function calls.

    Chapt er 4 . Useful Syst emT ap Script s

    4 5

  • para-callgraph.stp

    #! /usr/bin/env stap

    function trace(entry_p, extra) { %( $# > 1 %? if (tid() in trace) %) printf("%s%s%s %s\n", thread_indent (entry_p), (entry_p>0?"->":"

  • file=0xffff8801116ce980 ppos=0xffff88010544df48 count=0x1000 7 gnome-terminal(2921): do_sync_read filp=0xffff8801116ce980 buf=0xc86504 len=0x1000 ppos=0xffff88010544df48 15 gnome-terminal(2921):

  • thread-times.stp lists the top 20 processes currently taking up CPU time within a 5-second sample,along with the total number of CPU ticks made during the sample. The output of this script also notesthe percentage of CPU time each process used, as well as whether that time was spent in kernelspace or user space.

    Example 4.13, “ thread-times.stp Sample Output” contains a 5-second sample of the output forthread-times.stp:

    Example 4 .13. thread- t imes.stp Sample Output

    tid %user %kernel (of 20002 ticks) 0 0.00% 87.88%32169 5.24% 0.03% 9815 3.33% 0.36% 9859 0.95% 0.00% 3611 0.56% 0.12% 9861 0.62% 0.01%11106 0.37% 0.02%32167 0.08% 0.08% 3897 0.01% 0.08% 3800 0.03% 0.00% 2886 0.02% 0.00% 3243 0.00% 0.01% 3862 0.01% 0.00% 3782 0.00% 0.00%21767 0.00% 0.00% 2522 0.00% 0.00% 3883 0.00% 0.00% 3775 0.00% 0.00% 3943 0.00% 0.00% 3873 0.00% 0.00%

    4 .3.4 . Monitoring Polling Applicat ions

    This section describes how to identify and monitor which applications are polling. Doing so allowsyou to track unnecessary or excessive polling, which can help you pinpoint areas for improvement interms of CPU usage and power savings.

    t imeout .stp

    #! /usr/bin/env stap# Copyright (C) 2009 Red Hat, Inc.# Written by Ulrich Drepper # Modified by William Cohen

    global process, timeout_count, toglobal poll_timeout, epoll_timeout, select_timeout, itimer_timeoutglobal nanosleep_timeout, futex_timeout, signal_timeout

    probe syscall.poll, syscall.epoll_wait { if (timeout) to[pid()]=timeout}

    Red Hat Ent erprise Linux 7 Syst emT ap Beginners Guide

    4 8

  • probe syscall.poll.return { p = pid() if ($return == 0 && to[p] > 0 ) { poll_timeout[p]++ timeout_count[p]++ process[p] = execname() delete to[p] }}

    probe syscall.epoll_wait.return { p = pid() if ($return == 0 && to[p] > 0 ) { epoll_timeout[p]++ timeout_count[p]++ process[p] = execname() delete to[p] }}

    probe syscall.select.return { if ($return == 0) { p = pid() select_timeout[p]++ timeout_count[p]++ process[p] = execname() }}

    probe syscall.futex.return { if (errno_str($return) == "ETIMEDOUT") { p = pid() futex_timeout[p]++ timeout_count[p]++ process[p] = execname() }}

    probe syscall.nanosleep.return { if ($return == 0) { p = pid() nanosleep_timeout[p]++ timeout_count[p]++ process[p] = execname() }}

    probe kernel.function("it_real_fn") { p = pid() itimer_timeout[p]++ timeout_count[p]++ process[p] = execname()}

    probe syscall.rt_sigtimedwait.return { if (errno_str($return) == "EAGAIN") {

    Chapt er 4 . Useful Syst emT ap Script s

    4 9

  • p = pid() signal_timeout[p]++ timeout_count[p]++ process[p] = execname() }}

    probe syscall.exit { p = pid() if (p in process) { delete process[p] delete timeout_count[p] delete poll_timeout[p] delete epoll_timeout[p] delete select_timeout[p] delete itimer_timeout[p] delete futex_timeout[p] delete nanosleep_timeout[p] delete signal_timeout[p] }}

    probe timer.s(1) { ansi_clear_screen() printf (" pid | poll select epoll itimer futex nanosle signal| process\n") foreach (p in timeout_count- limit 20) { printf ("%5d |%7d %7d %7d %7d %7d %7d %7d| %-.38s\n", p, poll_timeout[p], select_timeout[p], epoll_timeout[p], itimer_timeout[p], futex_timeout[p], nanosleep_timeout[p], signal_timeout[p], process[p]) }}

    timeout.stp tracks how many times each application used the following system calls over time:

    poll

    select

    epoll

    itimer

    futex

    nanosleep

    signal

    In some applications, these system calls are used excessively. As such, they are normally identifiedas " likely culprits" for polling applications. Note, however, that an application may be using adifferent system call to poll excessively; sometimes, it is useful to find out the top system calls used bythe system (refer to Section 4.3.5, “Tracking Most Frequently Used System Calls” for instructions).Doing so can help you identify any additional suspects, which you can add to timeout.stp fortracking.

    Red Hat Ent erprise Linux 7 Syst emT ap Beginners Guide

    50

  • Example 4 .14 . t imeout .stp Sample Output

    uid | poll select epoll itimer futex nanosle signal| process28937 | 148793 0 0 4727 37288 0 0| firefox22945 | 0 56949 0 1 0 0 0| scim-bridge 0 | 0 0 0 36414 0 0 0| swapper 4275 | 23140 0 0 1 0 0 0| mixer_applet2 4191 | 0 14405 0 0 0 0 0| scim-launcher22941 | 7908 1 0 62 0 0 0| gnome-terminal 4261 | 0 0 0 2 0 7622 0| escd 3695 | 0 0 0 0 0 7622 0| gdm-binary 3483 | 0 7206 0 0 0 0 0| dhcdbd 4189 | 6916 0 0 2 0 0 0| scim-panel-gtk 1863 | 5767 0 0 0 0 0 0| iscsid 2562 | 0 2881 0 1 0 1438 0| pcscd 4257 | 4255 0 0 1 0 0 0| gnome-power-man 4278 | 3876 0 0 60 0 0 0| multiload-apple 4083 | 0 1331 0 1728 0 0 0| Xorg 3921 | 1603 0 0 0 0 0 0| gam_server 4248 | 1591 0 0 0 0 0 0| nm-applet 3165 | 0 1441 0 0 0 0 0| xterm29548 | 0 1440 0 0 0 0 0| httpd 1862 | 0 0 0 0 0 1438 0| iscsid

    You can increase the sample time by editing the timer in the second probe (timer.s()). The outputof functioncallcount.stp contains the name and UID of the top 20 polling applications, along withhow many times each application performed each polling system call (over time). Example 4.14,“ timeout.stp Sample Output” contains an excerpt of the script:

    4 .3.5. T racking Most Frequent ly Used System Calls

    timeout.stp from Section 4.3.4, “Monitoring Polling Applications” helps you identify whichapplications are polling by pointing out which ones used the following system calls most frequently:

    poll

    select

    epoll

    itimer

    futex

    Chapt er 4 . Useful Syst emT ap Script s

    51

  • nanosleep

    signal

    However, in some systems, a different system call might be responsible for excessive polling. If yoususpect that a polling application is using a different system call to poll, you need to identify first thetop system calls used by the system. To do this, use topsys.stp.

    topsys.stp

    #! /usr/bin/env stap## This script continuously lists the top 20 systemcalls in the interval # 5 seconds#

    global syscalls_count

    probe syscall.* { syscalls_count[name]++}

    function print_systop () { printf ("%25s %10s\n", "SYSC