Red Hat Enterprise Linux Performance Tuning...

118
Red Hat Enterprise Linux Performance Tuning Guide

Transcript of Red Hat Enterprise Linux Performance Tuning...

  • Red Hat Enterprise Linux

    Performance Tuning Guide

  • Red Hat Enterprise Linux: Performance Tuning GuideCopyright © 2004 by Red Hat, Inc.

    1801 Varsity DriveRaleigh NC 27606-2072 USAPhone: +1 919 754 3700Phone: 888 733 4281Fax: +1 919 754 3701PO Box 13588Research Triangle Park NC 27709 USA

    This document is derived, in part, from papers by Henry Maine, Neil Horman, Rik Faith, Adrian Likins, Sherif Abdelgawad,Ulrich Drepper, Will Cohen, and Daniel Barrange.Red Hat is a registered trademark and the Red Hat Shadow Man logo, RPM, and the RPM logo are trademarks of Red Hat, Inc.Linux is a registered trademark of Linus Torvalds.IntelTM, PentiumTM, ItaniumTM, and CeleronTM are registered trademarks of Intel Corporation.All other trademarks and copyrights referred to are the property of their respective owners.Copyright © 2005 by Red Hat Inc. This material may be distributed only subject to the terms and conditions set forth in theOpen Publication License, V1.0 or later (the latest version is presently available at http://www.opencontent.org/openpub/).A PDF version of this manual is available online at http://people.redhat.com/mbehm/.

  • Table of ContentsPreface................................................................................................................................................... i

    1. Who this Manual is For.......................................................................................................... i2. Document Conventions.......................................................................................................... i3. We Need Feedback...............................................................................................................iii

    1. Evidence-Based Performance Analysis and Tuning .................................................................... 11.1. Is the Performance Problem 5% or 50%? .......................................................................... 1

    1.1.1. Avoid Jumping to Conclusions ........................................................................... 11.2. Evidence-Based Analysis................................................................................................... 1

    1.2.1. Developing a Model............................................................................................ 21.2.2. Planning What Evidence to Gather..................................................................... 21.2.3. Gathering Evidence from the Running Application ........................................... 31.2.4. Analyzing the Evidence ...................................................................................... 31.2.5. Summarize Findings and Propose an Experiment .............................................. 5

    1.3. Evidence-Based Tuning ..................................................................................................... 52. System Tuning Tools for Linux Servers ........................................................................................ 7

    2.1. Benchmarking Utilities ...................................................................................................... 72.1.1. Web Server Benchmarking Tools........................................................................ 8

    2.2. System-Monitoring Utilities .............................................................................................. 92.3. Kernel Monitoring Utilities.............................................................................................. 112.4. Disk Benchmarking and Monitoring Utilities.................................................................. 122.5. Memory Monitoring Utilities........................................................................................... 132.6. Network Monitoring Utilities .......................................................................................... 152.7. Tools for Improving Application Performance................................................................ 16

    2.7.1. Pointer-Use Errors ............................................................................................ 172.7.2. Tools for NPTL ................................................................................................. 18

    3. Understanding the Virtual Memory Management System....................................................... 213.1. Virtual Memory Terminology .......................................................................................... 21

    3.1.1. What Comprises a VM...................................................................................... 213.1.2. Memory Management Unit (MMU) ................................................................. 223.1.3. Zoned Buddy Allocator..................................................................................... 233.1.4. Slab Allocator ................................................................................................... 243.1.5. Kernel Threads.................................................................................................. 243.1.6. Components that Use the VM........................................................................... 26

    3.2. The Life of A Page........................................................................................................... 263.3. Tuning the VM................................................................................................................. 28

    3.3.1. bdflush............................................................................................................... 283.3.2. dcache_priority ................................................................................................. 293.3.3. hugetlb_pool ..................................................................................................... 293.3.4. inactive_clean_percent...................................................................................... 303.3.5. kswapd .............................................................................................................. 303.3.6. max_map_count................................................................................................ 303.3.7. max-readahead .................................................................................................. 303.3.8. min-readahead................................................................................................... 313.3.9. overcommit_memory ........................................................................................ 313.3.10. overcommit_ratio ............................................................................................ 313.3.11. pagecache........................................................................................................ 313.3.12. page-cluster ..................................................................................................... 32

  • 4. Tuning Linux Servers ................................................................................................................... 334.1. Reducing System Load .................................................................................................... 33

    4.1.1. Avoiding Swapping........................................................................................... 344.1.2. ptys and ttys ...................................................................................................... 35

    4.2. Disk Tuning...................................................................................................................... 354.2.1. SCSI Tuning...................................................................................................... 364.2.2. Read-ahead........................................................................................................ 374.2.3. File System Fragmentation ............................................................................... 374.2.4. Tuning the ext2 File System ............................................................................. 384.2.5. Tuning the ext3 File System ............................................................................. 384.2.6. Creating a Software RAID Device.................................................................... 384.2.7. Use a Red Hat Global File System (GFS) ........................................................ 40

    4.3. Virtual Memory System Tuning ...................................................................................... 404.3.1. File (IMAP, Web, etc.) Servers ......................................................................... 414.3.2. General Compute Server With Many Active Users .......................................... 414.3.3. Non-interactive (Batch) Computing Server ...................................................... 41

    4.4. Kernel Tuning .................................................................................................................. 414.4.1. Kernel Tuning Tool ........................................................................................... 424.4.2. Flushing Old Pages with pdflush (RHEL 4) and kupdated (RHEL 3).............. 434.4.3. Setting bdflush for Server Operation (RHEL 3) ............................................... 444.4.4. Disk I/O Elevators............................................................................................. 454.4.5. File Descriptor Limits ....................................................................................... 46

    4.5. Network Interface Card Tuning ....................................................................................... 474.5.1. Network Interface Configuration ...................................................................... 474.5.2. Ethernet Channel Bonding................................................................................ 47

    4.6. TCP Tuning...................................................................................................................... 484.6.1. When a CPU is Overloaded by TCP................................................................. 49

    4.7. NFS Tuning...................................................................................................................... 504.7.1. Tuning NFS Buffer Sizes .................................................................................. 504.7.2. NFS over UDP .................................................................................................. 51

    4.8. Java Tuning ...................................................................................................................... 514.9. Apache Configuration ...................................................................................................... 52

    4.9.1. Configuring Apache for Benchmark Scores ..................................................... 524.9.2. Compiling Apache 2.0 for Your Environment .................................................. 524.9.3. Increasing the Number of Available File Handles ............................................ 524.9.4. Reduce Disk Writes with noatime .................................................................... 534.9.5. ListenBacklog ................................................................................................... 534.9.6. Using Static Content Servers ............................................................................ 534.9.7. Proxy Usage...................................................................................................... 544.9.8. Samba Tuning ................................................................................................... 544.9.9. OpenLDAP Tuning ........................................................................................... 55

    5. Increasing Performance Through System Changes................................................................... 575.1. Can You Upgrade Your Hardware?.................................................................................. 57

    5.1.1. Red Hat Enterprise Linux 3 System Limits ...................................................... 585.1.2. Applications with Hardware Prerequisites........................................................ 59

    5.2. Adding a Hardware RAID ............................................................................................... 596. Detecting and Preventing Application Memory Problems........................................................ 63

    6.1. Using valgrind.................................................................................................................. 636.1.1. valgrind Overview............................................................................................. 646.1.2. valgrind Limitations and Dependencies............................................................ 646.1.3. Before Running valgrind................................................................................... 656.1.4. Running valgrind .............................................................................................. 656.1.5. How to Read the valgrind Error Report ............................................................ 66

    6.2. Detecting Memory-Handling Problems with mudflap..................................................... 666.2.1. Using Mudflap .................................................................................................. 67

  • 6.2.2. How to Read the libmudflap Error Report ........................................................ 716.3. Best Programming Practices ............................................................................................ 72

    6.3.1. General Suggestions.......................................................................................... 726.3.2. Notes About C Functions.................................................................................. 73

    7. Application Tuning with OProfile................................................................................................ 757.1. OProfile Installation ......................................................................................................... 757.2. OProfile Configuration..................................................................................................... 777.3. Collecting And Analyzing Data....................................................................................... 787.4. Viewing the Overall System Profile................................................................................. 787.5. Examining A Single Executable’s Profile........................................................................ 797.6. Optimizing Code.............................................................................................................. 81

    7.6.1. Memory References And Data Cache Misses .................................................. 817.6.2. Unaligned/Partial Memory Accesses ................................................................ 827.6.3. Branch Misprediction........................................................................................ 837.6.4. Instruction Cache Misses .................................................................................. 84

    7.7. Cautions About OProfile.................................................................................................. 847.8. Performance Tuning with Eclipse.................................................................................... 85

    7.8.1. Configuring the OProfile Daemon with the Launch Manager .......................... 857.8.2. Data Analysis with the OProfile Perspective .................................................... 87

    8. Analyzing Java Application Performance Problems Using the IBM JVM.............................. 918.1. Problem Scenario ............................................................................................................. 918.2. Diagnostic Tools .............................................................................................................. 91

    8.2.1. Javadump .......................................................................................................... 918.2.2. Heapdump......................................................................................................... 928.2.3. JFormat ............................................................................................................. 92

    8.3. Diagnosing the Problem................................................................................................... 938.4. Conclusions...................................................................................................................... 95

    A. PerformanceAnalysisScript......................................................................................................... 97B. Additional Resources ................................................................................................................. 101

    B.1. General Linux Tuning ................................................................................................... 101B.2. Shared Memory/Virtual Memory .................................................................................. 101B.3. NFS Tuning ................................................................................................................... 101B.4. TCP Tuning ................................................................................................................... 101B.5. mod_perl Tuning ........................................................................................................... 101B.6. Application Tuning........................................................................................................ 101B.7. Samba Tuning Resources .............................................................................................. 102B.8. mod_proxy Tuning ........................................................................................................ 102B.9. OProfile Resources........................................................................................................ 102

    Index................................................................................................................................................. 103

  • Preface

    This guide describes how to determine if your system has performance problems and, if so, what totune and what not to tune.

    Caution

    While this guide contains information that is field-tested and proven, your unique environment willrequire that you carefully consider the ramifications of implementing the recommendations madehere. It is unlikely that anything catastrophic would occur, but it is always important to have your dataand systems backed up, and that you have an implementation reversal plan should the need arise.

    1. Who this Manual is ForThis manual contains sections on system tuning and application tuning:

    • The intended audience for the system tuning section is the same as for the Red Hat course RH401Red Hat Enterprise Deployment and Systems Management; that is, primarily upper-level or seniorsystem administrators. In addition to reading this manual, you should also take the Red Hat courseRH442 Red Hat Enterprise System Monitoring and Performance Tuning.

    • Chapter 1 Evidence-Based Performance Analysis and Tuning gives you a strategy for perfor-mance tuning. Once you have determined where the problem is, then you can determine howto fix it—whether this requires reconfigured software, new hardware, or modifications to theapplication.

    • Chapter 2 System Tuning Tools for Linux Servers introduces many of the performance tuningtools that are available and describes how to use them to improve the performance of your systemcomponents.

    • Chapter 3 Understanding the Virtual Memory Management System describes the virtual memorymanagement system.

    • Chapter 4 Tuning Linux Servers describes how to tune various components in your server.

    • Chapter 5 Increasing Performance Through System Changes looks at hardware upgrades that youcould consider.

    • Appendix A PerformanceAnalysisScript is a performance-analysis script that you may find use-ful.

    • The application-tuning chapters are for any application developer.

    • Chapter 6 Detecting and Preventing Application Memory Problems describes how to findmemory-handling problems in applications. There is also a section on coding techniques thatcan help avoid memory problems in the future.

    • Chapter 7 Application Tuning with OProfile describes using OProfile to tune an application.Instructions are given for both the stand-alone OProfile application and the Eclipse OProfileplug-in.

    • Chapter 8 Analyzing Java Application Performance Problems Using the IBM JVM is a case studyinvolving Java application performance problems.

    • Appendix B Additional Resources lists links to more information.

  • ii Preface

    2. Document ConventionsCertain words in this manual are represented in different fonts, styles, and weights. This highlightingindicates that the word is part of a specific category. The categories include the following:

    Courier font

    Courier font represents commands, file names and paths, and prompts.

    When shown as below, it indicates computer output:Desktop about.html logs paulwesterberg.pngMail backupfiles mail reports

    bold Courier font

    Bold Courier font represents text that you are to type, such as: xload -scale 2

    italic Courier font

    Italic Courier font represents a variable, such as an installation directory: install_dir/bin/

    bold fontBold font represents application programs, a button on a graphical application interface (OK),or text found on a graphical interface.

    Additionally, the manual uses different strategies to draw your attention to pieces of information. Inorder of how critical the information is to you, these items are marked as follows:

    Note

    Linux is case-sensitive: a rose is not a ROSE is not a rOsE.

    Tip

    The directory /usr/share/doc/ contains additional documentation for installed packages.

    Important

    Modifications to the DHCP configuration file take effect when you restart the DHCP daemon.

    Caution

    Do not perform routine tasks as root—use a regular user account unless you need to use the rootaccount for system administration tasks.

  • Preface iii

    Warning

    Be careful to remove only the listed partitions. Removing other partitions could result in data loss ora corrupted system environment.

    3. We Need FeedbackIf you have thought of a way to make this manual better, submit a bug report againstthe documentation component of the product Red Hat Application Server in Bugzilla at:http://bugzilla.redhat.com/bugzilla/

    When submitting a bug report, be sure to mention the manual’s identifier:

    rhel-EN-3-PDF-RHI (2005-05-01T01:08)

    If you have a suggestion for improving the documentation, try to be as specific as possible when de-scribing it. If you have found an error, please include the section number and some of the surroundingtext so we can find it easily.

    If you have a support question (for example, if you are not sure how to partition your hard drives), usethe online support system by registering your product at: http://www.redhat.com/apps/activate/

  • iv Preface

  • Chapter 1.

    Evidence-Based Performance Analysis andTuning

    You may see “performance problems” when deploying an application. This manual discusses how toapproach the diagnosis of system performance issues, what areas to examine, and how to tune (andwhen not to tune) systems for various workloads.

    Scheduling pressure may make you inclined to try to fix the problem before you diagnose the problem.You may have a list of suggestions based loosely on known operating system problems or the generalkind of application being used. For example, if you are using an old RHEL AS 2.1 kernel, you mayconsider upgrading; if you are using Java, you might try settings that have been used for other Javaapplications; if you are using TCP/IP, you may try kernel-tuning parameters.

    These initial, off-the-cuff suggestions never solve the problem.

    Real-world performance problems are not solvable in 5 minutes with a short list of random sugges-tions. Instead, you must step back, understand the issues, and then make suggestions that have twokey qualities:

    • The suggestions are based on evidence.

    • There is a metric that can be measured both before and after the change.

    1.1. Is the Performance Problem 5% or 50%?When looking at a performance problem, the first thing you must do is understand the level of per-formance increase required. You must also understand the level of performance increase typicallyprovided by various technical solutions.

    For example, binding IRQs to CPUs on SMP systems is standard operating procedure for high-throughput installations, and this is often the first suggestion for “performance problems” on SMPsystems. However, a Samba netbench study shows that just enabling IRQ affinity can decrease perfor-mance. Would you be happy if the first thing someone suggested, without understanding the problemor taking any measurements, moved your performance by about -2%? IRQ affinity is useful for ob-taining the last 5% performance improvement, but only after the rest of the system is tuned—it willnot help a machine that is having trouble pushing more than 30Mbps through a 100baseT connection.

    Rule: Do not try 5% solutions when you need a 50% increase in performance.

    1.1.1. Avoid Jumping to ConclusionsIt is common to call all application deployment issues “performance problems”, and to assume thatthe “performance problem” is really a Linux kernel issue. This is seldom the case. Further, there mightbe two or more different issues that are independent, but that appear to be a single issue. Tackle thebig issues first: a dozen “5% tweaks” will not fix the broken Ethernet switch to which the machine isattached.

    How do you avoid jumping to conclusions? Gather evidence.

  • 2 Chapter 1. Evidence-Based Performance Analysis and Tuning

    1.2. Evidence-Based AnalysisEvidence-based analysis requires:

    1. Developing a model of the system in which you diagram all of the components.

    2. Planning what type of evidence you need to gather.

    3. Gathering evidence from the running application.

    4. Analyzing the evidence.

    1.2.1. Developing a ModelConstruct a model of the system you want to tune—including all of the components. This may requireseparate diagrams for each subsystem.

    Figure 1-1. Basic System Diagram

    System Components

    CPU MHz RHEL Version

    RAM Bus

    Disk Type Swap Size

    H/W RAID S/W RAID

    NFS NFS Protocol

    NIC LAN

    Once you have mapped the system, calculate the maximum theoretical data flows. This will be thebaseline to which you will compare your observations.

    1.2.2. Planning What Evidence to GatherYour goal is to gather evidence that leads you to understand several basic facts:

    • Why do you think there is a performance problem?

    The answer may be obvious (for example, an application can process 1000 transactions per second(tps) for long periods, but when the rate goes above 2000 tps, the machine hangs). However, in

  • Chapter 1. Evidence-Based Performance Analysis and Tuning 3

    many cases, the answer is less clear (for example, the application rate was 1000 tps under an olderSolaris® machine and is 2000 tps on a new Linux machine, but does not “scale”).

    • What is the scope of the problem?

    Is the problem being experienced by a single user, or by all users? Does the problem appear onlyat certain times of the day, certain days of the week, or certain times of the year? If the onset ofthe problem was sudden, was it preceded by any change to the system (an upgrade to software, thereplacement of hardware, or the addition of a new machine to the network)?

    • What metric can be used to measure performance from one test run to another? Is this the rightmetric for the problem?

    If you are to successfully solve a performance problem, you must have a metric by which to judgeyour progress. Obvious metrics are things such as bytes per second, packets per second, or transac-tions per second.

    • What are your goals? What is the measure of success?

    In the case of an application that has been ported from Solaris to Linux, you may be getting goodperformance under Linux and want only to rule out problems with the Linux kernel that may belimiting scalability. (In cases such as this, the application is usually hitting internal application-related performance bottlenecks that were never seen under Solaris because the transaction ratewas never high enough to trigger the problem).

    1.2.3. Gathering Evidence from the Running ApplicationAfter determining your metric and specific goals, the next step is to examine the running application.This is often difficult, especially when the application runs only in a production environment.

    One of the main goals of your initial analysis will be to rule out common problems. With this goal inmind, a low-impact script (PerformanceAnalysisScript.sh) is provided with this document (seeAppendix A PerformanceAnalysisScript) and online at http://people.redhat.com/mbehm/. This scriptcan be run by hand before, during, and after the performance event (or you can modify it to be runevery 30 minutes from cron). The script provides enough information to rule out obvious problemswith swapping, process size, process count, Ethernet bandwidth availability, network errors, disk I/Obandwidth, disk errors, and interrupt counts. Always obtain at least two runs of any data-gatheringscripts so that you can compare a “normal, loaded system” with the system when it is not performingas expected.

    A complete list of all the things you should look for at this point would be large. Initially, see if thereare common problems that may have been overlooked:

    • Performance always degrades when the system starts swapping.

    • Network performance can be poor due to bad hardware or a bad negotiation.

    • Common applications errors that will result in poor Linux performance include such things aspolling for I/O instead of using poll/select (or using these calls with a zero timeout!).

    1.2.4. Analyzing the EvidenceAfter you have the run-time results in hand, you will understand the problem, the goal, the application,and the environment.

    First, perform informal calculations (based on the performance logs—perhaps the output of a scriptor of sar(1) or sa1) to ensure that the goals are realistic:

    • Is there sufficient capacity: is there enough memory to prevent swapping? Are there enough CPUcycles to do the required computations in the required amount of time?

  • 4 Chapter 1. Evidence-Based Performance Analysis and Tuning

    The vmstat utility will show how many processes are in the run queue. If more processes arewaiting to run (“r” under “procs”) than the machine has CPUs, then the system is probablyCPU bound. Note that this does not necessarily mean it is running at maximum potential capac-ity—programming errors that waste CPU cycles on useless work will cause a system to appearCPU bound, but small changes in programming technique may be able to improve performancedramatically.

    • Is there sufficient bandwidth: is there enough memory, PCI, network, and disk I/O bandwidth toservice the number and size of transactions required to achieve the goals?

    If the vmstat utility shows many processes in the blocked statistic (“b” under “procs”), then thesystem may be I/O bound. This could be because of slow disks or slow Ethernet cards. It couldalso be because very fast cards have not been placed on a 64-bit bus. Check the negotiated Ethernetrate and compute the number of bits throughput. You can usually assume 5MBps for 10baseT, 80-90Bps for 100baseT, and 300-1000Mbps for 1000baseT, depending on cards, PCI bus width andspeed, and drivers.

    • Are you using hardware that enables fast performance, such as SCSI drives rather than IDE diskdrives?

    Second, determine what sort of improvement you need:

    • Does the system appear to be highly tuned?

    If so, now is the time to make a list of “performance tweaks” that you predict will help eke out theremaining 5-10% of the performance requested. Make sure that you make only one change at a timeand re-run the performance metrics to provide evidence that your tweak helped. Many tweaks havewell documented performance “gains” in the -5% to 5% range—make certain you do not makethings worse for your particular workload.

    • Does the system or application crash, hang, or require a 50-100% improvement in performance?This is the case where applying small performance tweaks fails dramatically. However, Red Hatprovides some options:

    • You can take the Red Hat course on system monitoring and performance tuning, RH442(http://www.redhat.com/training/architect/courses/rh442.html).

    • You can engage Red Hat consulting services http://www.redhat.com/services/focus/hpc/).

    • Confirm that you are running the latest update to your operating system. Red Hat updates fix bugsthat can affect performance, so if the system is an old installation, run up2date.

    Third, check the performance logs for obvious system-level problems:

    • Is the system swapping? Most production systems are implemented to avoid swapping, so this isunusual.

    • Are there networking or other issues that may have been overlooked? (for example, are there anyerrors reported in ifconfig?)

    • Are some resource limits being exceeded? Look especially at the number of open file descriptors,the number of processes, and the number of processes linked with libpthread (these were limitedin various ways in AS2.1, for example).

    Try installing and running dklimits, which is described in Section 2.2 System-Monitoring Utili-ties.

    • Are some system components under their limits? See Section 5.1 Can You Upgrade Your Hard-ware?.

  • Chapter 1. Evidence-Based Performance Analysis and Tuning 5

    • Make sure nothing suspicious is in /var/log/messages. This may include hard disk or otherhardware errors that have been overlooked.

    • Is there any evidence of any known kernel-implementation issue? These are rare, but large enter-prise installations may uncover them.

    • Are there any binary-only modules loaded in the kernel, and are there any known issues with them?Just pointing to the first binary-only module you notice qualifies as a random answer—unless youhave evidence that that particular module causes a problem.

    Fourth, check the performance logs (from the PerformanceAnalysisScript.sh script (seeAppendix A PerformanceAnalysisScript), from your custom script, or from sar(1) for obviousapplication-level problems:

    • Is the application using 100% of the CPU? Why? That is, is the application really doing work, ordid someone implement a call to poll(2) incorrectly?

    • Is the memory footprint growing without bound? Is this by design or by accident?

    • For ported applications, does the application take advantage of, or rely on, a feature (or implemen-tation artifact) that is not available on Linux?

    • Is the application making use of a feature that works well under Solaris that does not work wellunder Linux (for example, streams, certain uses of AI/O, threads in AS2.1)?

    1.2.5. Summarize Findings and Propose an ExperimentAfter gathering and analyzing the evidence you have, prove your assertions by performing an experi-ment and measuring a change in the metric you have previously identified.

    Experiments should change one thing at a time. This is often difficult, especially when testing onproduction systems. However, if you change more than one variable, you will not know which changeprovided an improvement (and you will miss the possibility that one change makes a positive impactthat was canceled out by another change). This is extremely important.

    1.3. Evidence-Based TuningSystem tuning should never be performed without experimental evidence that the tuned parametershave a positive impact on system performance. Sometime, this evidence can be gathered simply bygetting a metric for a run, changing a parameter, and getting a metric for a second run. This is the rarecase, however.

    The more common case is that, without changing anything, the metric you are measuring will have 5-10% variation in it. Therefore, if you make a small tuning change that is known to generally producea -5% to 5% performance change, doing a pair of runs is worthless. (In this case, a series of runsbefore and after the tuning change and appropriate statistical techniques (for example, ANOVA) candemonstrate a difference. Further, such small changes are likely to be dominated by other factors andare not worth the effort to make.)

    If you are unable to demonstrate that a tuning change makes a measurable difference in performance,then do not make the change. Modern Linux kernels are designed to be self-tuning, and randomchanges you make could impact the effects of automatic tuning as the system runs for long periods oftime.

    Performance tuning is not without risks. If the machine is used for more than one task, you may beinadvertently degrading the performance of another application (for example, you may increase I/Othroughput but severely degrade interactivity).

  • 6 Chapter 1. Evidence-Based Performance Analysis and Tuning

  • Chapter 2.

    System Tuning Tools for Linux Servers

    This chapter describes some useful tools and utilities.

    2.1. Benchmarking UtilitiesA good set of benchmarking utilities is very helpful in doing system-tuning work. It is impossibleto duplicate "real-world" situations, but that is not really the goal of a good benchmark. A goodbenchmark typically tries to measure the performance of one particular thing very accurately. If youunderstand what the benchmarks are doing, they can be very useful tools.

    Some useful benchmarking tools are listed below. Note that they have some overlap in functionalitywith the tools described later in this chapter.

    LMbench

    LMbench is a suite of simple, portable benchmarks that measure bandwidth (memory operationsand TCP) as well as latency.

    See http://www.bitmover.com/lmbench/ for details.

    Bonnie

    Bonnie is a utility for testing driver performance. Its only drawback is it sometimes requires theuse of huge datasets on large-memory machines to get useful results.

    Bonnie (http://www.textuality.com/bonnie/) has been around forever, and the numbers it pro-duces are meaningful to many people. Thus, it is a good tool for producing information to sharewith others.

    Bonnie produces the following measures:

    • Block Reads/Writes: In these tests, bonnie tries to read a file it has already written in blocksof 8 KB at a time. It does not do anything useful with the data it reads, it just reads a blockof data in, then reads the next block in when that is done. This gives some sort of indicationhow fast the operating system is at sequential disk reads. Keep in mind that this is a regularfile read through the file system; no Async I/O and no Raw Disk I/O is being done here. As aresult, do not try to compare these results to the disk-read results from something like IOMeterunder Windows NT. They are testing two totally separate things (IOMeter uses Raw I/O andAsync I/O under NT to get its speed, which totally bypasses all the file system layer entirelyand tests only the speed of the disks and controller). By going through the file system, youget a closer-to-real-life metric as you see just how fast a normal application can expect theoperating system to be able to feed it data. Most applications are not Async I/O and Raw I/Oaware and will not be able to muster the numbers that IOMeter would suggest. For example,Microsoft Word under Windows NT does not use Raw I/O and Async I/O, so the numbers youget from bonnie would be a closer approximation than are the numbers from IOMeter as tohow fast NT could supply a large file to Microsoft Word. The same is true for writes.

    • Character Reads/Writes: In these tests, bonnie writes to, or reads from, the file one byte ata time using either getc or putc. This tests only the speed of your libc library (as how the libclibrary implements the buffering of these single-byte requests and how efficient it is outside ofthat determines the speed of this test). Under linux, the speed of putc/getc is also drasticallyaffected by whether or not you use the thread safe version. When using glibc-2.0 or later,using the putc_unlocked() and getc_unlocked() routines greatly improve performanceon these tests.

  • 8 Chapter 2. System Tuning Tools for Linux Servers

    • Rewrites: In this test, bonnie reads an 8 KB chunk of data, modifies one byte, then writesthat 8k chunk back to the original file. This tends to stress the file-system code quite heavily.The file system has to continually update data blocks on the disk or re-allocate blocks on thedisk, etc. It also stresses how well the operating system caches data. This would most closelyapproximate what a smaller SQL database would be like during update phases. (bonnie doesnot even begin to tell you anything about databases, such as Oracle, that want to have accessto plain, raw devices and to have no operating system intervention in their read/write patterns;it mimics only smaller databases that utilize plain files to store information in.)

    • Random Seeks: This provides a reasonable test as to how well your operating system ordersthe requests it gets. This test randomly seeks into the file that was written and reads a smallblock of data. The more of these you can complete per second, the faster your operating systemis going to be when performing real work. Since the biggest bottleneck to hard disks is not theactual disk throughput, but instead is the disk-seek times, better numbers here will mean betterperformance overall. This is especially true in multi-user environments.

    There is also a somewhat newer version of bonnie that is called bonnie++(http://www.coker.com.au/bonnie++/) that fixes a few bugs, and includes extra tests.

    dkftpbench

    An ftp benchmarking utility that is designed to simulate real-world ftp usage (large number ofclients, throttle connections to modem speeds, etc.). It also includes the useful dklimits utility.

    dkftpbench is available from Dan Kegel’s page (http://www.kegel.com/dkftpbench/).

    2.1.1. Web Server Benchmarking ToolsYou might want to apply benchmarking to high-traffic web sites or when you have an unexpected loadspike (for example, a virus is flooding your servers with bogus requests).

    autobench

    autobench is a simple Perl script for automating the process of benchmarking a web server (orfor conducting a comparative test of two different web servers). The script is a wrapper aroundhttperf. Autobench runs httperf a number of times against each host, increasing the numberof requested connections per second on each iteration, and extracts the significant data fromthe httperf output, delivering a CSV or TSV format file which can be imported directly into aspreadsheet for analysis/graphing.

    For more information on autobench, see http://www.xenoclast.org/autobench/. To download,go to: http://www.xenoclast.org/autobench/downloads

    http_load

    http_load is a nice, simple http benchmarking application that does integrity checking, parallelrequests, and simple statistics. http_load generates a load based off a test file of URLs toaccess, so it is flexible.

    http_load is available from ACME Labs (http://www.acme.com/software/http_load/).

    httperf

    httperf is a popular web server benchmark tool. It provides a flexible facility for generatingvarious HTTP workloads and for measuring server performance. The focus of httperf is noton implementing one particular benchmark but on providing a robust, high-performance tool thatfacilitates the construction of both micro- and macro-level benchmarks. The three distinguishingcharacteristics of httperf are its robustness, which includes the ability to generate and sus-tain server overload, support for the HTTP/1.1 protocol, and its extensibility to new workloadgenerators and performance measurements.

  • Chapter 2. System Tuning Tools for Linux Servers 9

    For information, see http://www.hpl.hp.com/personal/David_Mosberger/httperf.html To down-load, go to ftp://ftp.hpl.hp.com/pub/httperf/

    WebStone

    WebStone is a highly-configurable client-server benchmark for HTTP servers. It makes HTTP1.0 GET requests for specific pages on a web server and measures the throughput and latency ofeach HTTP transfer. By default, only statistical data are returned, but you can request data foreach transaction. WebStone also reports transaction failures ("Connection Refused") errors.

    For more information, see http://www.mindcraft.com/webstone/

    2.2. System-Monitoring UtilitiesMany system-monitoring utilities report the system load. A heavily loaded system will have a loadaverage that is greater than two times the number of CPUs. Note that high loads may indicate amachine that is CPU-bound, but it can also indicate memory-bound or I/O-bound systems.

    These system-monitoring utilities have some overlap in functionality with the tools described in Sec-tion 2.1 Benchmarking Utilities.

    xload

    xload provides a graphical view of the system load. Of course, this requires that you run the XWindow system.xload -scale 2

    uptime

    If you are not running the X Window system, you can see a snapshot of the load average for theprevious 1, 5, and 15-minute periods by using uptime:uptime

    This provides information similar to the following:9:10:01 up 120 days, 0 min, 9 users, load average: 9.23, 0.96, 0.00

    The value for users is the number of interactive login sessions on the system. The load averageis the number of processes that are waiting to run or have been blocked while waiting for I/O.Uniprocessor machines should give values around 1.00; multi-processor machines might run upto 10 times that rate.

    vmstat

    vmstat, which is part of the procps package, provides information about virtual memory.

    Here is a sample vmstat output from a lightly used server:procs memory swap io system cpur b swpd free buff cache si so bi bo in cs us sy id wa0 0 300492 16872 391760 436980 0 1 0 1 1 0 0 0 1 1

    And here is some sample output from a heavily used server:procs memory swap io system cpur b swpd free buff cache si so bi bo in cs us sy id wa

    16 0 2360 264400 96672 9400 0 0 0 1 53 24 3 1 96 124 0 2360 257284 96672 9400 0 0 0 6 3063 17713 64 36 0 115 0 2360 250024 96672 9400 0 0 0 3 3039 16811 66 34 0 1

    The most interesting number here is the first one, “r”, in the procs (processes) section; this isthe number of the process that are on the run queue. This value shows how many processes areready to be executed, but cannot be ran at the moment because other processes need to finish.

  • 10 Chapter 2. System Tuning Tools for Linux Servers

    For lightly loaded systems, this should be something less than two to four times the number ofprocessors (so, less than 8 on a dual-processor machine).

    The other main procs value is b (blocked processes). If a value regularly appears in this column,the system is waiting on disk access or network access.

    In the swap area, si in the number of KB of memory swapped in and so in the number of KBof memory swapped out.

    Other interesting values include the "system" numbers for in and cs. The in value is the numberof interrupts per second a system is getting. A system doing a lot of network or disk I/O will havehigh values here, as interrupts are generated every time something is read or written to the diskor network.

    The cs value is the number of context switches per second. A context switch occurs when thekernel has to remove the information stored in the processor’s registers for the current processand replace it with the context information for a process that has just preempted the currentprocess. It is actually much more complicated than that, but that is the basic idea. Excessivecontext switches are bad, as it takes a fairly large number of cycles to perform a context switch.Consequently, if you are doing lots of context switches, you are spending all your time changingjobs and not actually doing any work.

    vmstat -s provides a list of statistics including "IO-wait cpu ticks", "pages paged in", and"pages paged out".

    Chapter 1 Evidence-Based Performance Analysis and Tuning has an example of using vmstat(Section 1.2.4 Analyzing the Evidence).

    ps

    The ps displays a snapshot of the current processes. Consider using these options:ps -eo pid,%cpu,vsz,args,wchan

    This shows every process, their pid, % of cpu, memory size, name, and what syscall they arecurrently executing.

    A basic form of the command is:ps aux

    To view a particular process over time:watch --interval=interval "ps auxw | grep pid"

    To view all processes in tree form:pstree

    Parameter Show for every process:

    args Command name

    %cpu Percentage of CPU usage

    comm Name of the executable

    cmajflt Number of major page faults for the process and all its children

    cminflt Number of minor page faults for the process and all its children

    cmdline Command that is running

    majflt Number of major page faults

    minflt Number of minor page faults

    pid Process identifier

    pstree View all processes in tree form

  • Chapter 2. System Tuning Tools for Linux Servers 11

    Parameter Show for every process:

    rss Resident set size in KB

    vsz Memory size (also written vsize)

    wchan The syscall currently executing

    Table 2-1. Selected Parameters for ps

    top

    The top command displays the processes that are currently using the most processor time. Ineffect, this is ps run every 5 seconds. Use the s (secure) and d x (duration seconds) optionsto monitor operations.top sd 10

    You can also sort by the titles of the columns; for example, by the SIZE of the applications.

    Note that top may execute often enough to influence the load average.

    vtad

    vtad is a rule-based performance monitoring system for Linux servers. It monitors the system’sperformance and makes recommendations for tuning based on rulesets. Rulesets are suggestionsfor performance configuration. For example:

    • The configured maximum shared memory should be 25% of the physical memory.

    • The configured maximum number of files should be one-quarter the number of inodes.

    • The number of available local ports should be 10,000 (or up to 28,000 on servers that havemany proxy server connections).

    See: http://sourceforge.net/projects/vtad/

    dklimits

    A simple utility to check the actually number of file descriptors available, ephemeral ports avail-able, and poll()-able sockets. Be warned that it can take a while to run if there are a largenumber of file descriptors available, as it will try to open that many files, and then unlink them.

    This is part of the dkftpbench package (http://www.kegel.com/dkftpbench/).

    PerformanceAnalysisScript

    PerformanceAnalysisScript.sh is a low-impact script that is provided in Appendix A Per-formanceAnalysisScript and online at http://people.redhat.com/mbehm/. You can run this scriptby hand before, during, and after the performance event, or you can modify it to run periodicallyfrom cron. The script provides enough information to rule out obvious problems with swapping,process size, process count, Ethernet bandwidth availability, network errors, disk I/O bandwidth,disk errors, and interrupt counts. Always obtain at least two runs of any data-gathering scripts sothat you can compare a “normal, loaded system”, with the system when it is not performing asexpected.

    2.3. Kernel Monitoring Utilities

    strace

    strace collects information about kernel system calls and signals of the command following iton the command line; this can produce a lot of output and more detail than desired.

  • 12 Chapter 2. System Tuning Tools for Linux Servers

    The lower impact "-c" option counts the number of times that each kind of system call is invoked,the total time spent for each system call, and the average time per systemcall.

    The default is for strace to print out the information.

    To generate a usable amount of data from the process with pid 27692 and send that to a file calledstrace_output.txt, enter:strace -c -p 27692 -o strace_output.txt

    readprofile

    readprofile reads and displays the time-based sample data collected by the profiler built intothe kernel. The profiling is enabled as a kernel boot option, “profile=2” (the “2” indicates thateach bin in the kernel histogram is 22, 4, bytes in size). readprofile provides information onlyon the kernel; it will not provide information on the modules or the user-space code.

    ipcs

    ipcs lists the interprocess communication mechanisms including shared-memory regions, mes-sage queues, and semaphores.

    2.4. Disk Benchmarking and Monitoring Utilities

    dbench

    dbench is a good disk-I/O benchmarking utility. It is designed to simulate the disk I/O load of asystem when running the NetBench benchmark suite; it seems to do an excellent job.

    Dbench is available at the Samba ftp site and mirrors (ftp://ftp.samba.org/pub/tridge/dbench/).

    IOzone

    A file-system benchmark tool that generates and measures a variety of file operations. IOzoneis useful for performing a broad file-system analysis. The benchmark tests file I/O performancefor the following operations: read, write, re-read, re-write, read backwards, read strided, fread,fwrite, random read, pread, mmap, aio_read, and aio_write.

    See the IOzone site (http://www.iozone.org/).

    tiobench

    A multithread disk I/O benchmarking utility that does a good job at stressing the disks.tiobench comes with some useful scripts for generating reports and graphs.

    See the tiobench site (http://sourceforge.net/projects/tiobench).

    dt (data test)

    dt is a very good disk/scsi test program; it performs disk I/O, process creation, async I/O, etc. Itis not really a benchmark program; it is more of a data-integrity testing program. However, it canalso be used to benchmark things as one of the intended roles of this program is to help peopletune their drivers once they have verified that the driver is reliable.

    dt is available at: http://www.bit-net.com/~rmiller/dt.html

    iostat

    iostat -d shows the disk performance in transactions per second, and well as statistics onblock reads and writes.

    The example that follows gives the results from the command that monitors the disk at 2-secondintervals for three samplings:

  • Chapter 2. System Tuning Tools for Linux Servers 13

    $ iostat -d 2 3

    Linux 2.4.20-18.9smp (perftest.redhat.com) 09/28/2004

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtndev3-0 1.07 2.01 15.95 5904202 46798960

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtndev3-0 0.00 0.00 0.00 0 0

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtndev3-0 2.00 0.00 64.00 0 128

    The initial values are the total number of blocks read and written since the last boot. The subse-quent values are the reads and writes since that report; these very low values indicate a machinethat is largely idle.

    To get extended statistics, use the -x [device] argument:iostat -x

    For further information, see Section 4.4.4 Disk I/O Elevators.

    Note

    When disk device drivers actually controlled the movement of the read/write heads on the diskdrive, it was easy for utilities, such as iostat, to calculate metrics for a device. For example, theaccess times for a disk drive could be calculated based on the amount of time it took to servicea request. Modern disk controllers are much more advanced and perform many optimizations,so it becomes much more difficult to obtain granular performance metrics.

    vmstat

    vmstat, which is part of the procps package, provides useful information when diagnosingdisk performance problems.

    Here is a sample of vmstat output:procs memory swap io system cpu

    r b w swpd free buff cache si so bi bo in cs us sy id1 0 0 0 9200 31856 34612 0 1 992 861 140 122 2 1 1

    The bi (blocks in) and bo (blocks out) values are the areas to watch for disk I/O.

    Other fields are described in Section 2.2 System-Monitoring Utilities.

    2.5. Memory Monitoring UtilitiesThese system-load monitoring utilities have some overlap in functionality with the tools describedelsewhere in this chapter. See, for example, the description of vmstat in Section 2.1 BenchmarkingUtilities and top in Section 2.2 System-Monitoring Utilities.

    free

    /usr/bin/free lists the amount of memory and swap space in the system, as well as the buffersand cache used by the kernel.

    total used free shared buffers cachedMem: 512688 501828 10860 0 91552 152664-/+ buffers/cache: 257612 255076Swap: 1638620 250692 1387928

  • 14 Chapter 2. System Tuning Tools for Linux Servers

    If more than half the swap space is being used, add more swap space.

    pmap

    pmap takes pid numbers as input and produces a listing of the memory map for the process. Thismemory map includes information about whether the regions is read, write, or executable. Eachentry lists the starting address and the size of the region.

    pmap was introduced with RHEL 3.

    sar (System Activity Report)

    sar enables you to get real-time information about system resource usage. It displays activityrecords that are automatically stored in the system-activity data file (/var/log/sa/saDD whereDD is the date).

    To gather data about memory usage, use the following options:

    -n option

    Displays network statistics. option can be DEV (for basic statistics), EDEV (for errors),SOCK (for socket statistics), or FULL (for all statistics).

    -q

    Queue length and load averages.

    For example:sar -q 5 2Linux 2.4.9-e.16 (example.org) 04/04/200503:19:34 PM runq-sz plist-sz ldavg-1 ldavg-503:19:39 PM 2 97 0.13 0.4003:19:44 PM 2 98 0.15 0.40

    -r

    Memory-usage statistics.

    -R

    Rate of change of memory usage.

    -s and -e

    Start and end times.

    For example:sar -s 12:00:00 -e 12:15:0012:00:00 PM CPU %user %nice %system %idle12:10:00 PM all 2.29 0.13 0.40 97.1812:20:00 PM all 1.25 0.17 0.07 98.51

    -W

    Swap-page statistics.

    For example, to display the rate of change of memory usage at 5 second intervals for three reports:sar -R 5 3

  • Chapter 2. System Tuning Tools for Linux Servers 15

    mprof

    mprof (Memory Profiler and Leak Detector) (http://www.software-facilities.com/devel-software/mprof.php) is another utility you can use to obtain memory profiling information.

    2.6. Network Monitoring UtilitiesThese network monitoring utilities have some overlap in functionality with the tools described inSection 2.1 Benchmarking Utilities.

    netstat

    If you are primarily concerned with network servers, the netstat command can often be veryuseful. It can show status of all incoming and outgoing sockets, which can give very handyinformation about the status of a network server.

    One of the more useful options is:netstat -pa

    The “-p” option tells it to try to determine what program has the socket open, which is often veryuseful information. For example, say someone NMAPs their system and wants to know what isusing port 666. Running netstat -pa will show you its status and what is running on that TCPport.

    Another useful option is:netstat -i

    The “-i” option displays network statistics about sent and received packets. Large numbers oferrors or dropped packets can indicate congestion or hardware problems.

    A complex, but useful, invocation is:netstat -a -n|grep -E "^(tcp)"| cut -c 68-|sort|uniq -c|sort -n

    The output shows a sorted list of how many sockets are in each connection state. For example:9 LISTEN

    21 ESTABLISHED

    ttcp

    TTCP is a basic network speed tester. It really only pumps data from one side of a TCP/IP networksocket to another as fast as possible. The numbers this gives are really just a base guideline as tohow well your TCP stack can fill the particular pipe it is talking through, but it is more accuratethan trying to estimate with ftp or other protocols.

    On modern machines, the bottleneck usually is not the CPU, so those measurements becomeless useful. However, this is a useful means of testing reliability: TTCP generates a lot of TCPpacket allocations/deallocations and network card interrupts, so it is useful in determining if yourEthernet card driver is reliable or if it might break under heavy loads.

    See: http://www.pcausa.com/Utilities/pcattcp.htm

    netperf

    NetPerf is a benchmark that you can use to measure the performance of many different typesof networking. It provides tests for both unidirectional throughput, and end-to-end latency. Theenvironments currently measurable by NetPerf include: TCP and UDP via BSD Sockets, DLPI,Unix Domain Sockets, Fore ATM API, and HiPPI.

    For more information, see http://www.netperf.org/netperf/NetperfPage.html. To download, go toftp://ftp.sgi.com/sgi/src/netperf/

  • 16 Chapter 2. System Tuning Tools for Linux Servers

    NetPIPE

    NetPIPE is a tool that visually represents the network performance under a variety of conditions.It combines the best qualities of TTCP and NetPerf.

    pchar

    pchar is a tool to characterize the bandwidth, latency, and loss of links along an end-to-end paththrough the Internet.

    For more information, see http://www.kitchenlab.org/www/bmah/Software/pchar/.

    2.7. Tools for Improving Application Performance

    Note

    If your application uses NPTL, also see Section 2.7.2 Tools for NPTL.

    hoard (libhoard)

    Hoard is a fast, scalable and memory-efficient allocator for multiprocessors. It is a drop-in re-placement for the C and C++ memory routines.

    Hoard solves the heap contention problem caused when multiple threads call dynamic memory-allocation functions such as malloc() and free() (or new and delete). Hoard can dramati-cally improve the performance of multithreaded programs running on multiprocessors.

    For more information, see: http://www.cs.umass.edu/~emery/hoard/

    oprofile

    OProfile collects data on all the running executables and the kernel. It is useful to identifyhot spots where the processor spends significant amounts of time. OProfile can also use the per-formance monitoring hardware on processors to identify which processor resources are limitingperformance; for example, cache and branch prediction hardware.

    See Chapter 7 Application Tuning with OProfile.

    /usr/bin/gprof

    User code can be compiled and linked with "-pg" to enable profiling. This instruments the codeto record function calls so that gprof can build a call graph that uses time-based sampling toindicate where the program spends time. The gprof support in RHEL is not thread safe, doesnot dump data until the program exits, and does not record information about shared libraries.

    /usr/bin/gcov

    gcov reads the information generated by programs compiled with GCC’s "-ftest-coverage" and"-fprofile-arcs" options. When the compiled program is run, data is collected on which basicblocks in the program are executed and the number of times they are executed. gcov maps thatinformation back to the source code and indicates the number of times that specific lines in thesource code are executed.

    /usr/bin/sprof

    The loader included in glibc allows profiling of a shared library. For example, to profile theshared libc library for the du command and write the data out to a file xyz, type:

  • Chapter 2. System Tuning Tools for Linux Servers 17

    LD_PROFILE=libc.so.6 du

    To analyze the data:sprof libc.so.6 libc.so.6.profile|more

    /usr/bin/ltrace

    ltrace intercepts calls to shared library routines and prints out a trace of the shared libraryroutines executed. This instrumentation can significantly slow the execution of the program.

    2.7.1. Pointer-Use ErrorsMany pointer-use errors relate to heap allocation. Writing past the end of a heap object, or accessinga pointer after a free, can sometimes be detected with nothing more than a library that replacesthe standard library’s heap functions (malloc, free, etc.). This section examines programs that candetect such errors.

    ElectricFence

    ElectricFence C memory debugging library (http://linux.maruhn.com/sec/electricfence.html) isa utility you can use to obtain more detailed statistics regarding memory usage by individualprograms. It can also manage heap objects that abut inaccessible virtual memory pages. A bufferoverrun there causes an instant segmentation fault. Some libraries provide a protected paddingarea around buffers. This padding is filled with code that can be periodically checked for changes,so program errors can be detected at a coarser granularity.

    You link the ElectricFence library in at compile time and it warns you of possible problems, suchas freeing memory that does not exist.

    ElectricFence has a number of limitations:

    • It uses mmap() to allocate memory. mmap() can allocate memory only in page units, whichcan quickly exhaust the available address space on 32-bit platforms.

    • It can perform access checks only in one direction—either above or below the allocated buffer.You may want to run the program twice: once with EF_PROTECT_BELOW off and once with iton.

    • If the address is only slightly wrong and the block is aligned at the top of the page, someaccesses beyond the top of the object might go unnoticed if the object size and alignmentrequire a few fill bytes.

    • If the address is wrong by more than one page size, the access might be beyond the guardpages with PROT_NONE access created by the library; consequently, it might succeed.

    Bounded-pointers GCC Extension

    The bounded-pointers GCC extension (http://gcc.gnu.org/projects/bp/main.html) addressespointer errors by replacing simple pointers with a three-word struct that also contains the legalbounds for that pointer. This changes the system ABI, making it necessary to recompile theentire application. The bounds are computed upon assignment from the address-of operator, andconstructed for system calls within an instrumented version of the standard library. Each use ofthe pointer is quickly checked against its own bounds, and the application aborts upon aviolation. Because there is no database of live objects, an instrumented program can offer noextra information to help debug the problem.

    gcc-checker Extension

    The gcc-checker extension (http://www-ala.doc.ic.ac.uk/~phjk/BoundsChecking.html) addressespointer errors by mapping all pointer-manipulation operations and all variable lifetime scopes to

  • 18 Chapter 2. System Tuning Tools for Linux Servers

    calls into a runtime library. In this scheme, the instrumentation is heavy-weight, but the pointerrepresentation in memory remains ABI-compatible. It may be possible to detect the moment apointer becomes invalid (say, through a bad assignment or increment), before it is ever used toaccess memory.

    StackGuard GCC Extension

    The StackGuard GCC extension (http://immunix.org/stackguard.html) addresses stack-smashingattacks via buffer overruns. It does this by instrumenting a function to rearrange its stack frame,so that arrays are placed away from vulnerable values such as return addresses. Further guardpadding is added around arrays and is checked before a function returns. This is light-weight andreasonably effective, but provides no general protection for pointer errors or debugging assis-tance.

    Purify

    The Purify package (http://www.rational.com/products/purify_unix/) is a well-known proprietarypackage for detecting memory errors. Purify works by batch instrumentation of object files,reverse-engineering patterns in object code that represent compiled pointer operations, and re-placing them with a mixture of inline code and calls into a runtime library.

    Mudflap

    Mudflap is a pointer-use-checking technology based on compile-time instrumentation. It trans-parently adds protective code to a variety of potentially unsafe C/C++ constructs that detectactual erroneous uses at run time. It can detect NULL pointer dereferencing, running off the endsof buffers and strings, and leaking memory. Mudflap has heuristics that allow some degree ofchecking even if only a subset of a program’s object modules are instrumented.

    Mudflap is described in detail in Chapter 6 Detecting and Preventing Application Memory Prob-lems.

    valgrind

    Valgrind executes your program in a virtual machine, keeping track of the memory blocksthat have been allocated, initialized, and freed. It simulates every instruction a program executes,which reveals errors not only in an application, but also in all supporting dynamically-linked(.so-format) libraries, including the GNU C library, the X client libraries, Qt (if you workwith KDE), and so on.

    Valgrind is described in detail in Chapter 6 Detecting and Preventing Application MemoryProblems.

    2.7.2. Tools for NPTL

    NPTL Test and Trace Project

    The NPTL Test and Trace Project aims to improve the NPTL library. Although the group is justunderway, you may want to monitor their progress at http://nptl.bullopensource.org/home.php.

    NPTL Trace Tool

    NPTL Trace Tool provides a mechanism to trace the NPTL Library as unobtrusively as possiblefor the application dynamics. The post-mortem analysis allows measurements (of contension andother issues) to give you an understanding of hangs or bugs.

    See http://sourceforge.net/projects/nptltracetool/.

  • Chapter 2. System Tuning Tools for Linux Servers 19

    Tip

    See also Section 3.1.5 Kernel Threads.

  • 20 Chapter 2. System Tuning Tools for Linux Servers

  • Chapter 3.

    Understanding the Virtual Memory ManagementSystem

    One of the most important aspects of an operating system is the Virtual Memory Management system.Virtual Memory (VM) allows an operating system to perform many of its advanced functions, suchas process isolation, file caching, and swapping. As such, it is imperative that you understand thefunctions and tunable parameters of an operating system’s virtual memory manager so that optimalperformance for a given workload can be achieved.

    This chapter provides a general overview of how a VM works, specifically the VM implemented inRed Hat Enterprise Linux 3. The VM may change between releases; however, with a well-groundedunderstanding of the general mechanics of a VM, it is fairly easy to convert your knowledge of VMtuning to another VM as the same general principles will apply. The documentation for a given kernel(including its specific tunable parameters), can be found in the corresponding kernel source tree in thefile /usr/src/linux_kernel/Documentation/sysctl/vm.txt.

    3.1. Virtual Memory TerminologyTo properly understand how a Virtual Memory manager does its job, it helps to understand whatcomponents comprise a VM. While the low level details of a VM are overwhelming for most, ahigh level view is nonetheless helpful in understanding how a VM system works, and how it can beoptimized for various workloads. A high level overview of the components that make up a VirtualMemory manager is presented in Figure 3-1

  • 22 Chapter 3. Understanding the Virtual Memory Management System

    3.1.1. What Comprises a VM

    Figure 3-1. High Level Overview Of VM SubsystemWhile a VM is actually far more complicated than illustrated in Figure 3-1, the high-level function ofthe system is accurate. The following sections describe each of the listed components in the VM.

    3.1.2. Memory Management Unit (MMU)The Memory Management Unit (MMU) is the hardware base that make a Virtual Memory systempossible. The MMU allows software to reference physical memory by aliased addresses, quite oftenmore than one. It accomplishes this through the use of pages and page tables. The MMU uses asection of memory to translate virtual addresses into physical addresses via a series of table lookups.Various processor architectures perform this function in slightly different ways, but in general, Figure3-2 illustrates how a translation is performed from a virtual address to a physical address.

  • Chapter 3. Understanding the Virtual Memory Management System 23

    Figure 3-2. Illustration of a Virtual-to-Physical Memory Translation

    Each table lookup provides a pointer to the base of the next table, as well as a set of extra bits thatprovide auxiliary data regarding that page or set of pages. This information typically includes thecurrent page status, access privileges, and size. A separate portion of the virtual address being accessedprovides an index into each table in the lookup process. The final table provides a pointer to the startof the physical page corresponding to the virtual address in RAM, while the last field in the virtualaddress selects the actual word in the page being accessed. Any one of the table lookups during thistranslation, may direct the lookup operation to terminate and drive the operating system to performanother action. Some of these actions are somewhat observable at a system level, and have commonnames or references.

    Segmentation Violation

    A user space process requests a virtual address, and during the translation the kernel is interruptedand informed that the requested translation has resulted in a page that it has not allocated, or thatthe process does not have permission to access. The kernel responds by signaling the process thatit has attempted to access an invalid memory region, after which it is terminated.

    Swapped out

    During a translation of an address from a user space process, the kernel was interrupted andinformed that the page table entry lists the page as accessible, but not present in RAM. Thekernel interprets this to mean that the requested address is in a page that has been swapped todisk. The user process requesting the address is put to sleep and an I/O operation is started toretrieve the page.

    3.1.3. Zoned Buddy AllocatorThe Zoned Buddy Allocator is responsible for the management of page allocations to the entire sys-tem. This code manages lists of physically contiguous pages and maps them into the MMU page tablesto provide other kernel subsystems with valid physical address ranges when the kernel requests them(Physical to Virtual Address mapping is handled by a higher layer of the VM and is collapsed into thekernel subsystems block of Figure 3-1.

  • 24 Chapter 3. Understanding the Virtual Memory Management System

    The name Buddy Allocator is derived from the algorithm this subsystem uses to maintain it free pagelists. All physical pages in RAM are cataloged by the buddy allocator and grouped into lists. Eachlist represents clusters of 2n pages, where n is incremented in each list. There is a list of single pages,a list of 2 page clusters, a list of 4 page cluster, and so on. When a request comes in for an amountof memory, that value is rounded up to the nearest power of 2. An entry is then removed from theappropriate list, registered in the page tables of the MMU, and a corresponding physical address isreturned to the caller, which is then mapped into a virtual address for kernel use. If no entries exist onthe requested list, an entry from the next list up is broken into two separate clusters, and 1 is returnedto the caller while the other is added to the next list down. When an allocation is returned to the buddyallocator, the reverse process happens. The allocation is returned to the requisite list, and the list isthen examined to determine if a larger cluster can be made from the existing entries on the list that wasjust updated. This algorithm is advantageous in that it automatically returns pages to the highest orderfree list possible. That is to say, as allocations are returned to the free pool, they automatically formlarger clusters, so that when a need arises for a large amount of physically contiguous memory (forexample, for a DMA operation), it is more likely that the request can be satisfied. Note that the buddyallocator allocates memory in page multiples only. Other subsystems are responsible for finer grainedcontrol over allocation size. For more information regarding the finer details of a buddy allocator,refer to Bovet and Cesati’s Understanding the Linux Kernel.

    Note that the Buddy Allocator also manages memory zones, which define pools of memory that havedifferent purposes. Currently there are three memory pools for which the buddy allocator managesaccess:

    DMA

    This zone consists of the first 16 MB of RAM, from which legacy devices allocate to performdirect memory operations.

    NORMAL

    This zone encompasses memory addresses from 16 MB to 1 GB1 and is used by the kernel forinternal data structures, as well as other system and user space allocations.

    HIGHMEM

    This zone includes all memory above 1 GB and is used exclusively for system allocations (filesystem buffers, user space allocations, etc).

    3.1.4. Slab AllocatorThe Slab Allocator provides a more usable front end to the Buddy Allocator for those sections ofthe kernel that require memory in sizes that are more flexible than the standard 4 KB page. The SlabAllocator allows other kernel components to create caches of memory objects of a given size. TheSlab Allocator is responsible for placing as many of the caches’ objects on a page as possible andmonitoring which objects are free and which are allocated. When allocations are requested and nomore are available, the Slab Allocator requests more pages from the Buddy Allocator to satisfy therequest. This allows kernel components to use memory in a much simpler way. This way componentsthat make use of many small portions of memory are not required to individually implement memorymanagement code so that too many pages are not wasted.2

    1. The hugemem kernel extends this zone to 3.9 GB of space.2. The Slab Allocator may allocate only from the DMA and NORMAL zones.

  • Chapter 3. Understanding the Virtual Memory Management System 25

    3.1.5. Kernel ThreadsThe last component in the VM subsystem are the active tasks: kscand, kswapd, kupdated, andbdflush. These tasks are responsible for the recovery and management of in-use memory. All pagesof memory have an associated state (for more info on the memory state machine, see Section 3.2 TheLife of A Page).

    Freshly allocated memory initially enters the active state. As the page sits in RAM, it is periodicallyvisited by the kscand task, which marks the page as being inactive. If the page is subsequentlyaccessed again by a task other than kscand, it is returned to the active state. If a page is modifiedduring its allocation it is additionally marked as being dirty. The kswapd task scan pages periodicallyin much the same way, except that when it finds a page that is inactive, kswapd considers the page tobe a candidate to be returned to the free pool of pages for later allocation. If kswapd selects a page forfreeing, it examines its dirty bit. In the event the dirty bit is set, it locks the page, and initiates a diskI/O operation to move the page to swap space on the hard drive. When the I/O operation is complete,kswapd modifies the page table entry to indicate that the page has been migrated to disk, unlocksthe page, and places it back on the free list, making it available for further allocations. In general theactive tasks in the kernel relating to VM usage are responsible for attempting to move pages out ofRAM. Periodically they examine RAM, trying to identify and free inactive memory so that it can beput to other uses in the system.

    3.1.5.1. NPTL Threading Library

    The NPTL Threading Library was introduced in Red Hat Enterprise Linux 3. If you have seen areduction in performance after an upgrade to RHEL 3, it may be that your application does not useNPTL effectively.

    3.1.5.1.1. Explaining LD_ASSUME_KERNEL

    LD_ASSUME_KERNEL sets the kernel ABI that the application expects.

    LD_ASSUME_KERNEL is handled by the dynamic linker, but the behavior of a specific value ofLD_ASSUME_KERNEL is not hardcoded in the dynamic linker.

    Every DSO (Dynamic Shared Object, also known as a shared library) can tell the dynamic linker inglibc which minimum operating system ABI version is needed. Note that dynamic linkers other thanglibc’s do not have this feature. The information about the minimum operating system ABI version isencoded in an ELF note section usually named .note.ABI-tag. This note section must be referencedby a PT_NOTE entry in the DSO’s program header.

    To examine the content of a DSO’s note section, use the readelf program from elfutils (theversion from binutils is inadequate). On a Red Hat system, the binary is called eu-readelf.Using it on a Fedora Core 2 system shows the following:

    $ eu-readelf -n /lib/tls/libc-2.3.3.so

    Note segment of 32 bytes at offset 0x174:Owner Data size TypeGNU 16 VERSION

    OS: Linux, ABI: 2.4.20

    This means the /lib/tls/libc-2.3.3.soDSO requires at least OS ABI version 2.4.20.

    The specific ABI version requirements on a RHL9, and Fedora Core 1 and 2 system are as follows(note that this implies IA-32 is the architecture):

    • DSOs in /lib/tls need ABI version 2.4.20.

    • DSOs in /lib/i686 need ABI version 2.4.1.

    • DSOs in /lib need ABI version 2.2.5.

  • 26 Chapter 3. Understanding the Virtual Memory Management System

    This means:

    • LD_ASSUME_KERNEL supports only versions 2.2.5 and later.

    • Versions from 2.2.5 to 2.4.0 will use the DSOs in /lib

    • Versions from 2.4.1 to 2.4.19 will use the DSOs in /lib/i686

    • Versions 2.4.20 and newer will use the DSOs in /lib/tls.

    For the Red Hat releases, this layout was chosen to provide the maximum amount of backward com-patibility for applications that violate the threading specifications or use implementation artifacts (cor-rectly written applications will have no problems).

    You must ensure that programmers use explicit thread stack sizes when creating new threads. Forexample, to cause all thread stacks to be 64k by default, use:

    ulimit -s 64

    This also sets the main thread’s stack (the thread created by the kernel) to also have this size limitation.This is usually acceptable.

    3.1.6. Components that Use the VMIt is worth mentioning the remaining components that sit on top of the VM subsystem. These compo-nents actively use the VM to acquire memory and presents it to users, providing the overall “feel” ofthe system:

    Network Stack

    The Network Stack is responsible for the management of network buffers being received andsent out of the various network interfaces in a system.

    Standard C Library

    Via various system calls, the standard C library manages pages of virtual memory and presentsuser applications with an API allowing fine-grained memory control.

    Virtual File System

    The Virtual file system buffers data from disks for more rapid file access, and holds pages con-taining file data that has been memory-mapped by an application.

    3.2. The Life of A PageAll of the memory managed by the VM is labeled by a state. These states help let the VM know whatto do with a given page under various circumstances. Dependent on the current needs of the system,the VM may transfer pages from one state to the next, according to the state machine diagrammed inFigure 3-3:

  • Chapter 3. Understanding the Virtual Memory Management System 27

    Figure 3-3. Basic Configuration

    Using these states, the VM can determine what is being done with a page by the system at a giventime and what actions the VM may take on the page. The states that have particular meanings are asfollows:

    FREE

    All pages available for allocation begin in this state. This indicates to the VM that the page is notbeing used for any purpose and is available for allocation.

    ACTIVE

    Pages that have been allocated from the Buddy Allocator enter this state. It indicates to the VMthat the page has been allocated and is actively in use by the kernel or a user process.

    INACTIVE DIRTY

    The state indicates that the page has fallen into disuse by the entity that allocated it, and as suchis a candidate for removal from main memory. The kscand task periodically sweeps through allthe pages in memory, taking note of the amount of time the page has been in memory since itwas last accessed. If kscand finds that a page has been accessed since it last visited the page, itincrements the pages age counter; otherwise, it decrements that counter. If kscand happens ona page that has its age counter at zero, then the page is moved to the inactive dirty state. Pages inthe inactive dirty state are kept in a list of pages to be laundered.

  • 28 Chapter 3. Understanding the Virtual Memory Management System

    INACTIVE LAUNDERED

    This is an interim state in which those pages that have been selected for removal from mainmemory enter while their contents are being moved to disk. Only pages that were in the inactivedirty state enter this state. When the disk I/O operation is complete the page is moved to theinactive clean state, where it may be deallocated, or overwritten for another purpose. If, duringthe disk operation, the page is accessed, the page is moved back into the active state.

    INACTIVE CLEAN

    Pages in this state have been laundered. This means that the contents of the page are in sync withthe backing data on disk. As such, they may be deallocated by the VM or overwritten for otherpurposes.

    3.3. Tuning the VMNow that the picture of the VM mechanism is sufficiently illustrated, how is it adjusted to fit certainworkloads? There are two methods for changing tunable parameters in the Linux VM. The first isthe sysctl interface. The sysctl interface is a programming oriented interface that enables softwareprograms to directly modify various tunable parameters. The sysctl interface is exported to systemadministrators via the sysctl utility, which enables an administrator to specify a value for any of thetunable VM parameters on the command line, as in the following example:

    sysctl -w vm.max_map_count=65535

    The sysctl utility also supports the use of the /etc/sysctl.conf configuration file, in whichall the desirable changes to a VM can be recorded for a system and restored after a restart of theoperating system, making this access method suitable for long term changes to a system VM. Thefile is straightforward in its layout, using simple key-value pairs with comments for clarity, as in thefollowing example:

    #Adjust the min and max read-ahead for filesvm.max-readahead=64vm.min-readahead=32

    #turn on memory over-commitvm.overcommit_memory=2

    #bump up the percentage of memory in use to activate bdflushvm.bdflush="40 500 0 0 500 3000 60 20 0"

    The second method of modifying VM tunable parameters is via the proc file system. This methodexports every group of VM tunables as a file, accessible via all the common Linux utilities used formodify file contents. The VM tunables are available in the directory /proc/sys/vm, and are mostcommonly read and modified using the Linux cat and echo utilities, as in the following example:

    # cat kswapd512 32 8

    echo 511 31 7 > /proc/sys/vm/kswapd

    # cat kswapd511 31 7

    The proc file system interface is a convenient method for making adjustments to the VM whileattempting to isolate the peak performance of a system. For convenience, the following sections listthe VM tunable parameters as the filenames they are exported as in /proc/sys/vm. Pl