Post on 18-May-2018
Tuning Informix for OLTP Workloads
Sun Microsystems Computer Corporation2550 Garcia AvenueMountain View, CA 94043U.S.A.
Technical White Paper
Denis SheahanDatabase Engineering, SMCC
PleaseRecycle
1997 Sun Microsystems, Inc.2550 Garcia Avenue, Mountain View, California 94043-1100 U.S.A.
All rights reserved. This product and related documentation are protected by copyright and distributed under licensesrestricting its use, copying, distribution, and decompilation. No part of this product or related documentation may bereproduced in any form by any means without prior written authorization of Sun and its licensors, if any.
Portions of this product may be derived from the UNIX® and Berkeley 4.3 BSD systems, licensed from UNIX SystemLaboratories, Inc. and the University of California, respectively. Third-party font software in this product is protected bycopyright and licensed from Sun’s Font Suppliers.
RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the United States Government is subject to the restrictionsset forth in DFARS 252.227-7013 (c)(1)(ii) and FAR 52.227-19.
The product described in this manual may be protected by one or more U.S. patents, foreign patents, or pending applications.
TRADEMARKSSun, Sun Microsystems, Sun Microsystems Computer Corporation, the Sun logo, the Sun Microsystems ComputerCorporation logo,are trademarks or registered trademarks of Sun Microsystems, Inc. UNIX and OPEN LOOK are registeredtrademarks of UNIX System Laboratories, Inc., a wholly owned subsidiary of Novell, Inc. All other product names mentionedherein are the trademarks of their respective owners.
All SPARC trademarks, including the SCD Compliant Logo, are trademarks or registered trademarks of SPARC International,Inc. SPARCstation, SPARCserver, SPARCengine, SPARCworks, and SPARCompiler are licensed exclusively to SunMicrosystems, Inc. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc.
The OPEN LOOK® and Sun™ Graphical User Interfaces were developed by Sun Microsystems, Inc. for its users and licensees.Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical userinterfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface,which license also covers Sun’s licensees who implement OPEN LOOK GUIs and otherwise comply with Sun’s written licenseagreements.
X Window System is a trademark and product of the Massachusetts Institute of Technology.
TPC-C Benchmark TM is a trademark of the Transaction Processing Council.
Infomix ODS 7is a registered trademark of Informix Inc.
VERITAS, VxVM, VxVA, VxFS, and the VERITAS logo are registered trademarks of VERITAS Software Corporation.
THIS PUBLICATION IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED,INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR APARTICULAR PURPOSE, OR NON-INFRINGEMENT.
THIS PUBLICATION COULD INCLUDE TECHNICAL INACCURACIES OR TYPOGRAPHICAL ERRORS. CHANGES AREPERIODICALLY ADDED TO THE INFORMATION HEREIN; THESE CHANGES WILL BE INCORPORATED IN NEWEDITIONS OF THE PUBLICATION. SUN MICROSYSTEMS, INC. MAY MAKE IMPROVEMENTS AND/OR CHANGES INTHE PRODUCT(S) AND/OR THE PROGRAM(S) DESCRIBED IN THIS PUBLICATION AT ANY TIME.
iii
Contents
1. Informix Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
ODS Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Virtual Processors: Dynamic Scalable Architecture . . . . . . . . 7
An Informix Instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Private and Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2. Database Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Informix Layout basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Rootdbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Logical Logs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Online Physical Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Spindle Count:. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
iv Tuning Informix for OLTP Workloads
Table Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Volume Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Interleave Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Raw vs UFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Tuning Existing Layouts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Building Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Loading data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3. Online Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
I/O Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Physical Logging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Logical Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Connecting to the Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Configuring CPUVPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Configuring Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
BUFFERS and LRUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
LOCKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4. System Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Sample /etc/system File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Contents v
CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5. Appendix A : Informix Scripts. . . . . . . . . . . . . . . . . . . . . . . . . . 53
File: move_log.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6. Appendix B: Application Tuning . . . . . . . . . . . . . . . . . . . . . . . . 55
Using sqexplain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Database Procedures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Application errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Deadlock and locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Using PDQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
optcompind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
vi Tuning Informix for OLTP Workloads
Informix Overview 7
Informix Overview
IntroductionCurrently Informix offers 3 different versions of their engine. Version 7, also known as OnlineDynamic Server (ODS), is their main OLTP engine. Its current revision is 7.2.x with a plannedrelease of 7.3 in Early 1998. Version 8, also known as Extended Paralell Server (XPS), is theirmain Decision Support engine. Its current revision is 8.11 with a planned release of 8.2December 1997. Version 9, also known as Informix Universal Server (IUS), is a merge of 7.2and the Universal Server from Illustra. IUS is Informix’s Object Relational offering providingextended types and datablades.
This paper explains how to tune ODS for OLTP Workloads on Sun servers running Solaris2.5.1 and above. Most of the tuning tips are also applicable to IUS in OLTP situations. Thereare three major sections dealing with data layout, generic Informix tuning issues and systemtuning. In Appendix B we also provide a tutorial on Informix application tuning.
ODS Overview
Virtual Processors: Dynamic Scalable Architecture
The concept of virtual processors (VPs) underlies the entire structure of Informix ODS , and iscalled the Dynamic Scalable Architecture (DSA). ODS runs one Informix thread (rather thanone process) for every user session connected to the database. Context for threads ismaintained in shared memory, so the same thread can be serviced by different VPs if necessary,although by preference it remains in a single VP to reduce cache-line transfers at the hardwarelevel.
Each VP can run many user threads plus internal threads to perform database I/O, logging I/O,page cleaning, administrative tasks and other work. Certain Informix utilities are served bytheir own special threads. VPs are divided into several classes depending on the type of workthey do. All VPs in the same class share the same code and access the same data andprocessing queues in memory.
8 Informix Overview
Figure 1 Informix ODS 7 Clients, Virtual Processors and Physical CPUs
The relationship of virtual processors, physical CPUs and clients is illustrated in Figure 1.
An Informix Instance
Every running Informix database is associated with an Informix instance. When a database isbrought up, shared memory is allocated and the virtual processor processes are started. Thecombination of the shared memory and VPs is called an instance.
CPU 0 CPU 1 CPU 2 CPU 3
Client Client Client Client Client Client
Client Client Client Client Client Client
Virtual Processor Virtual Processor Virtual Processor Virtual Processor Virtual Processor
Shared Memory
Informix Overview 9
Multiple databases can be created within one instance. Whenever a CREATE DATABASEcommand is executed, system catalogs are created to map the tables, indexes, views and otherrelational objects of a logically independent database. Informix applications and utilitiesalways initiate a session by specifying which database within an instance they want to connectto.
Private and Shared Memory
Informix virtual processors each have some private memory plus access to global sharedmemory. The private memory holds the VP’s program text and private data, including privatepointers into shared memory. Locks and latches (mutexes) are used to manage concurrentaccess to shared memory by all VP’s.
Shared memory is divided into three major portions: the resident portion, the virtual portionand the message portion. The resident portion contains the buffer pool and several internaltables used only to track other objects in shared memory. The virtual portion contains largeI/O buffers, session data, thread data, the dictionary cache, stored procedures cache, thesorting pool and a global pool for structures shared by many components of OnLine, especiallymessages from client applications. The message portion is only used to exchange messageswith client programs executing on the same machine as the database and communicating viashared memory interprocess messages.
For more details of the architecture of ODS refer to the Administrators Guide, Volumes 1 and 2.
10 Informix Overview
11
Database Layout
This chapter describes one of the most important aspects of tuning a database application -laying out a database on disk. A well thought out and tuned layout can do wonders toperformance. On the other hand, all the performance tricks you can try on a runtime systemwon’t do any good if the database is poorly laid out to start with.
The primary tool for obtaining Informix statistics is onstat. Onstat interrogates the engine anddumps out statistics that the later has gathered. Online statistics are zeroised on bringup ofthe engine or with the onstat -z command. When tuning the system, wait until theenvironment is in a steady state, call onstat -z, let the system run for a short period (say 5 - 10minutes) and then dump the required stats. This will avoid statistics such as I/O per secondbeing inaccurate. onstat -a will dump all stats.
Informix Layout basicsThe basic unit of data in Online is a page. In 7.x this is currently 2k in size, in 8.x it is 4k. Eachpage in 7.x can hold up to 255 rows of data. Multiple contiguous pages make up a chunk.Chunks are created using the onspaces utility and can be 2GB maximum, 8.x has a 4GBmaximum. To be used in create statements chunks must be included in a dbspace. Dbspacescan are made up of one or more chunks.
To create a dbspace specify the first chunk it contains
onspaces -c -d oli_dbs1 -p /links/DEV/olinei_41 -o 0 -s 500000
-c indicates create
-d is the dbspace name
-p is the physical device
-o is the offset
-s is the size in kilobytes
To add more chunks to the database use the -a option
onspaces -a oli_dbs1 -p /links/DEV/olinei_42 -o 0 -s 500000
12 Database Layout—December 1997
As stated earlier the limit for any chunk is 2GB. Multiple chunks can be taken from the samedevice by using the offset parameter:
onspaces -a oli_dbs1 -p /links/DEV/olinei_42 -o 0 -s 500000
onspaces -a oli_dbs1 -p /links/DEV/olinei_42 -o 500001 -s 500000
We always recommend that the user use soft links when declaring devices in onspaces. Thenif the controller number changes on the system the link can be moved.
The 2GB limit can be a major drawback with Informix especially on larger 9GB disks whichrequire a minimum of 5 chunks to ulilize the whole disk. The user can quickly run out of the8 partitions provided by the format utility. Using veritas can aleviate this problem as the usercan declare as many plexes as required.
Once the dbspaces are created onstat -d can be used to display their sizes and remaining freespace. The output is in 2 sections, first dbspaces then chunks.
Dbspaces
address number flags fchunk nchunks flags owner name
7d778150 1 1 1 1 N informix rootdbs
7e085f50 2 1 2 1 N informix plogdbs
7e093f50 3 1 3 1 N informix llogdbs1
7e175978 4 1 4 1 N informix llogdbs2
7e175a38 5 1 5 1 N informix llogdbs3
7e175af8 6 1 6 1 N informix wdi_dbs
7e175bb8 7 1 7 4 N informix cd_dbs1
......
7e17cd98 114 1 268 1 N informix si_dbs12
114 active, 2047 maximum
Chunks
address chk/dbs offset size free bpages flags pathname
7d778210 1 1 500 950000 948181 PO- /links/DEV/root_chunk
7d7ab588 2 2 500 950000 49947 PO- /links/DEV/plog_chunk
7d7ab668 3 3 500 950000 49947 PO- /links/DEV/llog_chunk1
7d7ab748 4 4 500 950000 49947 PO- /links/DEV/llog_chunk2
Informix Layout basics 13
7d7ab828 5 5 500 950000 49947 PO- /links/DEV/llog_chunk3
7d7ab908 6 6 0 500000 490782 PO- /links/DEV/wdi_1
7d7ab9e8 7 7 0 250200 47 PO- /links/DEV/custd_1
7d7abac8 8 7 0 250200 97 PO- /links/DEV/custd_2
7d7abba8 9 7 0 250200 97 PO- /links/DEV/custd_3
7d7abc88 10 7 0 250200 97 PO- /links/DEV/custd_4
.....
7e175898 268 114 0 250000 170497 PO- /links/DEV/stocki_12
268 active, 2047 maximum
Notice how cd_dbs1, which is fchunk number 7, indicates 4 in its nchunks column. There arethen 4 chunks, numbers 7 - 10 which have 7 as their dbs number.
The size and free fields, in the chunk section, are specified in 2k pages. When the chunk isinitialized these two columns will be the same. As table space is allocated from the chunk theamount free is reduced. Online reserves a number of pages from the chunk for what is termedits bitmap pages. These pages indicate the free space in a chunk. In addition the first chunk ina dbspace has pages allocated for what is termed the tablespace. The result is that the firstchunk in a dbspace will have less usable space.
Data Layout
We recommend using raw devices for Online data. Raw devices are accessed using kaiowhich is the most optimal path. UFS also requires an extra copy through the kernel addressspace, consuming CPU cycles. Its aging algorithm is suboptimal for database applications andthe caching of inodes is extra overhead.
Because veritas is an extra layer in the data access path there is a small penalty with its use. Insituations such as striping below, however, its advantages outweigh this small performancedegradation.
Online requires space for the following
• System catalogs. This is held in a special dbspace called the root dbspacewhich is created on Online initialization.
14 Database Layout—December 1997
• Logical logs. This contains records of transactions in a database that islogged
• Physical Log. This contains before images of modified pages which areused in rolling transactions back and forward
• Tables. User tables can be placed by default in the root dbspace or in usercreated spaces
• Indexes. Index data can be held with the table data or in their own separatedbspaces
All data can be placed in the rootdbspace but naturally this can lead to poor performance.
Rootdbs
The user must first select a partition / volume to use for Online’s root dbspace. A soft linkshould then be made to this partition and is specified in the onconfig file i.e..
ROOTNAME rootdbs # Root dbspace name
ROOTPATH /dev/online_root # Path for device containing root dbspace
ROOTSIZE 20000 # Size of root dbspace (Kbytes)
The root dbspace is what is initialized when oninit -iy is performed. The size must besufficient to hold the initial physical log and all of the logical logs plus overhead for the systemcatalogs. The logs are always created in the rootdbs initially and can be moved later. If thelogs are moved the rootdbs will never be a hot disk. It can be placed on a disk with other dataif the user is short of spindles. Usually all catalog data is cached and the rootdbs is rarelyaccessed after the first couple of minutes of operation.
Logical Logs
Most commercial installations require some level of logging to ensure the integrity of the datain a database. The logical log in Online contains logical log records. These records are requiredfor a number of functions:
Informix Layout basics 15
• Fast Recovery: If Online shuts down in an uncontrolled manner it uses thelog records to recover all transactions that occured since the last checkpoint.
• Transactions roll back: If during normal transaction processing a transactionmust be rolled back (error, rollback command, etc.) Online uses the logrecords to reverse the changes made on behalf of the transaction.
• Data Restoration: During a data restore the user combines the backup tapesof the logical log files with the most recent Online dbspace backup tapes.
Online provides 3 logging modes Buffered, Unbuffered and Ansi. Unbuffered and Ansi are the2 safest modes. With these modes the logical log records are written to disk whenever anytransaction executes a commit. This guarantees that all transactions are committed in case ofsystem failure. In the buffered mode the logical log buffer is committed only when it is full.On system failure a number of committed transaction may not have been written to disk.
Logging is specified on a per database basis using the ontape utility
• ontape -A tpcc -s : Set Ansi logging on database tpcc
• ontape -B tpcc -s : Set Buffered logging on database tpcc
• ontape -U tpcc -s : Set Unbuffered logging on database tpcc
• ontape -N tpcc -s : Turn off logging on database tpcc
The number and initial size of the logical logs are specified in the onconfig file with thefollowing parameters
LOGFILES 3 # Number of logical log files
LOGSIZE 500 # Logical log size (Kbytes)
The user can see the state of the logical logs with the command onstat -l
Logical Logging
Buffer bufused bufsize numrecs numpages numwrits recs/pages pages/io
L-1 0 16 227 9 3 25.2 3.0
Subsystem numrecs Log Space used
OLDRSAM 227 12664
16 Database Layout—December 1997
address number flags uniqid begin size used %used
ab08990 1 U---C-L 7 300035 250 125 50.0
ab089ac 2 U-B---- 5 400035 250 250 100.00
ab089c8 3 U-B---- 6 500035 250 250 100.00
The size is specified in 2k pages. If the %used are all 100% the system will halt waiting forthe logs to be backed up unless the TAPEDEV onconfig parameter is specified to be /dev/null.
After initialization the logs can be moved with the move_log.sh script specified in AppendixA. If the system moves across 2 logical log boundaries it automatically triggers a checkpoint.
The flag fields of interest are C indicating current logical log, U indicating used, B indicatingbacked up and L indicating the log contains the most recent checkpoint record. The uniqidfield increases every time a new logical log is started. The current log will have the highestnumber. Each time a log is completed a mesage is dumped to the message log
<<INFORMIX-OnLine Server>>> Logical Log 21 Complete.
10:53:48 Logical Log 21 Complete.
Online Physical Log
The physical log contains before-images of database pages. Before the engine modifies a pageit is written to the physical log. These images are used in fast recovery to bring the database toa consistent state. After a failure Online uses the before-images in the physical log to restoreall pages on disk to their state at the last checkpoint. This is the first phase of fast recovery.Next the before images are combined with the logical log records stored since the checkpoint,restoring the data to consistency up to the last transaction commit, this is the second phase offast recovery..
Because after a checkpoint all modified pages are flushed to disk the physical log is emptied atthat time. During bringup Online starts recovery from the last checkpoint record in the logicallog.
As a safety feature a checkpoint is initiated when the physical log becomes 75% full. It isimportant to make the physical log big enough to avoid 2 scenarios
Informix Layout basics 17
• Continual checkpointing• Physical log overflow. This occurs when the log is filling up to fast for a
checkpoint to get in and empty it. This will cause Online to shutdown.
onstat -l also gives info on the Physical log
Physical Logging
Buffer bufused bufsize numpages numwrits pages/io
P-1 10 16 78633 4909 16.02
phybegin physize phypos phyused %used
200035 900000 875077 195578 21.73
Tables
Usually the majority of pages in a database are dedicated to user data. In Online an extent isthe minimum amount of contiguous space that can be allocated for a table. Every permanenttable has 2 extent sizes associated with it. The initial extent size is the number of kilobytesallocated to the table when first created. The next extent is the amount in kilobytes allocatedwhen the initial and all other extents are full. The size of these extents is specified in the createtable statement
CREATE TABLE item (
i_id INTEGER,
i_im_id INTEGER,
i_name CHAR(24),
i_price DECIMAL(5,2),
i_data CHAR(50)
)
IN wdi_dbs
EXTENT SIZE 12000
NEXT SIZE 8000
18 Database Layout—December 1997
wdi_dbs is the dbspace the table will be created in. After creation you can see from the onstat-d output that 12000/2 = 6000 pages will be taken from the free column of wdi_dbs. As thetables fills the free column will reduce in 8000/2 = 4000 page extents.
It is good practise to have the minimum number of extents possible for a table. Multipleextents leads to fragmentation of the disk. Initialize one a large chunk for the table and thenallocate an extent greater than the chunk size. Because extents cannot cross a chunk boundary,Online will take the maximum it can allocate for the table ie the entire chunk.
Online also has an internal mechanism to reduce the number of extents. After every 16 nextextents it will double the size of next. In our example the 17th, 33rd and 49th extents will be8000, 16000 and 32000 pages respectively.
Spindle Count:
Once the database size is determined, it would seem that calculating the number of disksrequired should be a simple task. Simply divide the total database size by the size of each disk.Unfortunately, for OLTP applications, this strategy could yield very poor performance. Thenumber of spindles required is usually much larger than what the above calculation woulddictate. However, this does not mean that we will be wasting disk space. The additional spaceis used for growing tables, filesystems etc.
To determine the number of spindles, it is crucial to understand the workload. Disk access isalso closely tied to the size of the buffer cache. If your workload is such that rows updated bysome transactions are re-used by others, it is advantageous to increase the size of the buffercache, cutting down disk access. This scheme is best explained by means of an example.
In TPC-C, the stock table is the largest. For a 900 warehouse database, the total spacerequirement for this table is over 36 GB. If we did a dumb distribution, we would need 182.1GB drives. But let’s look at the table accesses a little more by understanding the workload.The Stock table is randomly accessed 20 times from the Neworder transaction (10 reads/10updates) and 200 times from the Stocklevel transaction. However, the Stocklevel transactionreads the data that is accessed by the Neworder transaction and should hopefully be in thebuffer cache. Therefore, the number of stock table accesses is 20/Neworder transaction. For the900 warehouse database, we hope to achieve 10,500 Neworder transactions/minute or 175/sec.This is a total of 175 * 20 or 3500 ios/sec on the stock table. For good performance, each diskshould be restricted to a response time less than 20ms, which means we will need 3500 / 40 or88 disks. Notice that this is a far cry from the 18 drives we computed by looking at the spaceusage alone. Thus, for OLTP applications, it is extremely important to consider database accesspatterns before deciding on the number of spindles required.
Fragmentation 19
FragmentationFor small and lightly used tables a flat layout inside a single dbspace is sufficient. If thesetables are greater than 2GB just add extra chunks to the dbspace. For hot tables however thisscheme will lead to poor performance. These tables should be fragmented. Fragmentation letsthe user place table rows in different dbspaces based on some distribution scheme. Theadvantage of fragmentation is that the optimizer can possibly eliminate whole sections of datawhen executing a query. In an OLTP environment this improves concurrency as it reducescontention on the underlying devices.
There are 2 types of distribution scheme, round-robin and expression-based. In round-robinOnline uses and internal scheme to distribute the rows. It is generally a poor performer as nofragments can be eliminated.
An expression-based distribution scheme requires the user to define a rule and include it in thecreate table or create index statement for more details see Informix ODS Administrators GuideVol. 1.
We usually use range fragmentation for our hot table definitions e.g..
CREATE TABLE customer (
c_w_id SMALLINT,
c_d_id SMALLINT,
c_id INTEGER,
...
c_ytd_payment DECIMAL(12,2),
c_data CHAR(500)
)
FRAGMENT BY EXPRESSION
c_w_id <= 100 IN cd_dbs1,
c_w_id <= 200 AND c_w_id > 100 IN cd_dbs2,
c_w_id <= 300 AND c_w_id > 200 IN cd_dbs3,
c_w_id <= 400 AND c_w_id > 300 IN cd_dbs4,
c_w_id <= 500 AND c_w_id > 400 IN cd_dbs5,
c_w_id <= 600 AND c_w_id > 500 IN cd_dbs6,
20 Database Layout—December 1997
c_w_id <= 700 AND c_w_id > 600 IN cd_dbs7,
c_w_id <= 800 AND c_w_id > 700 IN cd_dbs8,
c_w_id <= 900 AND c_w_id > 800 IN cd_dbs9,
REMAINDER IN cd_dbs10
EXTENT SIZE 500200
NEXT SIZE 500200
As rows are inserted into the table the value of the c_w_id is checked and this determines inwhich dbspace to place the data.
Table SizeIn order to size an Online database correctly we can make a rough calculation of the rowlength of both data and indexes and the number of rows in a 2k page. If the user has alreadycreated the table he can find the rowsize with the sql fragment:
select rowsize from systables where tabname = “table-name”;
Each “normal” page has 28 bytes at the start taken up in the Page Header. At the end is a 4byte timestamp. Comparison of this timestamp with one in the header determines if a pagehas been modified. Growing back from the time stamp are the slot table entries, 4 bytes foreach row of data. A slot table entry contains the offset in the page and length of the row. Sothe actual space a row takes up is rowsize + 4. From this we can determine the number ofrows of a table a page can accommodate. Due to a 1 byte identifier in the slot table entry themaximum number of rows is 255.
It it not recommended to make a row longer than a page. There is a lot of overhead followingthe chain of pages for each row accessed and performance is degraded. Blobpages are alsovery poor performers.
If the table is already loaded oncheck can be used to determine the exact number of pagesallocated to it:
oncheck -pT tpcc:tablename
Table Size 21
This can take some time for large tables but can give some valuable information. It is useful todump the oncheck data when the database is initially loaded and on a regular basis to seeactual growth needs etc.
For each dbspace a report is generated e.g.
TBLspace Report for tpcc:informix.stock
Table fragment in DBspace s_ddbs01
Physical Address 200005
Creation date 03/31/97 19:16:07
TBLspace Flags 802 Row Locking
TBLspace use 4 bit bit-maps
Maximum row size 306
Number of special columns 0
Number of keys 0
Number of extents 4
Current serial value 1
First extent size 190000
Next extent size 190000
Number of pages allocated 760000
Number of pages used 750187
Number of data pages 750000
Number of rows 4500000
Partition partnum 2097154
Partition lockid 2097154
Extents
Logical Page Physical Page Size
0 200035 190000
190000 300003 190000
380000 400003 190000
570000 500003 190000
BLspace Usage Report for tpcc:informix.stock
22 Database Layout—December 1997
Type Pages Empty Semi-Full Full Very-Full
---------------- ---------- ---------- ---------- ---------- ----------
Free 9813
Bit-Map 187
Index 0
Data (Home) 750000
----------
Total Pages 760000
Unused Space Summary
Unused data slots 0
Unused bytes per data page 160
Total unused bytes in data pages 120000000
Home Data Page Version Summary
Version Count
0 (current) 750000
Volume ManagementAfter the number of disks required is determined, you must now consider how to managethem. In a large database, several hundred disks may be used and managing them can be amajor task. Use of a Volume Manager like Solstice DiskSuite or Veritas Volume Manager canease this task considerably.
Volume Management 23
Availability
There are trade-offs to be made between cost, availability and performance. How muchdowntime you can tolerate will help decide this. If you want your database to be impervious toa single disk failure, then you should consider RAID 1 or RAID 5 implementations. For theTPC-C workload, a fully mirrored database using RAID 1 shows a 10% performancedegradation compared to a non-RAID database. Using RAID 5, further increases thisdegradation to 35%. The RSM2000 is a possible solution for RAID 5. A more completedescription of the performance implications of RAID on OLTP workloads can be found in thewhitepaper “Performance Evaluation of RAID With OLTP Workloads” athttp://hot.eng/whitepapers. Note that if your workload performs many writes, thedegradation may be more severe. Performance degradation of up to 50% are not uncommon.
Striping
Disk striping using RAID 0 is often used to configure disks for good performance. Stripinghelps spread the load across disks, eliminating any hot-spots in the database. The user canuse Informix striping via fragmentation or veritas striping. In some situations, primarily whenthere is skew in the access pattern of a table, Informix fragmentation is not the best solution.
As an example lets takes the stock table of the TPC-C database. Using Informix striping weinitially lay out the data on 72 disks. Each dbspace is made up of a 6 chunks from 6 differentdisks
onspaces -c -d sd_dbs1 -p /device_links/stockd_1 -o 0 -s 463300
onspaces -a sd_dbs1 -p /device_links/stockd_2 -o 0 -s 463300
onspaces -a sd_dbs1 -p /device_links/stockd_3 -o 0 -s 463300
onspaces -a sd_dbs1 -p /device_links/stockd_4 -o 0 -s 463300
onspaces -a sd_dbs1 -p /device_links/stockd_5 -o 0 -s 463300
onspaces -a sd_dbs1 -p /device_links/stockd_6 -o 0 -s 463300
We run our benchmark and using iostat/statit we discover that I/Os to the 6 disks are veryuneven
Disk I/O Statistics (per second)
Disk util% xfer/s rds/s wrts/s rdb/xfr wrb/xfr wtqlen svqlen srv-ms
c3t3d1 42.6 59.9 28.8 31.1 2048 2048 0.00 0.62 10.4
24 Database Layout—December 1997
c3t3d2 51.7 81.0 34.9 46.1 2048 2048 0.00 0.85 10.4
c3t3d3 10.9 13.7 7.4 6.3 2048 2048 0.00 0.12 8.7
c3t3d4 14.1 18.0 9.5 8.5 2048 2048 0.00 0.16 8.8
c3t4d0 26.7 36.0 18.5 17.5 2048 2048 0.00 0.34 9.3
c3t4d1 38.7 55.3 27.4 27.9 2048 2048 0.00 0.55 9.9
The first 2 disks are hot, the second 2 cold and the last 2 medium. To even I/O we create 6, 6way veritas stripes
/etc/vx/bin/vxdisksetup -i c3t3d1
vxdg -g rootdg adddisk stkvol1=c3t3d1
....
/etc/vx/bin/vxdisksetup -i c3t4d1
vxdg -g rootdg adddisk stkvol6=c3t4d1
vxassist -g rootdg make stk1 1000m layout=stripe columns=6 stripeunit=16kstkvol1 stkvol2 stkvol3 stkvol4 stkvol5 stkvol6
....
vxassist -g rootdg make stk5 1000m layout=stripe columns=6 stripeunit=16kstkvol1 stkvol2 stkvol3 stkvol4 stkvol5 stkvol6
and take the 6 chunks from these new volumes. The I/O then becomes uniform:
c3t3d1 35.6 43.0 20.5 22.5 2048 2048 0.00 0.50 11.6
c3t3d2 36.2 43.4 20.7 22.7 2048 2048 0.00 0.50 11.6
c3t3d3 36.0 42.7 20.6 22.1 2048 2048 0.00 0.51 11.9
c3t3d4 36.6 43.6 21.0 22.6 2048 2048 0.00 0.51 11.6
c3t4d0 35.9 42.5 20.5 22.0 2048 2048 0.00 0.50 11.7
c3t4d1 35.6 43.6 20.6 22.9 2048 2048 0.00 0.50 11.4
Tuning Existing Layouts 25
Interleave Factor
Interlace or Interleave factor is the amount of contiguous space on one spindle. For example, ifwe specify an interlace factor of 16K, the volume manager will assign the first 16K bytes fromthe first disk, the next 16K from the next disk in the stripe and so on. For OLTP workloads, theinterlace factor should be small. 16K to 32K interleave factors should work well for table andindex data.
Raw vs UFS
A large number of customers use UFS for their database files for convenience. From aperformance perspective, raw will outperform UFS for OLTP workloads. We have measured atwo-fold increase in performance for raw vs UFS.
Tuning Existing LayoutsOften times, we don’t have the luxury to layout a database from scratch. In such cases, weneed to monitor disk performance and tune as best we can. The first step is to collect disk iostatistics during normal operation of the work-load. System utilities such as sar and iostat canbe used. If the volume manager being used is Veritas, then the utility vxstat will provide diskstatistics at a volume level, eliminating the need to map physical disks to logical tablespaces.Online also provides statistics such as the number of reads and writes per chunk. If thestatistics show greater than 40 ios/sec or service times greater than 50 ms, the system maywarrant tuning. Identify which tablespace is on the problem disks. If the tablespace containsmultiple tables, more analysis needs to be done to determine which table is the culprit.
It may be possible to reduce disk activity, by caching more data in memory. If you havesufficient memory try bumping the number of BUFFERS in the onconfig file. If this doesn’treduce the io bottleneck, then it may be necessary to re-distribute the data onto more disks. Ifthe Informix chunks were all added using logical names (i.e. symbolic links to the actualphysical devices, or Veritas volume names), this re-distribution is rather straight-forward.Shutdown Online, re-create the volumes over a larger number of disks and copy the data overfrom the old volumes to the new ones. If disk space is a constraint or actual physical devicenames were used as datafile names, it may be necessary to export the table contents, drop andre-create the tablespace before loading the data back in.
26 Database Layout—December 1997
Indexes
Indexes can greatly improve the performance of OLTP environments. Informix uses a btree+structure for indexes. All indexes start with a single root node . There maybe a number ofintermediate branch levels and the index ends in leaf nodes. The index nodes contain rowscalled index items. Each index item consists of a key value (which may be a composite) and arowid. The rowid represents either a row in another index page or the actual data row in thecase of leaf nodes. The fields from the original table that make up the index are called thekeys. Keys are chosen so that the optimizer will choose the index in a particular query. If bychoosing the keys all the results of a query can be returned without going to the data row theindex is said to cover the query.
With regards to performance there are a number of issues with indexes. The first is the depthof the tree. When accessing a data row using the index each level in the tree requires a bufferread. This can effect the cache hit rate of the buffer pool and worse case require extra I/Os. Inthe case of an update, a lock is held on all rows involved in the index search. Locks on idexescan become quite hot if a lot of updates are occurring.
One potential solution is to fragment the index. Indexs on fragmented tables that don’tspecify their own strategy are fragmented the same way as the table. The engine reservesspace in the tables dbspace for the index using an internal algorithm. These are known asattached indexes. Alternatively the user can declare his own index fragmentation schema, forexample he could fragment the table 4 ways
CREATE TABLE orders (
o_id INTEGER,
o_c_id INTEGER,
o_d_id SMALLINT,
o_w_id SMALLINT,
o_carrier_id SMALLINT,
o_ol_cnt SMALLINT,
o_all_local SMALLINT,
o_entry_d DATETIME YEAR TO SECOND
)
FRAGMENT BY EXPRESSION
o_w_id <= 250 IN od_dbs1,
o_w_id <= 500 AND o_w_id > 250 IN od_dbs2,
Indexes 27
o_w_id <= 700 AND o_w_id > 500 IN od_dbs3,
REMAINDER IN od_dbs4
EXTENT SIZE 267700
NEXT SIZE 267700
LOCK MODE ROW;
and fragment the index 2 ways
CREATE UNIQUE INDEX oi1 ON orders(o_id, o_w_id, o_d_id)
FRAGMENT BY EXPRESSION
o_w_id <= 500 IN oi1_dbs1,
REMAINDER IN oi1_dbs2;
Fragmentation has the effect of reducing the depth of the btree. The optimize can determinefrom the fragmentation strategy which fragment of the index to traverse. If the trees in thefragments are shallower than a single large index then fewer I/Os are required to reach therow. Index fragmentation does have a downside, however, it requires searching long linkedlists to get to the required fragment. This can sometimes adversely effect performance. Evenif an index is shallow the transaction mix may make the disks hot. Fragmentation can helpthis situation.
To determine the depth of a tree use the Online facility oncheck described earlier.
If there is an attached index on the table there will be an “Index” line with the number ofpages allocated. For each attached index there will be an output as follows:
Index Usage Report for index customer_index on tpcc:informix.customer
Average Average
Level Total No. Keys Free Bytes
----- -------- -------- ----------
1 1 66 972
2 66 62 1017
3 4135 62 1027
28 Database Layout—December 1997
4 256411 116 31
----- -------- -------- ----------
Total 260613 116 47
Here we see the customer_index has a depth of 4. Notice how the branch pages have far morefree space than the leaf pages. The number of rows in an index page is controlled by theFILLFACTOR parameter. FILLFACTOR can be set for the whole system in the onconfig file oroverridden in the create/alter index statement. The default FILLFACTOR is 90%.FILLFACTOR only takes effect when the index is being built not when it is being updated.
Set FILLFACTOR higher for static indexes or indexes that are updated but rarely change size.This will increase index performance by yielding a better cache hit rate and reduced I/O. Setthe FILLFACTOR to lower than default for indexes that are going to have a lot of inserts. Thiswill give the index room to grow and reduce the amount of page splitting required. Forheavily updated tables it might make sense to periodically drop the index and rebuild it withthe required FILLFACTOR to help performance.
Another issue with indexes is the number to have on a table. There is no limit in Informix butmultiple indexes must be maintained. If a table is heavily updated / altered the index mustmodified for each operation.
We also recommend not choosing a varchar for the key in an index. Traversing an indexrequires a number of key comparisons and varchars are generally a poor performer in thissituation.
Building IndexesOnline can build indexes serially or in parallel. Serial is the default operation and for smalltables is often sufficient. Each chunk of the table is read in sequence and the data sorted. Onmachines with a higher number of CPUS parallel builds are more efficient. Parallel builds readmultiple disks into memory, sorts the data and writes the index out in parallel. To do this theuser must enable Parallel Data Query (PDQ) and set the PSORTNPROCS environmentvariable. PSORTNPROCS restricts the number of sort threads that will be started in the engine(on a CPUVP basis) PDQ is enabled by setting MAX_PDQPRIORITY in the onconfig file andsetting the PDQPRIORITY environment variable. An example script would be:
PSORT_NPROCS=10
PDQPRIORITY=100
Building Indexes 29
export PSORT_NPROCS
export PDQPRIORITY
dbaccess tpcc index_second_customer
PDQ requires memory for operation and the amount it can allocate is set by theDS_TOTAL_MEMORY variable in the onconfig file. Each index build will getDS_TOTAL_MEMORY / DS_MAX_QUERIES of memory to work with. So for the mostefficient parallel build set DS_TOTAL_MEM to the most you can allocate, DS_MAX_QUERIESto 1 so the build gets all available.
It is also a good idea to set SHMVIRTSIZE high to avoid multiple shared segments from beingallocated during the index build (see shared memory section in Informix Tuning).
During a parallel index build onstat -g ath (show all threads) should show psortproc threadsbeing scheduled.
311 cf51d20 eb61490 2 cond wait(packet_cond) 5cpu xchg_2.82
312 cf75de8 eb61834 2 cond wait(packet_cond) 3cpu xchg_2.83
313 cf50c58 eb61bd8 2 cond wait(packet_cond) 3cpu xchg_2.84
314 cf50d80 eb61f7c 2 ready 5cpu xchg_2.85
625 e9b7d58 30b267b8 2 ready 1cpu psortproc
626 e9c0680 30b26b5c 2 ready 3cpu psortproc
627 e9c0988 30b26f00 2 ready 5cpu psortproc
628 e9c0c90 30b272a4 2 ready 4cpu psortproc
629 e9c0f98 30b27648 2 ready 4cpu psortproc
Parallel build also uses big buffers so onstat -g iob should have output
INFORMIX-OnLine Version 8.20.UA2 -- On-Line -- Up 00:33:58 -- 3608456Kbytes
AIO big buffer usage summary:
class reads writes
pages ops pgs/op holes hl-ops hls/op pages ops pgs/op
30 Database Layout—December 1997
fif 0 0 0.00 0 0 0.00 0 0 0.00
kio 1593640 53394 29.85 26 10 2.60 11470 2685 4.27
Finally onstat -g iof (I/O to each chunk) should have parallel reads from more than one of thetable chunks.
When DS memory is exhausted the index build will overflow to the temp dbspace. Temp isspecified by the DBSPACETEMP variable in onconfig. If none is specified /tmp is used andthe sort data is written to a cooked file. This is suboptimal as kaio cannot be employed and thecooked file codepath is not optimized. Always create a number of temp spaces and set theDBSPACETEMP variable.
DBSPACETEMPtmpdbs1,tmpdbs2,tmpdbs3,tmpdbs4,tmpdbs5,tmpdbs6,tmpdbs7,tmpdbs8,tmpdbs9,tmpdbs10
# Default temp dbspaces
The spaces will be written to in a round-robin fashion. If TEMP becomes hot the user mightconsider striping using veritas. A small interleave factor (say 16k) or lower should be chosenfor the volumes.
Loading dataInformix provide a parallel loader for Online which is beyond the scope of this document Formore information see the Guide to the High Performance Loader.
A simpler solution is to load into multiple fragments in parallel. Any table that is fragmentedcan be loaded in this fashion which avoids contention on the bitmap page of the table and thethrashing inherent in two loader going after the same tables.
31
Online Tuning
I/O TuningIn Solaris Online’s default method of I/O is kaio for all reads and writes. Each CPUVP has aspecial kaio thread that performs this task. When a normal thread yields in the engine the kaiothread is always scheduled next. The kaio thread uses aio_read and aio_write to submitoutstanding requests to the OS and then uses aio_wait with a zero timeout to check if there areany completions. It passes on any completed I/O and then yields to the scheduler whichchooses the next thread to run.
onstat -g iov displays the activity of each kaio thread
AIO I/O vps:
class/vp s io/s totalops dskread dskwrite dskcopy wakeups io/wup
kio 0 i 0.0 43 36 7 0 89 0.5
kio 1 i 0.0 658 640 18 0 1489 0.4
The kaio thread employs an algorithm to try and coalesce I/O requests for adjacent pagestogether. It then submits this larger I/O in what is called a big buffer which is more efficient.Big buffers are also used in Index building. To see big buffer usage use onstat -g iob
AIO big buffer usage summary:
class reads writes
pages ops pgs/op holes hl-ops hls/op pages ops pgs/op
fif 0 0 0.00 0 0 0.00 0 0 0.00
kio 2803 2742 1.02 0 0 0.00 102 77 1.32
Normal threads submit reads and writes via the I/O queues. The kaio thread empties itsqueue each time it is scheduled. To see the status of the queues use onstat -g ioq.
32 Online Tuning—December 1997
AIO I/O queues:
q name/id len maxlen totalops dskread dskwrite dskcopy
kio 1 0 16 119642 58799 60843 0
kio 2 0 16 132167 59397 72770 0
kio 3 0 16 124531 59469 65062 0
kio 4 0 16 110924 59482 51442 0
kio 5 0 16 126967 59182 67785 0
kio 6 0 16 122345 58707 63638 0
kio 7 0 16 130828 61490 69338 0
The len field gives the current number of outstanding I/Os submitted and the maxlen is ahighwater indicator. The max is usually achieved during buffer cleaning when a cleanerthread chooses a number of buffers from an LRU (default 16) and writes them to disk. Look atdskread and dkswrite to ensure all kaio threads are doing roughly the same amount of I/O.
To find out what chunks are receiving the I/O requests use onstat -g iof
AIO global files:
gfd pathname totalops dskread dskwrite io/s
3 root_chunk 93 45 48 0.0
4 plog_chunk 2986 0 2986 0.0
5 llog_chunk1 0 0 0 0.0
6 llog_chunk2 18411 5841 12570 0.1
7 llog_chunk3 0 0 0 0.0
8 /amir/DEV/wdi_1 172 133 39 0.0
9 custd_1 8578 5867 2711 0.0
10 custd_2 8789 6094 2695 0.1
11 custd_3 9106 6411 2695 0.1
.....
Informix generally recommends no more than 40 I/O a second to a disk but with todays fasterdisks 56 - 60 can be ok. If the chunk is on a multi-disk stripe however obviously the chunk cantake 40 I/Os times the number of disks. For instance if we have a 6 way stripe the I/Os can goup to 240 I/Os a second. On RSM or SSA arrays turning on the fast write cache can improvewrite performance.
33
Logging
Physical Logging
As mentioned earlier it is essential to make the physical log big enough to avoid continualcheckpointing. The size limit is 2GB in 7.x and there can be only one physical log in an Onlineinstance. If a 2GB log is still filling too fast the user will have to live with the checkpoints andtry to reduce their time instead.
The onconfig parameter PHYSBUFF (in kbytes) determines the I/O block size for thephysical log. Writes to the physical log are buffered, one is being written to in memory whileanother is being written to disk. In update intensive OLTP environments the physical log maybecome hot. In such situations first increase the size of Physbuf to 128k to see if it helpsperformance. 128k is by default the largest block size that Solaris will not break up whenwriting to disk.
The user can also stripe the physical log if possible using a volume manager such as veritas.Make sure the interleave factor is less than the PHYSBUFF or the entire buffer will be writtento just 1 disk in the stripe.
Logical Logging
As with the Physical log the user should ensure that the logical logs are big enough to avoidconstant checkpoints. A checkpoint is initiated when 2 logical logs are crossed. Having 16logs of 500k each, therefore, is a bad schema. Make the logical logs large (2GB is possible). Ifthe environment is being archived to tape the user should provide enough log space for say an8 hour work day. When all logs are full the system will halt waiting for backup to complete.
34 Online Tuning—December 1997
Again the logical logs can be striped but they are not usually hot enough to warrant this as theactual I/Os tend to be sequential. They are usually mirrored, however,. The user has 2options here, volume manager mirroring or Informix mirroring. Informix mirroring isachieved using the onspaces command either when the log is created
onspaces -c -d llog_dbs -p /dbs/llog_dbs0 -o 4 -s 500000 -m /dbs/lm_dbs0 4
or afterwards with the -m option
onspaces -m llog_ddbs01 -p /dev/ifmx/s132d15vol -o 5120
We have seen no difference in performance between Informix and veritas log mirroring
The onconfig parameter LOGBUFF sets the I/O size of the Logical log. In BUFFERED loggingthis amount of data is written out on each I/O. In UNBUFFERED and ANSI logging this isthe size of the log buffer. Multiple transactions write log records to the log buffer. When onetransaction commits the log buffer must be flushed. The amount of buffer actually written isdetermined by the amount of “piggybacking” achieved before a commit. The length of thetransactions and their mix determines this piggy backing. The pages/io in the onstat -l outputindicates the amount of piggybacking achieved
Logical Logging
Buffer bufused bufsize numrecs numpages numwrits recs/pages pages/io
L-1 0 16 2618502 97451 13371 26.9 7.3
Subsystem numrecs Log Space used
OLDRSAM 2618502 183907884
In this case each logical log write averaged 7.3 pages or 14.6k, roughly half the 32k Log buffer.In high volume, short transaction environments the piggybacking will decrease as transactionsare continually committing. Use the pages/io stat to determine the size of the log buffer
Connecting to the DatabaseWhen a database has been initialized the next step is getting users connected to the engine.In order to connect, either via an application or a tool such as dbaccess, a user must set anenvironment variable INFORMIXSERVER in their environment to indicate the instance of theserver they wish to attach to (multiple instances can be present either on the same machine oron the network). This name must be present in the users sqlhosts file
echo $INFORMIXDIR
INFORMIXSERVER=xtpcc
Connecting to the Database 35
cat $INFORMIXDIR/etc/sqlhosts
xtpcc ontlitcp campi-1 7600
The default directory for sqlhosts is $INFORMIXDIR/etc but this can be changed by setting theINFORMIXSQLHOSTS environment variable. Communication can be over the network usingtli/tcp or locally using a shared memory protocol called ipcshm. The hostname campi-1 abovemust also exist in the client’s /etc/hosts file.
The engine must then provide the service to the user. The INFORMIXSERVER name must bepresent in the onconfig file as either DBSERVERNAME or an alias in the DBSERVERALIASESlist. Each DBSERVERNAME or DBSERVERALIASES must be present in the servers sqlhostsfile
onconfig entries:
DBSERVERNAME rtpcc
DBSERVERALIASES xtpcc
sqlhosts file:
rtpcc olipcshm campi rtpcc
xtpcc ontlitcp campi-1 7600
In this example local connections user the server name rtpcc and remote connections use theserver alias xtpcc. When using tli/tcp there must be an entry in the /etc/services file for theOnline listener.
rtpcc 7600/tcp # Informix listner
Informix must have read permission for /etc/services. When the engine comes up netstat -awill show if the listener is operational
TCP
Local Address Remote Address Swind Send-Q Rwind Recv-Q State
----------------- -------------------- ----- ------ ----- ------ -------
campi-1.rtpcc *.* 0 0 0 0 LISTEN
campi-1.rtpcc haxx3-1.33232 8760 0 8760 0 ESTABLISHED
*.rtpc *.* 0 0 8576 0 BOUND
When dbaccess or an application is connected over tli/tcp netsat will show ESTABLISHED inthe state field. When a connection terminates there will still be a netstat entry with BOUNDin the state field.
onstat -g ses will show connections in Online
36 Online Tuning—December 1997
session #RSAM total used
id user tty pid hostname threads memory memory
22 dbbench 1 2098 haxx3-1 1 32768 27624
12 informix 5 18372 campi 1 32768 27400
The hostname field indicates a local or remote connection. For more information on thevarious connection options see the Administrators Guide, Volume 1.
By using different server aliases Online can listen on multiple networks. Each requires alistener in /etc/services and an entry in the sqlhosts file which specifies a different name
onconfig .
DBSERVERNAME thash
DBSERVERALIASES net2,net3 # List of alternate dbservernames
sqlhosts
thash ontlitcp campi-1 7600
net2 ontlitcp campi-2 7700
net3 ontlitcp campi-3 7800
/etc/hosts
#private nets
192.1.1.100 campi-1
192.1.2.100 campi-2
192.1.3.100 campi-3
By default Online spawns one poll thread for each nettype entry in the sqlhosts file. TheNETTYPE onconfig parameter allows the user to allocate more than one poll thread anddesignate whether these run inline as part of the work of a CPUVP or as its own process on aNETVP. onstat -g ath displays how many threads are polling
Threads:
tid tcb rstcb prty status vp-class name
7 ca0b9678 0 2 running 1cpu sm_poll
8 ca0c3248 0 2 running 25tli tlitcppoll
9 ca0c3788 0 2 cond wait(arrived) 26tli tlitcppoll
Connecting to the Database 37
Here we see one poll thread for shared memory (sm_poll) and two for tli/tcp. onstat -g glowill indicate if NETVPS have been started. If the poll thread is running on a NETVP there willbe a shm entry for ipcshm and a tli entry for tli/tcp.
Virtual processor summary:
class vps usercpu syscpu total
cpu 19 14.88 23.11 37.99
aio 1 0.06 0.25 0.31
tli 8 0.06 0.08 0.14
shm 8 0.06 0.03 0.09
lio 1 0.01 0.01 0.02
pio 1 0.02 0.00 0.02
adm 1 0.00 0.02 0.02
msc 1 0.00 0.01 0.01
total 40 15.09 23.51 38.60
Individual virtual processors:
26 18578 shm 0.01 0.01 0.02
33 18534 tli 0.01 0.01 0.02
The default operation is inline for ipcshm and a network vp for tli/tcp. In high transactionrate OLTP environments we have found it best to use NETVPS for polling. Using a CPUVPmeans this process must switch from processing queries to handling network operations. Thishas a detrimental effect on the cache of the processor. An option to try is to have one lessCPUVP than the number of physical processors and one network VP. Bind the CPUVPS tophysical processors. The NETVP will be scheduled on the spare CPU and will not starve forresources.
There is a known bug with the poll call in Solaris that is being currently fixed. When a lot ofconnections are made on a port, poll can perform extremely poorly. This is because the kernelkeeps a linked list of connections that the poll system call must traverse. This is particularlybad in Baan environments where each Baan session starts multiple Online sessions.
onstat -g ntu,ntt, ntm, ntd, nss, nsc and nsd can give useful network statistics,
38 Online Tuning—December 1997
Configuring CPUVPSAll processing in Online is performed by cpuvps. Informix recommend setting cpuvps to 1less than the number of physical processors available. The user should always experimentwith this, sometimes setting the number of CPUVPS greater than number of physical canincrease performance. This will only be true if there is idle time on the system. Use mpstat todetermine if Online is using resources efficiently.
When Online is initializing it starts a single oninit process and for each type of VP forksanother oninit. The state of the VPs can be seen with onstat -g glo
The aio, pi, lio and msc VPS should have little or no cpu time allocated to them. Also usercpushould dominate as Online only really uses the OS for I/O, shared memory and timing(gettimeofday). We have seen at most 20% system time on a heavily loaded OLTP system.
For optimal performance it is better that the CPUVPS stay on the same physical cpu for as longas possible. This reduces cache miss rate for the process. There are 2 parts to this, the first isbinding the processor to the CPU and the second is extending the amount of time each processhas.
The CPUVP processes are bound to physical cpus using the parameters AFF_SPROC, whichindicates the physical number of the first processor to bind and AFF_NPROCS the number ofprocessors to bind. If these are enabled the following messages appear in the message log.
10:01:58 Affinitied VP 1 to phys proc 1
10:01:58 Affinitied VP 3 to phys proc 4
...
If we run pbind on the system we see
process id 4940: 4
process id 4945: 1
...
Binding should reduce the amount of migration of the oninit processes from one physical cputo another.
Configuring CPUVPS 39
For details on modifying the dispatch table see http://hot.eng/dbe/tools/ . Modifying the dispatchtable gives the oninit processes a longer timeslice, moves them to the highest priority andkeeps them at that priority. This reduces the amount of times the processes are rescheduledand helps performance.
To see how many users are logged in use onstat -g ses. To see what threads are in the systemuse onstat -g ath
Threads:
tid tcb rstcb prty status vp-class name
6 de0e31d8 de00e018 4 sleeping secs: 1 3cpu main_loop()
7 de0e3980 0 2 running 24tli tlitcppoll
8 de0e3e78 0 2 running 25tli tlitcppoll
9 de0e6478 0 3 sleeping forever 1cpu tlitcplst
10 de0e6b98 0 3 sleeping forever 1cpu tlitcplst
11 de0e71e8 de00e4bc 2 sleeping forever 7cpu flush_sub(0)
12 de0e73d8 de00e960 2 sleeping forever 19cpu flush_sub(1)
58 de0f5000 de01bed8 2 sleeping forever 11cpu flush_sub(47)
59 de0f5378 0 4 sleeping forever 22aio kaio
60 de0f5568 0 4 sleeping forever 1cpu kaio
61 de1090b0 de01c37c 3 sleeping forever 3cpu aslogflush
62 de109360 de01c820 2 sleeping secs: 30 4cpu btclean
80 de1189f8 0 4 sleeping forever 3cpu kaio
307 de4788f8 0 4 sleeping forever 17cpu kaio
311 df229df8 deec32a8 2 cond wait netnorm 1cpu sqlexec
312 dfcd4b10 deed037c 2 cond wait netnorm 1cpu sqlexec
313 dfcd51b0 deedcb08 2 cond wait netnorm 1cpu sqlexec
Common threads are flush_sub, which are cleaner threads, kaio which are the kaio threads(notice it has highest priority), aslogflush flushes the logical log buffer, btclean is for cleaningup indexes and sqlexec area user sessions.
40 Online Tuning—December 1997
Configuring MemoryAll procceses in an Online instance attach to a number of shared segments. Since 7.2.3 andSolaris 2.5.1 the size of the shared memory area can be up to 3.7 GB (approx). There are 3segment types resident, virtual and message.
There is one resident portion, of fixed size which contains the buffer cache, locks, hash tables,log buffers etc. The size of this segment is determined by the parameters BUFFERS, LOCKS,PHYSBUFF and LOGBUFF. There is very little point in trying to manually determine the sizeof this segment as the amount each element takes can change with the release of Online.Simply set the desired parameters and see the size of the segment allocated. You can useonstat -g seg
Segment Summary:
(resident segments are locked)
id key addr size ovhd class blkused blkfree
577 1387874305 a000000 -749838336 54952 R 432754 1
578 1387874306 de000000 131072000 2592 V 6275 9725
Total: - - 3676200960 - - 439029 9726
Class can be one of R (Resident) V (Virtual) or M (Message). The minus sign in the Residentoutput is because the value is > 2GB, ipcs -a gives similar information but the segsize will bedisplayed correctly. The resident segment is the only one that can currently be locked downwith ISM by setting the RESIDENT flag to 1 in the onconfig file. For more information on ISMsee http://hot.eng/dbe/tuning_and_faqs/ism.html. We have seen performance gains of up to 15%with ISM and always recommend its use.
The address at which the resident segment is placed is controlled by the SHMBASE variable inonconfig. This is usually set to 0x0A000000L which is 160MB up in the address space. Onlineplaces resident this high to avoid the code and data segments which are below. As a last resortthis can be reduced if you are short of memory, be warned it can lead to odd behaviour.
The virtual portion of shared memory contains all other structures, big buffers, sort pools,active thread data, user session data, database procedure cache, network message queues plusmany more. It is also used by PDQ for scans, aggregations etc. It can be composed of 1 ormore actual shared memory segments. On startup the size of the first segment is controlled bythe onconfig parameter SHMVIRTSIZE. When this initial segment is full extra segments of sizeSHMADD will be added until the max shared memory limit is reached. The user can restrictthe entire size of shared memory with the SHMTOTAL onconfig parameter, a value of 0 meansunlimited size. If unlimited is specified and the system memory limit is reached a query willabort. Reducing SHMVIRTSIZE can give extra memory to the BUFFER cache. The user must
Configuring Memory 41
be careful here. Virtual memory structures associated with user connections are only allocatedwhen the connection is established. If SHMVIRTSIZE is too small and there is no sharedmemory available in the system the connection will fail.
The amount of shared memory that each user requires varies greatly with what each session isdoing. The Informix Performance Guide indicates that anything from 100k to 500k mayberequired per user. For a large number of users a TP monitor, such as Tuxedo, may be requiredto avoid memory exhaustion.
The message portion of the shared memory is for the ipcshm interface. It is used by ipcshmsessions to pass messages to and from the engine. It is usually small, its size being determinedby the NETTYPE parameter for ipcshm in onconfig. The message segment is always placed atthe end of first virtual segment. An ipcshm client attaches at default address 80000. In olderversions on Online the Message segment can be misaligned and cause a severe performanceproblem. This manifests itself as up to 70% system time. A workaround in this situation is tochange the client attach address with the env variable INFORMIXSHMBASE.
Solaris is optimized for a maximum of 5 shared segments per process. For best performance,therefore, we recommend the user determine the maximum size of the virtual segment for hisrunning system and set SHMVIRTSIZE to this value. It is suboptimal to have Online addmultiple small virtual segments. In some releases the maximum SHMVIRTSIZE can be set tois 2GB. In these cases allocate the initial 2GB and then use onmode -a to add an extra virtualsegment of the remainder of memory. Setting the shared memory values may requireadjusting the Solaris shm values in /etc/system (see chapter 4).
To see how the memory is used in the system use onstat -g mem
Pool Summary:
name class addr totalsize freesize #allocfrag #freefrag
resident R a00e018 143630336 12144 2 2
res-buff R 12908018 -1222950912 14192 2 2
global V ca002018 11100160 10009648 1029 803
mt V ca006018 16039936 9671264 4178 637
rsam V ca036018 827392 29232 1390 8
aio V ca072018 22650880 3036976 2537 869
...
458 V ca9c2018 8192 3512 7 1
aio_fpf V caa16018 81920 16240 2 2
42 Online Tuning—December 1997
Blkpool Summary:
name class addr size #blks
global V ca004168 0 0
BUFFERS and LRUsThe buffer cache, configured with the BUFFERS onconfig parameter, is one of the mostimportant resources in Online. Currently each buffer is 2k and each can hold 1 page of data.The reason for any buffer cache in a database is to reduce the number of I/Os to disk. Diskaccess is an order of magnitude longer than memory access. By caching a page in memory thehope is that it can be reused thus avoiding a disk I/O. This leads to the concept of cache hitrate. The onstat -p command gives the read and write cache hit rate. %cached is calculated asthe total number of buffer reads that were already in cache divided by the total number ofbuffer reads.
Profile
dskreads pagreads bufreads %cached dskwrits pagwrits bufwrits %cached
2676 2719 3061 12.58 102 102 44 0.00
isamtot open start read write rewrite delete commit rollbk
10 3 2 2 0 0 0 0 0
ovlock ovuserthread ovbuff usercpu syscpu numckpts flushes
0 0 0 8.62 14.15 1 2
bufwaits lokwaits lockreqs deadlks dltouts ckpwaits compress seqscans
0 0 9 0 0 0 0 0
ixda-RA idx-RA da-RA RA-pgsused lchwaits
1 0 0 1 44
onstat -p is one of the most important Informix OLTP stats. The meaning of the most relevantfields are as follows:
BUFFERS and LRUs 43
Raw I/O
• dskreads: Physical reads from disk
• pagreads: Number of pages read.
• bufreads: Is the number of reads from the buffer pool. This should alwaysbe significantly greater than dskreads or the system is critically short ofmemory..
• %cached: The Read Cache hit ratio
• dskwrites: Physical writes from disk
• pagwrites: Number of pages written.
• bufwrites: Is the number of reads from the buffer pool. This should alwaysbe significantly greater than dskwrites but not to the same extent asbufreads.
• %cached: The Write Cache hit ratio. Generally speaking this is less than theread cache hit ratio in insert intensive enviroments but can be higher inupdate intensive enviroments.
• isamtot: Total number of isam calls
• commit: Calls to iscommit. Informix states there is no link between thisstats and the number of COMMIT WORK calls but it is a fairly goodindicator of the number of successful transactions executed.
• rollbk: Number of rollbacks. If this starts to increase sharply there may bea lot of errors or deadlocks occuring in the system.
• ovlock: The number of times that Online attempted to exceed the maxnumber of locks (LOCKS in onconfig). If this is non zero there should beerrors in the message file as well.
• usercpu: The total user CPU time.
• syscpu: System CPU. If this is high it may indicate a Solaris problem.
• numckpts: Number of checkpoints. If this is high the physical or logicallogs might be too small or the checkpoit interval might be too short.
• bufwaits: The number of times a thread waited for a buffer. This mightindicate too few LRUs, a number of hot pages or a transaction holding abuffer too long. Always try to keep this number low
44 Online Tuning—December 1997
• lokwaits: The number of times a thread waited on a lock. Again strive tokeep this value low.
• lokreqs: The number of locks requested. Used with an isolated transactionuse this stat to size the lock requirements of the system.
• deadlks: Incremented every time a candidate is chosen and terminated toresolve a deadlock.
• seqscans: Increments for each sequential scans. In most OLTP enviromentssequential scans should be avoided.
• lchwaits: Increments each time a thread had to wait for a shared memoryresource. A high number indicates a problem.
In Online buffers are arranged into groups called Least Recently Used (LRU) queues. Thenumber of LRUs in Online is specified using the LRUS onconfig parameter. Each LRU is infact 2 queues, a free and a modified, and is assigned approximately BUFFERS / LRUS of thebuffers in the system . On initialization all buffers are placed on the free queues. User threadstake a buffer for use from the free queue and data is loaded into it from disk. Other sessionscan share this data page, the individual rows are locked when they are modified until thetransaction commits. If a buffer is modified it is placed on the modified queue.
To see the status of the lrus use onstat -R
64 buffer LRU queue pairs
# f/m length % of pair total
0 f 3278 69.6% 4708
1 m 1430 30.4%
2 f 3223 69.2% 4658
...
126 f 3329 70.9% 4698
127 m 1369 29.1%
92742 dirty, 300000 queued, 300000 total, 524288 hash buckets, 2048 buffersize
start clean at 25% (of pair total) dirty, or 1172 buffs dirty, stop at 24%
Modified buffers are placed at the head of the queue (hence the name LRU) and are flushed todisk in one of 3 ways, during a checkpoint, by a page cleaner or with a foreground write.
BUFFERS and LRUs 45
A user thread becomes a page cleaner when it places a buffer on a modified LRU queue andcalculates that the percentage of buffers on this queue is greater than the onconfig parameterLRU_MAX_DIRTY. The thread locks the queue for a short period, selects 16 buffers to flush todisk and unlocks the queue again. The cleaner will continue to flush groups of 16 buffers todisk until the percentage in the modified queue is less than the onconfig parameterLRU_MIN_DIRTY. Buffers being flushed are locked until cleaning is complete and then theyare placed on the free queue.
The cleaned buffer is not zeroized and is placed at the head of the free queue in the hope thatif another thread hashes to it will not have been reused. Clean buffers are read from the tail ofthe free queue. In high throughput environments the gap between LRU_MAX_DIRTY andLRU_MIN_DIRTY should be kept small or threads will spend too long cleaning and will beunavailable for user work.
The value of LRU_MAX_DIRTY also directly affects the duration of a checkpoint in OLTPenvironments. Assuming this many buffers are dirty at time of checkpoint, Online must flushLRU_MAX_DIRTY * (BUFFERS / 100) pages. If checkpoint time is a concern reducingLRU_MAX_DIRTY will help.
The onconfig parameter CLEANERS specifies the maximum number of threads that can becleaning at any one time. When a modified queue is being cleaned the small m in onstat -Rwill be replaced with a capital M
5 M 1342 28.8%
CLEANERS also affects the number of threads that will be initiated to complete a checkpoint.Each cleaner thread will be given a chunk to clean. When they complete their work the nextuncleaned chunk is assigned to them. This can affect the duration of a checkpoint as there canbe tail off if CLEANERS are set incorrectly.
The temptation is to set CLEANERS as close to the number of chunks in a database as possible(max value of CLEANERS is 128) to reduce checkpoint time. This can adversely affect OLTPperformance during regular page cleaning. What tends to happen is that all lrus are consumedat a roughly even rate. Initially they all reach LRU_MAX_DIRTY at approximately the sametime and the cleaners kick in. If CLEANERS is 128 suddenly this many threads are cleaningand 2048 buffers are locked, actual user work takes a severe hit. The system can show moreidle as user threads wait for the I/O to complete.
An alternative to increasing cleaners to reduce checkpoint time is determine if any of thechunks are taking longer to flush and thus increasing the overall length. Use iostat todetermine these chunks and use striping to reduce the overall write time. The duration of acheckpoint can be determined from the message log
22:40:49 Checkpoint Completed: duration was 44 seconds.
46 Online Tuning—December 1997
There is no hard and fast rule for configuring LRU queues. A thread must take a lock on thequeues when taking a free buffer or returning a modified one and so the main advantage tohaving more LRUs is spreading the heat on these locks. The minimum required, therefore, isthe number of active threads in the system (locks are not held across thread switches) which islimited to the number of CPUVPS. A number of 128 should be fine in most situations.
A foreground write occurs when a thread needs to load data from disk and all the free LRUqueues are empty. The thread will initiate a single I/O to write a modified buffer to disk Itmust wait until the I/O is complete. Foreground writes should be avoided at all costs as theyseverely impact performance. They occur if the cleaners can not keep up with the rate ofbuffer modification. onstat -R will show all the modified queues containing 100% of thebuffers. To avoid this situation reduce the LRU_MAX_DIRTY and / or increase the number ofCLEANERS.
Use onstat -F to determine if foreground writes are occurring
Fg Writes LRU Writes Chunk Writes
0 0 2
address flusher state data
ca038458 0 I 0 = 0X0
ca038898 1 I 0 = 0X0
LOCKS
In most OLTP environments locking is very important. The default mode for locking in Onlineis page level but the user can modify this to row level using the LOCK MODE clause in acreate/alter table statement. Row level locking naturally requires a lot more locks and canadd some overhead but for hot tables is generally desirable. The maximum number of locks isdetermined by the LOCKS onconfig parameter. Its value must be determined byexperimentation. Lock structures do not take too much memory so the user has some scope toincrease them. If a transaction cannot obtain enough locks an error will be dumped in themessage log and the transaction will be aborted.
LOCKS 47
If a lot of transactions are trying to lock the same row or page performance can be severelyimpacted. The transactions will spin waiting for the lock and eventually some may timeout.The onconfig parameter TXTIMEOUT determines the amount of time a transaction will waitbefore it times out waiting for a resource. The user might want to set this low if he has longtransactions wants to indicate quickly that there is congestion.
Use onstat -g spi to determine if locks are getting hot.
Spin locks with waits:
Num Waits Num Loops Avg Loop/Wait Name
297 2428 8.18 vproc vp_lock, id = 1
206 1645 7.99 vproc vp_lock, id = 3
153 159 1.04 lockfr0
68 73 1.07 lockfr1
189 688 3.64 lockfr2
7 7 1.00 lockfr10
36 49 1.36 lockfr11
24 24 1.00 lockfr12
16 149 9.31 fast mutex, lru-3
15 72 4.80 fast mutex, lru-5
17 77 4.53 fast mutex, lru-7
1 500 500.00 fast mutex, lockhash[37444]
1 4 4.00 fast mutex, lockhash[63173]
1 5 5.00 fast mutex, lockhash[63174]
82 593 7.23 fast mutex, bhash[228039]
2 2 1.00 fast mutex, bhash[299083]
88 604 6.86 fast mutex, bhash[494215]
There are a number of common hot locks to look for . vproc vplock is a lock held on the vp forscheduling, lockfrn are the locks that control the linked lists of lock structures themselves. lru-n are the locks on the LRUs, lockhash are individual user level locks and bhash are locks onbuffers. If the num waits field is high for a lock but the avg loop wait is low then the lock isbeing taken regularly for a short period. If the num waits is low but avg loop is high then thelock is being held by individual threads for a long period.
48 Online Tuning—December 1997
If the lru locks are hot increase the number of lrus. If a particular bhash is hot then the userhas a hot page and the application may need tuning. Use onstat -k to determine who isholding locks (see tuning the application in Appendix B)
49
System Tuning
In this section of the paper we discuss system and Solaris issues as related to Informix OLTPapplications. We do not intend to cover all system issues at great length, but will touch onsome key things to keep an eye on.
Sample /etc/system FileThe following parameters in /etc/system should bring up an Informix database that can haveup to a 3.86GB shared memory segment. For larger number of users, you may have to increasethe semaphore parameters.
set shmsys:shminfo_shmmax=4026531839
set shmsys:shminfo_shmseg=64
set shmsys:shminfo_shmmni=64
set semsys:seminfo_semmns=4000
set semsys:seminfo_semmnu=4000
set semsys:seminfo_semmsl=1000
set semsys:seminfo_semmni=2000
set semsys:seminfo_semume=2000
*
* The next 2 parameters should be used only if database is on raw devices
set bufhwm = 100
*
* For telnet connections, set pt_cnt
set pt_cnt = 1005
*
* Set the next parameter on sun4d systems only, to prevent minor faults at* large number of users. Value depends on memory configured
set max_nprocs=16000
50 System Tuning—December 1997
Setting bufhwm to 100 tells the kernel to reserve 100Kb to keep track of the filesystem buffercache. This will free up more memory for the memory segment.
DiskOne of the most important aspects of database system tuning is tuning the disk I/O sub-system well. We touched on this in Chapter 2. Use extended statistics from iostat -xc or sar -d togather disk i/o statistics. The disk utilization (%b column from iostat or %busy column fromsar) and the service time (svc_t column from iostat or avserv from sar) are the key statistics tomonitor. Ideally, a data disk doing random i/o’s should be less than 50% busy (40 ios/sec) andhave a service time less than 50ms. Service times will vary depending on the type of disksbeing used, so these numbers are by no means absolute. The log disks can sustain more i/o (upto 60% busy) without proving to be a bottleneck. Beyond this, it is better to stripe the logs.
MemoryThe memory sub-system plays a key role in OLTP performance. Informix requires a largeshared memory segment for good performance. In addition, memory is required to run userprocesses. As a general guideline, the kernel requires about 30Mb.
Use sar -pg or vmstat to gather memory statistics. A sample vmstat output is shown in Table 1.
The key vmstat parameters are explained in this paragraph and the sar parameters are shownin parenthesis. pi (ppgin) is the number of Kbytes/sec paged in by filesystem reads, po(ppgout)is the number of Kbytes/sec paged out to the filesystem, sr(pgscan) is the number of pagesscanned by the page daemon. If this is consistently non-zero, it indicates a shortage of memory.On raw databases, pi and po should be 0, otherwise they may indicate paging.
Table 1 Sample vmstat output
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr s2 sd sd sd in sy cs us sy id
0 2 0 20600 134128 0 1 0 0 0 0 0 0 0 0 0 434 450 690 6 2 92
0 0 0 4985632 5082864 0 5 0 0 0 0 0 0 0 0 0 109 21 404 0 0 100
Memory 51
If you find you’re short of memory, you can reduce the kernel’s memory requirements bytuning certain parameters, especially on systems with large memory. Many kernel resourcesare tied to the value of maxusers and max_nprocs in /etc/system. The default value of theseparameters depends on the amount of physical memory. max_nprocs can be set to themaximum number of processes every expected on the system. Use caution though - if thesystem hits this limit, it will not be able to fork any more processes. maxusers is not directlyrelated to the number of processes and must be experimentally lowered.
The user should also ensure he has the correct memory interleaving. To determine how thememory is interleaved use prtdiag -v under the Memory section
/usr/platform/sun4u/sbin/prtdiag -v
========================= Memory =========================
Intrlv. Intrlv.
Brd Bank MB Status Condition Speed Factor With
--- ----- ---- ------- ---------- ----- ------- -------
0 0 1024 Active OK 60ns 16-way A
0 1 1024 Active OK 60ns 16-way A
2 0 1024 Active OK 60ns 16-way A
2 1 1024 Active OK 60ns 16-way A
4 0 1024 Active OK 60ns 16-way A
4 1 1024 Active OK 60ns 16-way A
6 0 1024 Active OK 60ns 16-way A
6 1 1024 Active OK 60ns 16-way A
8 0 1024 Active OK 60ns 16-way A
9 0 1024 Active OK 60ns 16-way A
10 0 1024 Active OK 60ns 16-way A
11 0 1024 Active OK 60ns 16-way A
12 0 1024 Active OK 60ns 16-way A
13 0 1024 Active OK 60ns 16-way A
14 0 1024 Active OK 60ns 16-way A
15 0 1024 Active OK 60ns 16-way A
52 System Tuning—December 1997
Here we see we have 16GB of memory in high density / 1GB SIMMS and are getting 16 wayinterleaving. If the user is restricted in the amount of memory he can order for a system itmay be better to get low density SIMMS in order to achieve a better interleave factor.
CPUCPU utilization is highly dependent on the workload. In general, the goal of tuning should beto reduce time spent by a process in kernel mode. This can be achieved by better caching of thedata in a larger BUFFER cache to reduce i/o, sufficient memory to ensure that processes don’tget paged out, etc. System utilities like vmstat, sar, mpstat show CPU utilization. As Informixuses kaio with a zero timeout there should be little or no wt time in the mpstat output. In afully-loaded system, for Informix OLTP workloads, we’ve seen usr/sys time ratios of 75/12with 2% idle time. This is just a rough guideline, but if you see system times of 50%,something’s probably wrong.
Modifying the default TimeShare (TS) class dispatch table can help significantly when runninga large number of Informix users. See the whitepaper, Supporting Many Database Users athttp://hot.eng/dbe/whitepapers for details on how to modify the dispatch table.
53
Appendix A : Informix Scripts
File: move_log.sh## Bring down to Single user mode
Echo onmode -sy
onmode -c
onmode -sy
sleep 30
## Add logical and physical Log files
Echo Adding 3 logical log files into llog_ldbs01
onparams -a -d llog_ldbs01 -s 499000
onparams -a -d llog_ldbs02 -s 499000
onparams -a -d llog_ldbs03 -s 499000
## Take A checkpoint
sleep 30
Echo checkpoint
onmode -c
## Take A Null Level 0 Backup
Echo Backup again
ontape -s -L 0
## Switch Current LogFile Pointer To The New One (assume 3 times)
onmode -l
onmode -l
onmode -l
54 Appendix A : Informix Scripts—December 1997
## Take A checkpoint
Echo checkpoint
onmode -c
## Take A Null Level 0 Backup
Echo Backup again
ontape -s -L 0
## Now Drop The initial logical log files ( presume 3 )
Echo Dropping initial log files
onparams -d -l 1 -y
onparams -d -l 2 -y
onparams -d -l 3 -y
## Take A checkpoint To Free up the Old Logical Logs
Echo checkpoint
onmode -c
## Take A Null Level 0 Backup
Echo Backup again
ontape -s -L 0
55
Appendix B: Application Tuning
When tuning an application in an OLTP environments start with the tables that will beaccessed and the sql that will be executed on these tables. Generally speaking table scansshould be avoided in OLTP except on temporary tables or if the cardinality of the table isextremely small. Avoid scans on ALL tables that are being modified concurrently.
To avoid scans we need to build one or more indexes on the table and ensure that theseindexes are used in the queries we will perform on the table.
Using sqexplainOnce the index is built test that the optimizer is choosing it for the query. This is achieved byusing setting explain on in your sql code and running the query either in the application orvia dbaccesss
set explain on;
SELECT COUNT(*) FROM customer
WHERE c_w_id = 286 AND c_d_id = 1 AND c_last = "BARESEANTI";
A file sqexplain.out will be produced in the execution directory and a Query Execution Planwill be dumped into it
QUERY:
------
SELECT COUNT(*) FROM customer
WHERE c_w_id = 286 AND c_d_id = 1 AND c_last = "BARESEANTI"
Estimated Cost: 1
Estimated # of Rows Returned: 1
56 Appendix B: Application Tuning—December 1997
1) informix.customer: INDEX PATH
(1) Index Keys: c_last c_w_id c_d_id c_first (Key-Only) (Serial,fragments: 0)
LowerIndex Filter: (informix.customer.c_w_id = 286 AND(informix.customer.c_d_id = 1 AND informix.customer.c_last = 'BARESEANTI') )
Here we see that an index is being chosen for the query. In some situations even after an indexhas been built on the table the optimizer indicates that it is not chosen
QUERY:
------
SELECT d_name, d_street_1, d_street_2, d_city, d_state, d_zip
FROM district
WHERE d_w_id = 286 AND d_id = 1
Estimated Cost: 828
Estimated # of Rows Returned: 100
1) informix.district: SEQUENTIAL SCAN
Filters: (informix.district.d_w_id = 286 AND informix.district.d_id =1 )
The optimizer bases its qep on the statistics it has available to it. Statistics are gathered usingthe “update statistics” SQL statement. The user has 3 options, LOW, MEDIUM and HIGH. ForLOW the smallest amount of information is gathered. No distributions on columns aregathered. For HIGH the distribution information is exact. For large tables this can take a longtime, requiring scans for all columns specified.
For MEDIUM the data for distributions is obtained by sampling. This requires one scan of thedata but is a lot faster than HIGH. One strategy for statistics gathering is to specify HIGH forsmaller tables and medium for the rest. Statistics gathered with a LOW distribution can takeonly seconds to collect whereas MEDIUM can take minutes or hours.
The user should obtain qeps for all the major groups of SQL statements to be executed andensure that the indexes are correct
Database Procedures 57
Database ProceduresMost database interactions are in a client server situation, the database engine being the serverand the application being the client. This client server communication can be remote over anetwork or locally on the same machine. If there is a lot of data, such as multiple intermediateresult rows, passing back and fourth between client and server the user may consider usingdatabase procedures.
The advantage of procedures is the removal of the need for intermediate results to be passedback to the client. The disadvantage is that some of the processing that would otherwise beperformed on the client is moved to the server. The user might try both alternatives to seewhich performs best.
For more information on database procedures see the Informix Guide to SQL. To test a databaseprocedure the user can call it from a dbaccess session. For a procedure declared:
CREATE PROCEDURE payment (
did SMALLINT, -- pmt->d_id
cid INT, -- pmt->c_id
clast CHAR(16), -- pmt->c_last
c_did SMALLINT, -- pmt->c_d_id
c_wid SMALLINT, -- pmt->c_w_id
hamount NUMERIC(12,2), -- pmt->h_amount / 100
wid SMALLINT, -- pmt->w_id
byname INT, -- pmt->byname
hdate DATETIME YEAR TO SECOND -- pmt->pay_date
)
call the procedure
database tpcc;
execute procedureinformix.payment(6,123,"OUGHTABLEABLE",4,55,23.30,100,0,'1996-02-1416:58:21');
58 Appendix B: Application Tuning—December 1997
Note it is important to get the format correct for any DATETIME parameters. The procedureitself can be debugged using a trace file. Add the following lines:
SET DEBUG FILE TO '/tmp/payment.trc';
TRACE ON;
This will dump extensive amounts of debug data into the file /tmp/payment.trc including thelong form of any sql errors found. A database procedure can also call any Solaris commandusing the SYSTEM command
SYSTEM( "sleep 100" );
A sleep is useful to halt a procedure to determine its state, locks held, stack size etc.
Application errors
All Informix errors have 2 parts, an SQL error and an ISAM error. Use the Informix utilityfinderr to dump the full text of both errors
finderr 100
-100 ISAM error: duplicate value for a record with unique
key.
A row that was to be inserted or updated has a key value that already
exists in its index. For C-ISAM programs, a duplicate value was
presented in the last call to iswrite, isrewrite, isrewcurr, or
isaddindex. Review the program logic and the input data. For SQL
products, a duplicate key value was used in the last INSERT or UPDATE.
Deadlock and lockingThere are situations where an error may not be catastrophic in an application. Error 100 abovefor instance may just need some further intervention. Two other errors
-154 ISAM error: Lock Timeout Expired
Deadlock and locking 59
-143 ISAM error: deadlock detected.
occur when the session is chosen by Online as a candidate to free a deadlock situation. Theuser can simply re-submit the SQL statement in the hope that the deadlock has indeed beencleared. In high OLTP situations deadlock timeouts often occur but an excess number canindicate a bigger problem.
Even if timeouts are not occurring deadlock situations are one of the main performanceproblems in OLTP environments. The timeout value is set with the onconfig parameterDEADLOCK_TIMEOUT. The default is 60 seconds, the user might want to reduce this to get aquicker indication of problems
If a lot of timeouts are occuring check the following:
• The application is not doing prepare statements on the fly
• All sessions do not try to lock the same row or page
• A session is not performing a table lock on a frequently accessed table
• All the indexes are created correctly and are being chosen by the optimizer,thus avoiding table scans
• The statistics are up to date
To determine what locks an application requires use onstat -k. This dumps all locks in thesystem.
Locks
address wtlist owner lklist type tblsnum rowid key#/bsiz
a11f070 0 cb7b5fd8 0 S 100002 203 0
a11f0a4 0 cb7ba3d8 0 S 100002 203 0
a120388 0 ca8badd8 0 S 100002 203 0
a1203f0 0 ca8bcfd8 0 S 100002 203 0
a121394 0 cb7c9618 a758c24 HDR+S 4100002 43c02 0
a1234b0 0 cb7c9618 a935bc0 HDR+S 700002 5e1303 0
a123f40 0 cb7c9618 a8915e4 HDR+S 2d00002 533d17 0
a1bef9c 0 cb7c9618 a893f54 HDR+S 2d00002 533d1d 0
60 Appendix B: Application Tuning—December 1997
a1bf414 0 cb7c9618 a1234b0 HDR+IS 4100002 0 0
a1c31d4 0 cb7c9618 a43ab80 HDR+SR 3700002 533d1b K- 1
a25ddb8 0 cb7c9618 a4d8990 HDR+SR 3700002 533d16 K- 1
a26179c 0 cb7c9618 a43c06c HDR+SR 4d00002 48816 K- 1
a304378 0 cb7c9618 a1c31d4 HDR+S 2d00002 533d1b 0
a4394f4 0 cb7c9618 a304378 HDR+SR 3700002 533d1c K- 1
a43ab80 0 cb7c9618 a3a0858 HDR+S 2d00002 533d1a 0
a43c06c 0 cb7c9618 a1bf414 HDR+SR 4d00002 43c02 K- 1
....
a616fbc 0 cb7c9618 a6b4c2c HDR+SR 3700002 533d19 K- 1
aa6ccec 0 cb7c9618 a4394f4 HDR+S 2d00002 533d1c 0
....
239 active, 200000 total, 65536 hash buckets
The important fields here are the type of lock held, S is a shared lock and X is an exclusivelock, the tblsnum which is the partition number of the table and the rowid. The rowidindicates the following
• rowid of zero is a table lock
• rowid ends in 2 zeros is a page lock
• all other rowids are row level loocks of tables or indexes
The tblsnum indicates the internal partition number that the lock ist taken on. Use thefollowing fragment of sql to determine your partitions
select a.tabname as Table,
HEX(a.partnum) as TablePn,
HEX(b.partn) as FragPn,
b.fragtype as FragType
from systables a , OUTER sysfragments b
where a.tabid = b.tabid
and a.tabid >99 ORDER BY 1,2,3;
This produces output:
table tablepn fragpn fragtype
Deadlock and locking 61
orders 0x00000000 0x04100002 T
orders 0x00000000 0x04200002 T
....
orders 0x00000000 0x04D00002 I
orders 0x00000000 0x04E00002 I
fragpn maps to tblsnum in onstat -k, fragtype is T for a table and I for an index. From theonstat -k output above we see our code has taken a number of shared locks on both the tableand index of the orders table.
To determine what sql a session is executing first determine the Informix internal sessionnumber with onstat -g ses
session #RSAM total used
id user tty pid hostname threads memory memory
455 dbbench 0 1606 haxx3-1 1 147456 140448
onstat -u can then be used to dump the statistics for the sessio
Userthreads
address flags sessid user tty wait tout locks nreads nwrites
ca038018 ---P--D 1 informix - 0 0 0 3 41
ca038458 ---P--F 0 informix - 0 0 0 0 9136
.....
ca8a9118 ---P--D 19 informix - 0 0 0 0 0
cb7c6b98 ---P--- 455 dbbench 0 0 0 26 0 0
132 active, 384 total, 345 maximum concurrent
62 Appendix B: Application Tuning—December 1997
In a deadlock situation the wait field would show a lock that the session was waiting on,onstat -k can then be used to determine which session is holding that lock. onstat -p can alsobe use to determine how many locks each type of transaction required.
Once the session that is causing the deadlock is determined use onstat -g sql <session-no> todetermine the sql being executed.
INFORMIX-OnLine Version 7.24.UC1 -- On-Line -- Up 22:38:43 -- 3268344Kbytes
Sess SQL Current Iso Lock SQL ISAM F.E.
Id Stmt type Database Lvl Mode ERR ERR Vers
454 EXEC PROCEDURE tpcc RR Not Wait 0 0 7.24
Current statement name : slctcur
Current SQL statement :
execute procedure informix.order_status(0,3,5,234,"")
Last parsed SQL statement :
execute procedure informix.order_status(0,3,5,234,"")
The hot locks in the system as a whole can be seen using onstat -g spi (see Informix Tuningsection).
Using PDQOccasionally users will use Parallel Data Query (PDQ) in OLTP environments. They canperform scans on small or temporary tables often joining them with traditional indexes. Inthese situations memory must be allocated to PDQ using the DS_TOTAL_MEMORY onconfigparameter. The user must then achieve a balance between BUFFER requirements and DSrequirements within the memory available.
PDQ spawns a lot more Informix threads, to perform its parallel work, than a straight SQLsession. Use onstat -g ath to determine the number of active threads especially scan threads.
optcompind 63
Threads:
tid tcb rstcb prty status vp-class name
2 a9e3e018 0 2 sleeping(Forever) 21lio lio vp 0
3 a9e3e2c8 0 2 sleeping(Forever) 22pio pio vp 0
....
188547 c01f6fc0 b31c3318 2 sleeping(Forever) 11cpu join_2.1
188558 b5d6ad78 c0c59858 2 sleeping(secs: 3) 6cpu scan_3.0
189086 b9f96fc8 b6557918 2 sleeping(secs: 3) 14cpu group_1.0
There is a limit to the number of threads a CPUVP (and a physical processor) can sustainbefore the overhead of thread switching degrades performance. We have seen optimalperformance with 8 to 10 scan threads on a 250Mhz processor. Unfortunately the lower boundof the onconfig parameter DS_MAX_SCANS is 10, but reducing this parameter can oftenincrease performance in PDQ situations.
optcompindOptcompind arises from “OPTimizer COMPare the cost of using INDices”. The comment inthe onconfig file is as follows
# OPTCOMPIND
# 0 => Nested loop joins will be preferred (where
# possible) over sortmerge joins and hash joins.
# 1 => If the transaction isolation mode is not
# "repeatable read", optimizer behaves as in (2)
# below. Otherwise it behaves as in (0) above.
# 2 => Use costs regardless of the transaction isolation
# mode. Nested loop joins are not necessarily
# preferred. Optimizer bases its decision purely
# on costs.
64 Appendix B: Application Tuning—December 1997
OPTCOMPIND 0 # To hint the optimizer
In OLTP enviroments we always set this variable to 0.