Tuning Informix for OLTP Workloads - Oninit Informix for OLTP Workloads Sun Microsystems Computer...

Post on 18-May-2018

238 views 3 download

Transcript of Tuning Informix for OLTP Workloads - Oninit Informix for OLTP Workloads Sun Microsystems Computer...

Tuning Informix for OLTP Workloads

Sun Microsystems Computer Corporation2550 Garcia AvenueMountain View, CA 94043U.S.A.

Technical White Paper

Denis SheahanDatabase Engineering, SMCC


1997 Sun Microsystems, Inc.2550 Garcia Avenue, Mountain View, California 94043-1100 U.S.A.

All rights reserved. This product and related documentation are protected by copyright and distributed under licensesrestricting its use, copying, distribution, and decompilation. No part of this product or related documentation may bereproduced in any form by any means without prior written authorization of Sun and its licensors, if any.

Portions of this product may be derived from the UNIX® and Berkeley 4.3 BSD systems, licensed from UNIX SystemLaboratories, Inc. and the University of California, respectively. Third-party font software in this product is protected bycopyright and licensed from Sun’s Font Suppliers.

RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the United States Government is subject to the restrictionsset forth in DFARS 252.227-7013 (c)(1)(ii) and FAR 52.227-19.

The product described in this manual may be protected by one or more U.S. patents, foreign patents, or pending applications.

TRADEMARKSSun, Sun Microsystems, Sun Microsystems Computer Corporation, the Sun logo, the Sun Microsystems ComputerCorporation logo,are trademarks or registered trademarks of Sun Microsystems, Inc. UNIX and OPEN LOOK are registeredtrademarks of UNIX System Laboratories, Inc., a wholly owned subsidiary of Novell, Inc. All other product names mentionedherein are the trademarks of their respective owners.

All SPARC trademarks, including the SCD Compliant Logo, are trademarks or registered trademarks of SPARC International,Inc. SPARCstation, SPARCserver, SPARCengine, SPARCworks, and SPARCompiler are licensed exclusively to SunMicrosystems, Inc. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc.

The OPEN LOOK® and Sun™ Graphical User Interfaces were developed by Sun Microsystems, Inc. for its users and licensees.Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical userinterfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface,which license also covers Sun’s licensees who implement OPEN LOOK GUIs and otherwise comply with Sun’s written licenseagreements.

X Window System is a trademark and product of the Massachusetts Institute of Technology.

TPC-C Benchmark TM is a trademark of the Transaction Processing Council.

Infomix ODS 7is a registered trademark of Informix Inc.

VERITAS, VxVM, VxVA, VxFS, and the VERITAS logo are registered trademarks of VERITAS Software Corporation.





1. Informix Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

ODS Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Virtual Processors: Dynamic Scalable Architecture . . . . . . . . 7

An Informix Instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Private and Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2. Database Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Informix Layout basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Rootdbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Logical Logs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Online Physical Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Spindle Count:. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

iv Tuning Informix for OLTP Workloads

Table Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Volume Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Interleave Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Raw vs UFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Tuning Existing Layouts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Building Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Loading data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3. Online Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

I/O Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Physical Logging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Logical Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Connecting to the Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Configuring CPUVPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Configuring Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

BUFFERS and LRUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

LOCKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4. System Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Sample /etc/system File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Contents v

CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5. Appendix A : Informix Scripts. . . . . . . . . . . . . . . . . . . . . . . . . . 53

File: move_log.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6. Appendix B: Application Tuning . . . . . . . . . . . . . . . . . . . . . . . . 55

Using sqexplain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Database Procedures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Application errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Deadlock and locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Using PDQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

optcompind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

vi Tuning Informix for OLTP Workloads

Informix Overview 7

Informix Overview

IntroductionCurrently Informix offers 3 different versions of their engine. Version 7, also known as OnlineDynamic Server (ODS), is their main OLTP engine. Its current revision is 7.2.x with a plannedrelease of 7.3 in Early 1998. Version 8, also known as Extended Paralell Server (XPS), is theirmain Decision Support engine. Its current revision is 8.11 with a planned release of 8.2December 1997. Version 9, also known as Informix Universal Server (IUS), is a merge of 7.2and the Universal Server from Illustra. IUS is Informix’s Object Relational offering providingextended types and datablades.

This paper explains how to tune ODS for OLTP Workloads on Sun servers running Solaris2.5.1 and above. Most of the tuning tips are also applicable to IUS in OLTP situations. Thereare three major sections dealing with data layout, generic Informix tuning issues and systemtuning. In Appendix B we also provide a tutorial on Informix application tuning.

ODS Overview

Virtual Processors: Dynamic Scalable Architecture

The concept of virtual processors (VPs) underlies the entire structure of Informix ODS , and iscalled the Dynamic Scalable Architecture (DSA). ODS runs one Informix thread (rather thanone process) for every user session connected to the database. Context for threads ismaintained in shared memory, so the same thread can be serviced by different VPs if necessary,although by preference it remains in a single VP to reduce cache-line transfers at the hardwarelevel.

Each VP can run many user threads plus internal threads to perform database I/O, logging I/O,page cleaning, administrative tasks and other work. Certain Informix utilities are served bytheir own special threads. VPs are divided into several classes depending on the type of workthey do. All VPs in the same class share the same code and access the same data andprocessing queues in memory.

8 Informix Overview

Figure 1 Informix ODS 7 Clients, Virtual Processors and Physical CPUs

The relationship of virtual processors, physical CPUs and clients is illustrated in Figure 1.

An Informix Instance

Every running Informix database is associated with an Informix instance. When a database isbrought up, shared memory is allocated and the virtual processor processes are started. Thecombination of the shared memory and VPs is called an instance.


Client Client Client Client Client Client

Client Client Client Client Client Client

Virtual Processor Virtual Processor Virtual Processor Virtual Processor Virtual Processor

Shared Memory

Informix Overview 9

Multiple databases can be created within one instance. Whenever a CREATE DATABASEcommand is executed, system catalogs are created to map the tables, indexes, views and otherrelational objects of a logically independent database. Informix applications and utilitiesalways initiate a session by specifying which database within an instance they want to connectto.

Private and Shared Memory

Informix virtual processors each have some private memory plus access to global sharedmemory. The private memory holds the VP’s program text and private data, including privatepointers into shared memory. Locks and latches (mutexes) are used to manage concurrentaccess to shared memory by all VP’s.

Shared memory is divided into three major portions: the resident portion, the virtual portionand the message portion. The resident portion contains the buffer pool and several internaltables used only to track other objects in shared memory. The virtual portion contains largeI/O buffers, session data, thread data, the dictionary cache, stored procedures cache, thesorting pool and a global pool for structures shared by many components of OnLine, especiallymessages from client applications. The message portion is only used to exchange messageswith client programs executing on the same machine as the database and communicating viashared memory interprocess messages.

For more details of the architecture of ODS refer to the Administrators Guide, Volumes 1 and 2.

10 Informix Overview


Database Layout

This chapter describes one of the most important aspects of tuning a database application -laying out a database on disk. A well thought out and tuned layout can do wonders toperformance. On the other hand, all the performance tricks you can try on a runtime systemwon’t do any good if the database is poorly laid out to start with.

The primary tool for obtaining Informix statistics is onstat. Onstat interrogates the engine anddumps out statistics that the later has gathered. Online statistics are zeroised on bringup ofthe engine or with the onstat -z command. When tuning the system, wait until theenvironment is in a steady state, call onstat -z, let the system run for a short period (say 5 - 10minutes) and then dump the required stats. This will avoid statistics such as I/O per secondbeing inaccurate. onstat -a will dump all stats.

Informix Layout basicsThe basic unit of data in Online is a page. In 7.x this is currently 2k in size, in 8.x it is 4k. Eachpage in 7.x can hold up to 255 rows of data. Multiple contiguous pages make up a chunk.Chunks are created using the onspaces utility and can be 2GB maximum, 8.x has a 4GBmaximum. To be used in create statements chunks must be included in a dbspace. Dbspacescan are made up of one or more chunks.

To create a dbspace specify the first chunk it contains

onspaces -c -d oli_dbs1 -p /links/DEV/olinei_41 -o 0 -s 500000

-c indicates create

-d is the dbspace name

-p is the physical device

-o is the offset

-s is the size in kilobytes

To add more chunks to the database use the -a option

onspaces -a oli_dbs1 -p /links/DEV/olinei_42 -o 0 -s 500000

12 Database Layout—December 1997

As stated earlier the limit for any chunk is 2GB. Multiple chunks can be taken from the samedevice by using the offset parameter:

onspaces -a oli_dbs1 -p /links/DEV/olinei_42 -o 0 -s 500000

onspaces -a oli_dbs1 -p /links/DEV/olinei_42 -o 500001 -s 500000

We always recommend that the user use soft links when declaring devices in onspaces. Thenif the controller number changes on the system the link can be moved.

The 2GB limit can be a major drawback with Informix especially on larger 9GB disks whichrequire a minimum of 5 chunks to ulilize the whole disk. The user can quickly run out of the8 partitions provided by the format utility. Using veritas can aleviate this problem as the usercan declare as many plexes as required.

Once the dbspaces are created onstat -d can be used to display their sizes and remaining freespace. The output is in 2 sections, first dbspaces then chunks.


address number flags fchunk nchunks flags owner name

7d778150 1 1 1 1 N informix rootdbs

7e085f50 2 1 2 1 N informix plogdbs

7e093f50 3 1 3 1 N informix llogdbs1

7e175978 4 1 4 1 N informix llogdbs2

7e175a38 5 1 5 1 N informix llogdbs3

7e175af8 6 1 6 1 N informix wdi_dbs

7e175bb8 7 1 7 4 N informix cd_dbs1


7e17cd98 114 1 268 1 N informix si_dbs12

114 active, 2047 maximum


address chk/dbs offset size free bpages flags pathname

7d778210 1 1 500 950000 948181 PO- /links/DEV/root_chunk

7d7ab588 2 2 500 950000 49947 PO- /links/DEV/plog_chunk

7d7ab668 3 3 500 950000 49947 PO- /links/DEV/llog_chunk1

7d7ab748 4 4 500 950000 49947 PO- /links/DEV/llog_chunk2

Informix Layout basics 13

7d7ab828 5 5 500 950000 49947 PO- /links/DEV/llog_chunk3

7d7ab908 6 6 0 500000 490782 PO- /links/DEV/wdi_1

7d7ab9e8 7 7 0 250200 47 PO- /links/DEV/custd_1

7d7abac8 8 7 0 250200 97 PO- /links/DEV/custd_2

7d7abba8 9 7 0 250200 97 PO- /links/DEV/custd_3

7d7abc88 10 7 0 250200 97 PO- /links/DEV/custd_4


7e175898 268 114 0 250000 170497 PO- /links/DEV/stocki_12

268 active, 2047 maximum

Notice how cd_dbs1, which is fchunk number 7, indicates 4 in its nchunks column. There arethen 4 chunks, numbers 7 - 10 which have 7 as their dbs number.

The size and free fields, in the chunk section, are specified in 2k pages. When the chunk isinitialized these two columns will be the same. As table space is allocated from the chunk theamount free is reduced. Online reserves a number of pages from the chunk for what is termedits bitmap pages. These pages indicate the free space in a chunk. In addition the first chunk ina dbspace has pages allocated for what is termed the tablespace. The result is that the firstchunk in a dbspace will have less usable space.

Data Layout

We recommend using raw devices for Online data. Raw devices are accessed using kaiowhich is the most optimal path. UFS also requires an extra copy through the kernel addressspace, consuming CPU cycles. Its aging algorithm is suboptimal for database applications andthe caching of inodes is extra overhead.

Because veritas is an extra layer in the data access path there is a small penalty with its use. Insituations such as striping below, however, its advantages outweigh this small performancedegradation.

Online requires space for the following

• System catalogs. This is held in a special dbspace called the root dbspacewhich is created on Online initialization.

14 Database Layout—December 1997

• Logical logs. This contains records of transactions in a database that islogged

• Physical Log. This contains before images of modified pages which areused in rolling transactions back and forward

• Tables. User tables can be placed by default in the root dbspace or in usercreated spaces

• Indexes. Index data can be held with the table data or in their own separatedbspaces

All data can be placed in the rootdbspace but naturally this can lead to poor performance.


The user must first select a partition / volume to use for Online’s root dbspace. A soft linkshould then be made to this partition and is specified in the onconfig file i.e..

ROOTNAME rootdbs # Root dbspace name

ROOTPATH /dev/online_root # Path for device containing root dbspace

ROOTSIZE 20000 # Size of root dbspace (Kbytes)

The root dbspace is what is initialized when oninit -iy is performed. The size must besufficient to hold the initial physical log and all of the logical logs plus overhead for the systemcatalogs. The logs are always created in the rootdbs initially and can be moved later. If thelogs are moved the rootdbs will never be a hot disk. It can be placed on a disk with other dataif the user is short of spindles. Usually all catalog data is cached and the rootdbs is rarelyaccessed after the first couple of minutes of operation.

Logical Logs

Most commercial installations require some level of logging to ensure the integrity of the datain a database. The logical log in Online contains logical log records. These records are requiredfor a number of functions:

Informix Layout basics 15

• Fast Recovery: If Online shuts down in an uncontrolled manner it uses thelog records to recover all transactions that occured since the last checkpoint.

• Transactions roll back: If during normal transaction processing a transactionmust be rolled back (error, rollback command, etc.) Online uses the logrecords to reverse the changes made on behalf of the transaction.

• Data Restoration: During a data restore the user combines the backup tapesof the logical log files with the most recent Online dbspace backup tapes.

Online provides 3 logging modes Buffered, Unbuffered and Ansi. Unbuffered and Ansi are the2 safest modes. With these modes the logical log records are written to disk whenever anytransaction executes a commit. This guarantees that all transactions are committed in case ofsystem failure. In the buffered mode the logical log buffer is committed only when it is full.On system failure a number of committed transaction may not have been written to disk.

Logging is specified on a per database basis using the ontape utility

• ontape -A tpcc -s : Set Ansi logging on database tpcc

• ontape -B tpcc -s : Set Buffered logging on database tpcc

• ontape -U tpcc -s : Set Unbuffered logging on database tpcc

• ontape -N tpcc -s : Turn off logging on database tpcc

The number and initial size of the logical logs are specified in the onconfig file with thefollowing parameters

LOGFILES 3 # Number of logical log files

LOGSIZE 500 # Logical log size (Kbytes)

The user can see the state of the logical logs with the command onstat -l

Logical Logging

Buffer bufused bufsize numrecs numpages numwrits recs/pages pages/io

L-1 0 16 227 9 3 25.2 3.0

Subsystem numrecs Log Space used

OLDRSAM 227 12664

16 Database Layout—December 1997

address number flags uniqid begin size used %used

ab08990 1 U---C-L 7 300035 250 125 50.0

ab089ac 2 U-B---- 5 400035 250 250 100.00

ab089c8 3 U-B---- 6 500035 250 250 100.00

The size is specified in 2k pages. If the %used are all 100% the system will halt waiting forthe logs to be backed up unless the TAPEDEV onconfig parameter is specified to be /dev/null.

After initialization the logs can be moved with the move_log.sh script specified in AppendixA. If the system moves across 2 logical log boundaries it automatically triggers a checkpoint.

The flag fields of interest are C indicating current logical log, U indicating used, B indicatingbacked up and L indicating the log contains the most recent checkpoint record. The uniqidfield increases every time a new logical log is started. The current log will have the highestnumber. Each time a log is completed a mesage is dumped to the message log

<<INFORMIX-OnLine Server>>> Logical Log 21 Complete.

10:53:48 Logical Log 21 Complete.

Online Physical Log

The physical log contains before-images of database pages. Before the engine modifies a pageit is written to the physical log. These images are used in fast recovery to bring the database toa consistent state. After a failure Online uses the before-images in the physical log to restoreall pages on disk to their state at the last checkpoint. This is the first phase of fast recovery.Next the before images are combined with the logical log records stored since the checkpoint,restoring the data to consistency up to the last transaction commit, this is the second phase offast recovery..

Because after a checkpoint all modified pages are flushed to disk the physical log is emptied atthat time. During bringup Online starts recovery from the last checkpoint record in the logicallog.

As a safety feature a checkpoint is initiated when the physical log becomes 75% full. It isimportant to make the physical log big enough to avoid 2 scenarios

Informix Layout basics 17

• Continual checkpointing• Physical log overflow. This occurs when the log is filling up to fast for a

checkpoint to get in and empty it. This will cause Online to shutdown.

onstat -l also gives info on the Physical log

Physical Logging

Buffer bufused bufsize numpages numwrits pages/io

P-1 10 16 78633 4909 16.02

phybegin physize phypos phyused %used

200035 900000 875077 195578 21.73


Usually the majority of pages in a database are dedicated to user data. In Online an extent isthe minimum amount of contiguous space that can be allocated for a table. Every permanenttable has 2 extent sizes associated with it. The initial extent size is the number of kilobytesallocated to the table when first created. The next extent is the amount in kilobytes allocatedwhen the initial and all other extents are full. The size of these extents is specified in the createtable statement



i_im_id INTEGER,

i_name CHAR(24),

i_price DECIMAL(5,2),

i_data CHAR(50)


IN wdi_dbs



18 Database Layout—December 1997

wdi_dbs is the dbspace the table will be created in. After creation you can see from the onstat-d output that 12000/2 = 6000 pages will be taken from the free column of wdi_dbs. As thetables fills the free column will reduce in 8000/2 = 4000 page extents.

It is good practise to have the minimum number of extents possible for a table. Multipleextents leads to fragmentation of the disk. Initialize one a large chunk for the table and thenallocate an extent greater than the chunk size. Because extents cannot cross a chunk boundary,Online will take the maximum it can allocate for the table ie the entire chunk.

Online also has an internal mechanism to reduce the number of extents. After every 16 nextextents it will double the size of next. In our example the 17th, 33rd and 49th extents will be8000, 16000 and 32000 pages respectively.

Spindle Count:

Once the database size is determined, it would seem that calculating the number of disksrequired should be a simple task. Simply divide the total database size by the size of each disk.Unfortunately, for OLTP applications, this strategy could yield very poor performance. Thenumber of spindles required is usually much larger than what the above calculation woulddictate. However, this does not mean that we will be wasting disk space. The additional spaceis used for growing tables, filesystems etc.

To determine the number of spindles, it is crucial to understand the workload. Disk access isalso closely tied to the size of the buffer cache. If your workload is such that rows updated bysome transactions are re-used by others, it is advantageous to increase the size of the buffercache, cutting down disk access. This scheme is best explained by means of an example.

In TPC-C, the stock table is the largest. For a 900 warehouse database, the total spacerequirement for this table is over 36 GB. If we did a dumb distribution, we would need 182.1GB drives. But let’s look at the table accesses a little more by understanding the workload.The Stock table is randomly accessed 20 times from the Neworder transaction (10 reads/10updates) and 200 times from the Stocklevel transaction. However, the Stocklevel transactionreads the data that is accessed by the Neworder transaction and should hopefully be in thebuffer cache. Therefore, the number of stock table accesses is 20/Neworder transaction. For the900 warehouse database, we hope to achieve 10,500 Neworder transactions/minute or 175/sec.This is a total of 175 * 20 or 3500 ios/sec on the stock table. For good performance, each diskshould be restricted to a response time less than 20ms, which means we will need 3500 / 40 or88 disks. Notice that this is a far cry from the 18 drives we computed by looking at the spaceusage alone. Thus, for OLTP applications, it is extremely important to consider database accesspatterns before deciding on the number of spindles required.

Fragmentation 19

FragmentationFor small and lightly used tables a flat layout inside a single dbspace is sufficient. If thesetables are greater than 2GB just add extra chunks to the dbspace. For hot tables however thisscheme will lead to poor performance. These tables should be fragmented. Fragmentation letsthe user place table rows in different dbspaces based on some distribution scheme. Theadvantage of fragmentation is that the optimizer can possibly eliminate whole sections of datawhen executing a query. In an OLTP environment this improves concurrency as it reducescontention on the underlying devices.

There are 2 types of distribution scheme, round-robin and expression-based. In round-robinOnline uses and internal scheme to distribute the rows. It is generally a poor performer as nofragments can be eliminated.

An expression-based distribution scheme requires the user to define a rule and include it in thecreate table or create index statement for more details see Informix ODS Administrators GuideVol. 1.

We usually use range fragmentation for our hot table definitions e.g..

CREATE TABLE customer (

c_w_id SMALLINT,

c_d_id SMALLINT,



c_ytd_payment DECIMAL(12,2),

c_data CHAR(500)



c_w_id <= 100 IN cd_dbs1,

c_w_id <= 200 AND c_w_id > 100 IN cd_dbs2,

c_w_id <= 300 AND c_w_id > 200 IN cd_dbs3,

c_w_id <= 400 AND c_w_id > 300 IN cd_dbs4,

c_w_id <= 500 AND c_w_id > 400 IN cd_dbs5,

c_w_id <= 600 AND c_w_id > 500 IN cd_dbs6,

20 Database Layout—December 1997

c_w_id <= 700 AND c_w_id > 600 IN cd_dbs7,

c_w_id <= 800 AND c_w_id > 700 IN cd_dbs8,

c_w_id <= 900 AND c_w_id > 800 IN cd_dbs9,



NEXT SIZE 500200

As rows are inserted into the table the value of the c_w_id is checked and this determines inwhich dbspace to place the data.

Table SizeIn order to size an Online database correctly we can make a rough calculation of the rowlength of both data and indexes and the number of rows in a 2k page. If the user has alreadycreated the table he can find the rowsize with the sql fragment:

select rowsize from systables where tabname = “table-name”;

Each “normal” page has 28 bytes at the start taken up in the Page Header. At the end is a 4byte timestamp. Comparison of this timestamp with one in the header determines if a pagehas been modified. Growing back from the time stamp are the slot table entries, 4 bytes foreach row of data. A slot table entry contains the offset in the page and length of the row. Sothe actual space a row takes up is rowsize + 4. From this we can determine the number ofrows of a table a page can accommodate. Due to a 1 byte identifier in the slot table entry themaximum number of rows is 255.

It it not recommended to make a row longer than a page. There is a lot of overhead followingthe chain of pages for each row accessed and performance is degraded. Blobpages are alsovery poor performers.

If the table is already loaded oncheck can be used to determine the exact number of pagesallocated to it:

oncheck -pT tpcc:tablename

Table Size 21

This can take some time for large tables but can give some valuable information. It is useful todump the oncheck data when the database is initially loaded and on a regular basis to seeactual growth needs etc.

For each dbspace a report is generated e.g.

TBLspace Report for tpcc:informix.stock

Table fragment in DBspace s_ddbs01

Physical Address 200005

Creation date 03/31/97 19:16:07

TBLspace Flags 802 Row Locking

TBLspace use 4 bit bit-maps

Maximum row size 306

Number of special columns 0

Number of keys 0

Number of extents 4

Current serial value 1

First extent size 190000

Next extent size 190000

Number of pages allocated 760000

Number of pages used 750187

Number of data pages 750000

Number of rows 4500000

Partition partnum 2097154

Partition lockid 2097154


Logical Page Physical Page Size

0 200035 190000

190000 300003 190000

380000 400003 190000

570000 500003 190000

BLspace Usage Report for tpcc:informix.stock

22 Database Layout—December 1997

Type Pages Empty Semi-Full Full Very-Full

---------------- ---------- ---------- ---------- ---------- ----------

Free 9813

Bit-Map 187

Index 0

Data (Home) 750000


Total Pages 760000

Unused Space Summary

Unused data slots 0

Unused bytes per data page 160

Total unused bytes in data pages 120000000

Home Data Page Version Summary

Version Count

0 (current) 750000

Volume ManagementAfter the number of disks required is determined, you must now consider how to managethem. In a large database, several hundred disks may be used and managing them can be amajor task. Use of a Volume Manager like Solstice DiskSuite or Veritas Volume Manager canease this task considerably.

Volume Management 23


There are trade-offs to be made between cost, availability and performance. How muchdowntime you can tolerate will help decide this. If you want your database to be impervious toa single disk failure, then you should consider RAID 1 or RAID 5 implementations. For theTPC-C workload, a fully mirrored database using RAID 1 shows a 10% performancedegradation compared to a non-RAID database. Using RAID 5, further increases thisdegradation to 35%. The RSM2000 is a possible solution for RAID 5. A more completedescription of the performance implications of RAID on OLTP workloads can be found in thewhitepaper “Performance Evaluation of RAID With OLTP Workloads” athttp://hot.eng/whitepapers. Note that if your workload performs many writes, thedegradation may be more severe. Performance degradation of up to 50% are not uncommon.


Disk striping using RAID 0 is often used to configure disks for good performance. Stripinghelps spread the load across disks, eliminating any hot-spots in the database. The user canuse Informix striping via fragmentation or veritas striping. In some situations, primarily whenthere is skew in the access pattern of a table, Informix fragmentation is not the best solution.

As an example lets takes the stock table of the TPC-C database. Using Informix striping weinitially lay out the data on 72 disks. Each dbspace is made up of a 6 chunks from 6 differentdisks

onspaces -c -d sd_dbs1 -p /device_links/stockd_1 -o 0 -s 463300

onspaces -a sd_dbs1 -p /device_links/stockd_2 -o 0 -s 463300

onspaces -a sd_dbs1 -p /device_links/stockd_3 -o 0 -s 463300

onspaces -a sd_dbs1 -p /device_links/stockd_4 -o 0 -s 463300

onspaces -a sd_dbs1 -p /device_links/stockd_5 -o 0 -s 463300

onspaces -a sd_dbs1 -p /device_links/stockd_6 -o 0 -s 463300

We run our benchmark and using iostat/statit we discover that I/Os to the 6 disks are veryuneven

Disk I/O Statistics (per second)

Disk util% xfer/s rds/s wrts/s rdb/xfr wrb/xfr wtqlen svqlen srv-ms

c3t3d1 42.6 59.9 28.8 31.1 2048 2048 0.00 0.62 10.4

24 Database Layout—December 1997

c3t3d2 51.7 81.0 34.9 46.1 2048 2048 0.00 0.85 10.4

c3t3d3 10.9 13.7 7.4 6.3 2048 2048 0.00 0.12 8.7

c3t3d4 14.1 18.0 9.5 8.5 2048 2048 0.00 0.16 8.8

c3t4d0 26.7 36.0 18.5 17.5 2048 2048 0.00 0.34 9.3

c3t4d1 38.7 55.3 27.4 27.9 2048 2048 0.00 0.55 9.9

The first 2 disks are hot, the second 2 cold and the last 2 medium. To even I/O we create 6, 6way veritas stripes

/etc/vx/bin/vxdisksetup -i c3t3d1

vxdg -g rootdg adddisk stkvol1=c3t3d1


/etc/vx/bin/vxdisksetup -i c3t4d1

vxdg -g rootdg adddisk stkvol6=c3t4d1

vxassist -g rootdg make stk1 1000m layout=stripe columns=6 stripeunit=16kstkvol1 stkvol2 stkvol3 stkvol4 stkvol5 stkvol6


vxassist -g rootdg make stk5 1000m layout=stripe columns=6 stripeunit=16kstkvol1 stkvol2 stkvol3 stkvol4 stkvol5 stkvol6

and take the 6 chunks from these new volumes. The I/O then becomes uniform:

c3t3d1 35.6 43.0 20.5 22.5 2048 2048 0.00 0.50 11.6

c3t3d2 36.2 43.4 20.7 22.7 2048 2048 0.00 0.50 11.6

c3t3d3 36.0 42.7 20.6 22.1 2048 2048 0.00 0.51 11.9

c3t3d4 36.6 43.6 21.0 22.6 2048 2048 0.00 0.51 11.6

c3t4d0 35.9 42.5 20.5 22.0 2048 2048 0.00 0.50 11.7

c3t4d1 35.6 43.6 20.6 22.9 2048 2048 0.00 0.50 11.4

Tuning Existing Layouts 25

Interleave Factor

Interlace or Interleave factor is the amount of contiguous space on one spindle. For example, ifwe specify an interlace factor of 16K, the volume manager will assign the first 16K bytes fromthe first disk, the next 16K from the next disk in the stripe and so on. For OLTP workloads, theinterlace factor should be small. 16K to 32K interleave factors should work well for table andindex data.

Raw vs UFS

A large number of customers use UFS for their database files for convenience. From aperformance perspective, raw will outperform UFS for OLTP workloads. We have measured atwo-fold increase in performance for raw vs UFS.

Tuning Existing LayoutsOften times, we don’t have the luxury to layout a database from scratch. In such cases, weneed to monitor disk performance and tune as best we can. The first step is to collect disk iostatistics during normal operation of the work-load. System utilities such as sar and iostat canbe used. If the volume manager being used is Veritas, then the utility vxstat will provide diskstatistics at a volume level, eliminating the need to map physical disks to logical tablespaces.Online also provides statistics such as the number of reads and writes per chunk. If thestatistics show greater than 40 ios/sec or service times greater than 50 ms, the system maywarrant tuning. Identify which tablespace is on the problem disks. If the tablespace containsmultiple tables, more analysis needs to be done to determine which table is the culprit.

It may be possible to reduce disk activity, by caching more data in memory. If you havesufficient memory try bumping the number of BUFFERS in the onconfig file. If this doesn’treduce the io bottleneck, then it may be necessary to re-distribute the data onto more disks. Ifthe Informix chunks were all added using logical names (i.e. symbolic links to the actualphysical devices, or Veritas volume names), this re-distribution is rather straight-forward.Shutdown Online, re-create the volumes over a larger number of disks and copy the data overfrom the old volumes to the new ones. If disk space is a constraint or actual physical devicenames were used as datafile names, it may be necessary to export the table contents, drop andre-create the tablespace before loading the data back in.

26 Database Layout—December 1997


Indexes can greatly improve the performance of OLTP environments. Informix uses a btree+structure for indexes. All indexes start with a single root node . There maybe a number ofintermediate branch levels and the index ends in leaf nodes. The index nodes contain rowscalled index items. Each index item consists of a key value (which may be a composite) and arowid. The rowid represents either a row in another index page or the actual data row in thecase of leaf nodes. The fields from the original table that make up the index are called thekeys. Keys are chosen so that the optimizer will choose the index in a particular query. If bychoosing the keys all the results of a query can be returned without going to the data row theindex is said to cover the query.

With regards to performance there are a number of issues with indexes. The first is the depthof the tree. When accessing a data row using the index each level in the tree requires a bufferread. This can effect the cache hit rate of the buffer pool and worse case require extra I/Os. Inthe case of an update, a lock is held on all rows involved in the index search. Locks on idexescan become quite hot if a lot of updates are occurring.

One potential solution is to fragment the index. Indexs on fragmented tables that don’tspecify their own strategy are fragmented the same way as the table. The engine reservesspace in the tables dbspace for the index using an internal algorithm. These are known asattached indexes. Alternatively the user can declare his own index fragmentation schema, forexample he could fragment the table 4 ways



o_c_id INTEGER,

o_d_id SMALLINT,

o_w_id SMALLINT,

o_carrier_id SMALLINT,

o_ol_cnt SMALLINT,

o_all_local SMALLINT,




o_w_id <= 250 IN od_dbs1,

o_w_id <= 500 AND o_w_id > 250 IN od_dbs2,

Indexes 27

o_w_id <= 700 AND o_w_id > 500 IN od_dbs3,



NEXT SIZE 267700


and fragment the index 2 ways

CREATE UNIQUE INDEX oi1 ON orders(o_id, o_w_id, o_d_id)


o_w_id <= 500 IN oi1_dbs1,

REMAINDER IN oi1_dbs2;

Fragmentation has the effect of reducing the depth of the btree. The optimize can determinefrom the fragmentation strategy which fragment of the index to traverse. If the trees in thefragments are shallower than a single large index then fewer I/Os are required to reach therow. Index fragmentation does have a downside, however, it requires searching long linkedlists to get to the required fragment. This can sometimes adversely effect performance. Evenif an index is shallow the transaction mix may make the disks hot. Fragmentation can helpthis situation.

To determine the depth of a tree use the Online facility oncheck described earlier.

If there is an attached index on the table there will be an “Index” line with the number ofpages allocated. For each attached index there will be an output as follows:

Index Usage Report for index customer_index on tpcc:informix.customer

Average Average

Level Total No. Keys Free Bytes

----- -------- -------- ----------

1 1 66 972

2 66 62 1017

3 4135 62 1027

28 Database Layout—December 1997

4 256411 116 31

----- -------- -------- ----------

Total 260613 116 47

Here we see the customer_index has a depth of 4. Notice how the branch pages have far morefree space than the leaf pages. The number of rows in an index page is controlled by theFILLFACTOR parameter. FILLFACTOR can be set for the whole system in the onconfig file oroverridden in the create/alter index statement. The default FILLFACTOR is 90%.FILLFACTOR only takes effect when the index is being built not when it is being updated.

Set FILLFACTOR higher for static indexes or indexes that are updated but rarely change size.This will increase index performance by yielding a better cache hit rate and reduced I/O. Setthe FILLFACTOR to lower than default for indexes that are going to have a lot of inserts. Thiswill give the index room to grow and reduce the amount of page splitting required. Forheavily updated tables it might make sense to periodically drop the index and rebuild it withthe required FILLFACTOR to help performance.

Another issue with indexes is the number to have on a table. There is no limit in Informix butmultiple indexes must be maintained. If a table is heavily updated / altered the index mustmodified for each operation.

We also recommend not choosing a varchar for the key in an index. Traversing an indexrequires a number of key comparisons and varchars are generally a poor performer in thissituation.

Building IndexesOnline can build indexes serially or in parallel. Serial is the default operation and for smalltables is often sufficient. Each chunk of the table is read in sequence and the data sorted. Onmachines with a higher number of CPUS parallel builds are more efficient. Parallel builds readmultiple disks into memory, sorts the data and writes the index out in parallel. To do this theuser must enable Parallel Data Query (PDQ) and set the PSORTNPROCS environmentvariable. PSORTNPROCS restricts the number of sort threads that will be started in the engine(on a CPUVP basis) PDQ is enabled by setting MAX_PDQPRIORITY in the onconfig file andsetting the PDQPRIORITY environment variable. An example script would be:



Building Indexes 29



dbaccess tpcc index_second_customer

PDQ requires memory for operation and the amount it can allocate is set by theDS_TOTAL_MEMORY variable in the onconfig file. Each index build will getDS_TOTAL_MEMORY / DS_MAX_QUERIES of memory to work with. So for the mostefficient parallel build set DS_TOTAL_MEM to the most you can allocate, DS_MAX_QUERIESto 1 so the build gets all available.

It is also a good idea to set SHMVIRTSIZE high to avoid multiple shared segments from beingallocated during the index build (see shared memory section in Informix Tuning).

During a parallel index build onstat -g ath (show all threads) should show psortproc threadsbeing scheduled.

311 cf51d20 eb61490 2 cond wait(packet_cond) 5cpu xchg_2.82

312 cf75de8 eb61834 2 cond wait(packet_cond) 3cpu xchg_2.83

313 cf50c58 eb61bd8 2 cond wait(packet_cond) 3cpu xchg_2.84

314 cf50d80 eb61f7c 2 ready 5cpu xchg_2.85

625 e9b7d58 30b267b8 2 ready 1cpu psortproc

626 e9c0680 30b26b5c 2 ready 3cpu psortproc

627 e9c0988 30b26f00 2 ready 5cpu psortproc

628 e9c0c90 30b272a4 2 ready 4cpu psortproc

629 e9c0f98 30b27648 2 ready 4cpu psortproc

Parallel build also uses big buffers so onstat -g iob should have output

INFORMIX-OnLine Version 8.20.UA2 -- On-Line -- Up 00:33:58 -- 3608456Kbytes

AIO big buffer usage summary:

class reads writes

pages ops pgs/op holes hl-ops hls/op pages ops pgs/op

30 Database Layout—December 1997

fif 0 0 0.00 0 0 0.00 0 0 0.00

kio 1593640 53394 29.85 26 10 2.60 11470 2685 4.27

Finally onstat -g iof (I/O to each chunk) should have parallel reads from more than one of thetable chunks.

When DS memory is exhausted the index build will overflow to the temp dbspace. Temp isspecified by the DBSPACETEMP variable in onconfig. If none is specified /tmp is used andthe sort data is written to a cooked file. This is suboptimal as kaio cannot be employed and thecooked file codepath is not optimized. Always create a number of temp spaces and set theDBSPACETEMP variable.


# Default temp dbspaces

The spaces will be written to in a round-robin fashion. If TEMP becomes hot the user mightconsider striping using veritas. A small interleave factor (say 16k) or lower should be chosenfor the volumes.

Loading dataInformix provide a parallel loader for Online which is beyond the scope of this document Formore information see the Guide to the High Performance Loader.

A simpler solution is to load into multiple fragments in parallel. Any table that is fragmentedcan be loaded in this fashion which avoids contention on the bitmap page of the table and thethrashing inherent in two loader going after the same tables.


Online Tuning

I/O TuningIn Solaris Online’s default method of I/O is kaio for all reads and writes. Each CPUVP has aspecial kaio thread that performs this task. When a normal thread yields in the engine the kaiothread is always scheduled next. The kaio thread uses aio_read and aio_write to submitoutstanding requests to the OS and then uses aio_wait with a zero timeout to check if there areany completions. It passes on any completed I/O and then yields to the scheduler whichchooses the next thread to run.

onstat -g iov displays the activity of each kaio thread

AIO I/O vps:

class/vp s io/s totalops dskread dskwrite dskcopy wakeups io/wup

kio 0 i 0.0 43 36 7 0 89 0.5

kio 1 i 0.0 658 640 18 0 1489 0.4

The kaio thread employs an algorithm to try and coalesce I/O requests for adjacent pagestogether. It then submits this larger I/O in what is called a big buffer which is more efficient.Big buffers are also used in Index building. To see big buffer usage use onstat -g iob

AIO big buffer usage summary:

class reads writes

pages ops pgs/op holes hl-ops hls/op pages ops pgs/op

fif 0 0 0.00 0 0 0.00 0 0 0.00

kio 2803 2742 1.02 0 0 0.00 102 77 1.32

Normal threads submit reads and writes via the I/O queues. The kaio thread empties itsqueue each time it is scheduled. To see the status of the queues use onstat -g ioq.

32 Online Tuning—December 1997

AIO I/O queues:

q name/id len maxlen totalops dskread dskwrite dskcopy

kio 1 0 16 119642 58799 60843 0

kio 2 0 16 132167 59397 72770 0

kio 3 0 16 124531 59469 65062 0

kio 4 0 16 110924 59482 51442 0

kio 5 0 16 126967 59182 67785 0

kio 6 0 16 122345 58707 63638 0

kio 7 0 16 130828 61490 69338 0

The len field gives the current number of outstanding I/Os submitted and the maxlen is ahighwater indicator. The max is usually achieved during buffer cleaning when a cleanerthread chooses a number of buffers from an LRU (default 16) and writes them to disk. Look atdskread and dkswrite to ensure all kaio threads are doing roughly the same amount of I/O.

To find out what chunks are receiving the I/O requests use onstat -g iof

AIO global files:

gfd pathname totalops dskread dskwrite io/s

3 root_chunk 93 45 48 0.0

4 plog_chunk 2986 0 2986 0.0

5 llog_chunk1 0 0 0 0.0

6 llog_chunk2 18411 5841 12570 0.1

7 llog_chunk3 0 0 0 0.0

8 /amir/DEV/wdi_1 172 133 39 0.0

9 custd_1 8578 5867 2711 0.0

10 custd_2 8789 6094 2695 0.1

11 custd_3 9106 6411 2695 0.1


Informix generally recommends no more than 40 I/O a second to a disk but with todays fasterdisks 56 - 60 can be ok. If the chunk is on a multi-disk stripe however obviously the chunk cantake 40 I/Os times the number of disks. For instance if we have a 6 way stripe the I/Os can goup to 240 I/Os a second. On RSM or SSA arrays turning on the fast write cache can improvewrite performance.



Physical Logging

As mentioned earlier it is essential to make the physical log big enough to avoid continualcheckpointing. The size limit is 2GB in 7.x and there can be only one physical log in an Onlineinstance. If a 2GB log is still filling too fast the user will have to live with the checkpoints andtry to reduce their time instead.

The onconfig parameter PHYSBUFF (in kbytes) determines the I/O block size for thephysical log. Writes to the physical log are buffered, one is being written to in memory whileanother is being written to disk. In update intensive OLTP environments the physical log maybecome hot. In such situations first increase the size of Physbuf to 128k to see if it helpsperformance. 128k is by default the largest block size that Solaris will not break up whenwriting to disk.

The user can also stripe the physical log if possible using a volume manager such as veritas.Make sure the interleave factor is less than the PHYSBUFF or the entire buffer will be writtento just 1 disk in the stripe.

Logical Logging

As with the Physical log the user should ensure that the logical logs are big enough to avoidconstant checkpoints. A checkpoint is initiated when 2 logical logs are crossed. Having 16logs of 500k each, therefore, is a bad schema. Make the logical logs large (2GB is possible). Ifthe environment is being archived to tape the user should provide enough log space for say an8 hour work day. When all logs are full the system will halt waiting for backup to complete.

34 Online Tuning—December 1997

Again the logical logs can be striped but they are not usually hot enough to warrant this as theactual I/Os tend to be sequential. They are usually mirrored, however,. The user has 2options here, volume manager mirroring or Informix mirroring. Informix mirroring isachieved using the onspaces command either when the log is created

onspaces -c -d llog_dbs -p /dbs/llog_dbs0 -o 4 -s 500000 -m /dbs/lm_dbs0 4

or afterwards with the -m option

onspaces -m llog_ddbs01 -p /dev/ifmx/s132d15vol -o 5120

We have seen no difference in performance between Informix and veritas log mirroring

The onconfig parameter LOGBUFF sets the I/O size of the Logical log. In BUFFERED loggingthis amount of data is written out on each I/O. In UNBUFFERED and ANSI logging this isthe size of the log buffer. Multiple transactions write log records to the log buffer. When onetransaction commits the log buffer must be flushed. The amount of buffer actually written isdetermined by the amount of “piggybacking” achieved before a commit. The length of thetransactions and their mix determines this piggy backing. The pages/io in the onstat -l outputindicates the amount of piggybacking achieved

Logical Logging

Buffer bufused bufsize numrecs numpages numwrits recs/pages pages/io

L-1 0 16 2618502 97451 13371 26.9 7.3

Subsystem numrecs Log Space used

OLDRSAM 2618502 183907884

In this case each logical log write averaged 7.3 pages or 14.6k, roughly half the 32k Log buffer.In high volume, short transaction environments the piggybacking will decrease as transactionsare continually committing. Use the pages/io stat to determine the size of the log buffer

Connecting to the DatabaseWhen a database has been initialized the next step is getting users connected to the engine.In order to connect, either via an application or a tool such as dbaccess, a user must set anenvironment variable INFORMIXSERVER in their environment to indicate the instance of theserver they wish to attach to (multiple instances can be present either on the same machine oron the network). This name must be present in the users sqlhosts file



Connecting to the Database 35

cat $INFORMIXDIR/etc/sqlhosts

xtpcc ontlitcp campi-1 7600

The default directory for sqlhosts is $INFORMIXDIR/etc but this can be changed by setting theINFORMIXSQLHOSTS environment variable. Communication can be over the network usingtli/tcp or locally using a shared memory protocol called ipcshm. The hostname campi-1 abovemust also exist in the client’s /etc/hosts file.

The engine must then provide the service to the user. The INFORMIXSERVER name must bepresent in the onconfig file as either DBSERVERNAME or an alias in the DBSERVERALIASESlist. Each DBSERVERNAME or DBSERVERALIASES must be present in the servers sqlhostsfile

onconfig entries:



sqlhosts file:

rtpcc olipcshm campi rtpcc

xtpcc ontlitcp campi-1 7600

In this example local connections user the server name rtpcc and remote connections use theserver alias xtpcc. When using tli/tcp there must be an entry in the /etc/services file for theOnline listener.

rtpcc 7600/tcp # Informix listner

Informix must have read permission for /etc/services. When the engine comes up netstat -awill show if the listener is operational


Local Address Remote Address Swind Send-Q Rwind Recv-Q State

----------------- -------------------- ----- ------ ----- ------ -------

campi-1.rtpcc *.* 0 0 0 0 LISTEN

campi-1.rtpcc haxx3-1.33232 8760 0 8760 0 ESTABLISHED

*.rtpc *.* 0 0 8576 0 BOUND

When dbaccess or an application is connected over tli/tcp netsat will show ESTABLISHED inthe state field. When a connection terminates there will still be a netstat entry with BOUNDin the state field.

onstat -g ses will show connections in Online

36 Online Tuning—December 1997

session #RSAM total used

id user tty pid hostname threads memory memory

22 dbbench 1 2098 haxx3-1 1 32768 27624

12 informix 5 18372 campi 1 32768 27400

The hostname field indicates a local or remote connection. For more information on thevarious connection options see the Administrators Guide, Volume 1.

By using different server aliases Online can listen on multiple networks. Each requires alistener in /etc/services and an entry in the sqlhosts file which specifies a different name

onconfig .


DBSERVERALIASES net2,net3 # List of alternate dbservernames


thash ontlitcp campi-1 7600

net2 ontlitcp campi-2 7700

net3 ontlitcp campi-3 7800


#private nets campi-1 campi-2 campi-3

By default Online spawns one poll thread for each nettype entry in the sqlhosts file. TheNETTYPE onconfig parameter allows the user to allocate more than one poll thread anddesignate whether these run inline as part of the work of a CPUVP or as its own process on aNETVP. onstat -g ath displays how many threads are polling


tid tcb rstcb prty status vp-class name

7 ca0b9678 0 2 running 1cpu sm_poll

8 ca0c3248 0 2 running 25tli tlitcppoll

9 ca0c3788 0 2 cond wait(arrived) 26tli tlitcppoll

Connecting to the Database 37

Here we see one poll thread for shared memory (sm_poll) and two for tli/tcp. onstat -g glowill indicate if NETVPS have been started. If the poll thread is running on a NETVP there willbe a shm entry for ipcshm and a tli entry for tli/tcp.

Virtual processor summary:

class vps usercpu syscpu total

cpu 19 14.88 23.11 37.99

aio 1 0.06 0.25 0.31

tli 8 0.06 0.08 0.14

shm 8 0.06 0.03 0.09

lio 1 0.01 0.01 0.02

pio 1 0.02 0.00 0.02

adm 1 0.00 0.02 0.02

msc 1 0.00 0.01 0.01

total 40 15.09 23.51 38.60

Individual virtual processors:

26 18578 shm 0.01 0.01 0.02

33 18534 tli 0.01 0.01 0.02

The default operation is inline for ipcshm and a network vp for tli/tcp. In high transactionrate OLTP environments we have found it best to use NETVPS for polling. Using a CPUVPmeans this process must switch from processing queries to handling network operations. Thishas a detrimental effect on the cache of the processor. An option to try is to have one lessCPUVP than the number of physical processors and one network VP. Bind the CPUVPS tophysical processors. The NETVP will be scheduled on the spare CPU and will not starve forresources.

There is a known bug with the poll call in Solaris that is being currently fixed. When a lot ofconnections are made on a port, poll can perform extremely poorly. This is because the kernelkeeps a linked list of connections that the poll system call must traverse. This is particularlybad in Baan environments where each Baan session starts multiple Online sessions.

onstat -g ntu,ntt, ntm, ntd, nss, nsc and nsd can give useful network statistics,

38 Online Tuning—December 1997

Configuring CPUVPSAll processing in Online is performed by cpuvps. Informix recommend setting cpuvps to 1less than the number of physical processors available. The user should always experimentwith this, sometimes setting the number of CPUVPS greater than number of physical canincrease performance. This will only be true if there is idle time on the system. Use mpstat todetermine if Online is using resources efficiently.

When Online is initializing it starts a single oninit process and for each type of VP forksanother oninit. The state of the VPs can be seen with onstat -g glo

The aio, pi, lio and msc VPS should have little or no cpu time allocated to them. Also usercpushould dominate as Online only really uses the OS for I/O, shared memory and timing(gettimeofday). We have seen at most 20% system time on a heavily loaded OLTP system.

For optimal performance it is better that the CPUVPS stay on the same physical cpu for as longas possible. This reduces cache miss rate for the process. There are 2 parts to this, the first isbinding the processor to the CPU and the second is extending the amount of time each processhas.

The CPUVP processes are bound to physical cpus using the parameters AFF_SPROC, whichindicates the physical number of the first processor to bind and AFF_NPROCS the number ofprocessors to bind. If these are enabled the following messages appear in the message log.

10:01:58 Affinitied VP 1 to phys proc 1

10:01:58 Affinitied VP 3 to phys proc 4


If we run pbind on the system we see

process id 4940: 4

process id 4945: 1


Binding should reduce the amount of migration of the oninit processes from one physical cputo another.

Configuring CPUVPS 39

For details on modifying the dispatch table see http://hot.eng/dbe/tools/ . Modifying the dispatchtable gives the oninit processes a longer timeslice, moves them to the highest priority andkeeps them at that priority. This reduces the amount of times the processes are rescheduledand helps performance.

To see how many users are logged in use onstat -g ses. To see what threads are in the systemuse onstat -g ath


tid tcb rstcb prty status vp-class name

6 de0e31d8 de00e018 4 sleeping secs: 1 3cpu main_loop()

7 de0e3980 0 2 running 24tli tlitcppoll

8 de0e3e78 0 2 running 25tli tlitcppoll

9 de0e6478 0 3 sleeping forever 1cpu tlitcplst

10 de0e6b98 0 3 sleeping forever 1cpu tlitcplst

11 de0e71e8 de00e4bc 2 sleeping forever 7cpu flush_sub(0)

12 de0e73d8 de00e960 2 sleeping forever 19cpu flush_sub(1)

58 de0f5000 de01bed8 2 sleeping forever 11cpu flush_sub(47)

59 de0f5378 0 4 sleeping forever 22aio kaio

60 de0f5568 0 4 sleeping forever 1cpu kaio

61 de1090b0 de01c37c 3 sleeping forever 3cpu aslogflush

62 de109360 de01c820 2 sleeping secs: 30 4cpu btclean

80 de1189f8 0 4 sleeping forever 3cpu kaio

307 de4788f8 0 4 sleeping forever 17cpu kaio

311 df229df8 deec32a8 2 cond wait netnorm 1cpu sqlexec

312 dfcd4b10 deed037c 2 cond wait netnorm 1cpu sqlexec

313 dfcd51b0 deedcb08 2 cond wait netnorm 1cpu sqlexec

Common threads are flush_sub, which are cleaner threads, kaio which are the kaio threads(notice it has highest priority), aslogflush flushes the logical log buffer, btclean is for cleaningup indexes and sqlexec area user sessions.

40 Online Tuning—December 1997

Configuring MemoryAll procceses in an Online instance attach to a number of shared segments. Since 7.2.3 andSolaris 2.5.1 the size of the shared memory area can be up to 3.7 GB (approx). There are 3segment types resident, virtual and message.

There is one resident portion, of fixed size which contains the buffer cache, locks, hash tables,log buffers etc. The size of this segment is determined by the parameters BUFFERS, LOCKS,PHYSBUFF and LOGBUFF. There is very little point in trying to manually determine the sizeof this segment as the amount each element takes can change with the release of Online.Simply set the desired parameters and see the size of the segment allocated. You can useonstat -g seg

Segment Summary:

(resident segments are locked)

id key addr size ovhd class blkused blkfree

577 1387874305 a000000 -749838336 54952 R 432754 1

578 1387874306 de000000 131072000 2592 V 6275 9725

Total: - - 3676200960 - - 439029 9726

Class can be one of R (Resident) V (Virtual) or M (Message). The minus sign in the Residentoutput is because the value is > 2GB, ipcs -a gives similar information but the segsize will bedisplayed correctly. The resident segment is the only one that can currently be locked downwith ISM by setting the RESIDENT flag to 1 in the onconfig file. For more information on ISMsee http://hot.eng/dbe/tuning_and_faqs/ism.html. We have seen performance gains of up to 15%with ISM and always recommend its use.

The address at which the resident segment is placed is controlled by the SHMBASE variable inonconfig. This is usually set to 0x0A000000L which is 160MB up in the address space. Onlineplaces resident this high to avoid the code and data segments which are below. As a last resortthis can be reduced if you are short of memory, be warned it can lead to odd behaviour.

The virtual portion of shared memory contains all other structures, big buffers, sort pools,active thread data, user session data, database procedure cache, network message queues plusmany more. It is also used by PDQ for scans, aggregations etc. It can be composed of 1 ormore actual shared memory segments. On startup the size of the first segment is controlled bythe onconfig parameter SHMVIRTSIZE. When this initial segment is full extra segments of sizeSHMADD will be added until the max shared memory limit is reached. The user can restrictthe entire size of shared memory with the SHMTOTAL onconfig parameter, a value of 0 meansunlimited size. If unlimited is specified and the system memory limit is reached a query willabort. Reducing SHMVIRTSIZE can give extra memory to the BUFFER cache. The user must

Configuring Memory 41

be careful here. Virtual memory structures associated with user connections are only allocatedwhen the connection is established. If SHMVIRTSIZE is too small and there is no sharedmemory available in the system the connection will fail.

The amount of shared memory that each user requires varies greatly with what each session isdoing. The Informix Performance Guide indicates that anything from 100k to 500k mayberequired per user. For a large number of users a TP monitor, such as Tuxedo, may be requiredto avoid memory exhaustion.

The message portion of the shared memory is for the ipcshm interface. It is used by ipcshmsessions to pass messages to and from the engine. It is usually small, its size being determinedby the NETTYPE parameter for ipcshm in onconfig. The message segment is always placed atthe end of first virtual segment. An ipcshm client attaches at default address 80000. In olderversions on Online the Message segment can be misaligned and cause a severe performanceproblem. This manifests itself as up to 70% system time. A workaround in this situation is tochange the client attach address with the env variable INFORMIXSHMBASE.

Solaris is optimized for a maximum of 5 shared segments per process. For best performance,therefore, we recommend the user determine the maximum size of the virtual segment for hisrunning system and set SHMVIRTSIZE to this value. It is suboptimal to have Online addmultiple small virtual segments. In some releases the maximum SHMVIRTSIZE can be set tois 2GB. In these cases allocate the initial 2GB and then use onmode -a to add an extra virtualsegment of the remainder of memory. Setting the shared memory values may requireadjusting the Solaris shm values in /etc/system (see chapter 4).

To see how the memory is used in the system use onstat -g mem

Pool Summary:

name class addr totalsize freesize #allocfrag #freefrag

resident R a00e018 143630336 12144 2 2

res-buff R 12908018 -1222950912 14192 2 2

global V ca002018 11100160 10009648 1029 803

mt V ca006018 16039936 9671264 4178 637

rsam V ca036018 827392 29232 1390 8

aio V ca072018 22650880 3036976 2537 869


458 V ca9c2018 8192 3512 7 1

aio_fpf V caa16018 81920 16240 2 2

42 Online Tuning—December 1997

Blkpool Summary:

name class addr size #blks

global V ca004168 0 0

BUFFERS and LRUsThe buffer cache, configured with the BUFFERS onconfig parameter, is one of the mostimportant resources in Online. Currently each buffer is 2k and each can hold 1 page of data.The reason for any buffer cache in a database is to reduce the number of I/Os to disk. Diskaccess is an order of magnitude longer than memory access. By caching a page in memory thehope is that it can be reused thus avoiding a disk I/O. This leads to the concept of cache hitrate. The onstat -p command gives the read and write cache hit rate. %cached is calculated asthe total number of buffer reads that were already in cache divided by the total number ofbuffer reads.


dskreads pagreads bufreads %cached dskwrits pagwrits bufwrits %cached

2676 2719 3061 12.58 102 102 44 0.00

isamtot open start read write rewrite delete commit rollbk

10 3 2 2 0 0 0 0 0

ovlock ovuserthread ovbuff usercpu syscpu numckpts flushes

0 0 0 8.62 14.15 1 2

bufwaits lokwaits lockreqs deadlks dltouts ckpwaits compress seqscans

0 0 9 0 0 0 0 0

ixda-RA idx-RA da-RA RA-pgsused lchwaits

1 0 0 1 44

onstat -p is one of the most important Informix OLTP stats. The meaning of the most relevantfields are as follows:


Raw I/O

• dskreads: Physical reads from disk

• pagreads: Number of pages read.

• bufreads: Is the number of reads from the buffer pool. This should alwaysbe significantly greater than dskreads or the system is critically short ofmemory..

• %cached: The Read Cache hit ratio

• dskwrites: Physical writes from disk

• pagwrites: Number of pages written.

• bufwrites: Is the number of reads from the buffer pool. This should alwaysbe significantly greater than dskwrites but not to the same extent asbufreads.

• %cached: The Write Cache hit ratio. Generally speaking this is less than theread cache hit ratio in insert intensive enviroments but can be higher inupdate intensive enviroments.

• isamtot: Total number of isam calls

• commit: Calls to iscommit. Informix states there is no link between thisstats and the number of COMMIT WORK calls but it is a fairly goodindicator of the number of successful transactions executed.

• rollbk: Number of rollbacks. If this starts to increase sharply there may bea lot of errors or deadlocks occuring in the system.

• ovlock: The number of times that Online attempted to exceed the maxnumber of locks (LOCKS in onconfig). If this is non zero there should beerrors in the message file as well.

• usercpu: The total user CPU time.

• syscpu: System CPU. If this is high it may indicate a Solaris problem.

• numckpts: Number of checkpoints. If this is high the physical or logicallogs might be too small or the checkpoit interval might be too short.

• bufwaits: The number of times a thread waited for a buffer. This mightindicate too few LRUs, a number of hot pages or a transaction holding abuffer too long. Always try to keep this number low

44 Online Tuning—December 1997

• lokwaits: The number of times a thread waited on a lock. Again strive tokeep this value low.

• lokreqs: The number of locks requested. Used with an isolated transactionuse this stat to size the lock requirements of the system.

• deadlks: Incremented every time a candidate is chosen and terminated toresolve a deadlock.

• seqscans: Increments for each sequential scans. In most OLTP enviromentssequential scans should be avoided.

• lchwaits: Increments each time a thread had to wait for a shared memoryresource. A high number indicates a problem.

In Online buffers are arranged into groups called Least Recently Used (LRU) queues. Thenumber of LRUs in Online is specified using the LRUS onconfig parameter. Each LRU is infact 2 queues, a free and a modified, and is assigned approximately BUFFERS / LRUS of thebuffers in the system . On initialization all buffers are placed on the free queues. User threadstake a buffer for use from the free queue and data is loaded into it from disk. Other sessionscan share this data page, the individual rows are locked when they are modified until thetransaction commits. If a buffer is modified it is placed on the modified queue.

To see the status of the lrus use onstat -R

64 buffer LRU queue pairs

# f/m length % of pair total

0 f 3278 69.6% 4708

1 m 1430 30.4%

2 f 3223 69.2% 4658


126 f 3329 70.9% 4698

127 m 1369 29.1%

92742 dirty, 300000 queued, 300000 total, 524288 hash buckets, 2048 buffersize

start clean at 25% (of pair total) dirty, or 1172 buffs dirty, stop at 24%

Modified buffers are placed at the head of the queue (hence the name LRU) and are flushed todisk in one of 3 ways, during a checkpoint, by a page cleaner or with a foreground write.


A user thread becomes a page cleaner when it places a buffer on a modified LRU queue andcalculates that the percentage of buffers on this queue is greater than the onconfig parameterLRU_MAX_DIRTY. The thread locks the queue for a short period, selects 16 buffers to flush todisk and unlocks the queue again. The cleaner will continue to flush groups of 16 buffers todisk until the percentage in the modified queue is less than the onconfig parameterLRU_MIN_DIRTY. Buffers being flushed are locked until cleaning is complete and then theyare placed on the free queue.

The cleaned buffer is not zeroized and is placed at the head of the free queue in the hope thatif another thread hashes to it will not have been reused. Clean buffers are read from the tail ofthe free queue. In high throughput environments the gap between LRU_MAX_DIRTY andLRU_MIN_DIRTY should be kept small or threads will spend too long cleaning and will beunavailable for user work.

The value of LRU_MAX_DIRTY also directly affects the duration of a checkpoint in OLTPenvironments. Assuming this many buffers are dirty at time of checkpoint, Online must flushLRU_MAX_DIRTY * (BUFFERS / 100) pages. If checkpoint time is a concern reducingLRU_MAX_DIRTY will help.

The onconfig parameter CLEANERS specifies the maximum number of threads that can becleaning at any one time. When a modified queue is being cleaned the small m in onstat -Rwill be replaced with a capital M

5 M 1342 28.8%

CLEANERS also affects the number of threads that will be initiated to complete a checkpoint.Each cleaner thread will be given a chunk to clean. When they complete their work the nextuncleaned chunk is assigned to them. This can affect the duration of a checkpoint as there canbe tail off if CLEANERS are set incorrectly.

The temptation is to set CLEANERS as close to the number of chunks in a database as possible(max value of CLEANERS is 128) to reduce checkpoint time. This can adversely affect OLTPperformance during regular page cleaning. What tends to happen is that all lrus are consumedat a roughly even rate. Initially they all reach LRU_MAX_DIRTY at approximately the sametime and the cleaners kick in. If CLEANERS is 128 suddenly this many threads are cleaningand 2048 buffers are locked, actual user work takes a severe hit. The system can show moreidle as user threads wait for the I/O to complete.

An alternative to increasing cleaners to reduce checkpoint time is determine if any of thechunks are taking longer to flush and thus increasing the overall length. Use iostat todetermine these chunks and use striping to reduce the overall write time. The duration of acheckpoint can be determined from the message log

22:40:49 Checkpoint Completed: duration was 44 seconds.

46 Online Tuning—December 1997

There is no hard and fast rule for configuring LRU queues. A thread must take a lock on thequeues when taking a free buffer or returning a modified one and so the main advantage tohaving more LRUs is spreading the heat on these locks. The minimum required, therefore, isthe number of active threads in the system (locks are not held across thread switches) which islimited to the number of CPUVPS. A number of 128 should be fine in most situations.

A foreground write occurs when a thread needs to load data from disk and all the free LRUqueues are empty. The thread will initiate a single I/O to write a modified buffer to disk Itmust wait until the I/O is complete. Foreground writes should be avoided at all costs as theyseverely impact performance. They occur if the cleaners can not keep up with the rate ofbuffer modification. onstat -R will show all the modified queues containing 100% of thebuffers. To avoid this situation reduce the LRU_MAX_DIRTY and / or increase the number ofCLEANERS.

Use onstat -F to determine if foreground writes are occurring

Fg Writes LRU Writes Chunk Writes

0 0 2

address flusher state data

ca038458 0 I 0 = 0X0

ca038898 1 I 0 = 0X0


In most OLTP environments locking is very important. The default mode for locking in Onlineis page level but the user can modify this to row level using the LOCK MODE clause in acreate/alter table statement. Row level locking naturally requires a lot more locks and canadd some overhead but for hot tables is generally desirable. The maximum number of locks isdetermined by the LOCKS onconfig parameter. Its value must be determined byexperimentation. Lock structures do not take too much memory so the user has some scope toincrease them. If a transaction cannot obtain enough locks an error will be dumped in themessage log and the transaction will be aborted.


If a lot of transactions are trying to lock the same row or page performance can be severelyimpacted. The transactions will spin waiting for the lock and eventually some may timeout.The onconfig parameter TXTIMEOUT determines the amount of time a transaction will waitbefore it times out waiting for a resource. The user might want to set this low if he has longtransactions wants to indicate quickly that there is congestion.

Use onstat -g spi to determine if locks are getting hot.

Spin locks with waits:

Num Waits Num Loops Avg Loop/Wait Name

297 2428 8.18 vproc vp_lock, id = 1

206 1645 7.99 vproc vp_lock, id = 3

153 159 1.04 lockfr0

68 73 1.07 lockfr1

189 688 3.64 lockfr2

7 7 1.00 lockfr10

36 49 1.36 lockfr11

24 24 1.00 lockfr12

16 149 9.31 fast mutex, lru-3

15 72 4.80 fast mutex, lru-5

17 77 4.53 fast mutex, lru-7

1 500 500.00 fast mutex, lockhash[37444]

1 4 4.00 fast mutex, lockhash[63173]

1 5 5.00 fast mutex, lockhash[63174]

82 593 7.23 fast mutex, bhash[228039]

2 2 1.00 fast mutex, bhash[299083]

88 604 6.86 fast mutex, bhash[494215]

There are a number of common hot locks to look for . vproc vplock is a lock held on the vp forscheduling, lockfrn are the locks that control the linked lists of lock structures themselves. lru-n are the locks on the LRUs, lockhash are individual user level locks and bhash are locks onbuffers. If the num waits field is high for a lock but the avg loop wait is low then the lock isbeing taken regularly for a short period. If the num waits is low but avg loop is high then thelock is being held by individual threads for a long period.

48 Online Tuning—December 1997

If the lru locks are hot increase the number of lrus. If a particular bhash is hot then the userhas a hot page and the application may need tuning. Use onstat -k to determine who isholding locks (see tuning the application in Appendix B)


System Tuning

In this section of the paper we discuss system and Solaris issues as related to Informix OLTPapplications. We do not intend to cover all system issues at great length, but will touch onsome key things to keep an eye on.

Sample /etc/system FileThe following parameters in /etc/system should bring up an Informix database that can haveup to a 3.86GB shared memory segment. For larger number of users, you may have to increasethe semaphore parameters.

set shmsys:shminfo_shmmax=4026531839

set shmsys:shminfo_shmseg=64

set shmsys:shminfo_shmmni=64

set semsys:seminfo_semmns=4000

set semsys:seminfo_semmnu=4000

set semsys:seminfo_semmsl=1000

set semsys:seminfo_semmni=2000

set semsys:seminfo_semume=2000


* The next 2 parameters should be used only if database is on raw devices

set bufhwm = 100


* For telnet connections, set pt_cnt

set pt_cnt = 1005


* Set the next parameter on sun4d systems only, to prevent minor faults at* large number of users. Value depends on memory configured

set max_nprocs=16000

50 System Tuning—December 1997

Setting bufhwm to 100 tells the kernel to reserve 100Kb to keep track of the filesystem buffercache. This will free up more memory for the memory segment.

DiskOne of the most important aspects of database system tuning is tuning the disk I/O sub-system well. We touched on this in Chapter 2. Use extended statistics from iostat -xc or sar -d togather disk i/o statistics. The disk utilization (%b column from iostat or %busy column fromsar) and the service time (svc_t column from iostat or avserv from sar) are the key statistics tomonitor. Ideally, a data disk doing random i/o’s should be less than 50% busy (40 ios/sec) andhave a service time less than 50ms. Service times will vary depending on the type of disksbeing used, so these numbers are by no means absolute. The log disks can sustain more i/o (upto 60% busy) without proving to be a bottleneck. Beyond this, it is better to stripe the logs.

MemoryThe memory sub-system plays a key role in OLTP performance. Informix requires a largeshared memory segment for good performance. In addition, memory is required to run userprocesses. As a general guideline, the kernel requires about 30Mb.

Use sar -pg or vmstat to gather memory statistics. A sample vmstat output is shown in Table 1.

The key vmstat parameters are explained in this paragraph and the sar parameters are shownin parenthesis. pi (ppgin) is the number of Kbytes/sec paged in by filesystem reads, po(ppgout)is the number of Kbytes/sec paged out to the filesystem, sr(pgscan) is the number of pagesscanned by the page daemon. If this is consistently non-zero, it indicates a shortage of memory.On raw databases, pi and po should be 0, otherwise they may indicate paging.

Table 1 Sample vmstat output

procs memory page disk faults cpu

r b w swap free re mf pi po fr de sr s2 sd sd sd in sy cs us sy id

0 2 0 20600 134128 0 1 0 0 0 0 0 0 0 0 0 434 450 690 6 2 92

0 0 0 4985632 5082864 0 5 0 0 0 0 0 0 0 0 0 109 21 404 0 0 100

Memory 51

If you find you’re short of memory, you can reduce the kernel’s memory requirements bytuning certain parameters, especially on systems with large memory. Many kernel resourcesare tied to the value of maxusers and max_nprocs in /etc/system. The default value of theseparameters depends on the amount of physical memory. max_nprocs can be set to themaximum number of processes every expected on the system. Use caution though - if thesystem hits this limit, it will not be able to fork any more processes. maxusers is not directlyrelated to the number of processes and must be experimentally lowered.

The user should also ensure he has the correct memory interleaving. To determine how thememory is interleaved use prtdiag -v under the Memory section

/usr/platform/sun4u/sbin/prtdiag -v

========================= Memory =========================

Intrlv. Intrlv.

Brd Bank MB Status Condition Speed Factor With

--- ----- ---- ------- ---------- ----- ------- -------

0 0 1024 Active OK 60ns 16-way A

0 1 1024 Active OK 60ns 16-way A

2 0 1024 Active OK 60ns 16-way A

2 1 1024 Active OK 60ns 16-way A

4 0 1024 Active OK 60ns 16-way A

4 1 1024 Active OK 60ns 16-way A

6 0 1024 Active OK 60ns 16-way A

6 1 1024 Active OK 60ns 16-way A

8 0 1024 Active OK 60ns 16-way A

9 0 1024 Active OK 60ns 16-way A

10 0 1024 Active OK 60ns 16-way A

11 0 1024 Active OK 60ns 16-way A

12 0 1024 Active OK 60ns 16-way A

13 0 1024 Active OK 60ns 16-way A

14 0 1024 Active OK 60ns 16-way A

15 0 1024 Active OK 60ns 16-way A

52 System Tuning—December 1997

Here we see we have 16GB of memory in high density / 1GB SIMMS and are getting 16 wayinterleaving. If the user is restricted in the amount of memory he can order for a system itmay be better to get low density SIMMS in order to achieve a better interleave factor.

CPUCPU utilization is highly dependent on the workload. In general, the goal of tuning should beto reduce time spent by a process in kernel mode. This can be achieved by better caching of thedata in a larger BUFFER cache to reduce i/o, sufficient memory to ensure that processes don’tget paged out, etc. System utilities like vmstat, sar, mpstat show CPU utilization. As Informixuses kaio with a zero timeout there should be little or no wt time in the mpstat output. In afully-loaded system, for Informix OLTP workloads, we’ve seen usr/sys time ratios of 75/12with 2% idle time. This is just a rough guideline, but if you see system times of 50%,something’s probably wrong.

Modifying the default TimeShare (TS) class dispatch table can help significantly when runninga large number of Informix users. See the whitepaper, Supporting Many Database Users athttp://hot.eng/dbe/whitepapers for details on how to modify the dispatch table.


Appendix A : Informix Scripts

File: move_log.sh## Bring down to Single user mode

Echo onmode -sy

onmode -c

onmode -sy

sleep 30

## Add logical and physical Log files

Echo Adding 3 logical log files into llog_ldbs01

onparams -a -d llog_ldbs01 -s 499000

onparams -a -d llog_ldbs02 -s 499000

onparams -a -d llog_ldbs03 -s 499000

## Take A checkpoint

sleep 30

Echo checkpoint

onmode -c

## Take A Null Level 0 Backup

Echo Backup again

ontape -s -L 0

## Switch Current LogFile Pointer To The New One (assume 3 times)

onmode -l

onmode -l

onmode -l

54 Appendix A : Informix Scripts—December 1997

## Take A checkpoint

Echo checkpoint

onmode -c

## Take A Null Level 0 Backup

Echo Backup again

ontape -s -L 0

## Now Drop The initial logical log files ( presume 3 )

Echo Dropping initial log files

onparams -d -l 1 -y

onparams -d -l 2 -y

onparams -d -l 3 -y

## Take A checkpoint To Free up the Old Logical Logs

Echo checkpoint

onmode -c

## Take A Null Level 0 Backup

Echo Backup again

ontape -s -L 0


Appendix B: Application Tuning

When tuning an application in an OLTP environments start with the tables that will beaccessed and the sql that will be executed on these tables. Generally speaking table scansshould be avoided in OLTP except on temporary tables or if the cardinality of the table isextremely small. Avoid scans on ALL tables that are being modified concurrently.

To avoid scans we need to build one or more indexes on the table and ensure that theseindexes are used in the queries we will perform on the table.

Using sqexplainOnce the index is built test that the optimizer is choosing it for the query. This is achieved byusing setting explain on in your sql code and running the query either in the application orvia dbaccesss

set explain on;


WHERE c_w_id = 286 AND c_d_id = 1 AND c_last = "BARESEANTI";

A file sqexplain.out will be produced in the execution directory and a Query Execution Planwill be dumped into it




WHERE c_w_id = 286 AND c_d_id = 1 AND c_last = "BARESEANTI"

Estimated Cost: 1

Estimated # of Rows Returned: 1

56 Appendix B: Application Tuning—December 1997

1) informix.customer: INDEX PATH

(1) Index Keys: c_last c_w_id c_d_id c_first (Key-Only) (Serial,fragments: 0)

LowerIndex Filter: (informix.customer.c_w_id = 286 AND(informix.customer.c_d_id = 1 AND informix.customer.c_last = 'BARESEANTI') )

Here we see that an index is being chosen for the query. In some situations even after an indexhas been built on the table the optimizer indicates that it is not chosen



SELECT d_name, d_street_1, d_street_2, d_city, d_state, d_zip

FROM district

WHERE d_w_id = 286 AND d_id = 1

Estimated Cost: 828

Estimated # of Rows Returned: 100

1) informix.district: SEQUENTIAL SCAN

Filters: (informix.district.d_w_id = 286 AND informix.district.d_id =1 )

The optimizer bases its qep on the statistics it has available to it. Statistics are gathered usingthe “update statistics” SQL statement. The user has 3 options, LOW, MEDIUM and HIGH. ForLOW the smallest amount of information is gathered. No distributions on columns aregathered. For HIGH the distribution information is exact. For large tables this can take a longtime, requiring scans for all columns specified.

For MEDIUM the data for distributions is obtained by sampling. This requires one scan of thedata but is a lot faster than HIGH. One strategy for statistics gathering is to specify HIGH forsmaller tables and medium for the rest. Statistics gathered with a LOW distribution can takeonly seconds to collect whereas MEDIUM can take minutes or hours.

The user should obtain qeps for all the major groups of SQL statements to be executed andensure that the indexes are correct

Database Procedures 57

Database ProceduresMost database interactions are in a client server situation, the database engine being the serverand the application being the client. This client server communication can be remote over anetwork or locally on the same machine. If there is a lot of data, such as multiple intermediateresult rows, passing back and fourth between client and server the user may consider usingdatabase procedures.

The advantage of procedures is the removal of the need for intermediate results to be passedback to the client. The disadvantage is that some of the processing that would otherwise beperformed on the client is moved to the server. The user might try both alternatives to seewhich performs best.

For more information on database procedures see the Informix Guide to SQL. To test a databaseprocedure the user can call it from a dbaccess session. For a procedure declared:


did SMALLINT, -- pmt->d_id

cid INT, -- pmt->c_id

clast CHAR(16), -- pmt->c_last

c_did SMALLINT, -- pmt->c_d_id

c_wid SMALLINT, -- pmt->c_w_id

hamount NUMERIC(12,2), -- pmt->h_amount / 100

wid SMALLINT, -- pmt->w_id

byname INT, -- pmt->byname

hdate DATETIME YEAR TO SECOND -- pmt->pay_date


call the procedure

database tpcc;

execute procedureinformix.payment(6,123,"OUGHTABLEABLE",4,55,23.30,100,0,'1996-02-1416:58:21');

58 Appendix B: Application Tuning—December 1997

Note it is important to get the format correct for any DATETIME parameters. The procedureitself can be debugged using a trace file. Add the following lines:

SET DEBUG FILE TO '/tmp/payment.trc';


This will dump extensive amounts of debug data into the file /tmp/payment.trc including thelong form of any sql errors found. A database procedure can also call any Solaris commandusing the SYSTEM command

SYSTEM( "sleep 100" );

A sleep is useful to halt a procedure to determine its state, locks held, stack size etc.

Application errors

All Informix errors have 2 parts, an SQL error and an ISAM error. Use the Informix utilityfinderr to dump the full text of both errors

finderr 100

-100 ISAM error: duplicate value for a record with unique


A row that was to be inserted or updated has a key value that already

exists in its index. For C-ISAM programs, a duplicate value was

presented in the last call to iswrite, isrewrite, isrewcurr, or

isaddindex. Review the program logic and the input data. For SQL

products, a duplicate key value was used in the last INSERT or UPDATE.

Deadlock and lockingThere are situations where an error may not be catastrophic in an application. Error 100 abovefor instance may just need some further intervention. Two other errors

-154 ISAM error: Lock Timeout Expired

Deadlock and locking 59

-143 ISAM error: deadlock detected.

occur when the session is chosen by Online as a candidate to free a deadlock situation. Theuser can simply re-submit the SQL statement in the hope that the deadlock has indeed beencleared. In high OLTP situations deadlock timeouts often occur but an excess number canindicate a bigger problem.

Even if timeouts are not occurring deadlock situations are one of the main performanceproblems in OLTP environments. The timeout value is set with the onconfig parameterDEADLOCK_TIMEOUT. The default is 60 seconds, the user might want to reduce this to get aquicker indication of problems

If a lot of timeouts are occuring check the following:

• The application is not doing prepare statements on the fly

• All sessions do not try to lock the same row or page

• A session is not performing a table lock on a frequently accessed table

• All the indexes are created correctly and are being chosen by the optimizer,thus avoiding table scans

• The statistics are up to date

To determine what locks an application requires use onstat -k. This dumps all locks in thesystem.


address wtlist owner lklist type tblsnum rowid key#/bsiz

a11f070 0 cb7b5fd8 0 S 100002 203 0

a11f0a4 0 cb7ba3d8 0 S 100002 203 0

a120388 0 ca8badd8 0 S 100002 203 0

a1203f0 0 ca8bcfd8 0 S 100002 203 0

a121394 0 cb7c9618 a758c24 HDR+S 4100002 43c02 0

a1234b0 0 cb7c9618 a935bc0 HDR+S 700002 5e1303 0

a123f40 0 cb7c9618 a8915e4 HDR+S 2d00002 533d17 0

a1bef9c 0 cb7c9618 a893f54 HDR+S 2d00002 533d1d 0

60 Appendix B: Application Tuning—December 1997

a1bf414 0 cb7c9618 a1234b0 HDR+IS 4100002 0 0

a1c31d4 0 cb7c9618 a43ab80 HDR+SR 3700002 533d1b K- 1

a25ddb8 0 cb7c9618 a4d8990 HDR+SR 3700002 533d16 K- 1

a26179c 0 cb7c9618 a43c06c HDR+SR 4d00002 48816 K- 1

a304378 0 cb7c9618 a1c31d4 HDR+S 2d00002 533d1b 0

a4394f4 0 cb7c9618 a304378 HDR+SR 3700002 533d1c K- 1

a43ab80 0 cb7c9618 a3a0858 HDR+S 2d00002 533d1a 0

a43c06c 0 cb7c9618 a1bf414 HDR+SR 4d00002 43c02 K- 1


a616fbc 0 cb7c9618 a6b4c2c HDR+SR 3700002 533d19 K- 1

aa6ccec 0 cb7c9618 a4394f4 HDR+S 2d00002 533d1c 0


239 active, 200000 total, 65536 hash buckets

The important fields here are the type of lock held, S is a shared lock and X is an exclusivelock, the tblsnum which is the partition number of the table and the rowid. The rowidindicates the following

• rowid of zero is a table lock

• rowid ends in 2 zeros is a page lock

• all other rowids are row level loocks of tables or indexes

The tblsnum indicates the internal partition number that the lock ist taken on. Use thefollowing fragment of sql to determine your partitions

select a.tabname as Table,

HEX(a.partnum) as TablePn,

HEX(b.partn) as FragPn,

b.fragtype as FragType

from systables a , OUTER sysfragments b

where a.tabid = b.tabid

and a.tabid >99 ORDER BY 1,2,3;

This produces output:

table tablepn fragpn fragtype

Deadlock and locking 61

orders 0x00000000 0x04100002 T

orders 0x00000000 0x04200002 T


orders 0x00000000 0x04D00002 I

orders 0x00000000 0x04E00002 I

fragpn maps to tblsnum in onstat -k, fragtype is T for a table and I for an index. From theonstat -k output above we see our code has taken a number of shared locks on both the tableand index of the orders table.

To determine what sql a session is executing first determine the Informix internal sessionnumber with onstat -g ses

session #RSAM total used

id user tty pid hostname threads memory memory

455 dbbench 0 1606 haxx3-1 1 147456 140448

onstat -u can then be used to dump the statistics for the sessio


address flags sessid user tty wait tout locks nreads nwrites

ca038018 ---P--D 1 informix - 0 0 0 3 41

ca038458 ---P--F 0 informix - 0 0 0 0 9136


ca8a9118 ---P--D 19 informix - 0 0 0 0 0

cb7c6b98 ---P--- 455 dbbench 0 0 0 26 0 0

132 active, 384 total, 345 maximum concurrent

62 Appendix B: Application Tuning—December 1997

In a deadlock situation the wait field would show a lock that the session was waiting on,onstat -k can then be used to determine which session is holding that lock. onstat -p can alsobe use to determine how many locks each type of transaction required.

Once the session that is causing the deadlock is determined use onstat -g sql <session-no> todetermine the sql being executed.

INFORMIX-OnLine Version 7.24.UC1 -- On-Line -- Up 22:38:43 -- 3268344Kbytes

Sess SQL Current Iso Lock SQL ISAM F.E.

Id Stmt type Database Lvl Mode ERR ERR Vers

454 EXEC PROCEDURE tpcc RR Not Wait 0 0 7.24

Current statement name : slctcur

Current SQL statement :

execute procedure informix.order_status(0,3,5,234,"")

Last parsed SQL statement :

execute procedure informix.order_status(0,3,5,234,"")

The hot locks in the system as a whole can be seen using onstat -g spi (see Informix Tuningsection).

Using PDQOccasionally users will use Parallel Data Query (PDQ) in OLTP environments. They canperform scans on small or temporary tables often joining them with traditional indexes. Inthese situations memory must be allocated to PDQ using the DS_TOTAL_MEMORY onconfigparameter. The user must then achieve a balance between BUFFER requirements and DSrequirements within the memory available.

PDQ spawns a lot more Informix threads, to perform its parallel work, than a straight SQLsession. Use onstat -g ath to determine the number of active threads especially scan threads.

optcompind 63


tid tcb rstcb prty status vp-class name

2 a9e3e018 0 2 sleeping(Forever) 21lio lio vp 0

3 a9e3e2c8 0 2 sleeping(Forever) 22pio pio vp 0


188547 c01f6fc0 b31c3318 2 sleeping(Forever) 11cpu join_2.1

188558 b5d6ad78 c0c59858 2 sleeping(secs: 3) 6cpu scan_3.0

189086 b9f96fc8 b6557918 2 sleeping(secs: 3) 14cpu group_1.0

There is a limit to the number of threads a CPUVP (and a physical processor) can sustainbefore the overhead of thread switching degrades performance. We have seen optimalperformance with 8 to 10 scan threads on a 250Mhz processor. Unfortunately the lower boundof the onconfig parameter DS_MAX_SCANS is 10, but reducing this parameter can oftenincrease performance in PDQ situations.

optcompindOptcompind arises from “OPTimizer COMPare the cost of using INDices”. The comment inthe onconfig file is as follows


# 0 => Nested loop joins will be preferred (where

# possible) over sortmerge joins and hash joins.

# 1 => If the transaction isolation mode is not

# "repeatable read", optimizer behaves as in (2)

# below. Otherwise it behaves as in (0) above.

# 2 => Use costs regardless of the transaction isolation

# mode. Nested loop joins are not necessarily

# preferred. Optimizer bases its decision purely

# on costs.

64 Appendix B: Application Tuning—December 1997

OPTCOMPIND 0 # To hint the optimizer

In OLTP enviroments we always set this variable to 0.