Oracle database performance diagnostics - before your begin

Oracle Database Performance Diagnostics

© Hemant K Chitale 2011. http://hemantoracledba.blogspot.com

Oracle Database “Performance diagnostics” : Before you begin

Hemant K Chitale

Introduction

What do you, as the DBA / Developer / System Administrator / Analyst / Performance

Analyst / Application Manager, do when you get calls like:

1. The “system” is slow

2. The batch job is “hanging”

3. Users cannot login

Are these Database Performance issues? Always?

Where do you begin diagnostics? Do you jump into trace files, StatsPack / AWR, OS

statistics etc ?

This article is a primer on what you should be aware of *before* you begin looking at

Oracle Trace Files, Explain Plans, Statistics and what-not.

The diagnostic process must be able to help the Oracle Database Performance Analyst

identify :

a. Whether there really is an “issue”

b. How well the issue is defined, if necessary redefine it

c. Where the cause arises

d. What can be done to address the cause

Note : This article is NOT about how to use Oracle and OS methods to diagnose a

performance issue and/or tune an SQL/Application/Schema/Database. It is about what

you should be aware of before you begin.

Environmental Factors Let’s begin with some basic factors:

1. Response Time

“Response Time” is what users (and application servers!) see. They do not see

‘consistent gets’ or ‘redo size’ or ‘enq: TX - row lock contention”.

User perception of a system’s usability is significantly impacted by Response Time. “fit

for use” (the application is usable) must co-exist with “fit for purpose” (the application

does what it is supposed to do).

On the other hand, Response Time for a batch job can vary from execution time for a

(significant) single SQL call to the elapsed time for a key stage in the job.



2. Tiers

There are very many tiers through which a response reaches a user (or an application

server, depending on who/what has “response issues”).

From the desktop, via a browser, over the internet/intranet to an application server,

rewritten as an SQL call to the database, parsed and executed by the database, CPU and

I/O cycles consumed to fetch, filter and compute values, round-trips between the

application server and database server, formatting on the application server, latency down

to the user’s desktop; there are very many tiers that are comprised in an application’s

performance. Such tiers also exist in a batch job – often ignored are the round-trips

between the application server and database server.

3. Capacity

Each “component” (be it the User’s Desktop or the WAN Link or the App Server CPU or

the App Server RAM etc … down through the Tiers) has a defined Capacity – theoretical

and practical. Within a database instance, also, there are capacity parameters – e.g. SGA

sizing parameters, the processes parameter etc.

4. Usage

Usage of the available capacity of any component varies from time to time. Any tool that

“measures” usage has to collect a snapshot of usage at a certain point in time. Multiple

snapshots must be analyzed together.

5. Throughput

Throughput is the volume of “load” (Transactions/Queries/Rows/Users – each is a

different facet of “load”) that is being serviced by the “system”.

6. Constraints

Capacity is a constraint. Concurrency is a constraint as well. Two

users/processes/sessions may not be permitted to modify the same row/resource at the

same time.

7. Serialisation

Because Capacity is not Unlimited and because there are Constraints

(automatic/system/artificial/user-defined), there may well be some points in application

code or database code or the operating system where serialisation occurs.

8. Requirements

Volume requirements, usability requirements and control requirements are defined by

users / analysts and must be built into the “system”. Requirements also add to code

complexity.

9. Scalability

Scalability of the system is it’s ability to handle additional workload without more than a

proportional increase in component resources (CPU, RAM, I/O) usage. Scalability is



adversely impacted by points of contention or serialisation in the requirements / design /

code.

10. Non-Linearity

Many systems are non-linear. If a query that processes ten thousand rows that are always

in memory and never overflows to disk for Group/Sort operations takes 1second to run, it

doesn’t necessarily follow that a hundred thousand rows would take 10seconds. The

hundred thousand row query may require multiple disk reads because not all rows are

cached in memory and, furthermore, the Group/Sort operations also overflow to disk.

11. Shared Resources

A database server may be configured to host multiple databases. The CPU and I/O load

of one or more “other” databases may well be “interference” in the performance of a

database under review. The “cost” of such “interference” must be computed and

accounted for. Similarly, within a database, Batch reports may interfere with online

queries. Also, when multiple schemas (e.g. for different “applications”) are provided for

within a database, they share and contend for shared pool, library cache and buffer cache

resources as well as for CPU and I/O.

These basic Factors apply to any System. They apply to Airports and Aeroplanes. They

apply to Factories and Refineries. They apply to Hotels and Restaurants. They apply to

Applications using Oracle Databases.

As an Oracle Database Performance Analyst (a DBA or a Developer or a System

Administrator), it is necessary to be aware of these Factors.

Definition of “Issue” The definition is the first step in the process.

First start with identify what the command/process/job is that is under contention. Is it a

daily task? How many components (see Factor 2 “Tiers”) does it involve? Do you / the

team need to evaluate the capacity, usage and throughput of each of the Tiers? Can a

specific Tier be identified as a constraint?

Typically for a performance issue, the best reference is “Time Taken”. How long does

the particular command/process/job take to run? How long did it take to run on previous

occasions? Was there any variance in run times on previous occasions?

Can a test system / test run be executed? Can the test be traced (end to end, from the

user to system level waits and back to the user)? Can the production run be traced? Can



both traces be compared? Remember: The test may not have the same level of

components, capacity, throughput, usage and may have a different set of constraints.

Also important to understand when analysing the performance of a particular

system/job/process/function is to be able to differentiate between “short, sharp” queries

and sessions and “medium to long running” batch jobs and reports. A system may have a

mix of such operations.

Some of these questions may not need to be formally asked. The answers may be well

known or documented (e.g. the components and capacity). Others may need to be

discovered (e.g. previous response times, usage). Throughput and constraints may get

identified only during the diagnostics phase (unless some of them are “well known” and

documented).

A good definition of an issue might be:

Program “A” that takes 15minutes to run at (approximately) the same time every day (on

the same server), for the same volume of data, is now (since the last 2 days) taking

45minutes, although no change to program code or parameters has been made.

Another good definition might be:

Users are usually able to view the details on screen within 5seconds of submitting the

query and navigate through all the screens in 15seconds and commit in 2seconds but the

same query and same data is now taking 25seconds, 30seconds and 5seconds

respectively, under the same user workload.

Another example might be:

We have exactly doubled the incoming data volume for the ETL job but processing time is

now 5x with no other changes to the system.

Collection of Data Use the “Questionnaire for Issue Identification” in the Appendix. Remember, not all

questions need to be formally raised. Some information may be available from

documentation. Some recursion may be necessary – questions or answers that were

deemed “insignificant” during the first round of diagnostics may have to be revisited and

reviewed. (e.g. early discussion may have considered that the network was always stable

but testing or trace files may indicate that network round trips are significant so that

network component (“Tier”) may have to be revisited).



Some of the data collection may take time -- .e.g. running a trace and analyzing the trace

file. You may need to prioritize which data is to be collected early while other collection

can run “in the background”. Time Data should always be the first priority.

Time Data

Data about “Time taken to process/run the query/request/job/batch” should be in terms of

Seconds or Minutes (where the time exceeds an hour).

Data about “Time for on-screen query” should be in terms of seconds.

Data about “SQL Execution time” should be in terms of Milli-seconds, Seconds or

Minutes.

Time data for previous runs (including min/avg/max) and test runs should also be

collected.

When collecting data about different executions, ensure that the executions are

comparable – e.g. at the same time of day, for the same volume.

Time Series Data

Time Series Data (as different from “Time Data”) is about plotting performance

information and statistics over time and validating if a trend exists. If such a trend exists,

it must be considered as a factor when evaluating and projecting load and performance.

Such Time Series Data covers not only performance and response times but also volume

and workload, concurrency and throughput.

Components (Tiers) data

Data about the Tiers involved should include :

a. Hardware Size (number of CPUs/Cores, CPU Speeds, RAM, HBAs, Network

Interfaces)

b. Operating System and FileSystem types

c. OS performance counters – sar, vmstat, iostat, top, topas

d. Latency (min/avg/max)

Volume / Workload data

Data about Volume and Workload should include:

a. Number of concurrent, active users

b. Number and sizes of rows being processes

c. Number and sizes of batch jobs running concurrently

Such workloads impact throughput and concurrency.



Execution Plans, Statistics, Wait Events

Details about SQL Execution Plans and Execution Statistics (e.g. “consistent gets”) and

Wait Events are to be collected and analyzed when it is determined that performance

within the database needs to be reviewed. Let me emphasise: This is only after you

have determined that the database and, in particular, a specific portion of the application

needs to be reviewed. Do not jump into this too soon. I put this last in the list of data to

be collected.

Interpreting the Data

The Time data must be interpreted to identify patterns. For example, has the job been

taking ever more increasing time as the weeks/months have progressed? Does the job

take more time on certain days or at certain times? Is there a correlation between the

Time and the Volume? Can a report that is to be run every 30minutes be allowed to take

10minutes to run? Should the report OR the schema OR the data loads be redesigned to

have the report run in less than 1minute? Or should the frequency of the report be

changed to run every 60minutes?

Workload/Volume/Usage and Capacity/Throughput/Tiers data must be correlated. Does a

20% increase in Workload/Volume/Usage result in a 20% increase in CPU usage?

Oracle Trace Files, Oracle Wait Statistics, Server Performance (sar, vmstat, ping latency)

data must be reviewed to identify component resource utilisation. The key resources

CPU, RAM and I/O are used to transfer data to the user. Therefore, it is necessary to

correlate the usage of these resources to the volume of data. Does the query that fetches

100 rows without having to do any aggregation really need to do 1million buffer gets?

Making Recommendations

What changes (schema, code, architecture) you recommend will, to a not inconsiderable

degree, depend on your prior experiences and “confidence” level in the tools and methods

used. Remember that your proposed changes may interact with and impact other

environmental factors!

Identify which “environmental factors” are impinging on performance. Your

recommendation should be able to address the factor.



A cardinal rule of Performance is “never does anything that is not necessary”. For

example, when you review a user requirement, you do ask the questions “Is this

requirement necessary? Has it already been met by some portion of the design that the

user is not aware of? Should the data be duplicated?” Similarly, when reviewing a

system, configuration or code (or a diagnostic trace) asks the questions “Is this

component necessary? Is it duplicated? Is the same task being done repeatedly (e.g. a

lookup on the same rows or a validation being done twice)?”

Managing Changes

Once the root cause for an issue is identified, and recommendations made the steps of

defining, creating, testing and migrating the change (or changes) required have to be

careful managed.

Some issues can be addressed by workarounds while others may require changes with

long term impacts. However, workarounds, themselves, may have adverse consequences.

A reasonable degree of confidence in the impact assessment is a requirement.

Appendix

Example Questionnaire for Issue Definition:

What is the command/process/job is that is under contention? What is it called?

Is it a daily task?

How many components (see Factor 2 “Tiers”) does it involve? List each component.

Do you / the team need to evaluate the capacity, usage and throughput of each of the

Tiers?

Can a specific Tier be identified as a constraint?

How long does the particular command/process/job take to run?

How long did it take to run on previous occasions?

Was there any variance in run times on previous occasions?

Can a test system / test run be executed?

Can the test be traced (end to end, from the user to system level waits and back to the

user)?

Can the production run be traced?

Can both traces be compared?

Oracle database performance diagnostics - before your begin

Technology

Transcript of Oracle database performance diagnostics - before your begin