oracle performance tuning

A map to AWR report

10SEP

Introduction

An average 11g AWR report spans 40 screens broken into approximately 50 sections.

That’s a lot, especially for someone who’s not very well familiar with AWR reports, so I

decided to make a some sort of a map. The purpose is to show that this report has a

certain structure (which may not be obvious at first sight), and knowing this structure

can help extract the most essential information in the fastest way possible.

Types of sections

For simplicity, I break AWR report sections into following categories:

1) basic (key information)

2) detalization (provides details on a specific topic briefly covered in the basic section,

such as latches, enqueues etc.)

3) advisories (helps find optimal values of parameters)

4) advanced (stuff that is not generally needed, but can be useful on certain occasions

— basically, everything not covered in 1-3).

Basic sections

Basic sections contain information that is most essential to understanding what the

database is going through performance-wise. In most cases, they need to be read and

analyzed in their entirety.

Here is a list:

1) Header (information about the instance, the host, beginning and end snapshots

found on the top of the report)

2) Load profile

3) Waits (“top 5 timed foreground events”)

4) instance CPU

Detalization sections

By far the most important of these is “top SQL ordered by executions/elapsed

time/CPU time/reads/gets/parse calls/shared memory/versions” which can be

considered as a detalization of information in “load profile” and “top timed events”

sections. For example, if the load profile is showing unusually high number of

executions (e.g. much higher than the number of user calls), SQL ordered by

executions will tell which SQL exactly is responsible for that. If top timed events is

showing high disk I/O, then SQL ordered by reads may give some answers, etc.

Another useful detalization section is “Background Wait Events”. If one of the top

foreground events suggests a problem with a background process (e.g. log buffer

space waits indicate a problem with LGWR) then it makes sense to study background

waits that may be relevant.

Other detalization sections:

o event histograms (detailed distribution by time for timed events)

o latch activity (details for latch-related waits)

o segment stats (details for I/O related waits) etc.

Advanced sections

These include sections that are rarely needed: in case of special configuration (shared

server sections) or special options (java pool) etc.

Advisories

These sections are very different from everything else on the AWR report — they don’t

tell about any existing or potential problems, rather, they tell how certain statistics

would change if certain parameters (mostly sizes of various memory pool) are changed

either way. Nowadays undersized memory pools are not as common as they used to be

in 9i and earlier, so these sections are not needed very often. Go there only if you have

strong reasons to believe that changing these parameters is necessary to resolve an

existing problem.

Navigating from section to section

Generally, it’s advisable to read the report in its natural order (from top down):

1) header (RAC or standalone, duration of the snapshot, Oracle version, platform,

number of CPUs memory) — just read it to understand what you’re dealing with.

Obviously, if you’re looking at an AWR of a familiar database then you won’t need it.

2) load profile (average active sessions, DB CPU, logical and physical reads, user calls,

executions, parses, hard parses, logons, rollbacks, transactions) — check if the

numbers are consistent with each other and with general database profile

(OLTP/DWH/mixed)

3) events — see where the database spends most of the time. This section, combined

with the load profile, essentially determines what you’ll be looking for in the rest of the

report

4) if CPU time shows up in the top 5 events with a significant percentage, then make

sure to look at host CPU usage to see if there is a risk of CPU starvation (see here for

details)

5) go to top SQL to identify top resource consumers (pay special attention to the

resource which is likely to be scarce or the major source of delays — e.g. if there are

symptoms of CPU starvation, start with SQL ordered by CPU, if most of DB time falls

on disk I/O wait event then go to SQL ordered by reads etc.)

6) depending on your findings so far, go to one of the detalization sections, if necessary

7) if you have to (and if you know how to interpret your findings), look for any

additional information available in advanced sections

8) if in previous steps you have found hard evidence that tuning one of memory

parameters would resolve a performance problem, then go to the appropriate advisor

section.

Since this is a very popular subject on the OTN forum, I decided to put together a few

points about analyzing AWR reports.

1. Choosing time period for the AWR report

When troubleshooting a specific problem, one should try and chose the period as close

to the duration of the incident as possible. Including snapshots beyond that period

would dilute the symptoms of the problem. For example, if the incident occured

between 5:49 pm and 7:06 pm, then it’s reasonable to pick 7 pm as the start snapshot

and 8 pm as the end snapshot. Choosing 5 pm and 8 pm will result in the AWR report

being diluted by 1 hour and 55 minutes of normal running.

If the AWR report is generated to get a general feel of the database profile, then it’s

preferable to chose the period of a peak load, since potential performance bottlenecks

are more likely to manifest themselves at such times. On the other hand one should

avoid any untypical activity(e.g. huge reports that are only run once a year) or any

maintenance (e.g. an rman backup).

Of course, the AWR report cannot include an instance restart.

2. Choosing a baseline report

When using AWR report to troubleshoot a specific issue, it is a good idea to generate a

second report to as a point of reference. When choosing start and end snapshots for

such report, one should take into account application workload periodicity. E.g. if

Mondays are busier than other days of week, then an incident that occured on a

Monday between 2 and 3 am should be compared to a similar period for another

Monday, etc.

3. Most informative sections of the report

I find the following sections most useful:

o summary

o top 5 timed events

o top SQL (by elapsed time, by gets, sometimes by reads)

4. Things to look foro general workload profile (redo per sec, transactions per sec)

o abnormal waits (first of all, concurrency and commit)

o clear leaders in the top SQL (suggestive of plan-flip kind of a performance issue)

5. Things to keep in mind when interpreting the report

It is important not to get obsessed by the ratios in the report, especially ones that you

don’t fully understand. Normally AWR doesn’t contain enough evidence to do the full

analysis of a performance problem, it’s just a departing point. The next logical step is

to use high-resolution tools to pinpoint the root cause of the problem, such as:

1) query AWR views(DBA_HIST%) directly

2) query ASH views (V$ACTIVE_SESSION_HISTORY,

DBA_HIST_ACTIVE_SESS_HISTORY) to link suspicious waits to specific sessions

3) take a closer look at top SQL, using rowsource statistics and cardinality feedback

analysis; if necessary, use SQL extended trace

It is a bad idea to use AWR reports when the scope of a performance problem is limited

and known (and yet some people do that). E.g. if users complain about procedure

DOSOMETHING being slow, it’s fine to generate an AWR report to see if the database

is experiencing extra workload, or query AWR views to see if there are changes in the

way users call the procedure, but other than that one needs to use more specific

things: DBMS_PROFILER, rowsource stats, SQL trace etc.

Another bad idea is to get obsessed by some obscure ratio not being perfect in the

AWR report, especially when users are generally happy with the performance. It is

quite common that people run an AWR report just in case, find something that

supposedly shouldn’t be there and then start to plan a potentially expensive and risky

fix for a problem that may not even exist.

For example, when people see log file related waits, they tend to jump to conclusion

that something needs to be immediately done to the redo buffer (of course, making it

bigger is the 1st thing that comes to mind). Before doing anything, one should answer

following questions:

1. What is the size of the problem, indicated by the suspicious wait event (‘wrong’

ratio, etc.)? Is it big enough to mean a problem? If already experiencing a problem — is

the effect commensurate with its size? E.g. if anything in the database runs 5 times

slower than normal and you see ‘buffer busy waits’ with 3% in the top-5 wait list, then

clearly buffer busy waits are irrelevant (even though everyone knows they’re bad and

shouldn’t be there… in a perfect world).

2. What is it linked to? Could it be a one-time thing? E.g. someone running a huge

report that only runs once a quarter or uploading huge amount of data that will only

happen once?

Introduction

“Load profile” section of the AWR report contains some extremely useful information,

and yet it is very often overlooked (often in favor of instance efficiency percentages,

which is easier to read but much more likely to mislead). I decided to make some sort

of a short guide for it, describing how different statistics in it can be used to better

understand performance of a database.

Redo size

Everything that you do in a database is protected by redo. Redo is a collection of so-

called “change vectors” that tell Oracle how to repeat an operation on data if

necessary. Even though SELECTs can also generate some redo, the main sources of

redo are (in roughly descending order): INSERT, UPDATE and DELETE. For INSERTs

and UPDATE s, the size of redo is close to the amount of data created or modified. For

DELETEs, you only need to know the rowid’s of deleted rows to repeat the operation,

so if the rows are “fat”, then the size of redo may be much smaller than the size of

deleted data.

High redo figures mean that either lots of new data is being saved into the database, or

existing data is undergoing lots of changes.

How high is high? Databases are not created equal, so there is no universal standard.

However, I find it useful multiplying redo per second by 86,400 (number of seconds

there are in a day) and compare it to the size of the database — if the numbers are

within the same order of magnitude, then this would make me curious. Is the database

doubling in size every few days? Or is it modifying almost every row on a daily basis?

Or maybe there is something going on that I don’t know about?

What do you do if you find that redo generation is too high (and there is no business

reason for that)? Not much really — since there is no “SQL ordered by redo” in the

AWR report. Just keep an eye open for any suspicious DML activity. Any unusual

statements? Or usual statements processed more usual than often? Or produce more

rows per execution than usual? Also, be sure to take a good look in the segments

statistics section (segments by physical writes, segments by DB block changes etc.) to

see if there are any clues there.

Logical reads, block changes, physical reads/writes

Logical reads is simply the number of blocks read by the database, including physical

(i.e. disk) reads, and block changes is fairly self-descriptive. These statistics tell the

nature of the database activity (read-mostly, write-mostly, a little bit of both) and its

scale at the time of the report. It also gives you an idea how well data caching works in

the database (but you can also see that directly from the buffer cache hit ratio in the

“instance efficiencies” section).

If you find those number higher than expected (based on usual numbers for this

database, current application workload etc.), then you can drill down to the “SQL by

logical reads” and “SQL by physical reads” to see if you can identify specific SQL

responsible.

User calls

A user call is when a database client asks the server to do something, like logon, parse,

execute, fetch etc. This is an extremely useful piece of information, because it sets the

scale for other statistics (such as commits, hard parses etc.).

In particular, when the database is executing many times per a user call, this could be

an indication of excessive context switching (e.g. a PL/SQL function in a SQL

statement called too often because of a bad plan). In such cases looking into “SQL

ordered by executions” will be the logical next step.

Parses and hard parses

A parse is analyzing query’s text and optionally, optimizing a plan. If plan optimization

is involved, it’s a hard parse, otherwise a soft parse.

As we all know, parsing is expensive (performance-wise). Excessive parsing can cause

very nasty performance problems (one moment your database seems fine, the next

moment it comes to a complete standstill). Another bad thing about excessive parsing

is that it makes troubleshooting of poorly performing SQL much more difficult.

How much hard parsing is acceptable? It depends on too many things, like number of

CPUs, number of executions, how sensitive are plans to SQL parameters etc. But as a

rule of a thumb, anything below 1 hard parse per second is probably okay, and

everything above 100 per second suggests a problem (if the database has a large

number of CPUs, say, above 100, those numbers should be scaled up accordingly). It

also helps to look at the number of hard parses as % of executions (especially if you’re

in the grey zone).

If you suspect that excessive parsing is hurting your database’s performance:

1) check “time model statistics” section (hard parse elapsed time, parse time elapsed

etc.)

2) see if there are any signs of library cache contention in the top-5 events

3) see if CPU is an issue.

If that confirms your suspicions, then find the source of excessive parsing (for soft

parsing, use “SQL by parse calls”; for hard parsing, useforce_matching_signature) and

see if you can fix it.

Sorts

Sort operations consume resources. Also, expensive sorts may cause your SQL fail

because of running out of TEMP space. So obviously, the less you sort, the better (and

when you do, you should sort in memory). However, I personally rarely find sort

statistics particularly useful: normally, if expensive sorts are hurting your SQL’s

performance, you’ll notice it elsewhere first.

Logons

Establishing a new database connection is also expensive (and even more expensive in

case of audit or triggers). “Logon storms” are known to create very

serious performance problems. If you suspect that high number of logons is degrading

your performance, check “connection management elapsed time” in “Time model

statistics”.

Executes

Executes statistic is very important for analyzing performace, but what I had to say

about it I’ve already said above in “user calls” and “parses and hard parses” sections.

Transactions

This is another extremely important statistic, both on the general (i.e. creating context

for understanding the rest of the report) and specific (troubleshooting performance

problems related to transaction control) levels. The AWR report provides information

about transactions and rollbacks, i.e. the number of commits can be calculated as the

difference between the two. Rollbacks are expensive operations, and can cause

performance problems if used improperly (i.e. in tests, to revert the database to the

original state after testing), which can be controlled either by reducing the number of

rollbacks or by tuning rollback segments. Rollbacks can also indicate that a branch of

code is failing and thus forced to rollback the results (this can be overseen if resulting

errors are not processed or rethrown properly).

Excessive commits can lead to performance problems via log file sync waits .

How many is excessive? Once again, this entirely depends on the database. Obviously,

OLTP databases commit more than DWH ones, and between OLTP databases the

numbers can vary several orders of magnitude. For the databases that I worked with,

below 10-20 commits per second there never was a problem, and above 100-200 there

almost always was (when not sure, look in “top timed events”: if there are no “log file

sync” waits up there, then you’re probably okay!).

Let’s start with some basic concepts. AWR reports deal with several kinds of time. The

simplest kind is the elapsed time , it’s just the interval of time between the start and

end snapshots. Another important quantity is DB time, which is defined as time in user

calls during that period. It can be (and for a busy system typically is) greater than the

elapsed time. However, the reason for that is not the number of CPUs as some experts

incorrectly state (apparently, they confuse it with CPU time that we’ll discuss below,

e.g. here), it’s that this time is a sum over all active user processes which are using

CPU or waiting for something. Note that it only counts time spent in user calls, i.e.

background processes are not included in that.

Another important quantity is database CPU time. It can also exceed the elapsed time,

because the database can use more than one CPU. Unfortunately, AWR reports use up

to 3 different names for it: CPU time, DB CPU, and CPU used by this session.

Normally, they should have close values, and differences can probably be attributed to

connection management (e.g. establishing or tearing down a session). And of course

“CPU used by this session” is an odd name for an instance-level metric, but that’s

understandable: it’s just a sum of a session-level metric over all sessions.

CPU time represents time spent on CPU and does not include time waiting for CPU.

Unfortunately, the latter quantity is not accessible via AWR (but there are indirect

ways of extracting in via ASH, see here).

Finally, CPU consumption in the host operating system can also be important for

trobleshooting high CPU usage. AWR provides these numbers in the “Operating Sysem

Statistics” section (as “BUSY” and “IDLE”, the units are centiseconds).

DB time and DB CPU define two important timescales: wait times should be measured

against the former, while CPU consumption during certain activity (e.g. CPU time

parsing) should be measurd against the latter.

High CPU time

CPU usage is described by “CPU time” (or “DB CPU”) statistics. Somewhat

counterintuitively, AWR report showing CPU time close to 100% in the top timed

events section does not necessarily indicate a problem. It simply means that database

is busy using CPU to do work for its users. However, if CPU time (expressed in CPU

seconds) becomes commensurate to the total CPU power available on the host (or

shows consistent growth patterns), then it becomes a problem, and a serious one: this

means that at best, Oracle processes will wait lots of time to get on CPU runqueue. In

the worst case scenario, the host OS won’t have adequate resources to run and may

eventually hang.

Unfortunately, AWR reports only provide CPU time estimates either in absolute units

or as a percentage of DB time, but not in terms of the overall capacity. It’s not wrong:

you need to know what percentage of user calls falls on CPU time to see whether or

not it’s contributing appreciably to response times. But it’s not complete, because

when talking about resource usage you need to know what % of total resource

available is being used. Fortunately, it’s quite simple to calculate that:

DB CPU usage (% of CPU power available) = CPU time / NUM_CPUS / elapsed time

where NUM_CPUS is found in the Operating System statistics section. Of course, if

there are other major CPU users in the system, the formula must be adjusted

accordingly. To check that, look at OS CPU usage statistics either directly in the OS

(using sar or other utility available on the host OS) or by looking at IDLE/(IDLE+BUSY)

from the Operating System statistics section and comparing it to the number above.

If DB CPU usage is at 80-90% of the capacity (or 70-80% and growing) then you try to

reduce CPU usage or if not possible, buy more CPU power before the system freezes.

To reduce high CPU usage one needs to find its source within the database. The first

thing to check is parsing, not only because this is a CPU-intensive activity, but also

because high parsing means lack of cursor sharing, which makes diagnostics very

difficult: each statement is parsed to its own sql_id, spreading database workload over

thousands of statements which only differ by parameter values. Of course, this makes

all “SQL ordered by” lists in the AWR report useless.

If parsing is reasonable, then one needs to look at SQL statements consuming most

CPU (“SQL ordered by CPU time” in the CPU section of the report) to see if there is

excessive logical I/O that could be reduced by tuning, or some expensive sorts that

could be avoided, etc. It could also be useful to check “segments by logical reads” to

see if partitioning or a different indexing strategy would help.

Unaccounted CPU time

Occasionally, CPU time may underestimate the actual CPU usage because of errors

and holes in database and OS kernel code instrumentation — then one needs to rely on

OS statistics to figure out how much of the OS CPU capacity the database is using.

In this case, when looking for the source of high CPU usage within the database, in

addition to OS tools (top, sar, vmstat etc.) one can use indirect indications of high CPU

consumption, such as:

- missing time in the “timed events” section (sum of percentages in top-5 significantly

below 100%)

– high parsing (ideally CPU usage during parsing should be accounted for in “CPU

time”, but that’s not always the case)

– mutex-related wais, such as “cursor: pin S wait on X” etc. (either because of high

parsing, or bugs, or both)

– logon storms (high number of logons in short time)

– resource manager events (“resmgr: cpu quantum”),

or look in ASH for sessions with the “ON CPU” state and see what they are doing.

Examples

Let’s consider a few examples.

Example 1WORKLOAD REPOSITORY report for

DB Name DB Id Instance Inst Num Release RAC Host------------ ----------- ------------ -------- ----------- --- ------------xxxx xxxxxxxxx xxxx 1 10.2.0.4.0 NO xxxxxxxxx

Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- ---------Begin Snap: 66607 02-Mar-12 12:00:52 648 19.6 End Snap: 66608 02-Mar-12 12:30:54 639 21.4 Elapsed: 30.04 (mins) DB Time: 3,436.49 (mins)

...Top 5 Timed Events Avg %Total~~~~~~~~~~~~~~~~~~ wait CallEvent Waits Time (s) (ms) Time Wait Class------------------------------ ------------ ----------- ------ ------ ----------resmgr: cpu quantum 475,956 152,959 321 74.2 SchedulerCPU time 47,879 23.2db file sequential read 3,174,880 15,866 5 7.6 User I/Odb file scattered read 196,255 4,078 21 2.0 User I/Olog file sync 157,730 4,579 29 4.4 Commit...-> Total time in database user-calls (DB Time): 104720.3s...Operating System Statistics DB/Inst: ****/**** Snaps: 66607/66608

Statistic Total-------------------------------- --------------------...BUSY_TIME 5,707,941IDLE_TIME 1...NUM_CPUS 32-------------------------------------------------------------

This is a simple case: the report has “CPU starvation” written all over it. CPU time

(47,879s) — even though not the largest wait event in the database — is close to the

maximum capacity (32 x 30 min x 60 sec/min = 57,600). The top wait event (resmgr:

cpu quantum) indicates that the database user calls are spending most of their time

waiting for the Resource Manager to allocate CPU resource to them — that’s another

symptom of extreme CPU starvation. And finally, OS stats are confirming that CPU is

completely maxed out: 1 centisecond of idle time versus 5,707,941 busy!

Fortunately, SQL ordered by CPU time is just as unambiguous: it showed one SQL

statement responsible for 60.99% of DB time, and fixing it (it was a bad plan with poor

table ordering and millions of context switching because of a PL/SQL function calls)

fixed the entire database.

Now let’s consider something less trivial.


DB Name DB Id Instance Inst Num Release RAC Host------------ ----------- ------------ -------- ----------- --- ------------xxxx xxxxxxxxx xxxx 1 10.2.0.5.0 NO xxxxxxxxxSnap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- ---------Begin Snap: 38338 08-Mar-12 02:00:40 673 6.7 End Snap: 38339 08-Mar-12 04:29:22 760 5.6

Elapsed: 148.70 (mins) DB Time: 77,585.95 (mins)...Top 5 Timed Events Avg %Total~~~~~~~~~~~~~~~~~~ wait CallEvent Waits Time (s) (ms) Time Wait Class------------------------------ ------------ ----------- ------ ------ ----------cursor: pin S ############ 2,072,643 2 44.5 Othercursor: pin S wait on X 76,424,627 929,303 12 20.0 Concurrenclatch free 1,958 246,702 ###### 5.3 OtherCPU time 58,596 1.3log file sync 746,839 44,076 59 0.9 Commit -------------------------------------------------------------...-> Total time in database user-calls (DB Time): 4655157.1s... -------------------------------------------------------------Operating System StatisticsStatistic Total-------------------------------- --------------------...BUSY_TIME 6,327,511IDLE_TIME 24,053...NUM_CPUS 7 -------------------------------------------------------------

There are quite a few remarkable things in this report. And there is a good story to it,

too, but I’m hoping to make a separate post about it, so let’s focus on CPU stuff here.

The time period of the report spans 148 min, but DB time is 77,586 min, which means

that there were ~524 active sessions on the average. If we compare that to the number

of sessions (673/760 beginning/end), we can see that even the database was terribly

busy, or, yet more likely, most of the sessions were waiting on something. The list of

timed event confirms this: it shows massive mutex contention in the library cache.

Now let’s look at the CPU time here. It’s 58,596 s, or just 1.3% of DB time… negligible!

Or is it?… Let’s compare it to the total CPU time available: 148 minutes times 7 CPUs

times 60 seconds per minute equals 62,454 s — i.e. the database alone was responsible

for 93.7% of the CPU time during a 2.5 hour interval! More likely, it started off at a

moderate level, and then for a good portion of the interval stayed close at 100%, which

averaged to 93.7%.

If we look again at the wait events, we don’t find any mention of CPU time at all!

However, if we do the math, we can find an indirect indication:

44.5+20+5.3+1.3+0.9=72, so where did the remaining 28% go?… Also, cursor: pin S

wait on X, cursor: pin S are both mutext waits, which can burn CPU at a very high rate

(see here for details). This gives us a good idea of how the CPU is wasted (and if one

looks in ASH, one can find where exactly it happens, but that’s beyond the scope of

this post).

In this case, “SQL ordered by CPU time” was useless for finding the source of high

CPU usage, because many SQL statements were not using binds. The culprit was found

by looking in the ASH (actually, that requires a bit of work, too, but I’m hoping to make

a separate post about it), and fixing it fixed the problem.

Let’s consider another case.


DB Name DB Id Instance Inst Num Release RAC Host------------ ----------- ------------ -------- ----------- --- ------------xxxx xxxxxxxxx xxxx 1 10.2.0.4.0 NO xxxxxxxxx Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- ---------Begin Snap: 33013 02-Apr-12 10:00:00 439 27.1 End Snap: 33014 02-Apr-12 11:00:12 472 24.4 Elapsed: 60.20 (mins) DB Time: 520.72 (mins)...Top 5 Timed Events Avg %Total~~~~~~~~~~~~~~~~~~ wait CallEvent Waits Time (s) (ms) Time Wait Class------------------------------ ------------ ----------- ------ ------ ----------CPU time 15,087 48.3db file sequential read 28,442,386 8,758 0 28.0 User I/Oenq: TX - row lock contention 1,459 3,633 2490 11.6 Applicatiolog file sync 89,026 2,922 33 9.4 Commitdb file parallel write 169,289 2,783 16 8.9 System I/O...Operating System StatisticsStatistic Total-------------------------------- --------------------...BUSY_TIME 5,707,941IDLE_TIME...NUM_CPUS 64

Here, CPU time is responsible for almost half of the DB time. This looks big. Does this

mean we should rush to buy more (or faster) CPUs? Probably not, since the CPU time

(15,087) is only a small fraction of available CPU resource (64 CPUs x 60 mins x 60 s =

230,000s). OS stats also show that CPU is not a scarce resource on the system

(211,335s idle vs 19,831s busy).

Of course, this doesn’t mean that tuning SQL to reduce CPU consumption won’t help

here — it will, it just won’t be a global effect. Therefore, it would make sense to tune

based on business priority, not on the amount of CPU usage.

Conclusion

Troubleshooting high CPU usage with AWR reports can be tricky and may require

other tools (like ASH). While most waits are compared to DB time, CPU time should

also be compared to the total CPU capacity on the host.

In my previous post I described some sections that are typically useful when

interpreting AWR data. However, sometimes the answer comes from an unexpected

source. For example, the workload profile section of the report contains key

information for understanding what the database looks like, but it seldom gives a

direct answer to the problem (except for maybe excessive parsing and excessive

commits). But recently I came across a case when this section was enough to identify

the root cause of a non-trivial issue: Per Second Per Transaction

Redo size: 1,895,241.12 12,004.40

Logical reads: 832,945.54 5,275.85

Block changes: 11,937.82 75.61

Physical reads: 7,458.75 47.24

Physical writes: 759.33 4.81

User calls: 449.83 2.85

Parses: 225.18 1.43

Hard parses: 15.90 0.10

Sorts: 467.90 2.96

Logons: 1.38 0.01

Executes: 103,266.84 654.09

Transactions: 157.88

This excerpt was coming from an AWR report for a database that virtually froze with

100% CPU consumption on the box. The question was what causing this high CPU

consumption (the SAs ruled out possibility of blaming other processes on the box).

When looking carefully at the numbers above, one could notice that executes per

second looks enormous. This becomes even more apparent when looking at the rate of

user calls, which is a few orders of magnitude lower. These numbers, combined with

high CPU usage, are enough to suspect context switching as the primary suspect: a

SQL statement containing a PL/SQL function, which executes a SQL statement

hundreds of thousands of time per function call.

Further investigation confirmed that it was indeed the case. There was a stats job

running shortly before the incident, leading to invalidation of the SQL plan, and the

new plan was calling the PL/SQL function at an early stage, before most rows were

eliminated.

The point I am trying to make is that one should try to maintain a good balance

between focusing on just few key performance indicators, and paying attention to

secondary details as well.

Load Profile

This section gives a glimpse of the database workload activity that occurred within the snapshot interval. For example, the load profile below shows that an average transaction generates about 18K of redo data, and the database produces about 1.8K redo per second.

Load Profile~~~~~~~~~~~~ Per Second Per Transaction -------------- --------------- Redo size: 1,766.20 18,526.31 Logical reads: 39.21 411.30 Block changes: 11.11 116.54 Physical reads: 0.38 3.95 Physical writes: 0.38 3.96 User calls: 0.06 0.64 Parses: 2.04 21.37 Hard parses: 0.14 1.45 Sorts: 1.02 10.72 Logons: 0.02 0.21 Executes: 4.19 43.91

The above statistics give an idea about the workload the database experienced during the time observed. However, they do not indicate what in the database is not working properly. For example, if there are a high number of physical reads per second, this does not mean that the SQLs are poorly tuned.

Perhaps this AWR report was built for a time period when large DSS batch jobs ran on the database. This workload information is intended to be used along with information from other sections of the AWR report in order to learn the details about the nature of the applications running on the system. The goal is to get a correct picture of database performance.

The following list includes detailed descriptions for particular statistics:

Redo size: The amount of redo generated during this report.

Logical Reads: Calculated as (Consistent Gets + DB Block Gets = Logical Reads).

Block changes: The number of blocks modified during the sample interval.

Physical Reads: The number of requests for a block that caused a physical I/O operation.

Physical Writes: Number of physical writes performed.

User Calls: Number of user queries generated.

Parses: The total of all parses; both hard and soft.

Hard Parses: The parses requiring a completely new parse of the SQL statement. These consume both latches and shared pool area.

Soft Parses: Soft parses are not listed but derived by subtracting the hard parses from parses. A soft parse reuses a previous hard parse; hence it consumes far fewer resources.

Sorts, Logons, Executes and Transactions: All self-explanatory.

Parse activity statistics should be checked carefully because they can immediately indicate a problem within the application. For example, a database has been running several days with a fixed set of applications, it should, within a course of time, parse most SQLs issued by the applications, and these statistics should be near zero.

If there are high values of Soft Parses or especially Hard Parses statistics, such values should be taken as an indication that the applications make little use of bind variables and produce large numbers of unique SQLs. However, if the database serves developmental purposes, high vales of these statistics are not bad.

The following information is also available in the workload section:

% Blocks changed per Read: 4.85 Recursive Call %: 89.89

Rollback per transaction %: 8.56 Rows per Sort: 13.39

The % Blocks changed per Read statistic indicates that only 4.85 percent of all blocks are retrieved for update, and in this example, the Recursive Call %statistic is extremely high with about 90 percent. However, this fact does not mean that nearly all SQL statements executed by the database are caused by parsing activity, data dictionary management, space management, and so on.

Remember, Oracle considers all SQL statements executed within PL/SQL programs to be recursive. If there are applications making use of a large number of stored PL/SQL programs, this is good for performance. However, applications that do not widely use PL/SQL may indicate the need to further investigate the cause of this high recursive activity.

It is also useful to check the value of the Rollback per transaction % statistic.This statistic reports the percent of transactions rolled back. In a production system, this value should be low. If the output indicates a high percentage of transactions rolled back, the database expends a considerable amount of work to roll back changes made. This should be further investigated in order to see why the applications roll back so often.

f you have worked in IT long enough then it is hard to miss the acronym "AWR". AWR is short for Automatic Workload Repository report and is probably the first word out of a DBA's mouth at the mention of performance problems in your application. If you are like most people then your head would start spinning when you perchance happen to glance the report. You are not alone, most DBAs don't understand 90% of what is in the report and how to make sense of it. Most times the DBAs tend to look at such reports with a preconceived bias since they are looking for patterns most are familiar with like full table scans, too much CPU use or too much disk I/O, etc and then lean the findings accordingly. So what is a layman with reasonable intelligence to do when you see the report and how does one validate what the DBA is saying.

So here goes....Before we do anything a little history... AWR is the Pièce de résistance of what is called as Oracle Wait Interface (OWI), one of the features that sets Oracle apart from the other databases. So while evolving the Oracle engine over the many years, Oracle realized the importance of measuring every touch point of a SQL as it progress through the Oracle RDBMS engine. The OWI was the result, this was initially very cumbersome to read, analyze and diagnose. As releases of Oracle have come and gone they have fine tuned the OWI such that today it produces a neat report (default every hour) recording all activities in the database and capturing every wait event the SQLs were subjected to. No special switch or extra software is required since Oracle 10g onwards the AWR is ready to go out of the box. The DBA can control the frequency of the report generation based on need and you also control the retention period of the records so that you can go back in time if needed.

Ok coming back to reading the AWR, the first thing you want to make sure is - If the issue is really caused by the DB? To do this the best thing to do is to glance at the DB Time which is reported at the very start of the report.

At the very bottom I have culled out 4 tables from the numerous that you would encounter in an AWR report to illustrate how you can make a fairly good inference based on glancing a few key data points instead of getting intimidated by the sea of data in an AWR report. We will refer to this data below for our analysis.

Let's start with the first table. Looking at the 180 mins of elapsed time (meaning this report is for 3 hrs), the application is roughly spending 320 mins on the DB. What this implies is that roughly 320/180 = 1.8 DB Seconds is being spent for every elapsed second. Confusing? In a DB there are thousands of transactions at any given second and servers have more than

one CPU so multiple transactions can run in parallel. For example if we ran 2 transactions a second, the DB Time would be 2 seconds, 10 transactions in a second implies 10 DB seconds and so on. Which is why you see DB Time being more than the wall clock, in our case 320 DB Minutes in 180 wall clock minutes. DB Time is the total time spent by sessions in the DB doing active work which include time spent on CPU, I/O and other waits. Consequently the higher the DB Time for a given hour for example, higher the load on the DB. So for a 60 min period if you saw the DB Time as 600 mins, then that implies a busier DB because you are executing more transaction concurrently in a given minute.

Now let's move on to the second table. Here if you look at DB time spent in a second, you will see that it is 1.8 DB seconds, meaning on avg, there are about 1.8 sessions active in the database doing real work. For example in our case DB Time of 320, divided by wall clock of 180 mins give you roughly 1.8 sessions active sessions per second. The higher the number of active sessions in a given second the more the load on the DB.

To cross check, search for "user commits" in the report or Table 3 below.So in the 3 hour period we had about 12000 transactions, this times the 1.6 DB seconds per transaction (column 3 of Table 2) will give you back the 320 DB mins spent by the DB executing SQLs. Obviously you want the DB Time spent per transaction to be as small as possible.

Now, we have to see if we can break down this DB Time into its components, how is this time distributed, meaning how many seconds did the SQL spend executing on the CPU, doing I/O or waiting for a lock (enqueues, latches etc are too complicated for now, just imagine them all as being similar to locks primarily use to control concurrency to common objects like tables, rows, etc). I am also excluding interconnect latency, network etc from our discussion for now.

First search for "Top 5 Timed Foreground Events" in the report or look at Table 4 below. Now, look at the % DB Time column, pay attention to those that have a higher value for this column since these are the prime drivers of DB Time. In the above example you can see that almost 40+28=68% of DB Time is consumed by the 2 top events. Both of these are I/O related. So now at least you know where to look, are your SQLs returning too many rows, is the I/O response pretty bad on the server, is DB not sized to cache enough result sets, etc.

The 3rd row in Table 4 indicates 19% of DB Time is spent on row locks, meaning you have sessions wanting to change same set of rows but cannot do so all at once until the holder of the lock doing the change finishes. This indicates a code problem, check for unnecessary access to same rows or single row table to implement serialization, usually applications at the start of transaction update a master table or something and then go do a bunch of stuff before coming back and committing or rolling back the update on the master table. In apps that have a lot of sessions this will cause a backup of waiting sessions because the locks are not released fast enough, eventually your apps server will run out of connection threads and the whole thing stops.

Now, the 4th row in Table 4, DB CPU is critical, in CPU bound databases you will see this as the top event. There is a very easy way to see how much CPU is used by the DB. DB CPU was about 2335s or 39 mins for the whole 3 hours. So 39 mins out of a total DB Time of 320

mins is only 12% and now we can conclude that in our example above most of our DB Time is spent doing I/O.

Another interesting tidbit is this, look for "Host CPU" in the report to look for the number of CPUs on the DB server:

Host CPU (CPUs: 6 Cores: 3 Sockets: )

So we have 6 cores, meaning in a 60 min hour we have 60 X 6 = 360 CPU mins, so for 3 hours we have 1080 CPU mins and we used only 39 CPU mins, meaning only 39/1080 = 3.6% of all available CPU on the box! Tiny indeed! If you had a CPU bound DB, you would probably see DB CPU more like 900 - 1000 mins, and that is not a good sign. Usually indicates contention for latches or you have SQLs doing too many logical I/Os or lot of parsing due to the application not using bind variables, etc. More on these later but at the very least I hope this write-up gives you the ability to quickly look at a few data points and infer what is ailing performance of your database.

oracle performance tuning

Documents

Transcript of oracle performance tuning