Understanding LGWR, Log File Sync Waits and Commit Performance

30
WWW.E2SN.COM Enterprise 2.0 Service Networks © 2009-2010 www.e2sn.com 1 Understanding LGWR, Log File Sync Waits and Commit Performance Tanel Põder http://blog.tanelpoder.com http://tech.e2sn.com

Transcript of Understanding LGWR, Log File Sync Waits and Commit Performance

Page 1: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

1

Understanding LGWR, Log File Sync Waits and

Commit Performance

Tanel Põder

http://blog.tanelpoder.com

http://tech.e2sn.com

Page 2: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

2

Intro: About me

• Tanel Põder

• http://tech.e2sn.com - My company and technical Oracle stuff

• http://blog.tanelpoder.com - Personal Blog (more tech stuff)

[email protected] - Questions & enquiries

• Consulting, Training, Seminars

[email protected]

• http://tech.e2sn.com/oracle-training-seminars

• Online seminars coming soon!

Page 3: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

3

Topics

• How does commit & log file sync work

• Overview

• Reasons for “too long” log file sync waits

• How to measure where’s the problem?

Page 4: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

4

Reasons for log file sync waits

• Commits wait for log file sync by default

• User commits

• There’s an user commits statistic in v$sesstat

• DDL

• Resulting recursive transactions commit

• Recursive data dictionary DML

• Rollbacks wait too!

• User rollbacks

• User/application issued a rollback command

• Transaction rollbacks

• We had an internal rollback (because of some failure)

• Space allocation/ASSM problems, cancelled queries, killed sessions

Page 5: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

5

Commit & log file sync flow – idealistic overview

Time1) User issues a

COMMIT

1) User issues a

COMMIT

FG proc.

LGWR

IO

2) Foreground

proc posts

LGWR

2) Foreground

proc posts

LGWR

3) LGWR issues

the physical

write syscall

3) LGWR issues

the physical

write syscall

5) LGWR posts

the FG proc.

5) LGWR posts

the FG proc.

4) The physcical

write syscall

completes

4) The physcical

write syscall

completes

6) Commit

complete

6) Commit

complete

log file parallel write

log file sync

Page 6: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

6

Log file sync performance -> disk IO speed

Time

FG proc.

LGWR

IO

The physical write IO

(log file parallel

write) takes most of

the time

The physical write IO

(log file parallel

write) takes most of

the time

Most of the log file sync time was spent waiting on log file

parallel write

The other components, scheduling latency, IPC were small

Page 7: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

7

Log file sync performance -> scheduling latency

Time

FG proc.

LGWR

IO

2) LGWR

waits in CPU

runqueue

2) LGWR

waits in CPU

runqueue

4) IO completes,

OS puts LGWR

to CPU

runqueue

4) IO completes,

OS puts LGWR

to CPU

runqueue

3) LGWR

submits the

IO, goes to

sleep

3) LGWR

submits the

IO, goes to

sleep

5) LGWR gets

onto CPU, posts

foreground

proc.

5) LGWR gets

onto CPU, posts

foreground

proc.

6) Foreground proc

gets posted, gets

onto CPU runqueue

6) Foreground proc

gets posted, gets

onto CPU runqueue

1) User issues a

COMMIT

1) User issues a

COMMIT

Page 8: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

8

Log file sync flow

1. Foreground process (FG) posts LGWR and goes to sleep

• The “log file sync” wait starts

• Posting is done via a semaphore operation on Unix/Linux

2. LGWR wakes up, gets onto CPU

• Issues the IO request(s)

• LGWR goes to sleep, waiting for “log file parallel write” wait

3. Hardware completes the IO and OS wakes up LGWR

• LGWR gets onto CPU

• Marks “log file parallel write” event complete and posts the FG

4. Foreground process is woken up by LGWR post

• Foreground process gets onto CPU and completes the “log file sync” wait

Page 9: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

9

Measuring LGWR "speed"

SQL> @snapper out 1 10 1096

-- Session Snapper v2.01 by Tanel Poder ( http://www.tanelpoder.com )

---------------------------------------------------------------------------------

SID, USERNAME , TYPE, STATISTIC , DELTA, HDELTA/SEC, %TIME, GRAPH

---------------------------------------------------------------------------------

1096, (LGWR) , STAT, messages sent , 12 , 12,

1096, (LGWR) , STAT, messages received , 10 , 10,

1096, (LGWR) , STAT, background timeouts , 1 , 1,

1096, (LGWR) , STAT, physical write total IO requests , 40, 40,

1096, (LGWR) , STAT, physical write total multi block request, 38, 38,

1096, (LGWR) , STAT, physical write total bytes, 2884608 , 2.88M,

1096, (LGWR) , STAT, calls to kcmgcs , 20 , 20,

1096, (LGWR) , STAT, redo wastage , 4548 , 4.55k,

1096, (LGWR) , STAT, redo writes , 10 , 10,

1096, (LGWR) , STAT, redo blocks written , 2817 , 2.82k,

1096, (LGWR) , STAT, redo write time , 25 , 25,

1096, (LGWR) , WAIT, LGWR wait on LNS , 1040575 , 1.04s, 104.1%, |@@@@@@@@@@|

1096, (LGWR) , WAIT, log file parallel write , 273837 , 273.84ms, 27.4%,|@@@ |

1096, (LGWR) , WAIT, events in waitclass Other , 1035172 , 1.04s , 103.5%,|@@@@@@@@@@|

-- End of snap 1, end=2010-03-23 12:46:04, seconds=1

Page 10: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

10

LGWR and Asynch IO

oracle@linux01:~$ strace -cp `pgrep -f lgwr`

Process 12457 attached - interrupt to quit

^CProcess 12457 detached

% time seconds usecs/call calls errors syscall

------ ----------- ----------- --------- --------- --------------

100.00 0.010000 263 38 3 semtimedop

0.00 0.000000 0 213 times

0.00 0.000000 0 8 getrusage

0.00 0.000000 0 701 gettimeofday

0.00 0.000000 0 41 io_getevents

0.00 0.000000 0 41 io_submit

0.00 0.000000 0 2 semop

0.00 0.000000 0 37 semctl

------ ----------- ----------- --------- --------- --------------

100.00 0.010000 1081 3 total

This system call is not

instrumented by wait

Interface!

This system call is not

instrumented by wait

Interface!

This is what the log

file parallel write wait

event measures – AIO

reaping duration

This is what the log

file parallel write wait

event measures – AIO

reaping duration

Page 11: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

11

Warning

• The (background processes) IO instrumentation has quite a

few bugs

• Different IO modes, sync, asynd, direct, buffered etc

• On some versions, the log file parallel write (and db file

parallel write) aren’t properly instrumented

• Version dependent, for example 11.2.0.1 on Linux

• When filesystemio_options = NONE, IO syscall waits are instrumented

ok (but you don’t wan to use this option)

• When filesystem_io_options = ASYNC, IO reaping waits are all very

short

• However there’s unaccounted time in LGWR’s wait profile

• When filesystem_io_options = SETALL, IO reaping waits are

instrumented properly

Page 12: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

12

Redo, commit related latches and “tuning”

• Redo related latches• redo allocation latches

• Protect allocating space in log buffer / RBA ranges in redolog stream

• redo copy latches

• Used only for keeping track of whether anyone’s copying data into redo log buffer

• …so that LGWR would know to wait for these memory copies to complete before it tries to write buffers to disk

• LGWR will wait for LGWR wait for redo copy wait event in such cases

• Used to be tuned by _log_simultaneous_copies

• Should we “tune” any of these?• No, we should fix only problems which exist

• In other words, if wait interface doesn’t show anyone waiting for them, then don’t bother “tuning” them!

Page 13: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

13

Instrumentation

• Wait Events:

• log file sync

• log file parallel write

• log file single write

• Performance Counters (V$SESSTAT, V$SYSSTAT)

• redo size

• redo writing time

• user commits

• user rollbacks

• transaction rollbacks

Page 14: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

14

Wait event: log buffer space

• (Not a commit problem)

• LGWR is too slow flushing redo log buffer contents to disk

• Either because too slow IO subsystem

• Or LGWR not getting enough (quality) CPU time

• Sometimes pops up due large (unplanned) transactions

• Of course, it can also be because of a too small log buffer

• Which is not the case anymore in modern days

• Log buffer is usually multiple MB due how it is allocated from SGA

• You shouldn’t even set the log_buffer parameter in 10g+

Page 15: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

15

Wait event: log file single write

• Single block redo IO is used mostly for logfile header block

reading/writing

• Log switch is the main cause

• Archiving as well as it updates log header

• Who wait: LGWR & ARCH

• Example of what LGWR does during a log switch:WAIT #0: nam='log file sequential read' ela= 12607 log#=0

block#=1

WAIT #0: nam='log file sequential read' ela= 21225 log#=1

block#=1

WAIT #0: nam='control file sequential read' ela= 358 file#=0

WAIT #0: nam='log file single write' ela= 470 log#=0 block#=1

WAIT #0: nam='log file single write' ela= 227 log#=1 block#=1

Page 16: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

16

LGWR trace warnings…

• Starting from 10.2.0.3 (or was it 10.2.0.4) LGWR is trying to behelpful and dump warnings when the actual log write IO takes too long:• New parameter: _side_channel_batch_timeout_ms

• timeout before shipping out the batched side channelmessages in milliseconds

• LGWR trace file:

*** 2010-03-10 11:36:06.759

Warning: log write time 690ms, size 19KB

*** 2010-03-10 11:37:23.778

Warning: log write time 52710ms, size 0KB

*** 2010-03-10 11:37:27.302

Warning: log write time 3520ms, size 144KB

Page 17: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

17

Log file sync in Statspack / AWR

• How much of the end-to-end response time goes to log file

sync?

• How big it is compared to the full response time of the end user?

• Log file sync may take 20% of your DB Time…

• …but the DB Time itself may take only 10% of the total end user

response time!

Page 18: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

18

Log file sync in Statspack / AWR

• If log file sync waits take a significant part of the response

time, look into the “Avg wait (ms)” column:

Top 5 Timed Events Avg %Total

~~~~~~~~~~~~~~~~~~ wait Call

Event Waits Time (s) (ms) Time

------------------------------------- ------------ ----------- ------ ------

PL/SQL lock timer 57,159 8,754 153 68.8

db file sequential read 61,258 1,562 25 12.3

log file sync 5,873 1,522 259 12.0

CPU time 463 3.6

direct path write 235,606 155 1 1.2

............

Avg %Total

%Tim Total Wait wait Waits Call

Event Waits out Time (s) (ms) /txn Time

------------------------ ------------ ---- ---------- ------ -------- ------

log file parallel write 13,292 0 108 8 0.9 .8

db file parallel write 3,470 0 39 11 0.2 .3

db file sequential read 257 0 4 16 0.0 .0

Page 19: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

19

Better breakdown of wait times

• V$EVENT_HISTOGRAM• Instead of a single wait time average, breaks wait times into buckets

SQL> select event, wait_time_milli,wait_count

2 from v$event_histogram

3 where event = 'log file parallel write';

EVENT WAIT_TIME_MILLI WAIT_COUNT

------------------------- --------------- ----------

log file parallel write 1 22677

log file parallel write 2 424

log file parallel write 4 141

log file parallel write 8 340

log file parallel write 16 1401

log file parallel write 32 812

log file parallel write 64 391

log file parallel write 128 21

log file parallel write 256 6

Page 20: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

20

Identifying scheduling latency

Easy on Solaris

• prstat –m

• Works since Solaris 8

• Timed_os_statistics

• OS Wait-cpu (latency) time in v$sesstat

On other OS’es, it can’t be directly measured with standard tools

• Indirectly, you can look into system wide average CPU

runqueue length and assume that LGWR was also queueing

• Not a too systematic approach, huh?

Page 21: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

21

Measure scheduling latency (Solaris)

• Reading thread-level microstate accounting data with prstat:• USR - % Time spent on CPU in user mode

• SYS - % Time spent on CPU in kernel mode

• TRP - % Time spent processing system traps (CPU traps)

• TFL/DFL - % Time spent processing text/data page faults

• LCK - % Time spent waiting for user locks

• SLP - % Time spent sleeping (other than user locks)

• LAT - % CPU scheduling latency

# prstat -mLp 5124

PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID

5124 oracle 4.2 1.6 0.0 0.6 0.0 0.0 94 0.1 412 20 3K 0 oracle/1

5124 oracle 0.0 0.6 0.0 0.0 0.0 99 0.0 0.0 44 3 90 0 oracle/11

5124 oracle 0.0 0.6 0.0 0.0 0.0 99 0.0 0.0 44 0 90 0 oracle/7

5124 oracle 0.0 0.6 0.0 0.0 0.0 99 0.7 0.0 46 0 90 0 oracle/9

5124 oracle 0.0 0.6 0.0 0.0 0.0 99 0.0 0.0 44 1 90 0 oracle/5

5124 oracle 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 24 0 7 0 oracle/2

5124 oracle 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 1 0 5 0 oracle/3

5124 oracle 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 1 0 5 0 oracle/10

5124 oracle 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 1 0 5 0 oracle/8

Page 22: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

22

Bugs, problems

• No instrumentation

• On some version/platform/IO configuration the wait interface doesn’t

record log file parallel write waits at all

• The same goes for db file parallel writes

• (I’ve noticed it on 9.2-10.2.0.x on Solaris for example)

• For LGWR you can use V$SESSTAT redo write time statistic instead

• It’s in centiseconds

• 1-second log file sync bug

• Most log file syncs took ~1 second to complete

• The posts sent back by LGWR were missed by foreground process

• Thus the FG always waited until the 1 second log file sync wait

timeout happened

Page 23: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

23

Tuning

• No need for tuning!• Log buffer is quite large by default

• All memory remaining in a granule after the allocation for fixed SGA is given to log buffer

• Oracle used to have a single redo log buffer until v9.0

• Redo allocation latch could become the ultimate contention point

• Since 9.2, Oracle can have the log buffer split into multiple buffers

• Each protected by a separate redo allocation latch

• From 10g, Oracle can keep lots of small private redo strands in shared pool

• Each protected by a separate redo allocation latch

• @rs.sql – Show redo strands available

Page 24: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

24

Evil tuning

• If you don’t care about the D in ACID (and want to

occasionally lose data for fun), then:

• 10gR1:

• commit_logging transaction commit log write behavior

• 10gR2:

• commit_write transaction commit log write behavior

• commit_wait transaction commit log wait behavior

• Old undocumented stuff

• _wait_for_sync wait for sync on commit MUST BE ALWAYS TRUE

• Old: Put redologs to /tmp (on Solaris) or in-memory disks (/dev/shm)

for duration of a migration/upgrade

• If your OS / server crashes, you’ll need to restore from a backup!

Page 25: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

25

Optimizations for working around bad applications

• Commit optimization

• In PL/SQL since Oracle 9i

• The log file sync is deferred until the end of the PL/SQL call!

SQL> exec while true loop update t set a=a+1 ; commit ;

end loop;

• No log file sync waits

• log buffer space / log file switch completion waits more likely!

Page 26: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

26

LGWR configuration – CPU

• Prevent priority decay

• Put LGWR into fixed priority scheduling class (FX60 on Solaris)

• LGWR should get onto CPU faster when waking up

• LGWR isn’t thrown off CPU as likely

• If LGWR is still experiencing significant scheduling latency

• You can put LGWR into a higher priority class

• You should not put LGWR into the highest real-time class

• Real time is tricky – your process can monopolize a CPU for itself

• You don’t want to make LGWR pre-empt the OS kernel!

• Note that Oracle sets some processes into higher priority by default:

• _high_priority_processes LMS*|VKTM

Page 27: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

27

LGWR configuration - IO

• Reduce the amount of work and waiting a “log file parallel

write” has to do

• Unbuffered concurrent IO

• Verify with truss/strace whether proper flags are used

• (O_DIRECT, O_DIO, O_CIO etc)

• Or use raw devices

• ASM is essentially a raw device

• Or ODM for some cases

• And optimize the whole IO hardware stack, of course!

• Note shat mid-large size storage arrays do have write cache built in

• So, moving redo log files to SSD may not give any advantage!

• Verify what’s your current “log file parallel write” latency using

v$event_histogram

Page 28: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

28

log file sync magic tuning super-secret!!!

COMMIT LESS!!!Commit when your business transaction ends, not after every

single update!

Page 29: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

29

Summary

• Application: Commit less!

• Ideally only when your logical business transaction ends

• Troubleshooting: Measure log file sync at session level detail

• CPU: If waits for log file sync are significant - see whether

LGWR gets:

• Enough (quality) CPU time

• Onto the CPU fast enough

• IO: See how much LGWR waits for log file parallel write event

• What’s the log file parallel write completion time

• V$EVENT_HISTOGRAM for better detail

Page 30: Understanding LGWR, Log File Sync Waits and Commit Performance

WWW.E2SN.COMEnterprise 2.0

Service Networks

© 2009-2010

www.e2sn.com

30

Thanks!

• Download slides from:

• http://tech.e2sn.com/oracle-slides-and-whitepapers

• Download Snapper from:

• http://tech.e2sn.com/oracle-scripts-and-tools/session-snapper

• Blog:

• http://blog.tanelpoder.com

• Email:

[email protected]