B35 Inside rac by Julian Dyke

83
1 DB Tech Showcase - Osaka May 2013 juliandyke.com © 2013 Julian Dyke Julian Dyke Independent Consultant Inside RAC

Transcript of B35 Inside rac by Julian Dyke

Page 1: B35 Inside rac by Julian Dyke

1

DB Tech Showcase - Osaka May 2013

juliandyke.com © 2013 Julian Dyke

Julian Dyke Independent Consultant

Inside RAC

Page 2: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 2

Agenda

OPS versus RAC Buffer Cache Global Cache Services

Page 3: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 3

RAC Overview

Public Network

Shared Storage

Node 1

Instance 1

Node 2

Instance 2

Node 3

Instance 3

Node 4

Instance 4

Private Network

(Interconnect)

Storage Network

Page 4: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 4

OPS versus RAC Oracle 8.0.6 and below

Instance 2

Node 2

OPS - Oracle 8.0.6 and below

Instance 1

Node 1 Interconnect

Shared Storage

Current Writes Consistent Reads

Current Reads All I/O uses shared storage

Enqueues only use interconnect

Page 5: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 5

Instance 2

Node 2

OPS - Oracle 8.1.5 to Oracle 8.1.7 - Cache Fusion Phase 1

Instance 1

Node 1 Interconnect

Shared Storage

Current Writes Consistent Reads

Current Reads Current I/O always uses shared storage

Consistent reads can use interconnect

OPS versus RAC Oracle 8.1.5 to Oracle 8.1.7

Page 6: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 6

Instance 2

Node 2

RAC - Oracle 9.0.1 and above - Cache Fusion Phase 2

Instance 1

Node 1 Interconnect

Shared Storage

Current Writes Consistent Reads

Current Reads Current I/O and

consistent reads can use interconnect

OPS versus RAC Oracle 9.0.1 and above

Page 7: B35 Inside rac by Julian Dyke

© 2006 Julian Dyke juliandyke.com 7

Head of Cold End

Head of Hot End

92

0

34

3

72

4

52

1

71

2

66

0

49

0

42

1

45

2

52

1

71

2

66

0

42

1

11

1

52

1

71

2

11

1

42

1

42

2

71

0

92

0

34

3

72

4

45

2

11

1

52

1

42

2

33

1

45

2

11

1

42

2

33

1

34

4

92

0

34

4

72

4

45

2

11

1

42

0

33

1

71

0

87

1

87

1

72

4

33

1

45

2

Read Block 42

Get first available buffer from cold end

Update buffer contents Insert buffer at head of cold end

Read Block 11

Get first available buffer from cold end

Update buffer contents Insert buffer at head of cold end

Read Block 42

Update touch count for block 42

Read Block 33

Move block 71 to head of hot end

Set touch count on block 71 to zero

Get first available buffer from cold end

Update buffer contents Insert buffer at head of cold end

Read Block 34

Update touch count for block 34

Read Block 87

Move block 42 to head of hot end

Set touch count on block 42 to zero

Get first available buffer from cold end

Update buffer contents Insert buffer at head of cold end

STOP

Block Number

Touch Count

Buffer Cache Single Block Reads

Page 8: B35 Inside rac by Julian Dyke

© 2006 Julian Dyke juliandyke.com 8

Head of Cold End

Head of Hot End

Read Block 1

Get first four available buffers from cold end Read next four blocks into buffers

1 2 3 4

Insert buffers at head of cold end

1 2 1 3 2 1 4 3 2 1

Move block 1 to cold end

1 2 1

Read Block 2

Move block 2 to cold end

2 1 3 2 1 3 4

Read Block 3

Move block 3 to cold end

Read Block 4

Move block 4 to cold end

Read Block 5

Get next four available buffers from cold end Read next four blocks into buffers Insert buffers at head of cold end Move block 5 to cold end

4 3 2 1 5

5 5 6

7 6

7 6 5

8

7 8 5 5 6 5 6 5 6 7 5 6 7 8

Read Block 6

Move block 6 to cold end

Read Block 7

Move block 7 to cold end

Read Block 8

Move block 8 to cold end

STOP

DB_FILE_MULTIBLOCK_READ_COUNT = 4

Buffer Cache Multi Block Reads

Page 9: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 9

Global Services Overview Resource

Object to which access must be controlled at instance level

Enqueue

Memory structure that serializes access to a resource

Global Resources Object to which access must be controlled at cluster level

Global Enqueue

Locks and enqueues which need to be consistent between all instances

Page 10: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 10

Global Services Overview Global Resource Directory (GRD)

Records current state and owner of each resource Contains convert and write queues Distributed across all instances in cluster Maintained by GCS and GES

Global Cache Services (GCS)

Implements cache coherency for database Coordinates access to database blocks for instances

Global Enqueue Services (GES)

Controls access to other resources (locks) including library cache and dictionary cache

Performs deadlock detection

Page 11: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 11

Global Cache Services Introduction Global Cache Services exist to implement Cache Fusion

Cache Fusion allows blocks to be updated by multiple

instances

Only one instance can have the updatable (current) version of a block GCS must ensure that only one instance can update a

block at any time

Many instances can have read-only (consistent read) versions of a block Instances can have multiple copies of same block at

different SCNs

Page 12: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 12

Global Cache Services 2 way Current Read

Instance 1

Instance 2

Instance 4

1318

Request shared resource

Instance 3

Resource Master

Instance 2 requests current read on block

Request granted

S N

Read request

Block returned

1318

1

2

3

4

STOP

Page 13: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 13

Global Cache Services 3-way Current Read

Instance 1

Instance 2

Instance 4

1318

Request exclusive resource

Instance 3

Resource Master

Instance 1 requests exclusive read on block

Transfer block to Instance 1 for exclusive access

S N Block and resource status

Resource status

1318

1

2

3

4

N

N

X

1320

STOP

Page 14: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 14

Global Cache Services 3-way Current Read (Dirty Block)

Instance 1

Instance 2

Instance 4

1318

Request block in exclusive mode

Instance 3

Resource Master

Instance 4 requests exclusive read on block

Transfer block to Instance 4 in exclusive mode

S N

Block and resource status

Resource status

1318

1 2

3

4 N N X

1320 N

N

X 1320 1323

STOP

Note that Instance 1 will create a past image (PI) of the dirty block

Page 15: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 15

Global Cache Services 3-way Current (Without Downgrade)

Instance 1

Instance 2

Instance 4

1318

Request block in shared mode

Instance 3

Resource Master

Instance 2 requests current read on block

Block and resource status

Resource status

1

3

4

N N X

1320 N

N

X 1320 1323

Transfer block to Instance 2 in shared mode

2

STOP

In Oracle 8.1.5 and above _fairness_threshold is used to avoid unnecessary lock conversions

Page 16: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 16

Global Cache Services 3-way Current (With Downgrade)

Instance 1

Instance 2

Instance 4

1318

Request block in shared mode

Instance 3

Resource Master

Instance 2 requests current read on block

Block and resource status

Resource status

1

3

4

N N X

1320 N X

1320 1323

Transfer block to Instance 2 in shared mode

2

S

S

STOP

In Oracle 8.1.5 and above _fairness_threshold is used to avoid unnecessary lock conversions

Page 17: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 17

Global Cache Services Past Images When an instance passes a dirty block to another instance it

Flushes redo buffer to redo log

Retains past image (PI) of block in buffer cache PI is retained until another instance writes block to disk Used to reduce recovery times

Recorded in V$BH.STATUS as PI

Based on X$BH.STATE (value 8 in Oracle 10.2)

Page 18: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 18

Global Cache Services Past Images

7128 7129 UPDATE t1 SET c1 = 7124; COMMIT;

UPDATE t1 SET c1 = 7129; COMMIT;

7123

Instance 1

7123 7124 7125 7126 7127

Buffer Cache

7124 7123

7125 7124

7126 7125

7127 7126

7128

7128 7127

Redo Log 1

Instance 2

Buffer Cache

7129 7128

UPDATE t1 SET c1 = 7125; COMMIT;

UPDATE t1 SET c1 = 7126; COMMIT;

UPDATE t1 SET c1 = 7127; COMMIT;

UPDATE t1 SET c1 = 7128; COMMIT; 7128

7123

Redo Log 2

7123

7128 7129 7129

7129

7129

Assume table t1 contains a single row in block 42

Instance 1 updates column to 7124 Block 42 is read from disk Undo/Redo written to

Redo Log 1 Block 42 is updated in buffer

cache Instance 1 updates column to

7125 Undo/Redo written to

Redo Log 1 Block 42 is updated in buffer

cache Instance 1 updates column to

7126 Undo/Redo written to

Redo Log 1 Block 42 is updated in buffer

cache Instance 1 updates column to

7127 Undo/Redo written to

Redo Log 1 Block 42 is updated in buffer

cache Instance 1 updates column to

7128 Undo/Redo written to

Redo Log 1 Block 42 is updated in buffer

cache Instance 2 updates column to

7129 GCS transfers block from Instance 1 to Instance 2

Instance 1 makes block 42 a Past Image block Undo/redo written to

Redo Log 2 Block 42 is updated in buffer

cache Instance 2 Crashes

Contents of buffer cache are lost DBWR has not written changes

to block 42 back to disk yet Instance 1 must perform recovery for Instance 2

Block 42 needs recovery Instance 1 uses Past Image Undo/redo is applied from

Redo Log 2 Block 42 is subsequently written

back to disk by DBWR

STOP

Page 19: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 19

Global Cache Services Wait Events Wait events show reads where messages have been

exchanged with other instances Can include:

gc cr grant 2-way gc cr block 2-way gc cr block 3-way gc cr multi block request gc current grant 2-way gc current block 2-way gc current block 3-way gc current multi block request

Page 20: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 20

Global Cache Services gc cr block 3-way wait event

Source Destination Description Bytes

RAC4 - Server RAC2 - LMS1 Request file 8 block 15 456

RAC2 - LMS1 RAC4 - Server OK 212

RAC2 - LMS1 RAC3 - LMS1 Send file 8 block 15 to RAC4 480

RAC3 - LMS1 RAC2 - LMS1 OK 212

RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 1 1500

RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 2 1500

RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 3 1500

RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 4 1500

RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 5 1500

RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 6 868

Page 21: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 21

Global Cache Services gc cr block 3-way wait event

RAC1

RAC2

RAC4

1318

RAC3

Resource Master

1,40 2,44

1,42 2,44

UPDATE t1 SET c2 = 50

WHERE c1 = 2;

1

2

3

4 5

10

6 7

8 9

1,42 2,44 1,42 2,44

Page 22: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 22

Global Cache Services gc cr block 2-way wait event 2-way Consistent Read

Source Destination Description Bytes

RAC4 - Server RAC3 - LMS1 Request file 6 block 69 400

RAC3 - LMS1 RAC4 - Server OK 212

RAC3 - LMS1 RAC4 - Server Block file 6 block 69 part 1 1500

RAC3 - LMS1 RAC4 - Server Block file 6 block 69 part 2 1500

RAC3 - LMS1 RAC4 - Server Block file 6 block 69 part 3 1500

RAC3 - LMS1 RAC4 - Server Block file 6 block 69 part 4 1500

RAC3 - LMS1 RAC4 - Server Block file 6 block 69 part 5 1500

RAC3 - LMS1 RAC4 - Server Block file 6 block 69 part 6 868

Page 23: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 23

Global Cache Services gc cr block 2-way wait event

RAC1

RAC2

RAC4

1318

RAC3

Resource Master

1,40 2,44

1,40 2,44

UPDATE t1 SET c2 = 50

WHERE c1 = 2;

1 2

3 4

5 6

7 8

1,40 2,44 1,40 2,44

STOP

Page 24: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 24

Global Cache Services gc current block 3-way wait event 3-way Current Read

Source Destination Description Bytes

RAC4 - Server RAC2 - LMS1 Request file 8 block 15 456

RAC2 - LMS1 RAC4 - Server OK 212

RAC2 - LMS1 RAC3 - LMS1 Send file 8 block 15 to RAC4 480

RAC3 - LMS1 RAC2 - LMS1 OK 212

RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 1 1500

RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 2 1500

RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 3 1500

RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 4 1500

RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 5 1500

RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 6 868

RAC4 - LMS1 RAC2 - LMS1 Received file 8 block 15 244

RAC2 - LMS1 RAC4 - LMS1 OK 212

Page 25: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 25

11

Global Cache Services gc current block 3-way wait event

RAC1

RAC2

RAC4

1318

RAC3

Resource Master

1,40 2,44

1,42 2,44

UPDATE t1 SET c2 = 50

WHERE c1 = 2;

1

2

3

4 5

10

6 7

8 9

1,42 2,44

12

UPDATE t1 SET c2 = 42

WHERE c1 = 1;

RAC3 saves past image of the dirty block until RAC4 writes the block to disk

1,42 2,44

1,42 2,50

STOP

Page 26: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 26

Global Cache Services gc cr grant 2-way wait event 2-way Consistent Read

Source Destination Description Bytes

RAC4 - Server RAC3 - LMS1 Request file 6 block 69 400

RAC3 - LMS1 RAC4 - Server OK 212

RAC3 - LMS1 RAC4 - Server Grant read file 6 block 69 276

RAC4 - Server RAC3 - LMS1 OK 212

Page 27: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 27

Global Cache Services gc cr grant 2-way wait event

RAC1

RAC2

RAC4

1318

RAC3

Resource Master

1,40 2,44 1,40 2,44

1,40 2,44

SELECT c2 FROM t1

WHERE c1 = 1;

1 2

5 6

3 4

STOP

Page 28: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 28

Global Cache Services gc cr multi block request wait event

Source Destination Description Bytes

RAC4 - Server RAC3 - LMS1 Request file 8 blocks 69-76 1872

RAC3 - LMS1 RAC4 - Server OK 212

RAC3 - LMS1 RAC4 - Server Grant file 8 blocks 69-76 to RAC4 772

RAC4 - Server RAC3 - LMS1 OK 212

Page 29: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 29

Global Cache Services gc cr multi block request wait event

RAC1

RAC2

RAC4

1318

RAC3

Resource Master

SELECT c2 FROM t1

WHERE c1 = 1;

1 2

5 6

3 4

1,40 2,44

1,40 2,44

1,40 2,44

1,40 2,44

1,40 2,44

1,40 2,44

1,40 2,44

1,40 2,44

1,40 2,44

1,40 2,44

1,40 2,44

1,40 2,44

1,40 2,44

1,40 2,44

1,40 2,44

STOP

Page 30: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 30

Global Cache Services gc cr multi block request wait event The following 10046/8 trace is for a gc cr multi block request

WAIT #2: nam='gc cr multi block request' ela= 722 file#=4 block#=248 class#=1 obj#=51866 tim=1169728375495574

WAIT #2: nam='db file scattered read' ela= 10437 file#=4 block#=244 blocks=5 obj#=51866 tim=1169728375506092

This trace can be misleading because: the gc cr multi block request specifies the LAST block in

the range the gc cr multi block request does not specify how many

blocks should be read the gc cr multi block request does not specify how many

blocks have been returned from another instance

Page 31: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 31

Global Cache Services Block Mastering Each block is mastered on one instance

Block DBA is reported by X$KJBR.KJBRNAME Names have the format:

[<block_number>][<file_number>][BL]

For example

[0x137][0x40000][BL]

Ordering by X$KJBR.KJBRNAME is difficult because the resource names do not collate when sorted e.g.:

is file# 4, block# 311

[0x12E][0x40000][BL]

[0x12F][0x40000][BL]

[0x13][0x40000][BL]

[0x130][0x40000][BL]

[0x131][0x40000][BL]

etc...

Page 32: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 32

Global Cache Services Block Mastering Some useful functions

CREATE OR REPLACE FUNCTION get_file_number (p_resource_name VARCHAR2) RETURN INTEGER IS pos1 INTEGER := INSTR (p_resource_name,'x',1,2); pos2 INTEGER := INSTR (p_resource_name,']',1,2); s VARCHAR2(30) := SUBSTR (p_resource_name,pos1+1,pos2-pos1-1); BEGIN RETURN TO_NUMBER (s,'XXXXXXXX') / 65536; END; /

CREATE OR REPLACE FUNCTION get_block_number (p_resource_name VARCHAR2) RETURN INTEGER IS pos1 INTEGER := INSTR (p_resource_name,'x',1,1); pos2 INTEGER := INSTR (p_resource_name,']',1,1); s VARCHAR2(30) := SUBSTR (p_resource_name,pos1+1,pos2-pos1-1); BEGIN RETURN TO_NUMBER (s,'XXXXXXXX'); END; /

Page 33: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 33

Global Cache Services Block Mastering In Oracle 10.2 block mastering is determined by

_lm_contiguous_res_count Specifies number of contiguous blocks that will hash to the

same HV bucket Defaults to 128 For example

Start End

0x080 0x0FF

0x180 0x1FF 0x280 0x2FF 0x380 0x3FF 0x480 0x4FF 0x580 0x5FF

etc etc

Start End

0x000 0x07F

0x100 0x17F 0x200 0x27F 0x300 0x37F 0x400 0x47F 0x500 0x57F

etc etc

Instance 0 Instance 1

Page 34: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 34

Global Cache Services Block Mastering The following table shows that masters are still assigned to

ranges of 128 contiguous blocks in a four-node cluster

Start Block End Block Master

0 127 1

128 255 2

256 383 2

384 511 3

512 639 3

640 767 3

768 895 1

896 1023 0

1024 1279 2

1280 1407 1

Page 35: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 35

Global Cache Services Dynamic Remastering In Oracle 9.2

documentation describes dynamic remastering not implemented in code

In Oracle 10.1

work at data file level very high threshold so difficult to test does occur on some customer sites

In Oracle 10.2 and above

works at segment level thresholds are relatively low

Page 36: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 36

Global Cache Services Dynamic Remastering Object remastering is recorded in V$GCSPFMASTER_INFO Instances are internally numbered 0, 1 etc Initially contains no rows After remastering object 52084 to instance 0

SELECT object_id, current_master, previous_master FROM v$gcspfmaster_info;

After remastering object 52084 to instance 1

Object ID Current Master Previous Master 52084 0 32767

Object ID Current Master Previous Master 52084 1 0

Page 37: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 37

Global Cache Services Dynamic Remastering In Oracle 10.2 and above, information about Dynamic

Remastering operations is also reported in the following fixed views X$KJDRMREQ

Dynamic Remastering Requests

X$KJDRMAFNSTATS File Remastering Statistics

X$KJDRMHVSTATS

Hash Value Statistics

Page 38: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 38

Global Cache Services Dynamic Remastering In Oracle 11.1 and above, Dynamic Remastering statistics are

reported in V$DYNAMIC_REMASTER_STATS

Column Name Data Type REMASTER_OPS NUMBER

REMASTER_TIME NUMBER

REMASTERED_OBJECTS NUMBER

QUIESCE_TIME NUMBER

FREEZE_TIME NUMBER

CLEANUP_TIME NUMBER

REPLAY_TIME NUMBER

FIXWRITE_TIME NUMBER

SYNC_TIME NUMBER

RESOURCES_CLEANED NUMBER

REPLAYED_LOCKS_SENT NUMBER

REPLAYED_LOCKS_RECEIVED NUMBER

CURRENT_OBJECTS NUMBER

Page 39: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 39

Global Cache Services Dynamic Remastering Dynamic remastering is coordinated by the LMD0 background

The LMD0 process background process includes limited details of dynamic remastering operations

Excessive dynamic remastering can cause instance freezes Observed in both Oracle 10.1 and 10.2 Oracle Support occasionally recommends that dynamic

remastering is disabled using the following parameters:

_gc_affinity_time = 0 _gc_undo_affinity=FALSE

Page 40: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 40

Thank you for listening

[email protected]

Page 41: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 41

Backup

Page 42: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 42

Interconnect Overview Instances communicate with each other over the interconnect

(network)

Information transferred between instances includes data blocks locks SCNs

Typically 1Gb Ethernet

UDP protocol Often teamed in pairs to avoid SPOFs

Can also use Infiniband

Fewer levels in stack

Other proprietary protocols are available

Page 43: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 43

Interconnect TCP/IP Five Layer Model All messages travel down through layers, across physical

layer then up again

5 Application

4 Transport

3 Network

2 Data Link

1Physical

5 Application

4 Transport

3 Network

2 Data Link

1Physical

Page 44: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 44

Interconnect TCP/IP Five Layer Model TCP/IP has a four or five layer model Five-layer model shown below

Layer TCP/IP Suite

5 Application DHCP, DNS, FTP, HTTP, SSH, NFS, NTP, SMTP, SNMP, TELNET, RPC, SOAP

4 Transport TCP, UDP

3 Network IP (IPv4, IPv6), ICMP, ARP, RARP

2 Data Link Ethernet, Token Ring, 802.11, Wi-Fi, FDDI, PPP

1 Physical 10BASE-T, 100BASE-T, 1000BASE-T, Optical Fibre, Twisted Pair

Four-layer model combines data link and physical layers

Page 45: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 45

Interconnect TCP/IP Transport Layer Transport Layer

Connection-oriented (TCP) Connectionless (UDP)

Ethernet

Physical Layer

IP

TCP UDP Clusterware RAC

Page 46: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 46

Interconnect Encapsulation

Ethernet Header

Ethernet Trailer

UDP Header

IP Header Data

UDP Header

IP Header Data

UDP Header Data

Data

4 bytes 14 bytes 20 bytes 8 bytes

MTU Size

Page 47: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 47

Oracle Clusterware Node Heartbeat Messages Sent to each node in cluster every second in both directions

Checks nodes are still members of cluster Sent by ocssd.bin using TCP well-known port 49895

Outgoing message is 134 bytes (80 byte payload) Incoming message is 66 bytes (12 byte payload)

Node 1

Node 3

Node 2

Node 4

Outgoing

Incoming

Page 48: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 48

Oracle Clusterware Node Status Messages Number of packets exchanged by a node is determined by

number of nodes in cluster Number of packets per node per hour is

(#nodes - 1) * 4 messages * 3600 seconds

Number of nodes Packets per hour 2 14,400 3 28,800 4 43,200 5 57,600 6 72,000 7 86,400 8 100,800

16 216,000 32 446,400

Page 49: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 49

Datafiles Controlfiles

Redo Logs

RAC Background Processes Overview

Redo Logs

DIAG

LMON

LCK0

LMD0

LMSn

PMON SMON

LGWR

CKPT

ARCn

SMON PMON

DBWR DBWR LGWR

Shared Pool

Buffer Cache

Instance 2

Shared Pool

Buffer Cache

Instance 1

DIAG

LMON

LCK0

LMD0

LMSn

CKPT

ARCn

Node 1 Node 2

Page 50: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 50

RAC Background Processes LMSn LMSn

Global Cache Service Process

Manage requests for data access across cluster

Up to 20 in Oracle 10.1 LMS0-LMS9 LMSa-LMSj

Up to 36 in Oracle 10.2

LMS0-LMS9 LMSa-LMSz

In Oracle 10.1 and above, number of GCS server processes can be configured using gcs_server_processes parameter Default value is 1 (single CPU system) Can also be configured using _lm_lms parameter

Page 51: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 51

RAC Background Processes LMSn In Oracle 10.2 and above

LMS processes run in real-time mode Remaining processes run in time-share mode

Check using: [oracle@server3 ~]$ ps -eo pid,user,opri,cmd | grep ora_lm

8596 oracle 75 ora_lmon_TEST1 8598 oracle 75 ora_lmd0_TEST1 8601 oracle 58 ora_lms0_TEST1

58 is real time; 75 or 76 is time share You can also check process scheduling policies using chrt

oracle@server3 ~]$ chrt -p 8601 # lms0 - Real Time pid 8601's current scheduling policy: SCHED_RR pid 8601's current scheduling priority: 1

[oracle@server3 ~]$ chrt -p 8596 # lmon - Time Share pid 8596's current scheduling policy: SCHED_OTHER pid 8596's current scheduling priority: 0

Page 52: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 52

RAC Background Processes LCK0 LCK0

Instance Enqueue Process Part of KCL (Kernel Cache Library)

Manages

instance resource requests cross-instance call operations

Assists LMS processes

Formerly known as lock process

One LCK0 process per instance

In 9.0.1 and below, number of lock processes may be

configurable using _gc_lck_procs parameter

Page 53: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 53

RAC Background Processes LMD0 LMD0

Global Enqueue Service Daemon

Manages requests for global enqueues Updates status of enqueues when granted to / revoked

from an instance

Responsible for deadlock detection

One LMD0 process per instance

In 8.1.7 and below number of lock daemons may be configurable using _lm_dlmd_processes parameter

Page 54: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 54

RAC Background Processes LMON LMON

Global Enqueue Service Monitor

One LMON process per instance

Monitors cluster to maintain global enqueues and resources

Manages instance and process expirations recovery processing for cluster enqueues

Page 55: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 55

RAC Background Processes DIAG DIAG - Diagnosability Process

Collects diagnostic data in the event of a failure

Creates subdirectories in BACKGROUND_DUMP_DEST directory

In Oracle 9.0.1 and above can be disabled using _diag_daemon parameter Do not try this on a production system

Page 56: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 56

Global Cache Services UDP Messages There are two types of message exchanged within RAC These are PROBABLY defined as follows

Synchronous These messages require an acknowledgement for each

packet In some cases the acknowledgement packet can be

larger than the original request e.g. SCN synchronization

Asynchronous

These messages do not require an individual acknowledgement for each packet e.g. block transfers between instances

Page 57: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 57

Global Cache Services Lock Modes Lock modes can be:

Null Another instance can hold an exclusive or shared lock

Shared Another instance can hold a shared lock but not an

exclusive lock Exclusive

No other instances can hold shared or exclusive locks

Locks can also be: Local

No other instance has held an exclusive lock Global

Another instance has held an exclusive lock in the past

Page 58: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 58

Global Cache Services Fairness Threshold Intended to prevent unnecessary lock downgrades when other

instances only require read-only copies

For write to read transfers Writing instance retains X lock Reading instance retains null lock

If _fairness_threshold reached then

Writing instance downgrades X lock to S lock Reading instance receives S lock

_fairness_threshold default value is 4

Page 59: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 59

Global Cache Services Lock Elements Lock elements are externalized in the V$LOCK_ELEMENT

dynamic performance view Based on X$LE

Additional information is available in the X$LE view

Past image buffers do not have a lock element

In OPS one lock element could manage a contiguous range of

blocks Still can in RAC using GC_FILES_PER_LOCK parameter Disables Cache Fusion

Page 60: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 60

Global Cache Services Lock Elements

Contain embedded GCS Client structures (KJBL)

Lock Element

GCS Client

Buffer Header

Lock Element

GCS Client

Buffer Header

Buffer Header

Lock Element

GCS Client

Buffer Header

Page 61: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 61

Global Cache Services Memory Structures

KJBR KJBR

KJBL

BH BH

LE

KJBL

LE

KJBL

GCS Client

GCS Shadow

GCS Resource

Block Header Lock

Element

GCS Shadow describes blocks

held by other instances, but

mastered locally

Page 62: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 62

Global Cache Services Memory Structures GCS Resources (KJBR)

Stored in segmented array Number of GCS resource structures determined by

_gcs_resources parameter Externalized in X$KJBR Number of free GCS resource structures in X$KJBRFX

GCS Enqueues (Clients / Shadows) (KJBL)

GCS clients embedded in lock elements GCS shadows stored in segmented array Number of GCS shadow structures determined by

_gcs_shadow_locks parameter Externalized in X$KJBL Number of free GCS shadow structures in X$KJBLFX

Page 63: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 63

Global Cache Services Dynamic Remastering Example

SELECT data_object_id FROM dba_objects WHERE owner = 'US01'AND object_name = 'T1'; OBJECT_ID --------- 52084

ORADEBUG LKDEBUG -m pkey 52084

To remaster object at current instance use:

All blocks now mastered by the current instance

To redistribute masters to all available instances use: ORADEBUG LKDEBUG -m dpkey 52084

Blocks mastered by both (all) instances again

Page 64: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 64

Global Cache Services Block Mastering In Oracle 10.1 and below block mastering is determined by a

hash function Algorithm applied to groups of 1289 contiguous blocks In two node cluster

Instance 0 has 645 blocks Instance 1 has 644 blocks etc

In three node cluster Instance 0 has 430 blocks Instance 2 has 215 blocks Instance 1 has 430 blocks Instance 2 has 214 blocks etc

Beware of small hot tables and indexes....

Page 65: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 65

Global Cache Services Dumps To dump the contents of the global cache use:

ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME GC_ELEMENTS LEVEL 1';

GLOBAL CACHE ELEMENT DUMP (address: 0x21fecd18): id1: 0x3591 id2: 0x10000 obj: 181 block: (1/13713) lock: SL rls: 0x0000 acq: 0x0000 latch: 0 flags: 0x41 fair: 0 recovery: 0 fpin: 'kdswh05: kdsgrp' bscn: 0x0.18a9c bctx: (nil) write: 0 scan: 0x0 xflg: 0 xid: 0x0.0.0

GCS CLIENT 0x21fecd60,1 sq[(nil),(nil)] resp[(nil),0x3591.10000] pkey 181 grant 1 cvt 0 mdrole 0x21 st 0x20 GRANTQ rl LOCAL master 1 owner 0 sid 0 remote[(nil),0] hist 0x7c history 0x3c.0x1.0x0.0x0.0x0.0x0. cflag 0x0 sender 2 flags 0x0 replay# 0 disk: 0x0000.00000000 write request: 0x0000.00000000 pi scn: 0x0000.00000000 msgseq 0x1 updseq 0x0 reqids[1,0,0] infop 0x0 pkey 181 hv 107 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 6, dom 0] kjga st 0x4, step 0.0.0, cinc 8, rmno 10, flags 0x0 lb 0, hb 0, myb 178, drmb 178, apifrz 0

Page 66: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 66

Global Cache Services Dumps Continued

GLOBAL CACHE ELEMENT DUMP (address: 0x237f4358): id1: 0x6a39 id2: 0x10000 obj: 74 block: (1/27193) lock: SL rls: 0x0000 acq: 0x0000 latch: 0 flags: 0x41 fair: 0 recovery: 0 fpin: 'kdswh05: kdsgrp' bscn: 0x0.26992 bctx: (nil) write: 0 scan: 0x0 xflg: 0 xid: 0x0.0.0

GCS SHADOW 0x237f43a0,1 sq[0x2ee64e8c,0x2eff3858] resp[0x2ee64e74,0x6a39.10000] pkey 74 grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL master 0 owner 0 sid 0 remote[(nil),0] hist 0x12a5 .....

GCS RESOURCE 0x2ee64e74 hashq [0x2ee61894,0x2ff57390] name[0x6a39.10000] pkey 74 grant 0x2eff3858 cvt (nil) send (nil),0 write (nil),0@65535 flag 0x0 mdrole 0x1 mode 1 scan 0 role LOCAL ..... GCS SHADOW 0x2eff3858,1 sq[0x237f43a0,0x2ee64e8c] resp[0x2ee64e74,0x6a39.10000] pkey 74 grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL master 0 owner 1 sid 0 remote[0x23fea160,1] hist 0x65f .....

GCS SHADOW 0x237f43a0,1 sq[0x2ee64e8c,0x2eff3858] resp[0x2ee64e74,0x6a39.10000] pkey 74 grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL master 0 owner 0 sid 0 remote[(nil),0] hist 0x12a5 .....

Page 67: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 67

Global Cache Services System Change Number In RAC clusters SCN must be maintained across all nodes in

cluster SCN propagation scheme differs according to version

In Oracle 10.1and below defaults to Lamport algorithm

Lamport in alert.log SCN piggy-backed on GCS/GES messages Recorded in redo log Default delay of 7 seconds

In Oracle 10.2 and above defaults to Broadcast on Commit

algorithm SCN negotiated immediately Apparently no delay

Page 68: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 68

Global Cache Services System Change Number System Change Number algorithm is determined by the

MAX_COMMIT_PROPAGATION_DELAY parameter

In Oracle 10.1 and below Initialization parameter specified in centriseconds Default value is 700 centiseconds (7 seconds) Specifies maximum time taken for a COMMIT on one node

to be reflected on other nodes in the cluster For some applications performing rapid updates and

queries of the same data from different instances, value must be set to 0 (Broadcast on commit)

Examples include: E-Business suite SAP

Page 69: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 69

Global Cache Services System Change Number In Oracle 10.2 and above

Default value of MAX_COMMIT_PROPAGATION_DELAY parameter is 0

SCN broadcast on commit method is used SCN updates are synchronized immediately

SCN is synchronized

after current read before block updated

This ensures correct SCN is written to block

Page 70: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 70

Global Cache Services Broadcast on Commit Ethernet broadcast is not used

SCN is synchronized by updating instance

Sends UDP SCN synchronization message to each remote instance

Remote instances respond with their current SCN

Another round of messages may be required if remote SCNs are more recent than local SCN

Synchronization occurs every time an instance needs a new SCN

Synchronization is always performed by the updating instance Number of messages = 4 x (number of instances - 1)

Page 71: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 71

Global Cache Services Broadcast on Commit In a 4-node cluster 12 messages are exchanged

Source Destination Description Bytes RAC4-LMS0 RAC1-LMS0 Send current SCN 192 RAC1-LMS0 RAC4-LMS0 OK 212 RAC4-LMS0 RAC2-LMS0 Send current SCN 192 RAC2-LMS0 RAC4-LMS0 OK 212 RAC4-LMS0 RAC3-LMS0 Send current SCN 192 RAC3-LMS0 RAC4-LMS0 OK 212 RAC1-LMS0 RAC4-LMS0 Send current SCN 192 RAC4-LMS0 RAC1-LMS0 OK 212 RAC2-LMS0 RAC4-LMS0 Send current SCN 192 RAC4-LMS0 RAC2-LMS0 OK 212 RAC3-LMS0 RAC4-LMS0 Send current SCN 192 RAC4-LMS0 RAC3-LMS0 OK 212

Page 72: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 72

Global Cache Service Read Consistency When a read consistent version of a block is requested it may

be necessary to apply undo to a more recent version of that block

Undo can be applied by LMSn background process in Remote instance Local instance

If undo applied by remote instance, any outstanding redo

must first be flushed from redo buffer of remote instance to redo log Can have significant performance impact on consistent

reads Particularly on extended clusters

Page 73: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 73

Global Cache Service Read Consistency Statistics on inter-instance consistent reads are reported in

V$CR_BLOCK_SERVER

Reports statistics for blocks served by local instances to remote instances including Number of consistent reads served Number of current reads served Number of data blocks served Number of undo blocks served Number of undo headers served Number of fairness down converts Number of log flushes Number of times light works rule invoked

Page 74: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 74

Global Cache Service Read Consistency In theory, once a block has been written to disk, the LMS

process will not attempt to read it again when responding to a consistent read request

Light Works Rule Prevents LMS processes from going to disk when

responding to CR requests for data, undo or undo segment blocks

Can prevent LMS process from completing its response to a CR request

Page 75: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 75

Global Cache Service Read Consistency Uncommitted changes MUST be flushed to the redo log before

the LMS process can ship a consistent block to another instance

Reading process must wait until redo log changes have been written to redo log by LMS process

Bad for standard RAC databases Reads must wait for redo log writes

Worse for extended / stretch RAC clusters

Increased latency of cross site disk communications

Page 76: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 76

Global Cache Service Read Consistency For each block on which a consistent read is performed, a

redo log flush must first be performed

Number of redo log flushes is recorded in the FLUSHES column of V$CR_BLOCK_SERVER

Redo log flush time is recorded in the gc cr block flush time statistic for the

LMS process will increase time taken to serve consistent block will increase time taken to perform consistent read

If LMS processes become very busy, consistent reads will

experience high wait times e.g. for a full table scan gc cr multi block request

Page 77: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 77

Global Cache Services Read Consistency

Committed transaction on RAC2 - All blocks still in buffer cache

110

109

108

108

Redo Buffer Redo Buffer

Buffer Cache Buffer Cache

RAC1 RAC2

Redo Log

1

2

3 110 110

STOP

Page 78: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 78

Global Cache Services Read Consistency

Committed transaction on RAC2 - Some blocks written to disk

110

109

108

Redo Buffer Redo Buffer

Buffer Cache Buffer Cache

RAC1 RAC2

Redo Log

1

3

2

110

110

4

110

110

STOP

Page 79: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 79

Global Cache Services Read Consistency

Uncommitted transaction on RAC2 - All blocks still in buffer cache

110

108

Redo Buffer Redo Buffer

Buffer Cache Buffer Cache

RAC1 RAC2

Redo Log

2

3 1

108 110

4

5

6

109

110

109

109

108 108

108 108

STOP

Page 80: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 80

Global Cache Services Read Consistency

Uncommitted transaction on RAC2 - Some blocks written to disk

Redo Buffer Redo Buffer

Buffer Cache Buffer Cache

RAC1 RAC2

Redo Log

3

2

1

110

4

6

8

110 5

7 110

110

109

110

109

109

108 108

108

STOP

Page 81: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 81

Global Cache Services Jumbo Frames By default Maximum Transmission Unit (MTU) is 1500 MTU includes

IP header UDP header Data

Requires six packets to transmit one 8192 byte block On some adapters MTU can be increased to around 9000

e.g. Intel PRO/1000 At command line

ifconfig eth1 mtu 9000 up

or in /etc/sysconfig/ifcfg-eth<x>

MTU=9000

Page 82: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 82

Global Cache Services Jumbo Frames Example - cost of sending on 8192 byte block MTU=1500 (default)

Frame# Ethernet Header

IP Header UDP Header

Data Ethernet Trailer

Total

1 14 20 8 1472 4 1518 2 14 20 8 1472 4 1518 3 14 20 8 1472 4 1518 4 14 20 8 1472 4 1518

5 14 20 8 1472 4 1518 6 14 20 8 840 4 886

Total 84 120 48 8200 24 8476

Frame# Ethernet Header

IP Header UDP Header

Data Ethernet Trailer

Total

1 14 20 8 8200 4 8246 Total 14 20 8 8200 4 8246

MTU=9000

Page 83: B35 Inside rac by Julian Dyke

© 2013 Julian Dyke juliandyke.com 83

Global Cache Services Jumbo Frames Not all network adapter drivers support jumbo frames

Particularly cheap ones....

All network adapters in private interconnect must have same MTU size

Switch must also be configured to support jumbo frames

Lots of bugs and compatibility issues e.g. Bug 4447620: RAC UDP MTU size restricted to 1500 or 9000

affects 10.1.0.5, 10.2,0.1 fixed in 10.2.0.2 and above