Lundhild-Understanding RAC Internals

31
1 <Insert Picture Here> Understanding RAC Internals Barb Lundhild Oracle Corporation RAC Product Management The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

Transcript of Lundhild-Understanding RAC Internals

Page 1: Lundhild-Understanding RAC Internals

1

<Insert Picture Here>

Understanding RAC InternalsBarb Lundhild Oracle CorporationRAC Product Management

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions.The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

Page 2: Lundhild-Understanding RAC Internals

2

Agenda

1. What are the major components of Oracle Clusterware and how do they interact?

2. Why does Oracle reboot nodes?3. How does Oracle handle private interconnect failure and

scalability?4. When my public network fails, why does ASM and the db

instance get shut down?5. What exactly is the VIP, it’s purpose, and how does it

work?6. What is the purpose of ONS – is it required for anything

other than FAN?7. How does Oracle do load balancing across RAC

instances?

<Insert Picture Here>

What are the major components of Oracle Clusterware and how do they interact?

Page 3: Lundhild-Understanding RAC Internals

3

Service

RAC 10� Architecture

public network

Node1

Operating System

Oracle Clusterwarecluster

interconnect

instance 1

ASM

Node n

Operating System

Oracle Clusterware

instance n

ASM

Redo / Archive logs all instances

shared storage

Database / Control files

OCR and Voting Disks

VIP1 VIPn

Managed by ASM

RAW Devices

Listener Listener

Service

What does Clusterware provide?

Operating System

Group Membership

High Availability Framework

Process Monitor

VIP

Event Management

Clusterware

Page 4: Lundhild-Understanding RAC Internals

4

Oracle Clusterware 10� Architecture

Operating System

CSS

CRS

OPROC

VIP

RACG

EVM

Oracle

Clusterware

<Insert Picture Here>

Why does Oracle Clusterware reboot nodes?

Page 5: Lundhild-Understanding RAC Internals

5

Oracle Clusterware Group Membership and Heartbeats

• Cluster needs to know who is a member at all times• Oracle Clusterware has 2 heartbeats:

• Network heartbeat and Disk heartbeat

• If a node does not send a network heartbeat for <MissCount> (time in seconds), then node is evicted from cluster

• If disk heartbeat (voting disk) is not updated in <I/O timeout>, then node is evicted from cluster

Heartbeat Failures

• Network Heartbeatnode(4) missed(59) checkin(s)

>2005-06-18 08:14:37.858 [3002575792]>WARNING: clssnmPollingThread:

Eviction started for node 4,flags 0x000d, >state 3, wt4c 0 >2005-06-18 08:14:41.985 [3047074736] >TRACE: clssnmHandleSync:

• Disk HeartbeatCSSD]2005-10-11 15:56:23.668 [93645744] >WARNING: clssnmDiskPMT: long disk latency >(45940 ms) to voting disk (0//dev/raw/raw1)

Page 6: Lundhild-Understanding RAC Internals

6

Oracle Clusterware Split Brain Resolution

• Split Brain Resolution:• Determine surviving subcluster

• Sub-cluster with largest number of Nodes• Sub-cluster with lowest node number

• IO Fencing via Stonith algorithm (remote power reset)• Voting disk is used to detect and resolve network

problems that could lead to a split-brain • Final arbiter of the status of configured nodes, either up or

down, and delivers eviction notices• Recommended to have at least 3 voting disks

• Multiple voting disks supported in RAC 10g Release 2• Dynamic addition of voting disk RAC 11g

Oracle Clusterware Disk Heartbeat

• Disktimeout: maximum time (s) for voting file I/O to complete. • 10g Release 1 and 10.2.0.1 I/O timeout was directly related to

MissCount. • I.E. MissCount governed sensitivity of both heartbeats

• 10.2.0.2– more granular sensitivity via separation of network and disk heartbeats • Disktimeout parameter set for CSS, default = 200s

• Tune disktimeout for the Voting Disk storage solution • be careful - some multipathing solutions require high

disktimeout values

Page 7: Lundhild-Understanding RAC Internals

7

Changing MissCount

• IT IS NOT SUPPORTED TO REDUCE MISSCOUNT BELOW THE DEFAULT • Default varies somewhat by platform (30s or 60s)• Default = 600s if vendor clusterware is installed

• It should not be necessary to tune Disktimeout

<Insert Picture Here>

How does Oracle handle private interconnect failure and scalability?

Page 8: Lundhild-Understanding RAC Internals

8

VIP1Service

Private Interconnect

public network

Node1

Operating System

Oracle Clusterware

clusterinterconnect

instance 1

ASM

ListenerNode 2

Operating System

Oracle Clusterware

instance 2

ASM

VIP2

Listener

Service

Switch 1 Switch 2

Node n

Operating System

Oracle Clusterware

instance n

ASM

VIPn

Listener

Service

/…/

Private Interconnect

• Network between the nodes of a RAC cluster MUST be private

• Supported links: GbE, IB ( IPoIB: 10.2 ) • Supported transport protocols:

• Oracle Clusterware uses TCP• RAC: UDP, RDS (10.2.0.3)

• Use multiple or dual-ported NICs for redundancy and increase bandwidth with NIC bonding

• Large ( Jumbo ) Frames for GbE recommended

Page 9: Lundhild-Understanding RAC Internals

9

Interconnect Bandwidth

• Bandwidth requirements depend on • CPU power per cluster node• Application-driven data access frequency • Number of nodes and size of the working set• Data distribution between PQ slaves

• Typical utilization approx. 10-30% in OLTP• 10000-12000 8K blocks per sec to saturate 1 x Gb Ethernet

( 75-80% of theoretical bandwidth )

• Multiple NICs generally not required for performance and scalability

IPC configuration

• Settings:• Socket receive buffers ( 256 KB – 1MB )• Negotiated top bit rate and full duplex mode • NIC ring buffers• Ethernet flow control settings• CPU(s) receiving network interrupts

• Verify your setup:• CVU does checking • Load testing eliminates potential for problems

Page 10: Lundhild-Understanding RAC Internals

10

Interconnect Bonding

• Terminology: NIC Bonding, link aggregation, port trunking, NIC teaming, …

• Multiple physical links combined into a single logical link• Provides redundancy and/or scalability

• Logical link is provided to Oracle Clusterware and RAC

• Most operate at OSI Layer 2• Different implementations on different platforms

• Read the fine print• Generally recommend failover only (active/passive)

configuration

Interconnect Bonding

• Some cluster managers provide support for multiple interconnects• Not required with Oracle Clusterware

• OS-Specific bonding• Solaris: IPMP, Sun Trunking• AIX: etherchannel• HP-UX: APA• Linux: NIC Bonding• Windows: NIC Teaming• IB drivers inherently support failover and load balancing.

Page 11: Lundhild-Understanding RAC Internals

11

Interconnect Configuration

• OCR[SYSTEM.css.interfaces.global.bond0.192|d168|d12|d0.1]ORATEXT : cluster_interconnect

SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_ALL_ACCESS, OTHER_PERMISSION : PROCR_READ, USER_NAME : oracle, GROUP_NAME : odba}

• RDBMSSQL> select * from x$ksxpia;

ADDR INDX INST_ID P PICK NAME_KSXPIA IP_KSXPIA

-------- ---------- ---------- - ---- --------------- -------------

58EC8340 0 1 Y OCR bond0 192.168.12.1

• cluster_interconnects (init.ora for RAC)• Overrides clusterware setting• Supports load balancing, not failover

Operating System Dependency

• Block access latencies increase when CPU(s) busy and run queues are long

• Immediate LMS scheduling is critical for predictable block access latencies when CPU > 80% busy

• Fewer and busier LMS processes may be more efficient. i.e. monitor their CPU utilizaiion

• Real Time or fixed priority for LMS is supported• Implemented by default with 10.2 • Do not put more instances than ½ CPU’s on a server

Page 12: Lundhild-Understanding RAC Internals

12

Misconfigured or Faulty Interconnect Can Cause:

• Dropped packets/fragments• Buffer overflows• Packet reassembly failures or timeouts• Ethernet Flow control kicks in• TX/RX errors

“lost blocks” at the RDBMS level, responsible for 64% of escalations

“Lost Blocks”: NIC Receive Errors

Db_block_size = 8Kifconfig –a:

eth0 Link encap:Ethernet HWaddr 00:0B:DB:4B:A2:04

inet addr:130.35.25.110 Bcast:130.35.27.255 Mask:255.255.252.0

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:21721236 errors:135 dropped:0 overruns:0 frame:95

TX packets:273120 errors:0 dropped:0 overruns:0 carrier:0

Page 13: Lundhild-Understanding RAC Internals

13

“Lost Blocks”: IP Packet Reassembly Failures

netstat –s

Ip: 84884742 total packets received …

1201 fragments dropped after timeout… 3384 packet reassembles failed

Top 5 Timed Events Avg %Total~~~~~~~~~~~~~~~~~~ wait CallEvent Waits Time(s)(ms) Time Wait Class----------------------------------------------------------------------------------------------------

log file sync 286,038 49,872 174 41.7 Commit

gc buffer busy 177,315 29,021 164 24.3 Cluster

gc cr block busy 110,348 5,703 52 4.8 Cluster

gc cr block lost 4,272 4,953 1159 4.1 Cluster

cr request retry 6,316 4,668 739 3.9 Other

Finding a Problem with the Interconnect or IPC

Should never be here

Page 14: Lundhild-Understanding RAC Internals

14

<Insert Picture Here>

What are the startup/shutdown sequence and dependencies?

Node Startup Sequence

Service

Operating System

Oracle Clusterware

Instance 1

ASM

VIP1

Listener

1

2

3

4

5

6

7

Page 15: Lundhild-Understanding RAC Internals

15

Oracle DependenciesPrior to 10.2.0.3

public network

Node1

clusterinterconnect

Node2

Operating System

Oracle Clusterware

instance 2

ASM

Redo / Archive logs all instances

shared storage

Database / Control files

OCR and Voting Disks

VIP2

Managed by ASM

RAW Devices

Service

Operating System

Oracle Clusterware

instance 1

ASM

Listener Listener

ServiceVIP1

Oracle DependenciesPrior to 10.2.0.3

public network

Node1

clusterinterconnect

Node2

Operating System

Oracle Clusterware

instance 2

ASM

Redo / Archive logs all instances

shared storage

Database / Control files

OCR and Voting Disks

VIP2

Managed by ASM

RAW Devices

Service

Operating System

Oracle Clusterware

instance 1

ASM

Listener Listener

ServiceVIP1VIP1

Page 16: Lundhild-Understanding RAC Internals

16

Oracle Dependencies

public network

Node1

clusterinterconnect

Node 2

Operating System

Oracle Clusterware

instance 2

ASM

Redo / Archive logs all instances

shared storage

Database / Control files

OCR and Voting Disks

VIP2

Managed by ASM

RAW Devices

Service

Operating System

Oracle Clusterware

instance 1

ASM

Listener Listener

ServiceListener

VIP1

VIP1

Oracle Dependencies

public network

Node1

clusterinterconnect

Node 2

Operating System

Oracle Clusterware

instance 2

ASM

Redo / Archive logs all instances

shared storage

Database / Control files

OCR and Voting Disks

VIP2

Managed by ASM

RAW Devices

Service

Operating System

Oracle Clusterware

instance 1

ASM

Listener Listener

ServiceListener

VIP1

Page 17: Lundhild-Understanding RAC Internals

17

<Insert Picture Here>

What exactly is the VIP, it’s purpose, and how does it work?

Why Oracle RAC 10g has a VIP?

• Protects database clients from long TCP/IP timeouts (can be >10 minutes)

• During normal operation, works the same as hostname

• During failure, it removes network timeout from connection request time, client fails immediately to next address in the list

sales.us.acme.com =(DESCRIPTION=(ADDRESS_LIST=

(LOAD_BALANCE=on)(FAILOVER=ON)

(ADDRESS=(PROTOCOL=tcp)(HOST=sales1-vip)(PORT=1521))

(ADDRESS=(PROTOCOL=tcp)(HOST=sales2-vip)(PORT=1521)))

(CONNECT_DATA=

(SERVICE_NAME=sales.us.acme.com)))

Page 18: Lundhild-Understanding RAC Internals

18

Oracle RAC 10g VIPThe Details!

• One for each node in cluster• Required for Oracle Clusterware installation• IP and network name should not currently be in use• Should be registered in DNS and be on the same

subnet as public IP address• Can use OS bonding to provide failover and load

balancing on network interfaces on the node• Configuration managed by VIPCA• Note that netmask defaults to 255.255.255.0, rather

than defaulting to netmask of underlying physical interface.

Oracle RAC VIP is DIFFERENT

• Only accepts connections when on its home node

• Failure on home node: relocates to another node in the cluster only to send a error back to client (it will not be in the listener so it cannot accept connections!)

• You will only have one active RAC VIP per node (there may be others who have relocated due to failure!) • Independent of number of databases running in cluster

Page 19: Lundhild-Understanding RAC Internals

19

Oracle RAC 10g VIP

[root@pmrac1 root]# ifconfig

eth0 Link encap:Ethernet HWaddr 00:12:79:D8:90:93

inet addr:144.25.214.45 Bcast:144.25.215.255 Mask:255.255.252.0

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:5070815 errors:0 dropped:0 overruns:0 frame:0

TX packets:3064435 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:509963813 (486.3 Mb) TX bytes:3621223517 (3453.4 Mb)

Interrupt:25

eth0:1 Link encap:Ethernet HWaddr 00:12:79:D8:90:93

inet addr:144.25.214.47 Bcast:144.25.215.255 Mask:255.255.252.0

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:5762695 errors:0 dropped:0 overruns:0 frame:0

TX packets:5679252 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:3400642002 (3243.1 Mb) TX bytes:3166774792 (3020.0 Mb)

Interrupt:25

VIP

Listener.ora

SID_LIST_LISTENER_PMRAC1 =

(SID_LIST =

(SID_DESC =

(SID_NAME = PLSExtProc)

(ORACLE_HOME = /u01/oracle/product/10gR2/asm)

(PROGRAM = extproc)

)

)

LISTENER_PMRAC1 =

(DESCRIPTION_LIST =

(DESCRIPTION =

(ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC1))

(ADDRESS = (PROTOCOL = TCP)(HOST = pmrac1-vip)(PORT = 1521)(IP = FIRST))(ADDRESS = (PROTOCOL = TCP)(HOST = 144.25.214.45)(PORT = 1521)(IP = FIRST))

)

)

VIP

Page 20: Lundhild-Understanding RAC Internals

20

Application VIPs

• New resource in Oracle RAC 10g Release 2• Created as functional VIPs which can be used to

connect to an application regardless of the node it is running on

• VIP is a dependent resource of the user registered application

• There can be many VIPs, one per User Application

Creating an Application VIP

• The usrvip script must run as root• The default permissions need to be changed

• As root…crs_setperm ApplicationVIP1 –o root

• Allow oracle user to execute this script• As root…

crs_setperm ApplicationVIP1 –u user:oracle:r-x

• Start the VIP• As oracle…

crs_start ApplicationVIP1

Page 21: Lundhild-Understanding RAC Internals

21

<Insert Picture Here>

What is the purpose of ONS – is it required for anything other than FAN?

Oracle Notification Service (ONS)

• Publish/Subscribe Messaging System• Allows both local and remote consumption• Used by Fast Application Notification (FAN) to publish

HA Events and Load Balancing Events • Used by FAN clients to subscribe to events• Automatically installed and configured by the

installation of Oracle Clusterware• DO NOT TURN OFF – Required by Oracle

Clusterware and RAC

Page 22: Lundhild-Understanding RAC Internals

22

What is FAN?

• Fast Application Notification (FAN) is a RAC notification mechanism

• FAN HA Events: Notification of Up/Down for service, instance & node

• Load Balancing Advisory Events: Advise clients of current load for service and where to send connection requests

• Enable it, and Forget it.

Fan Clients

• HA Events: JDBC Implicit Connection Cache, OCI, ODP.NET Connection Pools, Listener, Server Side Callouts, CMAN

• Load Balancing Advisory Events: JDBC Implicit Connection Cache, ODP.NET Connection Pools, Listener, CMAN• New in RAC 11g – OCI Session Pools subscribe to Load

Balancing Advisory Events to provide Runtime Connection Load Balancing

Page 23: Lundhild-Understanding RAC Internals

23

<Insert Picture Here>

How does Oracle do load balancing across RAC instances?

Connection Load Balancing

LISTENER

Service OLTP?

OLTP1 on N1

OLTP2 on N2

OLTP3 on N3

NetworkNetwork

RAC Database

Application Server

Page 24: Lundhild-Understanding RAC Internals

24

Connection Load Balancing

Clients

Listeners

LISTENERConnection made to

OLTP1 Network

RAC Database

Connection PoolsHow do you Load Balance?

Real Application Clusters

Application Connection Pool

c

c

cc

c

c

c ccc

c c

Page 25: Lundhild-Understanding RAC Internals

25

Load Balancing Advisory

• Load Balancing Advisory is an advisory for balancing work across RAC instances.

• Load Balances at the transaction level (not connections!)

• Directs work to where services are executing well and resources are available.

• Adjusts distribution for different power nodes, different priority and shape workloads, changing demand.

• Stops sending work to slow, hung, failed nodes early.

Load Balancing Advisory

• Automatic Workload Repository• Calculates goodness locally, forwards to master

mmon• Master mmon builds advisory for distribution of work • Records advice to SYS$SERVICE_METRICS• Posts FAN event to AQ, PMON, ONS

Page 26: Lundhild-Understanding RAC Internals

26

View LBA FAN Event

Runtime Connection Load Balancing

• When application does “getConnection”, the connection given is the one that will provide the best service.

• Supported by Oracle JDBC and ODP.NET connection Pools (OCI Session Pools in RAC 11g!)

• Policy defined by setting GOAL on Service • Need to have Connection Load Balancing

Page 27: Lundhild-Understanding RAC Internals

27

Load Balancing Advisory Enabled through Service Goal

• THROUGHPUT – Work requests are directed based on throughput .

• used when the work in a service completes at homogenous rates. An example is a trading system where work requests are similar lengths.

• SERVICE_TIME – Work requests are directed based on response time.

• used when the work in a service completes at various rates. An example is as internet shopping system where work requests are various lengths

• None – Default setting, turn off advisory

Fast Connection Failover

• Fast and reliable high availability for connections in an Oracle Real Application Clusters 10g environment

• Enable it and forget it

• Application can make it transparent to user by trapping SQL Exception and retrying

• Supported by Oracle JDBC, OCI, and ODP.NET

Page 28: Lundhild-Understanding RAC Internals

28

FAN/FCF Client IntegrationJDBC

• When DOWN signal received from RAC 10g• First pass: Connections are marked as down

• Second pass: Aborts and removes connections that are marked as down

• Routes new requests to surviving instances• Throws exception if application was in midst of transaction

• When UP signal received from RAC 10g• Creates new connections to new instances• Distributes new work requests evenly to all available instances

AQ&Q U E S T I O N SQ U E S T I O N SA N S W E R SA N S W E R S

Page 29: Lundhild-Understanding RAC Internals

29

<Insert Picture Here>

Appendix

For More Information

http://search.oracle.com

or

otn.oracle.com/rac

REAL APPLICATION CLUSTERS

Page 30: Lundhild-Understanding RAC Internals

30

Useful Metalink Notes

• Note 342082.1 “How to Change Subnet Masks for VIPs” • Note 294430.1 “CSS Timeout Computation in RAC 10g ”• Note 284752.1 “10g RAC: Steps To Increase CSS Misscount,

Reboottime and Disktimeout”• Note 291962.1 ‘Setting Up Bonding in SLES 9’ • Note 291958.1 ‘Setting Up Bonding in Suse SLES8’• Note 298891.1 ‘Configuring Linux for the Oracle 10g VIP using

bonding’• Note 283107.1 ‘Configuring Solaris IP Multipathing (IPMP) for

the Oracle 10g VIP’

OTN.ORACLE.COM/RAC

• Workload Management with Oracle Real Application Clusters (FAN, FCF, Load Balancing)

• Using standard NFS to support a third voting disk on a stretch cluster configuration on Linux

• Using Oracle Clusterware to Protect 3rd Party Applications

• RAC Sample Code Pagehttp://www.oracle.com/technology/sample_code/products/rac/index.html

Page 31: Lundhild-Understanding RAC Internals

31