CRS-RAC Troubleshooting
-
Upload
api-3759126 -
Category
Documents
-
view
1.264 -
download
0
Transcript of CRS-RAC Troubleshooting
![Page 1: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/1.jpg)
Oracle Corporation
CRS & RACTroubleshooting
Krishnadev TelikicherlaCluster & Parallel Storage Technology
Oracle Corporation
![Page 2: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/2.jpg)
Oracle Corporation
Topics:
� Defining the Issue� Creating a Timeline� Hang or Slowdown� Performance Issues� Gathering Data� Testcases� Rediscovery� Engaging Oracle Support� Examples
![Page 3: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/3.jpg)
Oracle Corporation
Defining the IssueLayers
� What layers are involved in the issue:
• Oracle Clusterware
• CRS daemon• CSS daemon• HangCheckTimer [Linux] / Oprocd (not
Linux)• EVM• OCR• Voting
• General RDBMS• Operating System• Hardware
![Page 4: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/4.jpg)
Oracle Corporation
Defining the IssueCause vs. Effects
� Causes:– Resource issues– Oracle issues– OS issues
� Effects:– Hangs/Spins– Instances Crashes and Evictions– Node Reboots and Evictions– Oracle Errors (ORA-600, ORA-7445, ORA-29740)
![Page 5: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/5.jpg)
Oracle Corporation
Defining the IssueDescription
� When describing the problem while creating the SR via Metalink it is important that you use phrases that will help identify known issues either in bugs or Metalink content.
� In the body of the SR try to be as detailed as possible about the environment.
� Nobody knows the system better than the you.� Talk to the sys-admin as well regarding OS/Network
related issues.
![Page 6: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/6.jpg)
Oracle Corporation
Creating a Timeline
� A timeline helps identify the times to concentrate on when reviewing files
� A timeline can be built from reviewing the files themselves once they are provided to support but this will only slow resolution time down
� Timelines should include an ordering of cause and effects as well as include all participating nodes
� Include specific times, ie…– At 3:00am PST we noticed that node2 was hanging.
![Page 7: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/7.jpg)
Oracle Corporation
Hang or slowdown
� Differentiate between a database hang and a database slowdown
� Identify the extent of a hang
![Page 8: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/8.jpg)
Oracle Corporation
Is it a Hang or a Slowdown?
� Check:� System states to see if there is any change
over a short period of time� V$SESSION_WAIT where wait_time=0� Overall machine load, including cpu,
memory, swap, I/O
![Page 9: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/9.jpg)
Oracle Corporation
Is it a Hang or a Slowdown?
� Single or multiprocess hang:– Usually characterized by a particular job
hanging or not completing– Essentially the same as in single instance
unless it’s internode parallel query.
� Instance hang: A single instance is unusable.
� Multi-instance or full database hang: Entire database is hung or not responding
![Page 10: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/10.jpg)
Oracle Corporation
Performance
� Single process or statement� Instance� Multi-Instance
![Page 11: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/11.jpg)
Oracle Corporation
Single Process or Single Statement� Find the wait event� 10046 level 12
- oradebug setorapid
- oradebug event 10046 trace name context forever, level 12- oradebug tracefile_name
� Explain plan� 10053 if plan problems are found� V$SESSTAT� Truss/trace/dbx/pstack if OS-related
problems are suspected
![Page 12: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/12.jpg)
Oracle Corporation
Instance Slowdown
� Statspack / AWR� OS performance statistics - cpu, memory,
and I/O� Characteristics:
– Related to a particular job?– Certain time of day?– What’s changed?
![Page 13: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/13.jpg)
Oracle Corporation
Multi-Instance Slowdowns
� AWR from each node can be of use:� AWR collects instance specific data� Examine and correlate the reports
![Page 14: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/14.jpg)
Oracle Corporation
Multi-Instance Slowdowns
� In cases of extreme slowdowns:� systemstates on all nodes� V$SESSION_WAIT� Alert logs and any trace files� Process states, or stack traces if
determined and applicable
![Page 15: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/15.jpg)
Oracle Corporation
Debugging Techniques
� v$session_wait� System states from all nodes� 10046 level 12 trace of the hung process� ORADEBUG� Lock layer and DLM tracing� Get any traces:
� DLM traces� Background processes, alert logs, and init.ora� User traces
![Page 16: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/16.jpg)
Oracle Corporation
Debugging and Diagnostics
� Performance issues or hangs:� Identify the resource being requested.� Identify who holds the resource.
![Page 17: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/17.jpg)
Oracle Corporation
ORADEBUG and Tools
� Hang analyze:– hanganalyze <level>
� Note: 301137.1 – OS Watcher User Guide� Note: 135714.1 - Script to Collect RAC
Diagnostic Information (diagcollection.pl)
![Page 18: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/18.jpg)
Oracle Corporation
Gathering DataBest Practices
� Single most important step� There is never too much data, but including lots of
useless data can increase download time of the data as well as increase the amount of time to process the data.
� Always error on getting too much data, but be aware of the impact on the resolution time.
� Too little data increases resolution time more than too much data.
� Always include a readme.txt file that explains the contens of the provided files
![Page 19: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/19.jpg)
Oracle Corporation
Gathering DataProcesses
� Always get stacks from processes that seem to be spinning, hanging or unresponsive:
– oradebug– gdb– pstack
� ps and top info can be very usefull when trying to determine if a processes exhibits issues such as memory leaks, spinning or hanging
![Page 20: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/20.jpg)
Oracle Corporation
Gathering DataRAC
� For instance evictions please review Metalinknote 219361.1
� See Metalink note 203226.1 : RAC Survival Kit: Real Application Clusters Troubleshooting and Information
� See Metalink note 289690.1 : Data Gathering for Troubleshooting RAC and CRS issues
![Page 21: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/21.jpg)
Oracle Corporation
Gathering DataTools
� RDA – system and Oracle configuration information� racdiag – modifiable sql script for gathering rac data. See
Metalink note 135714.1 “Script to Collect RAC Diagnostic Information
� OSW – OS Watcher gathers top, slabinfo, netstat and ps data over programmable intervals 301137.1 “OS Watcher User Guide”
![Page 22: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/22.jpg)
Oracle Corporation
Gathering DataCRS 10.2.0.x (continued)
� CRS and other resource issues:– ORA_CRS_HOME
� log/<hostname>/cssd/oclsmon
� log/<hostname>/cssd
� log/<hostname>/client
� log/<hostname>/crsd
� log/<hostname>/evmd� log/<hostname>/racg
– ORACLE_HOME (rdbms)
� racg/dump
� ORACLE_BASE/<db_name>/hdump
![Page 23: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/23.jpg)
Oracle Corporation
Gathering DataTools (continue)
� Starting with 10.2.0.1 $ORA_CRS_HOME/bin/diagcollection.pl collect all RAC relevant files (run as root)
oracle10@stnsp010>./diagcollection.plProduction Copyright 2004, 2005, Oracle. All rights reservedCluster Ready Services (CRS) diagnostic collection tooldiagcollection
--collect[--crs] For collecting crs diag information[--oh] For collecting oracle home diag information[--ob] For collecting oracle base diag information[--all] Default.For collecting all diag informationNOTE:1. You can also do the following
./diagcollection.pl --collect --crs --oh2. ORA_CRS_HOME,ORACLE_HOME and ORACLE_BASE env variables
need to be set.--clean cleans up the diagnosability
information gathered by this script--coreanalyze extracts information from core files
and stores it in a text file
![Page 24: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/24.jpg)
Oracle Corporation
Testcases
� Not always feasible� If provided, can greatly influence resolution time� When providing a testcase:
– Include a readme file
– Try to strip the testcase down to the minimal elements that are needed to reproduce the problem
� If at all possible, always try to build a testcase� Testcases are your friends!
![Page 25: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/25.jpg)
Oracle Corporation
Rediscovery
� Expensive for a support organization� Issue rediscovery is not always obvious� Use Metalink to identify possible causes for
issues as well as workarounds and patch availability
� Communicate new issues between DBAs
![Page 26: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/26.jpg)
Oracle Corporation
Engaging Oracle Support
� Try to be responsive to all TARs when they are set to CUS status. Delays inherently causes two problems:1. The issue loses momentum2. A new engineer may have to take over the issue
![Page 27: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/27.jpg)
Oracle Corporation
Examples
� 10.2.0.2 HP-UX/Itanium ServiceGuard, CRS, CFS and RAC
� Delays in reconfiguration
![Page 28: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/28.jpg)
Oracle Corporation
Examples
� 10.2.0.2 Linux CRS, RAC and ASM� ORA-600[2103] and one instance crashed
![Page 29: CRS-RAC Troubleshooting](https://reader030.fdocuments.us/reader030/viewer/2022012405/5528eb885503467f2e8b4591/html5/thumbnails/29.jpg)
Oracle Corporation
Questions?