Maximizing Systems Availability - DOAG
Transcript of Maximizing Systems Availability - DOAG
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Maximizing Systems Availability Best Practices & Lessons Learned
Dr.-Ing. Holger Leister EMEA Systems Support September 25, 2014
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
3
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Agenda
Introduction
Architect Phase: Systems Architecture
Implement Phase: Installation & Configuration
Manage Phase: Operational Best Practices
1
2
3
4
4
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
SUPerG 2005, Washington D.C.
5
Tom Chalfant
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
High Availability
Availability is the degree to which an application, service, or function is accessible on demand. […] If a user cannot access the system, it is said to be unavailable. Generally, the term downtime is used to refer to periods when a system is unavailable. Users who want their systems to be always ready to serve them need high availability. A system that is highly available is designed to provide uninterrupted computing services during essential time periods, during most hours of the day, and most days of the week throughout the year; this measurement is often shown as 24x365.
Oracle® Database High Availability Overview
8
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Causes of Downtime Oracle® Database High Availability Overview
9
Unplanned Downtime
• Site failure
• Clusterwide failure
• Computer failure
• Storage failure
• Data corruption
• Human error
• Lost writes
• Hang or slowdown
Planned Downtime
• System and database changes
• Data changes
• Application changes
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
High-Availability: Trade-Off
10
Return on Investment
Risk
Complexity
Cost
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
High-Availability: 3 P‘s
11
People
Product Process
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
High-Availability
12
Return on Investment
Risk
Complexity
Cost
• Define your objectives
– Trade-off: return on investment, cost, complexity, risk
• Cover 3 P‘s: people, product, process
• Follow best practices
People
Product Process
Assume Nothing
Plan Ahead
Reuse Configurations
Get Trained
Test Everything
Document Everything
Setup Processes
Exploit External Resources
Minimize Risk
Keep It Simple
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Architect Phase: Systems Architecture
13
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Oracle Maximum Availability Architecture (MAA) Architecture, Configuration, and Operational Practices = Maximize Availability
14
Configuration Best Practices
Operational Best Practices
MAA
Architecture • Architecture – Enable • Configuration – Optimize
• Operations – Maintenance for Stability and Availability
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Systems Architecture High Availability Architecture
• Tolerate failures such that processing continues with minimal or no interruption
• Be transparent to—or tolerant of—system, data, or application changes
• Provide built-in preventive measures
• Provide active monitoring and fast detection of failures
• Provide fast recoverability
• Automate detection and recovery operations
• Protect the data to minimize or prevent data loss
• Implement the operational best practices to manage your environment
• Achieve the goals set in Service Level Agreements (SLAs) for the lowest possible total cost of ownership
Oracle® Database High Availability Overview
15
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Systems Architecture Oracle Maximum Availability Architecture (MAA)
• MAA Best Practices
– Oracle Database
– Exadata Database Machine
...
• Case Studies
• Documentation
• Demonstrations
• Articles, Presentations
16
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Systems Architecture Oracle Maximum Availability Architecture (MAA)
17
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Systems Architecture Oracle Maximum Availability Architecture (MAA)
18
Return on Investment
Risk
Complexity
Cost
High Availability (HA) Service Level Tiers
Oracle® Database High Availability Overview - 12c Release 1 (12.1)
• MAA technical white papers • Oracle Database High Availability Overview, Chapter 6, "Operational Prerequisites to Maximizing Availability" • Oracle Database High Availability Best Practices
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Systems Architecture Maximum Availability Architecture (MAA) – Database
19
• Comprehensive protection from failures: Server – Storage – Network – Site – Corruptions
Oracle® Database High Availability Overview - 12c Release 1 (12.1)
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Systems Architecture Redundancy Across Complete Stack
• Hardware
– Servers
– Storage
– Network
– Power
– Data center
Minimize Single Point Of Failures (SPOFs)
20
• Software
– Applications
– Middleware
– Database
– RAC
– ASM
– Data Guard ...
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Systems Architecture Engineered Systems: Database
21
• Expedited time to value
• Easier to manage and upgrade
• Lower cost of ownership
Exalytics Exadata Database Machine
Exalogic Elastic Cloud
Database Appliance
SuperCluster Big Data Appliance
• Reduced change management risk
• One-stop support
• Extreme performance
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Systems Architecture Example: Exadata Architecture I Highly Redundant
22
• Redundant database servers, storage servers, InfiniBand switches
• Two Redundant Power Distribution Units (PDUs)
• Each database server, storage server, InfiniBand switch has redundant hot-swap power supplies
• Disk drives, fans are hot-swap
• InfiniBand network highly redundant – can tolerate loss of entire switch or connections
• RAC provides protection against database server failures
• ASM provides protections against storage server failures
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Implement Phase: Installation & Configuration
23
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Oracle Maximum Availability Architecture (MAA) Architecture, Configuration, and Operational Practices = Maximize Availability
24
Configuration Best Practices
Operational Best Practices
MAA
Architecture • Architecture – Enable • Configuration – Optimize
• Operations – Maintenance for Stability and Availability
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Installation & Configuration Product Documentation – Site Requirements
• Prepare data center using official product documentation & specifications
– http://docs.oracle.com/
25
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Installation & Configuration Space Requirements
26
Physical Dimensions
Height: 1998 mm
Width: 600 mm with side panels
Depth: 1200 mm
Access Route Requirements
Minimum door height: 2184 mm
Minimum door width: 1220 mm
Minimum elevator depth: 1575 mm
Maximum incline: 6 degrees
Minimum elevator, pallet jack, and floor loading capacity: 1134 kg
Maintenance Access Requirements
Rear maintenance: 914 mm
Front maintenance: 914 mm
Top maintenance: 914 mm
• Check access route (doors, ramps, elevators) for large systems
– Example: Exalogic Elastic Cloud Machine
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Installation & Configuration Space Requirements
27
• Check access route for large systems
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Installation & Configuration Weight Requirements
28
• Check the weight requirements for raised floors
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Installation & Configuration Data Center Room Specifications
29
• Prepare final system location: floor cut-outs, maintenance areas, stabilization
– Example: Exalogic Elastic Cloud Machine
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Installation & Configuration Data Center Room Specifications
30
• Prepare final system location: redundant power sources
– Example: Exalogic Elastic Cloud Machine
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Installation & Configuration Data Center Room Specifications
31
• Prepare final system location: redundant network connections
– Example: Exalogic Elastic Cloud Machine
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Installation & Configuration Temperature and Humidity Requirements
32
• Prepare environmental conditions → best practices settings
– Example: Exalogic Elastic Cloud Machine
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Installation & Configuration Temperature and Humidity Requirements
33
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Installation & Configuration Ventilation and Cooling Requirements
34
• Prepare cooling and airflow based on product specifications
– Example: Exalogic Elastic Cloud Machine
Strategies for Solving the Datacenter Space, Power, and Cooling Crunch: Sun Server and Storage Optimization Techniques
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Installation & Configuration Cleanliness Requirements
• Minimize contamination → Prevent downtime
– Subfloor Surface Cleaning
– Raised Floor Surface Cleaning
– Exterior Equipment Surface Cleaning
– Interior Server Cabinet Cleaning
– Anti-Static Floor Finishing …
• ISO 14644-1 Classification of Air Cleanliness – Cleanroom standards
35
Sun Microsystems Data Center Site Planning Guide Data Centers’ Best Practices
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Installation & Configuration Maximum Availability Architecture (MAA)
• MAA Best Practices
– Oracle Database
– Exadata Database Machine
...
• Case Studies
• Documentation
• Demonstrations
• Articles, Presentations
36
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Installation & Configuration General Deployment Advice – Example: Exadata
• Use the defaults as much as possible
– Use the tested configuration
– Avoid customizations and non-standard installs
• MAA best practices installed at deployment – Customers should review and observe before any reconfiguration
• Verify Supported Configurations
– Oracle products are certified to work with specific versions of other Oracle products
– Apps in particular are sensitive
37
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Installation & Configuration General Deployment Advice – Example: Exadata
38
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Installation & Configuration Enterprise Installation Standards (EIS) Checklists
39
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Manage Phase: Operational Best Practices
40
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Oracle Maximum Availability Architecture (MAA) Architecture, Configuration, and Operational Practices = Maximize Availability
41
Configuration Best Practices
Operational Best Practices
MAA
Architecture • Architecture – Enable • Configuration – Optimize
• Operations – Maintenance for Stability and Availability
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices High Availability and Performance Service Level Agreements
• Understand the attributes of high availability and various causes of downtime
• Understand and document your high availability and performance service-level agreements (SLAs)
42
Unplanned Downtime
•Site failure
•Clusterwide failure
•Computer failure
•Storage failure
•Data corruption
•Human error
•Lost writes
•Hang or slowdown
Planned Downtime
•System and database changes
•Data changes
•Application changes
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices High Availability Architecture
• Implement and validate a high availability architecture that meets your SLAs
43
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices High Availability Architecture
• Create an outage/solution matrix that maps to your SLAs
– Example: Exadata MAA Outage and Solution Matrix
44
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices Testing
• Establish test practices and environment
– Test system should be a replica of the production MAA environment
• Validate high availability and performance SLAs
– Fault-injection testing to validate the expected high availability response and all automatic, automated, or manual repair solutions
– Backup and recovery procedures
– Data Guard
– Patching ...
45
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices Testing
• Document your testing
46
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices Security
• Set up and use security best practices
– Reduces the chance of outages due to security breaches
– Oracle® Database Security Guide
47
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices Change Management
• Establish change control procedures
– No changes in primary database before rigorous evaluation on test systems
– Document and follow change management process
– Review changes and get approval from change management team
48
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices Software Upgrades & Patching
• Provide a plan to test and apply recommended patches and software periodically
– Check for any critical software issues • Perform health check (e.g., ORAchk, Exachk, …)
– Use proper patch testing and patching practices • Patching method: Rolling vs. Non-Rolling
• Validate all updates and changes on a test system
– MOS Note 888828.1 (Exadata):
“[…] It is not required or necessary to install every new patch release. A patch should be installed on a production system only after it has been validated in a proper test environment, and no less than one month after release to allow field experience to solidify. However, if the system requires a fix that is available in a newer version, then plans to adopt a newer version should be accelerated. […]”
49
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices Software Upgrades & Patching
50
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices Data Guard
• Execute Data Guard role transitions
– MOS Notes 1304939.1, 1305019.1
– MAA Whitepapers
– Perform periodic switchover operations every three to six months
51
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices Escalation Management
• Establish escalation management procedures
– Downtime can be prolonged if proper escalation policies are not followed and decisions are not made quickly
– Execute repair and failover operations first, then gather diag data for Root Cause Analysis (RCA)
52
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices Health Checks
• Execute database health checks periodically (e.g., ORAchk, Exachk, …)
– Frequency • After initial deployment
• Before planned system maintenance
• After planned system maintenance
• At least once every three months
– ORAchk - Health Checks for the Oracle Stack (MOS Note ID 1268927.2) • Proactively scans for the most impactful problems across the various layers of your stack
• Scope: Oracle database servers, Grid Infrastructure, Oracle databases, hardware, Operating System and RAC software
• The non-intrusice tool audits configuration settings within the following categories: OS kernel parameters, OS packages, CRS/Grid Infrastructure, RDBMS, ASM, Database parameters, …
53
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices Health Checks (Exachk)
54
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices Health Checks (Exachk)
55
Recommendation + Risk Analysis + Direction/Steps
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices Health Checks (Exachk)
56
Installed versions validation
Critical Issue exposure
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices Monitoring
• Configure Oracle Enterprise Manager monitoring infrastructure for High Availability
– Detect and react to performance and high availability related thresholds to avoid potential downtime
57
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices Auto Service Request (ASR)
• Configure Automatic Service Request (ASR) infrastructure
– MOS Note 1185493.1
58
5
8
CPU
Disk controllers
Disks
Flash Cards
Flash modules
InfiniBand
Memory
System Board
Power supplies
Fans
Comprehensive Fault Coverage
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices Platinum Support – Engineered Systems
59
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices MAA Best Practices
• Check the latest MAA best practices
– MAA Portal • MAA Best Practices
– Oracle Database
– Exadata Database Machine
• Case Studies
• Documentation
• Articles, Presentations
– Oracle Database High Availability Best Practices documentation • High Availability Overview
• High Availability Best Practices
– Operational and configuration best practices
60
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices Technical Development
61
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices My Oracle Support (MOS)
• Refer to MOS knowledge articles, e.g.,
– Working Effectively With Oracle Support - Best Practices (Doc ID 166650.1) • Working with Support (How to open Service Request?, When to escalate?)
• MOS
• Support Policies
– Accessing the Oracle System Handbook on My Oracle Support (Doc ID 1227213.1)
– Oracle Premier Support: Get Proactive! (Doc ID 432.1)
– Tools and Training (Doc ID 67032.1)
... and many more ...
62
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Operational Best Practices Product Documentation
• Refer to product documentation: database, servers, storage, network, ...
63
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Oracle Maximum Availability Architecture (MAA) Architecture, Configuration, and Operational Practices = Maximize Availability
65
Configuration Best Practices
Operational Best Practices
MAA
Architecture • Architecture – Enable • Configuration – Optimize
• Operations – Maintenance for Stability and Availability
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
MAA Summary and Key Takeaways
MAA best practices
Health checks and operational best practices
Proactive monitoring and management
Keep current with patch sets
Best protection is a strong defense
66
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
SUPerG Top Principles
Assume Nothing
Plan Ahead
Reuse Configurations
Get Trained
Test Everything
Best protection is a strong defense
67
Document Everything
Setup Processes
Exploit External Resources
Minimize Risk
Keep It Simple
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
Resources
69
• Product documentation http://docs.oracle.com/
• Database http://www.oracle.com/technetwork/documentation/index.html#database
• OTN High Availability Portal http://www.oracle.com/goto/availability
• Maximum Availability Architecture (MAA) http://www.oracle.com/goto/maa
• MAA Blogs http://blogs.oracle.com/maa