Maximizing Systems Availability - DOAG

71

Transcript of Maximizing Systems Availability - DOAG

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Maximizing Systems Availability Best Practices & Lessons Learned

Dr.-Ing. Holger Leister EMEA Systems Support September 25, 2014

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Safe Harbor Statement

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

3

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Agenda

Introduction

Architect Phase: Systems Architecture

Implement Phase: Installation & Configuration

Manage Phase: Operational Best Practices

1

2

3

4

4

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

SUPerG 2005, Washington D.C.

5

Tom Chalfant

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

SUPerG 2005, Washington D.C.

6

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Introduction

7

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

High Availability

Availability is the degree to which an application, service, or function is accessible on demand. […] If a user cannot access the system, it is said to be unavailable. Generally, the term downtime is used to refer to periods when a system is unavailable. Users who want their systems to be always ready to serve them need high availability. A system that is highly available is designed to provide uninterrupted computing services during essential time periods, during most hours of the day, and most days of the week throughout the year; this measurement is often shown as 24x365.

Oracle® Database High Availability Overview

8

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Causes of Downtime Oracle® Database High Availability Overview

9

Unplanned Downtime

• Site failure

• Clusterwide failure

• Computer failure

• Storage failure

• Data corruption

• Human error

• Lost writes

• Hang or slowdown

Planned Downtime

• System and database changes

• Data changes

• Application changes

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

High-Availability: Trade-Off

10

Return on Investment

Risk

Complexity

Cost

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

High-Availability: 3 P‘s

11

People

Product Process

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

High-Availability

12

Return on Investment

Risk

Complexity

Cost

• Define your objectives

– Trade-off: return on investment, cost, complexity, risk

• Cover 3 P‘s: people, product, process

• Follow best practices

People

Product Process

Assume Nothing

Plan Ahead

Reuse Configurations

Get Trained

Test Everything

Document Everything

Setup Processes

Exploit External Resources

Minimize Risk

Keep It Simple

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Architect Phase: Systems Architecture

13

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Oracle Maximum Availability Architecture (MAA) Architecture, Configuration, and Operational Practices = Maximize Availability

14

Configuration Best Practices

Operational Best Practices

MAA

Architecture • Architecture – Enable • Configuration – Optimize

• Operations – Maintenance for Stability and Availability

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Systems Architecture High Availability Architecture

• Tolerate failures such that processing continues with minimal or no interruption

• Be transparent to—or tolerant of—system, data, or application changes

• Provide built-in preventive measures

• Provide active monitoring and fast detection of failures

• Provide fast recoverability

• Automate detection and recovery operations

• Protect the data to minimize or prevent data loss

• Implement the operational best practices to manage your environment

• Achieve the goals set in Service Level Agreements (SLAs) for the lowest possible total cost of ownership

Oracle® Database High Availability Overview

15

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Systems Architecture Oracle Maximum Availability Architecture (MAA)

• MAA Best Practices

– Oracle Database

– Exadata Database Machine

...

• Case Studies

• Documentation

• Demonstrations

• Articles, Presentations

16

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Systems Architecture Oracle Maximum Availability Architecture (MAA)

17

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Systems Architecture Oracle Maximum Availability Architecture (MAA)

18

Return on Investment

Risk

Complexity

Cost

High Availability (HA) Service Level Tiers

Oracle® Database High Availability Overview - 12c Release 1 (12.1)

• MAA technical white papers • Oracle Database High Availability Overview, Chapter 6, "Operational Prerequisites to Maximizing Availability" • Oracle Database High Availability Best Practices

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Systems Architecture Maximum Availability Architecture (MAA) – Database

19

• Comprehensive protection from failures: Server – Storage – Network – Site – Corruptions

Oracle® Database High Availability Overview - 12c Release 1 (12.1)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Systems Architecture Redundancy Across Complete Stack

• Hardware

– Servers

– Storage

– Network

– Power

– Data center

Minimize Single Point Of Failures (SPOFs)

20

• Software

– Applications

– Middleware

– Database

– RAC

– ASM

– Data Guard ...

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Systems Architecture Engineered Systems: Database

21

• Expedited time to value

• Easier to manage and upgrade

• Lower cost of ownership

Exalytics Exadata Database Machine

Exalogic Elastic Cloud

Database Appliance

SuperCluster Big Data Appliance

• Reduced change management risk

• One-stop support

• Extreme performance

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Systems Architecture Example: Exadata Architecture I Highly Redundant

22

• Redundant database servers, storage servers, InfiniBand switches

• Two Redundant Power Distribution Units (PDUs)

• Each database server, storage server, InfiniBand switch has redundant hot-swap power supplies

• Disk drives, fans are hot-swap

• InfiniBand network highly redundant – can tolerate loss of entire switch or connections

• RAC provides protection against database server failures

• ASM provides protections against storage server failures

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Implement Phase: Installation & Configuration

23

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Oracle Maximum Availability Architecture (MAA) Architecture, Configuration, and Operational Practices = Maximize Availability

24

Configuration Best Practices

Operational Best Practices

MAA

Architecture • Architecture – Enable • Configuration – Optimize

• Operations – Maintenance for Stability and Availability

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Installation & Configuration Product Documentation – Site Requirements

• Prepare data center using official product documentation & specifications

– http://docs.oracle.com/

25

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Installation & Configuration Space Requirements

26

Physical Dimensions

Height: 1998 mm

Width: 600 mm with side panels

Depth: 1200 mm

Access Route Requirements

Minimum door height: 2184 mm

Minimum door width: 1220 mm

Minimum elevator depth: 1575 mm

Maximum incline: 6 degrees

Minimum elevator, pallet jack, and floor loading capacity: 1134 kg

Maintenance Access Requirements

Rear maintenance: 914 mm

Front maintenance: 914 mm

Top maintenance: 914 mm

• Check access route (doors, ramps, elevators) for large systems

– Example: Exalogic Elastic Cloud Machine

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Installation & Configuration Space Requirements

27

• Check access route for large systems

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Installation & Configuration Weight Requirements

28

• Check the weight requirements for raised floors

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Installation & Configuration Data Center Room Specifications

29

• Prepare final system location: floor cut-outs, maintenance areas, stabilization

– Example: Exalogic Elastic Cloud Machine

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Installation & Configuration Data Center Room Specifications

30

• Prepare final system location: redundant power sources

– Example: Exalogic Elastic Cloud Machine

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Installation & Configuration Data Center Room Specifications

31

• Prepare final system location: redundant network connections

– Example: Exalogic Elastic Cloud Machine

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Installation & Configuration Temperature and Humidity Requirements

32

• Prepare environmental conditions → best practices settings

– Example: Exalogic Elastic Cloud Machine

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Installation & Configuration Temperature and Humidity Requirements

33

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Installation & Configuration Ventilation and Cooling Requirements

34

• Prepare cooling and airflow based on product specifications

– Example: Exalogic Elastic Cloud Machine

Strategies for Solving the Datacenter Space, Power, and Cooling Crunch: Sun Server and Storage Optimization Techniques

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Installation & Configuration Cleanliness Requirements

• Minimize contamination → Prevent downtime

– Subfloor Surface Cleaning

– Raised Floor Surface Cleaning

– Exterior Equipment Surface Cleaning

– Interior Server Cabinet Cleaning

– Anti-Static Floor Finishing …

• ISO 14644-1 Classification of Air Cleanliness – Cleanroom standards

35

Sun Microsystems Data Center Site Planning Guide Data Centers’ Best Practices

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Installation & Configuration Maximum Availability Architecture (MAA)

• MAA Best Practices

– Oracle Database

– Exadata Database Machine

...

• Case Studies

• Documentation

• Demonstrations

• Articles, Presentations

36

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Installation & Configuration General Deployment Advice – Example: Exadata

• Use the defaults as much as possible

– Use the tested configuration

– Avoid customizations and non-standard installs

• MAA best practices installed at deployment – Customers should review and observe before any reconfiguration

• Verify Supported Configurations

– Oracle products are certified to work with specific versions of other Oracle products

– Apps in particular are sensitive

37

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Installation & Configuration General Deployment Advice – Example: Exadata

38

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Installation & Configuration Enterprise Installation Standards (EIS) Checklists

39

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Manage Phase: Operational Best Practices

40

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Oracle Maximum Availability Architecture (MAA) Architecture, Configuration, and Operational Practices = Maximize Availability

41

Configuration Best Practices

Operational Best Practices

MAA

Architecture • Architecture – Enable • Configuration – Optimize

• Operations – Maintenance for Stability and Availability

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices High Availability and Performance Service Level Agreements

• Understand the attributes of high availability and various causes of downtime

• Understand and document your high availability and performance service-level agreements (SLAs)

42

Unplanned Downtime

•Site failure

•Clusterwide failure

•Computer failure

•Storage failure

•Data corruption

•Human error

•Lost writes

•Hang or slowdown

Planned Downtime

•System and database changes

•Data changes

•Application changes

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices High Availability Architecture

• Implement and validate a high availability architecture that meets your SLAs

43

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices High Availability Architecture

• Create an outage/solution matrix that maps to your SLAs

– Example: Exadata MAA Outage and Solution Matrix

44

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices Testing

• Establish test practices and environment

– Test system should be a replica of the production MAA environment

• Validate high availability and performance SLAs

– Fault-injection testing to validate the expected high availability response and all automatic, automated, or manual repair solutions

– Backup and recovery procedures

– Data Guard

– Patching ...

45

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices Testing

• Document your testing

46

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices Security

• Set up and use security best practices

– Reduces the chance of outages due to security breaches

– Oracle® Database Security Guide

47

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices Change Management

• Establish change control procedures

– No changes in primary database before rigorous evaluation on test systems

– Document and follow change management process

– Review changes and get approval from change management team

48

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices Software Upgrades & Patching

• Provide a plan to test and apply recommended patches and software periodically

– Check for any critical software issues • Perform health check (e.g., ORAchk, Exachk, …)

– Use proper patch testing and patching practices • Patching method: Rolling vs. Non-Rolling

• Validate all updates and changes on a test system

– MOS Note 888828.1 (Exadata):

“[…] It is not required or necessary to install every new patch release. A patch should be installed on a production system only after it has been validated in a proper test environment, and no less than one month after release to allow field experience to solidify. However, if the system requires a fix that is available in a newer version, then plans to adopt a newer version should be accelerated. […]”

49

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices Software Upgrades & Patching

50

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices Data Guard

• Execute Data Guard role transitions

– MOS Notes 1304939.1, 1305019.1

– MAA Whitepapers

– Perform periodic switchover operations every three to six months

51

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices Escalation Management

• Establish escalation management procedures

– Downtime can be prolonged if proper escalation policies are not followed and decisions are not made quickly

– Execute repair and failover operations first, then gather diag data for Root Cause Analysis (RCA)

52

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices Health Checks

• Execute database health checks periodically (e.g., ORAchk, Exachk, …)

– Frequency • After initial deployment

• Before planned system maintenance

• After planned system maintenance

• At least once every three months

– ORAchk - Health Checks for the Oracle Stack (MOS Note ID 1268927.2) • Proactively scans for the most impactful problems across the various layers of your stack

• Scope: Oracle database servers, Grid Infrastructure, Oracle databases, hardware, Operating System and RAC software

• The non-intrusice tool audits configuration settings within the following categories: OS kernel parameters, OS packages, CRS/Grid Infrastructure, RDBMS, ASM, Database parameters, …

53

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices Health Checks (Exachk)

54

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices Health Checks (Exachk)

55

Recommendation + Risk Analysis + Direction/Steps

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices Health Checks (Exachk)

56

Installed versions validation

Critical Issue exposure

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices Monitoring

• Configure Oracle Enterprise Manager monitoring infrastructure for High Availability

– Detect and react to performance and high availability related thresholds to avoid potential downtime

57

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices Auto Service Request (ASR)

• Configure Automatic Service Request (ASR) infrastructure

– MOS Note 1185493.1

58

5

8

CPU

Disk controllers

Disks

Flash Cards

Flash modules

InfiniBand

Memory

System Board

Power supplies

Fans

Comprehensive Fault Coverage

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices Platinum Support – Engineered Systems

59

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices MAA Best Practices

• Check the latest MAA best practices

– MAA Portal • MAA Best Practices

– Oracle Database

– Exadata Database Machine

• Case Studies

• Documentation

• Articles, Presentations

– Oracle Database High Availability Best Practices documentation • High Availability Overview

• High Availability Best Practices

– Operational and configuration best practices

60

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices Technical Development

61

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices My Oracle Support (MOS)

• Refer to MOS knowledge articles, e.g.,

– Working Effectively With Oracle Support - Best Practices (Doc ID 166650.1) • Working with Support (How to open Service Request?, When to escalate?)

• MOS

• Support Policies

– Accessing the Oracle System Handbook on My Oracle Support (Doc ID 1227213.1)

– Oracle Premier Support: Get Proactive! (Doc ID 432.1)

– Tools and Training (Doc ID 67032.1)

... and many more ...

62

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Operational Best Practices Product Documentation

• Refer to product documentation: database, servers, storage, network, ...

63

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Summary

64

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Oracle Maximum Availability Architecture (MAA) Architecture, Configuration, and Operational Practices = Maximize Availability

65

Configuration Best Practices

Operational Best Practices

MAA

Architecture • Architecture – Enable • Configuration – Optimize

• Operations – Maintenance for Stability and Availability

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

MAA Summary and Key Takeaways

MAA best practices

Health checks and operational best practices

Proactive monitoring and management

Keep current with patch sets

Best protection is a strong defense

66

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

SUPerG Top Principles

Assume Nothing

Plan Ahead

Reuse Configurations

Get Trained

Test Everything

Best protection is a strong defense

67

Document Everything

Setup Processes

Exploit External Resources

Minimize Risk

Keep It Simple

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Resources

68

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

Resources

69

• Product documentation http://docs.oracle.com/

• Database http://www.oracle.com/technetwork/documentation/index.html#database

• OTN High Availability Portal http://www.oracle.com/goto/availability

• Maximum Availability Architecture (MAA) http://www.oracle.com/goto/maa

• MAA Blogs http://blogs.oracle.com/maa

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 70