The Modern Data Center Topology - The High Availability Mantra

1

Webinar 1: The Modern Data Center

Topology:

The High Availability Mantra

August 29, 2012 (Wednesday): 11:00 AM IST

August 29, 2012 (Wednesday): 11:00 AM IST

2

About GreenField Software

About GreenField Software

Our Company

Incubated by a US$ 40 million Engineering Company

Founders: • Shekhar Dasgupta, ex-MD

Oracle India • Abhijit Sen, Director, UD Group

Our Mission Pioneering Energy & Environment

Management Software for Cost Savings & Energy Optimization

Our Solutions Data Center Infrastructure

Management (DCIM)

Sustainability Management for Manufacturing

Our Partners

3

Today’s Topics Today’s Topics

• The Modern Data Center Overview

• The High Availability (HA) Mantra

• Operating Challenges

• A Solution

4

Modern Data Center

Overview

5

Multiple Classes of Data Centers Multiple Classes of Data Centers

• Internet Data Center used by external clients connecting from the Internet supports servers and devices required for B2C transaction-based applications (e-

commerce).

• Extranet Data Center provides support and services for external B2B partner transactions. accessed over secure VPN connections or private WAN links between the partner

network and the enterprise extranet.

• Intranet Data Center hosts applications and services mostly accessed by internal employees with

connectivity to the internal enterprise network.

ness services. • Special Purpose Data Center

For specialized application areas like Geological & Geophysical for Oil & Gas Industry

May or may not be inter-connected

6

Common Objective: Business Continuity Common Objective: Business Continuity

• Disaster Recovery Data Center Each Class may have dedicated or Shared DR Center Usually located separately from Primary Data Center

• High Availability (HA) Data Center Each Data Center provided for with significant redundancies DR Center comes into play only when a Disaster strikes. Component or system failures within any DC should be either self-healing or

redundancies within the DC should take over

• Insurance Against Power & Network Outages Reliability through multiple service providers Internal Back-ups

ness services. • Securing the Data Center

Against malicious hacking that can bring down the Data Center impacting business continuity

Implementing Firewalls/ Virtual Firewalls

7

Common Complexity: Multitude of Assets Common Complexity: Multitude of Assets

Multitude of Assets

Divided between two worlds: IT & Facilities

Includes Mission Critical Applications

Like a manufacturing operation

Raw Material: Power & Networks

Processing: Data

Output: Information Service

Needs: Asset Management, Resource Optimization, a la Manufacturing

Multitude of Assets

Divided between two worlds: IT & Facilities

Includes Mission Critical Applications

Like a manufacturing operation

Raw Material: Power & Networks

Processing: Data

Output: Information Service

Needs: Asset Management, Resource Optimization, a la Manufacturing

8

The High Availability

Mantra

9

Extreme Redundancies for 99.99% Uptime -> Higher Power Consumption Extreme Redundancies for 99.99% Uptime -> Higher Power Consumption

Huge Population of N+1/N+2 Equipment -> Asset Under utilization & Too complex to manage with spreadsheets & Visio tools Huge Population of N+1/N+2 Equipment -> Asset Under utilization & Too complex to manage with spreadsheets & Visio tools

Chain of inter-dependent equipment -> Multiple points of failures Chain of inter-dependent equipment -> Multiple points of failures

Growing Heat Loads, Carbon Emissions & e-waste -> Sustainability Issues Growing Heat Loads, Carbon Emissions & e-waste -> Sustainability Issues

KW per Rack increases as more processing capacity is added -> Trade-offs: need to support more per rack versus extra space & heat loads. KW per Rack increases as more processing capacity is added -> Trade-offs: need to support more per rack versus extra space & heat loads.

High Availability is Inversely Proportional to Asset Utilization & Energy Efficiency High Availability is Inversely Proportional to Asset Utilization & Energy Efficiency

Today’s High Availability Data Center Today’s High Availability Data Center

10

When HA fails - Tale of Two Disasters When HA fails - Tale of Two Disasters

Amazon

Amazon RBS RBS

Tech fault at RBS and Natwest freezes millions of UK bank balances

RBS and Natwest have failed to register inbound payments for up to three days, customers have reported, leaving people unable to pay for bills, travel and even food. The banks - both owned by RBS Group - have confirmed that technical glitches have left bank accounts displaying the wrong balances and certain services unavailable. There is no fix date available.

Amazon cloud outage takes down Netflix, Instagram, Pinterest, & more

With the critical Amazon outage, which is the second this month, we wouldn’t be surprised if these popular services started looking at other options, including Rackspace, SoftLayer, Microsoft’s Azure, and Google’s just-introduced Compute Engine. Some of Amazon’s biggest EC2 outages occurred in April and August of last year.

Which Will Be The Next One? Which Will Be The Next One?

http://www.theregister.co.uk/2012/06/21/rbs_natwest_tech_glitch_banking_freeze/





http://venturebeat.com/2012/06/29/amazon-outage-netflix-instagram-pinterest/








http://venturebeat.com/2012/06/28/google-compute-engine/



http://venturebeat.com/2011/04/23/amazons-outage-in-third-day-debate-over-cloud-computings-future-begins/

http://venturebeat.com/2011/08/09/amazon-ec2-outage/

11

What’s the High Availability Mantra? What’s the High Availability Mantra?

Amazon Data Centers (built to Tier 4 standards and with an expected availability of 99.995%) has had two outages already in 2012 – each over 3 hours!

• Tier 3/Tier 4 just defined by hardware redundancies

• Glaring gaps in operating procedures to prevent fatal human errors

• Lack of purpose-built BCP software to predict failures

• Lack of chain of custody to detect root cause

Amazon Data Centers (built to Tier 4 standards and with an expected availability of 99.995%) has had two outages already in 2012 – each over 3 hours!

• Tier 3/Tier 4 just defined by hardware redundancies

• Glaring gaps in operating procedures to prevent fatal human errors

• Lack of purpose-built BCP software to predict failures

• Lack of chain of custody to detect root cause

Availability % Downtime per year Downtime per month* Downtime per week

99% ("two nines") 3.65 days 7.20 hours 1.68 hours

99.5% 1.83 days 3.60 hours 50.4 minutes

99.8% 17.52 hours 86.23 minutes 20.16 minutes

99.9% ("three nines") 8.76 hours 43.8 minutes 10.1 minutes

99.95% 4.38 hours 21.56 minutes 5.04 minutes

99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes

99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds

99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds

99.99999% ("seven nines") 3.15 seconds 0.259 seconds 0.0605 seconds

12

Delivering the High Availability Promise Delivering the High Availability Promise

Adequate Redundancies

• Are there any points of failure – besides power and external networks - that can impact uptime? (Not everything is N+1)

• What are my redundancy paths?

• Are the relationships & dependencies among critical assets clearly defined?

• Can I do an impact analysis on the outage/downtime of any equipment? Can I predict the cascading effect of such an outage on other assets/applications in the data center?

Preventing Failures

• Can any failure be predicted to take proactive measures? Do I get alerts on threshold breaches so that I can take preventive actions before a failure happens?

• Is there a history of a Move-Add-Change (MAC) that I should be aware of?

• What is the impact of a MAC on space, power, cooling?

• Where can new devices/servers be best placed? Floor -> Rack -> Cage. How this can be determined based on current infrastructure and other dependencies to avoid a failure?

• How do I prevent a fatal human error?

13

Operating Challenges

14

The High Availability Challenge The High Availability Challenge

Asset Over Provisioning Lack of HA Management Tool

IT assets tracked by Systems Management Tool

Facilities assets tracked by BMS

Two not inter-operable: Unable to determine missing link for HA

Unable to track redundancy paths

HA fails if any equipment or software in critical path fails

HA fails if there’s fatal human error

Health and history of equipment, or previous MAC impact, not tracked

IT assets tracked by Systems Management Tool

Facilities assets tracked by BMS

Two not inter-operable: Unable to determine missing link for HA

Unable to track redundancy paths

HA fails if any equipment or software in critical path fails

HA fails if there’s fatal human error

Health and history of equipment, or previous MAC impact, not tracked

Too many assets; two classes of assets

Absence of Software Portfolio (even if hardware assets are tracked)

Move-Add-Change: Decisions not based on simulations, analysis

Absence of change management

Absence of workflow approvals

Unable to predict failures

No chain of custody

Too many assets; two classes of assets

Absence of Software Portfolio (even if hardware assets are tracked)

Move-Add-Change: Decisions not based on simulations, analysis

Absence of change management

Absence of workflow approvals

Unable to predict failures

No chain of custody

Need to Predict Failures Need to Predict Failures

15

Beyond HA: Infrastructure & Operational Challenges Beyond HA: Infrastructure & Operational Challenges

Energy Problems Operational Problems

Low level asset tracking

Under utilization of many computing resources

Running of old inefficient equipment

Decisions not based on analysis

Cooling not optimized

Floor & Rack Space: Non-optimal placements of equipment

Increasing demand for rack space

Absence of capacity planning

Low level asset tracking

Under utilization of many computing resources

Running of old inefficient equipment

Decisions not based on analysis

Cooling not optimized

Floor & Rack Space: Non-optimal placements of equipment

Increasing demand for rack space

Absence of capacity planning

Higher power consumption & growing power bills

Not monitoring power use at device levels

Dissemination of enormous heat

Creation of hot spots

Drastic reduction in expected life of computing equipment

Failing of a data center

Increase in CO2 emission

Higher power consumption & growing power bills

Not monitoring power use at device levels

Dissemination of enormous heat

Creation of hot spots

Drastic reduction in expected life of computing equipment

Failing of a data center

Increase in CO2 emission

Need to Improve Energy & Operational Efficiencies Need to Improve Energy & Operational Efficiencies

16

A Solution

17

IT System

Performance

Management

IT System

Performance

Management

Building

Management

System

Building

Management

System

Data Center

Infrastructure

Management

Data Center

Infrastructure

Management

Solution That Bridges the Gap Between IT & Facilities Solution That Bridges the Gap Between IT & Facilities

Data Center Infrastructure Management (DCIM) Software Data Center Infrastructure Management (DCIM) Software

Facilities IT

18

Solution That Addresses The High Availability Challenge Solution That Addresses The High Availability Challenge

DCIM Helps to Predict Failures DCIM Helps to Predict Failures

Asset Over Provisioning Lack of HA Management Tool Single tool manages both IT &

Facilities – single window helps in better monitoring and management.

Tracks redundancy path & identifies Single Point of Failure across the DC ecosystem.

Does trend analysis on device/application behavior & performance and predicts failures.

Tracks MAC and prevents disruption due to unauthorized change.

Change Management prevents downtime due to human errors

Single tool manages both IT & Facilities – single window helps in better monitoring and management.

Tracks redundancy path & identifies Single Point of Failure across the DC ecosystem.

Does trend analysis on device/application behavior & performance and predicts failures.

Tracks MAC and prevents disruption due to unauthorized change.

Change Management prevents downtime due to human errors

Tracks and manages both IT and non-IT assets

Rationalizes asset base and identifies assets for retirement, consolidation, replacement & repurpose.

Tracks and records MAC of assets to the component level

Provides Change & Work Flow Management for better manageability, control & chain of custody.

Monitors performance trends of assets and predicts failures.

Tracks and manages both IT and non-IT assets

Rationalizes asset base and identifies assets for retirement, consolidation, replacement & repurpose.

Tracks and records MAC of assets to the component level

Provides Change & Work Flow Management for better manageability, control & chain of custody.

Monitors performance trends of assets and predicts failures.

19

Solution That Addresses Infra & Operational Challenges Solution That Addresses Infra & Operational Challenges

DCIM Improves Energy & Operational Efficiencies DCIM Improves Energy & Operational Efficiencies

Energy Problems Operational Problems In depth asset tracking of IT & non-IT

Identifies underutilized computing resources and recommends ways of optimization

Identifies old equipment and recommends replacement

Enables decision making based on data and analysis

Optimizes floor and rack space utilization

Enables more accurate capacity planning based on real-time data rather than assumptions

In depth asset tracking of IT & non-IT

Identifies underutilized computing resources and recommends ways of optimization

Identifies old equipment and recommends replacement

Enables decision making based on data and analysis

Optimizes floor and rack space utilization

Enables more accurate capacity planning based on real-time data rather than assumptions

Measures power consumption till device level

Identifies devices in the data center with lower performance per watt rating and recommends improvement methods

Optimizes cooling

Measures PUE & DCiE of the data center and identifies inefficiencies

Monitors health of the data center continuously and compares it with global benchmarks

Reports CO2 emission

Measures power consumption till device level

Identifies devices in the data center with lower performance per watt rating and recommends improvement methods

Optimizes cooling

Measures PUE & DCiE of the data center and identifies inefficiencies

Monitors health of the data center continuously and compares it with global benchmarks

Reports CO2 emission

20

Anatomy of a DCIM Software: GFS Crane DC Anatomy of a DCIM Software: GFS Crane DC

Enables a More Efficient, Higher Availability & Greener Data Center

Enables a More Efficient, Higher Availability & Greener Data Center

21

Thank You and Q&A

http://www.greenfieldsoft.com Email: [email protected]

22

Next Webinar:

Data Center Infrastructure Management: ERP for the Data

Center Manager

September 26, 2012 (Wednesday): 11:00 AM IST

September 26, 2012 (Wednesday): 11:00 AM IST

The Modern Data Center Topology - The High Availability Mantra

Technology

Transcript of The Modern Data Center Topology - The High Availability Mantra