Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control...

29
Increasing Data Center Resilience While Lowering PUE Nandini Mouli, Ph.D. President/Founder eSai LLC [email protected] www.esai.technology

Transcript of Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control...

Page 1: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

Increasing Data Center Resilience While Lowering PUE

Nandini Mouli, Ph.D.

President/FoundereSai LLC

[email protected]

www.esai.technology

Page 2: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

Introduction – eSai LLC

• eSai LLC: Is a Disadvantaged woman-owned minority business focused on providing energy management solutions for federal and state government agencies

• Core Competencies: Technologies:

• Technical/Business Feasibility Studies Dynamic Pricing, Demand Energy Audits, Commissioning Response

• Energy Conservation Measures Distributed Energy Services, Combined Heat and Power

• Evaluation, Validation and Measurement Microgrid Integration

• Utility, Federal and State Grants Building Management Systems

Experience in consulting and implementing clean energy programs to meet DOE, EPA and FEMP policies and programs.

Currently leading multiple projects to bring resiliency and energy conservation for federal agencies and private corporations

Page 3: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

Topics for Discussion• What is Resilience?

• Why it is Resilience critical for data centers?

• Dynamics of treating resilience

• Challenges to achieving data center resilience

• Some tools to achieving the resilience

• What is DCIM?

• How is DCIM a resilience platform for: • Planning and implementation • Monitoring• Data Collection• Dash Board Visualization

• Getting the most out of DCIM tools

• Key Take-aways !!

Page 4: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

What is Resilience?

• TechTarget’s Definition of Resilience: “the ability of a server, network, storage system, or an entire data center, to recover quickly and continue operating even when there has been an equipment failure, power outage or other disruption.”

• In the context of cyber security: “Resilience is the ability of a system to resist illegitimate activity and its ability to effect a speedy recovery”

Page 5: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

Why is Resilience critical for Data Center?• Forrester Research: Resilience is # 2 top priority for Facility Directors:

• Carrier availability and density – 82%• Availability, resilience – 80%• Control over facility – 78%• Access to Cloud and other partners – 75%

• Lack of resilience is costly: • IBM Reputational Risk and IT Study: system outage is one of the top two IT risks that can

harm an organization’s reputation.• 91% of data centers have experienced an unplanned data center outage in the past 24

months.• The average cost per minute of data center downtime has increased 38% from $7,908 in 2013

to $11,000 in 2015• Organizations which improve from “Laggard” to “Industry Average” levels of downtime can

reduce losses ~$3 million/year.

Page 6: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

Dynamics in Treating Resilience

• Achieving resilience used to mean redundancy:• Two (or more) of everything – servers, power supplies, generators,

and even whole data centers

• But most of this duplicate equipment was never utilized.

• Waste of space and energy = Increased PUE

• Now, the trend: increase resilience sans waste selecting software instead of hardware• Fault tolerance built right into software

• Improve resilience through load balancing, virtualization, prediction and other techniques.

Page 7: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

Challenges To Achieving Data Center Resilience

Measurement of how vulnerable the data center system is to failure and fixing the potential problems leads to increased uptime;

However,

• Increase in the number of applications to be managed and backed up

• Organizations getting larger and more geographically dispersed

• Infrastructural ecosystems are more complex

• Decreasing costs of hardware encouraging organizations to maintain backup and recovery in house incompatible with other network software to mitigate problems

• Increasing use of virtualization

• Frequency and intensity of natural disastersIncreasing risks

Page 8: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

What Are Some Traditional Ways To Achieving Resilience?

Current Methodologies Conventional Data Center relies on manual response plan and Human teamsDesign Failure: Competent design firm,

integration firm, construction companies and commissioning team

Catastrophic Failure: Comprehensive maintenance and operation program

Compounding Failure: Paying more attention to details of each and every possible failure mode

Human-error Failure: Having experienced staff and training all responsible. Continuous training and execution with pilot/co-pilot approach for operation.

Page 9: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

Modern Tools To Achieving Resilience

• A modern data center needs the II dashboard. Due to the complexity of the operations, IT and Facility management can not rely on just the human component to combat failures occurring from a combination of two or three faults

• IT/Facilty Management have to align themselves in using predictive ways of disaster mitigation DCIM

Page 10: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

What is Data Center Infrastructure Management - DCIM?

• It is a software platform that helps operators safely manage the physical infrastructure and controls with higher visibility and transparency of the IT and the facilities operations and quick identification and resolution of problems before they happen

• Maximizes the efficient use of power, cooling, and space capacities now and in the future.

• Two core building blocks: • Asset Management

• Monitoring

Page 11: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

DCIM - A Resiliency Platform: Physical Infrastructure/ Controls

From Device Level Monitoring in a traditional data center system to Context-Aware Monitoring so actions can be performed to mitigate a risk !!!!

Page 12: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

DCIM-Planning and Implementation Platform

Planning tools and functions: • Display impact of pending moves on power

capacity and cooling distribution • Graphical representations of IT equipment

and its location in the rack • Proactively manage within rack and floor tile

weight limits • Correlate data between CRAC units, the PDUs,

and the UPSs. The entire chain is monitored. • Simulate consequences of power and cooling

device failure on IT equipment through “What If?” scenarios

• Generate recommended installation locations for rack-mount IT equipment. The selection will be based on available power, cooling, space capacity, and network ports

Page 13: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

DCIM – Monitoring and Automation Platform

• Alarming/Notification: DCIM sends out an alarm from the rack prior to a breaker tripping. Provides operator with the opportunity to make adjustments before shut-down

• Status: Notes are generated for minimum, maximum, and average usage over time for that rack and for each rack

• Control: If a rack gets close to an overcapacity threshold, predictive simulation can be triggered generated to determine the best way to alleviate the situation.

• Reports and graphs are generated to help diagnose the problem

Page 14: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

DCIM – Monitoring and Automation Platform (contd.) Comparison of Primary and Secondary Functions

Certain DCIM applications will take certain data center features as primary or secondary functions.

Depending on the facility and need, care must be taken to select the right ones to include in the suite of integrated platform

Page 15: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

DCIM- Data Collection Platform

The data collection subset represents devices such as meters, power protection devices, embedded cards, programmable logic controllers (PLCs), sensors and other such devices.

The devices perform the fundamental function of gathering data and forwarding it to management software for processing.

Page 16: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

DCIM- Dash Board Platform

Key performance indicators are at the operators’ fingertips with DCIM

When will I run out of power and what is the current cooling capacity?

What is my current server utilization?

Do I have any servers that can be retired and if so what are they?

The dashboard is the key centerpiece for aggregation of actionable data that can be shared quickly with decision-makers

Sample dashboard collects data across OT subsets and centralizes information anytime, any where and any user interfaces: mobile, laptop, PC

Page 17: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

DCIM- Dash Board Platform (Contd.) –Another view

Page 18: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

DCIM – Energy and Power Saving Platform• DCIM provides overview of facility energy use and cost and a

complete breakdown of each kW per device

• Cost savings realized from the Servers Rack Row RoomBuilding and Beyond

Page 19: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

DCIM Communication Platform

Page 20: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

An Example of DCIM Integration

Page 21: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

DCIM Offer in the Market: Suite and Non-Suite Providers

Page 22: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

Comparison of Various DCIM Products

Page 23: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

DCIM Market Trends • Market is growing

• From $240 million in 2011 to $1.2 Billion in 2016

• Growth in Data Center is very high since facilities and IT meet to think about the business

• Inhibitors to adoption:• Cost and functionality issues• Difficulty of creating and maintaining asset databases• Believe blindly that it is possible to manage data center without software solutions

• Energy Savings from well-managed data centers• Reduce operating expenses by 20%

Source: the 451 group

Page 24: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

How To Get The Highest Benefit From DCIM?

• There are quite a variety of options. Care must be taken to ensure best fit

• Scalable, modular, standardized, pre-engineered, open communication architecture with a strong vendor support structure

• Agreement between facilities, IT, and management on operating parameters, metrics, and goals for the data center power and cooling systems and their management

• A review of existing processes and comparison to DCIM requirements

• New processes should be formally defined and resources committed and specific owners assigned

Page 25: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience
Page 26: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience
Page 27: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

Case Study Conclusions: Data centers are complex systems, changing constantly over timeMonitoring and measurement of capacity is not enoughMuch lost capacity can be reclaimed using predictive modeling and state of the art tools with support of DCIM measurements

Page 28: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

Key Take-Aways - DCIM Benefits

DCIM provides higher visibility, more control and improved automation

Decision Support and Information Management• Asset Planning and Implementation

Monitoring, Measuring and Alerting

Management and Control• Fault-tolerant (fail-over)

Software Services

Final outcome: More reliable and efficient data center higher resilience and decreased PUE.

Page 29: Increasing Data Center Resilience While Lowering PUE•Availability, resilience –80% •Control over facility –78% •Access to Cloud and other partners –75% •Lack of resilience

THANK YOU !!!

Contact:

Nandini Mouli, Ph.D.

President/Founder

eSai LLC

www.esai.technology

[email protected]

(443) 691 7664