Is your data center on the verge of a crisis?
-
Upload
uptime-institute -
Category
Technology
-
view
148 -
download
1
description
Transcript of Is your data center on the verge of a crisis?
© 2014 Uptime Institute
Is your data center on the verge of a crisis?
Julian Kudritzki Chief Operating Officer
Uptime Institute
What Defines a Crisis?
2
Tour of Operational Computer Room
3
Looking for Clues
4
Tour of ‘Live’ Critical Spaces
5
Daily Practices Compromise Uptime, Safety, and Security
6
• Overtime hours exceeding 10% • Voice mail boxes full • Emails not responded to • Email inbox size limit exceeded • Meetings missed or routinely cancelled • No time for training • Shortage of qualified staff • Personnel performing work outside their competency • Everything is an emergency • Personnel turnover
What Else Is Going On?
7
• Break fix budget exceeded • Maintenance budget exceeded • Energy cost estimate exceeded or unknown • Last minute deployment requirements • No organization chart • No responsibilities matrix • No records of maintenance activities • No written policies & procedures • No preventive maintenance schedule • Back of the server looks like a spaghetti pot exploded
The Issues Add Up
8
• Cabling is not labeled or worse incorrectly labeled • Equipment is not uniquely labeled • Loads are consistently out of balance • Capacities are not managed or tracked • Deferred maintenance exceeds 10% • Housekeeping: if it looks like a mess, it is a mess Maybe you don’t have a crisis, but how do you know how well your data center operation compares to rest of industry?
The Issues Add Up
9
Are you confident in your Facilities team’s capability to manage a technologically advanced and highly efficient design to your 24 x 7 uptime requirements?
• Can you easily replace any member of that team? • Are you protected against poor operations practices
migrating from older sites to higher criticality data centers? • Do you have sites that operate in isolation, ignoring global
corporate standards? • Do you even have corporate global standards? • If you outsource any aspect of your data center operations,
how do you avoid losing responsibility and accountability? • Do you manage an outsourcing contract. . . . or direct an
expert team?
Ask the Tough Questions
10
• Initial review • Gap analysis against industry best practices
§ Staffing and Organization § Maintenance § Training § Planning, Coordination & Management § Operating Conditions
• Roadmap to operational excellence • Plan changes • Implement changes • Monitor & refine • Annual review
Path to Data Center Operations Success
11
Key Elements of Facilities Management Staffing and Organization
• Staffing • Qualifications • Organization
Maintenance • Preventative Maintenance (PM)
Program • Housekeeping Policies • Maintenance Management
System (MMS) • Vendor Support • Deferred Maint. Program • Predictive Maintenance • Life-Cycle Planning • Failure Analysis Program
12
Key Elements of Facilities Management Training
• Data Center Staff • Vendors
Planning, Coordination, and Management
• Site Policies • Financial Management • Reference Library • Computer Room Mgmt.
Operating Conditions • Load Management • Operating Set Points • Alternating Use of
Infrastructure Equipment
13
The Uptime Institute over the years has observed management issues posing the largest risk to uptime physical infrastructure
• Inadequate staffing • Ineffective or non-existing maintenance and training programs • Lacking processes and procedures • Resulting in the majority of outages being caused by
‘human error’ No standard existed to help Owners/Operators determine
• Common language/vocabulary of data center operations • Focus of data center management • Resource allocation • Resource requirements
Genesis of Industry Best Practices
14
Data Center Owners / Operators / End Users • Increased availability and cost savings • Multi-site consistency • Benchmark for continuous monitoring and refinement
Colocation / Managed Services Sites
• All of the above plus… • Customer assurance of consistency • Competitive differentiator (attain & retain certification)
Industry Benchmark
• No need to reply on opinions and anecdotes
Value of Industry Best Practices
15
Uptime Institute has been conducting Operational Sustainability Reviews for approximately 3 years— based upon decades of site operations knowledge and experience:
• Operational Sustainability Certifications: Tier + Gold, Silver, or Bronze • Management & Operations (M&O) Stamps of Approval
See http://uptimeinstitute.com/publications for Tier Standard: Operational Sustainability
Best Practices Reviews
16
Staffing • Inadequate staffing • Excessive overtime (over 10%) • No escalation process
Qualification
• No list of required qualifications • No experience with data center specific equipment
Organization
• Roles and Responsibilities not documented • Data center organization not integrated
Staffing and Organization Significant Findings
17
Preventive Maintenance (PM) • No list of required PM activities • PM activities not fully scripted • No quality control process
Housekeeping
• Combustibles in the data center • No documented housekeeping policy
Maintenance Management System (MMS)
• No list of equipment • Missing critical data: warranty info, maintenance history, performance
data, etc.
Maintenance Significant Findings
18
Vendor Support • Contracts missing response times, call-in process, detail SOW, or
technician qualifications Deferred Maintenance
• Unable to produce Deferred maintenance report from MMS Predictive Maintenance
• No predictive maintenance program • Not comparing current results with previous results
Maintenance Significant Findings
19
Life-Cycle Planning • No life-cycle plan • Not using MMS data to develop plan
Failure Analysis • No record of outages or near misses
Maintenance Significant Findings
20
Data Center Staff • Undocumented On-the-Job (OJT) programs • No formal qualification program • No list of training required by position • No formal training program with lesson plans, etc.
Vendors • No briefing for escorted vendors
Training Significant Findings
21
Load Management • Alarm settings not documented • Alarms not set on PDUs to ensure maximum loads are not exceeded
Operating Set Points • Cooling set points are not document or part of
Change Management Process • Changing of set points is not controlled
Operating Conditions Significant Findings
22
Site Policies • Missing Site Policies • Especially Site Configuration Policy
Reference Library
• No process for keeping documents up-to-date
Capacity Management • No process for forecasting future space, power, and cooling
requirements • No active tracking of cooling capacity • Ineffective management of Cold Aisles /Hot Aisles • Electrical power monitoring (balancing phases)
Planning, Coordination, and Management Significant Findings
23
Facilities • Operate and maintain the critical facility infrastructure • Support the installation of IT equipment (space, power, & cooling)
IT Management • Operate and maintain IT hardware, software, applications, and
network connectivity • Manage the installation/de-installation of IT equipment
Security • Access Control • Physical Security
Typical Data Center Disciplines
24
Functionally Separate Organization • Corporate Real Estate (Facilities) • IT • Security
Communication between organizations was typically poor
• Data center activities conducted without coordination • Poor future space, power, and cooling planning
No individual responsible for all aspects of operating a data center
Past Organizational Structures
25
Factors driving changes to organizational structure • Rapid changes in technology and speed at which capacity must be
brought online • Increased costs associate with IT and Facilities • Business objectives of continuous computing availability
Legacy organizations could not accommodate quickly evolving business requirements
• Slow to respond • Not integrated
Evolving Organizational Structure
26
The value of industry best practices is in the process of continuous improvement
• Discovery leads to learning • Learning leads to change • Change leads to improvement • Regular reviews leads to discovery • Crises can be avoided
Summary
27
For more information contact: Julian Kudritzki
[email protected] 206.706.4143
Questions?
© 2014 Uptime Institute 28