Essential elements of data center facility operations
-
Upload
schneider-electric -
Category
Documents
-
view
364 -
download
7
description
Transcript of Essential elements of data center facility operations
Essential Elements of Data Center Facility Operations
Schneider Electric Data Center Science Center
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
Data Center Science Center White Paper 196
70% of data center outages are directly attributable to human error according to the Uptime Institute’s analysis of their “abnormal incident” reporting (AIR) database1. This figure highlights the critical importance of having an effective operations and maintenance (O&M) program. This presentation describes
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
and maintenance (O&M) program. This presentation describes unique management principles and provides a comprehensive, high-level overview of the necessary program elements for operating a mission critical facility efficiently and reliably throughout its life cycle. Practical management tips and advice are also given.
Introduction
Importance of operations and maintenance (O&M) program
• Most facility outages attributable to human (operator) error• Majority of data center facility TCO is in OPEX, not CAPEX, where greatest
potential cost savings reside• Largest portion of OPEX are energy costs, which are rising
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
• Drive for energy efficiency reducing capacity safety margins and system redundancy, increasing importance of proactive maintenance and data center infrastructure management (DCIM)
• High levels of facility automation and equipment performance data have created new opportunities for enhancing reliability while reducing costs, when properly managed
Mission Critical Mentality
● Focuses on risk mitigation● Grasps interconnectedness of facility
and IT systems● Data center availability is paramount
Failure is not an option
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Data center availability is paramount● Highly complex, fast-paced changes
in mission critical facility● Challenging to manage
● Unique outside pressures● Government regulations● Customer audits
NOTE: In this paper, only system planning is covered. System planning refers to the power, cooling, racks,
and other support infrastructure systems. Planning related to the IT equipment is not discussed here.
Mission Critical Mentality
Code of Conduct
“Mission Critical Mindset” principles Impact
Focused on risk mitigation in all operational and
maintenance activities, work processes, and
procedures
Proactively deals with all potential threats to
system availability and worker/occupant safety
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
Acting with confidence and patience that is an
outgrowth of careful planning and preparation
Prevents risks from becoming problems;
enables faster response times and fewer errors
if problems do arise
Analytical, process-driven approach to risk
avoidance and problem solving
Helps identify and mitigate risk in complex
environments; ensures predictable and safe
operation
Comprehensive understanding of the function and
interconnectedness of facility systems and
components
Quickly identify and resolve potential threats
or actual problems; avoid or reduce system
downtime
Commitment to continuous learning and process
improvement
Increases skills and operational efficiency to
maintain an edge in a constantly changing
environment
12 Essential Elements of an O&M Program
Environmental Health and Safety
● Key components include● Injury, illness prevention● Electrical safety● Hazard analysis
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Hazard analysis● Hazard communication
12 Essential Elements of an O&M Program
Environmental Health and Safety
Key Program Attributes Description
Safety plans and trainingWritten safety plans must be established that describe the safe work practices and procedures to be observed by all workers. Regular training on the program elements must also be conducted.
Hazard analysisAll operational procedures shall start with an analysis of the possible hazards involved. Risks must be identified and safety measures assigned.
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
involved. Risks must be identified and safety measures assigned.
Lockout/tagout proceduresProper procedures to prevent the unexpected energizing or startup of machines or equipment (or which causes a release of stored energy) shall be used when servicing or maintaining equipment.
Personal protective equipment (PPE)
Appropriate protective equipment should be provided, properly sized, stored, maintained, and utilized as required to mitigate identified safety hazards.
Hazardous material handlingHazardous materials must be properly identified, labeled, stored, maintained, and used in conformance with manufacturer’s requirements, local laws, and ordinances.
Hazard communications programIncludes a list of hazardous chemicals, use of material safety data sheets (MSDS), proper labeling of all hazardous materials containers, and employee training on use of and protection from hazardous materials.
Compliance with all applicable health and safety laws and regulations
Requirements will likely vary by region and by level of government (e.g., local, state, federal).
12 Essential Elements of an O&M Program
Personnel Management
● Hiring and training● Competent, team-oriented people with
mission critical mentality● Well-rounded team
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Well-rounded team
● Develop staffing model● Clearly defined roles and responsibilities
12 Essential Elements of an O&M Program
Emergency Preparedness and Response
● Develop emergency operating procedures – EOPs – for all high-risk failure scenarios
● Develop, rehearse escalation
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Develop, rehearse escalation procedures
● Conduct regular scenario drills● Formal failure analysis for significant
facility events
See White Paper 199, “Data Center Emergency Preparedness and Response”, for more information.
12 Essential Elements of an O&M Program
Maintenance Management
● Key tasks● Asset management● Work order management● Spare parts management
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Ensure power and cooling continual performance
● Improved reliability with● Good asset intelligence● Proactive and preventative predictive
maintenance plan
● Results in● More accurate maintenance budget
forecasts● Minimized TCO and downtime
12 Essential Elements of an O&M Program
Maintenance Management > Asset Management
● Accurate, consistent tracking of critical facility assets● Computerized maintenance management system (CMMS)
● Record, track, and manage asset data and maintenance history
● Scope of service (SOS)
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Scope of service (SOS)● Defines maintenance frequency, specific activities, # of man hours● Establishes standard for procurement of
● Service agreements● Maintenance scheduling● Procedure development● Continuous program improvement
12 Essential Elements of an O&M Program
Maintenance Management > Asset Management
● Recommended asset management information● Type - top level classification (e.g. electrical,
mechanical, fire system)● Sub-type (e.g. PDU, UPS, CRAH)
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Text description of asset● Make - asset manufacturer name● Model - manufacturer model #● Size or rating● Location ID (room/area)● Trade responsible for maintenance● Manufacturer serial #● Install date● Warranty expiration date● Date asset to be replaced
12 Essential Elements of an O&M Program
Maintenance Management > Work Order Management
● Tool for service process management● Allows work to be
● Correctly prioritized
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Correctly prioritized● Assigned the right resources● Complete d on schedule
● Standalone ticketing system OR● Integrated work order module in a
CMS or DCIM system● Provide valuable information to facility personnel
12 Essential Elements of an O&M Program
Maintenance Management > Spare Parts Management
● Shortens mean time to recovery MTTR● Inventory should include parts with lead times longer than acceptable
downtime● Maintain spare parts list
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Maintain spare parts list● Stock frequently used items● Re-evaluate annually
12 Essential Elements of an O&M Program
Change Management
● Method of Procedure - MOP - process● Detailed checklist of
specified tasks
● MOP helps control work
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● MOP helps control work activity along with● Operational procedure
development and review● Risk analysis and
communication● Structured work practices● Vendor/contractor
supervision
12 Essential Elements of an O&M Program
Documentation Management
● Facilitates development of● Accurate procedures● Proper training● Workplace safety
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Process improvement
● Document management software application● System to keep critical infrastructure records
organized, up-to-date● Detailed checklist of specified tasks
● Manual process can also work
12 Essential Elements of an O&M Program
Training
● Establish training program that organizes operational and maintenance tasks into categories ● Mapped to capability levels – basic, intermediate, advanced
● Train and evaluate personnel to certify them
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Require annual recertification exams
● Ongoing education keeps personnel current
12 Essential Elements of an O&M Program
Infrastructure Management
● System to match facility resources with changing IT requirements● Prevent downtime● Improve resiliency
and response
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
and response● Reduce operating
expenses● Provide a sound
basis for capacity planning decisions
● Three key tasks● Facility monitoring● Capacity management● IT/Facilities integration
12 Essential Elements of an O&M Program
Quality Management
● Key components● Quality Assurance (QA): Typified by process and procedure
standardization● Quality Control (QC): Quality checks, inspections, and audits
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Quality Control (QC): Quality checks, inspections, and audits● Continuous Quality Improvement
12 Essential Elements of an O&M Program
Energy Management
● Energy typically the single largest data center expense
● 3 core tasks of an effective
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● 3 core tasks of an effective energy management program● Performance benchmarking● Efficiency analysis● Strategic energy sourcing
● Optimized energy sourcing● Reduce exposure to price volatility● Secure pricing that fits budget and business objectives
12 Essential Elements of an O&M Program
Financial Management
● Financial-related issues can impact facility’s day-to-day availability and resiliency
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Processes should focus on● Purchasing● Invoice matching● Financial reporting/analysis
● Facility managers and purchasing department should maintain close relationship
12 Essential Elements of an O&M Program
Performance Monitoring and Review
● Regularly monitor and review facility performance ● Determines health and effectiveness
of O&M program
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Shows where it is trending● Quality process should incorporate
facility KPIs● Benefits
● Aligns operational activities with business goals
● Positive reinforcement for innovation and process improvement
Common Mistakes
Common Mistakes Description
Maintenance program is not driven
by metrics
Often the result of poor asset management
No linkage made between break/fix maintenance
activities and preventative maintenance
Poor trainingTraining is not formalized and/or is not taken seriously
Over-reliance on technician “shadowing”
No linkage between certification level and tasking
Ineffective change managementInadequate risk analysis
Poor or non-existent procedures
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
Ineffective change management Poor or non-existent procedures
No defined process for performing critical work tasks
Failure to consistently test &
evaluate skills
Existing skills/training level not formally evaluated
Scenario drills are not employed
Incident and drill results are not evaluated
Poor documentationNo coherent sequence of operations
Drawings and schedules are outdated
Lack of revision control and/or lack of digitization
Failure to develop and implement a
quality control system
Lack of governance or resources to measure, monitor,
and review performance
Stuck in manual mode Failure to implement CMMS, EDMS, DCIM, etc
OverconfidenceAssumption that future performance can be predicted
by past experience
Facility Operations Services
Using Outside Vendors for O&M Programs
● Offer services for both existing and new data centers● Advise on● Develop● Implement
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Implement● Operate
See White Paper 198, “How to Write an Effective RFP for Data Center Facility Operations Services”, for more information.
12 Essential Elements of an O&M Program
Performance Monitoring and Review > Recommended Facility KPIs
● Critical load uptime● Load redundancy
maintained● Support system uptime
● Safety policy and procedure adherence
● Procedure development, management and use
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Support system uptime● Maintenance completion● Staffing coverage● Security policy
conformance● Emergency preparedness
drills● Emergency response
procedure adherence
● Quality control/improvement● Training compliance● Process improvement● Operational reporting● Proper event notification and
escalation● Timely and accurate cost reporting
Conclusion
● Efficient Operations & Maintenance program● Mitigates threats, effects of human error
● Focus on 12 essential elements of O&M program● Must have facilities operation team with “mission critical” mindset● Operational philosophy focuses on
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Operational philosophy focuses on ● risk mitigation● Preparedness● standardized processes● continuous improvement
ResourcesFacility Operations Maturity Model for Data CentersWhite Paper 197
How To Write an Effective RFP For Data Center Facility Operations ServicesWhite Paper 198
Data Center Emergency Preparedness and ResponseWhite Paper 199
Classification of Data Center Infrastructure Management (DCIM) ToolsWhite Paper 104
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
Browse all APC white papers whitepapers.apc.com
Browse all APC TradeOff Tools™tools.apc.com
White Paper 104
How Data Center Infrastructure Management (DCIM) Software Improves Planning and Cuts Operational CostsWhite Paper 107
Avoiding Common Pitfalls of Evaluating and Implementing DCIM SoftwareWhite Paper 170