White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf ·...
Transcript of White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf ·...
White Book of the High-Availability Operations of
Ping An Data Center
May 2018
Preface by the Authors
With more than a decade’s development, the data center of Ping An of China (“Ping An
Data Center”) boasts a well-established operations system that is compliant with multiple
international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing to the
conscientious efforts of the operations team toward meticulously following and continuously
improving the working rules and processes of the system, the data center has sustained a high
level of availability.
We express our special appreciation to the staff and vendors of the data center for their
relentless hard work for the maintenance of high-availability service.
This White Book of the High-Availability Operations of Ping An Data Center, which
embodies the experience of the operations team over more than a decade at sustaining the high
availability of the data center, is an endeavor of Ping An to carry out its social accountability, as
the book aims to summarize and share the excellent experience of Ping An Data Center in
developing and maintaining a high-availability Internet finance data center. We believe that data
centers in China, particularly those in the finance and banking sector, can benefit from the
experience shared here to improve their operations management and sustain high availability.
We hope that the book can serve to mobilize industry players and experts to make concerted
efforts for China’s development in the big-data age.
We would like to acknowledge the support of Zhong Jinghua, Leader of China Data Center
Committee- China, and Philip Hu, Managing Director - North Asia of Uptime Institute and the
hard work of the compilation team of the book.
We will be grateful for feedback regarding any error or negligence in the book.
Data Center of Ping An Technology (Shenzhen) Co., Ltd.
Preface by Zhong Jinghua
Ping An of China began to plan and construct a data center in Guanlan, Shenzhen in 2009. I
was fortunate to be appointed the chief designer of the project. Having been involved in the
entire construction process, I witnessed the great efforts made by the company to continuously
update its information technology to align closely with national strategies.
One of the first financial companies engaging in data center construction, Ping An has
acquired an in-depth understanding of data center construction and operations and fostered a
pool of data center experts. This enables the company to be well prepared for the Internet
Finance 3.0 age and contributes enormously to the healthy development of the data center
industry in China.
The life cycle of a data center consists of the following phases: requirement analysis,
planning and design, construction and installation, testing and receipt, and operations
management. Operations management is the last and longest phase of the life cycle. For a data
center to be successful, the operations management phase is, in some sense, more important than
the construction phases. Operations management should be considered from the time of
commencement of a data center project, or the requirements for operations management should
be built into the design and construction phases. In this sense, the scope of operations
management covers the entire life cycle of a data center, or the entire process of providing the
data service support required for attaining the development goals of the business.
This book is the crystallization of the continuous efforts of Ping An staff in the spirit of
remaining true to our original aspiration and keeping our mission firmly in mind.
Covering the operations standardization, best practices, organization structure, security
management, and quality system of the data center, this book embodies the devotion of Ping An
staff to the data center and their diligent pursuit of science. I appreciate the hard work of the
compilation team of the book and hope that readers can benefit from the knowledge shared in the
book.
Zhong Jinghua
Leader of China Data Center Committee (CDCC)
May 2018
Preface by Philip Hu
The Uptime Institute Tier Standard: Topology has been developed for nearly two decades.
This Standard describes four classifications (Tier I to IV) to evaluate and differentiate data center
infrastructure in terms of availability. Since its creation, this system has been widely adopted for
the design and construction of data centers across the world.
Suppose someone says: I need a data center for business development. Another person will
turn and say: I will build one for you. However, they are possibly not referring to data centers of
the same output performance. I have said on many occasions that the life cycle of a data center is
characterized by short design and constructions phases, anywhere between a few months and one
or two years, but a long operations phase—one decade or even longer. So the guiding principle
of the Tier standard is to design, construct, and manage the operations of data centers to achieve
specific business objectives.
Uptime Institute’s annual industry surveys show that approximately half of the company’s
in-house IT organizations experienced outages of in-house data centers with impact on business
in the prior 12 months. Nearly one third of the company experienced outages of IT services
outsourced from colocation centers. Most of the outage events are attributed to operator errors,
which may have included program error, resource inadequacy, management deficiency, and
inappropriate decision-making. These failures are often attributed to operators for their untimely,
unsuccessful emergency response.
In most cases, however, such failures can be attributed to management decisions (for
example, design compromise, budget cut, staff reduction, vendor selection, and resource
allocation). Very often, an incident can be attributed to a time and space before the incident (a
causative incident). For example, one can question if a management decision has resulted in an
operator not being well prepared or adequately trained for the proper handling of the emergency
event in question.
With increasingly higher data service requirements from business functions, stake-holders
of data center technology and facilities are faced with the constant pressure of realizing values
while sustaining cost effectiveness and operation efficiency. Therefore, the data center
management & operations (M&O) certification provides guidance and framework as well as the
best practices for achieving effective management and operations of data centers.
The M&O standard established for data center management and operations is applicable to
all teams, departments, cultures, and practices within the organization. It addresses staffing,
organization and training, preventive maintenance, and operational conditions as well as
planning, management, and coordination of practices and resources. In this sense, the standard
provides useful information to not only data center operations teams but also service providers
and top managers to facilitate them to carry out their roles and responsibilities.
I am glad to see the white book, an achievement of the data center industry in China in
general and of Ping An Technology to develop operations standards for the in-house data center
of Ping An Group in particular. I expect that this book can provide substantial help to the
colleagues at the data center of the Ping An group.
Philip Hu
Managing Director - North Asia Uptime Institute
May 2018
CONTENT
Chapter1 Introduction .............................................................................................................................. 1
1.1 Purpose and scope ............................................................................................................................. 1
1.2 Brief overview .................................................................................................................................. 2
Chapter2 Operations Standardization ...................................................................................................... 4
2.1 Lean management: theories and methods ......................................................................................... 4
2.1.1 The concept of lean management .......................................................................................... 4
2.1.2 Lean management practices ................................................................................................... 4
2.2 IT infrastructure library (ITIL) framework for operations ................................................................ 5
2.2.1 Incident management ............................................................................................................. 6
2.2.2 Problem management ............................................................................................................ 8
2.2.3 Change management .............................................................................................................. 8
2.3 Uptime Management & Operations (M&O) program .................................................................... 10
2.3.1 Staffing and organization ..................................................................................................... 11
2.3.2 Maintenance management ................................................................................................... 12
2.3.3 Training management .......................................................................................................... 14
2.3.4 Planning, coordination, and control ..................................................................................... 14
2.3.5 Operating conditions ............................................................................................................ 15
Chapter3 Security Management ............................................................................................................ 16
3.1 Information security ........................................................................................................................ 16
3.2 Physical security management ........................................................................................................ 17
3.2.1 Physical security configuration ............................................................................................ 17
3.2.2 Terminology and definition ................................................................................................. 18
3.2.3 Procedure ............................................................................................................................. 19
3.2.4 Site access registration system ............................................................................................. 20
3.2.5 Control of goods .................................................................................................................. 24
3.2.6 Fire safety management system ........................................................................................... 25
3.3 Personnel safety management ......................................................................................................... 26
3.3.1 Personnel safety training ...................................................................................................... 26
3.3.2 Day-to-day operational safety management ........................................................................ 27
Chapter4 Staffing and Staff Development ............................................................................................. 30
4.1 Organizational structure .................................................................................................................. 30
4.2 Roles and responsibilities ............................................................................................................... 31
4.3 Staff training ................................................................................................................................... 36
4.3.1 New-employee training ........................................................................................................ 36
4.3.2 Training plan ........................................................................................................................ 37
4.3.3 Training procedure ............................................................................................................... 38
4.4 Staff development ........................................................................................................................... 39
4.4.1 Routine training ................................................................................................................... 39
4.4.2 Special training .................................................................................................................... 39
4.5 Vendor management ....................................................................................................................... 40
4.5.1 Vendor training .................................................................................................................... 40
4.5.2 Service level agreement (SLA) ............................................................................................ 40
4.5.3 Vendor qualification ............................................................................................................ 41
4.5.4 Vendor performance evaluation ........................................................................................... 42
Chapter5 Best Practices of High-availability Operations ...................................................................... 43
5.1 Routine check - Overview .............................................................................................................. 43
5.1.1 Routine check - basic requirements ..................................................................................... 43
5.1.2 Routine check - frequency and methods .............................................................................. 43
5.1.3 Routine check of medium- and low-voltage switchgears .................................................... 44
5.1.4 Routine check of uninterrupted power supplies (UPS) ........................................................ 45
5.1.5 Routine check of precision power distribution systems ....................................................... 45
5.1.6 Routine check of diesel generation systems ........................................................................ 46
5.1.7 Routine check of heating, ventilation, and air conditioning (HVAC) systems .................... 47
5.1.8 Routine check of firefighting systems ................................................................................. 48
5.1.9 Routine check of security systems ....................................................................................... 49
5.1.10 Routine check of electronic monitoring systems ............................................................... 49
5.2 Preventive maintenance - overview ................................................................................................ 50
5.2.1 Preventive maintenance - general requirements .................................................................. 51
5.2.2 Checklists for preventive inspection, maintenance, and operation ...................................... 51
5.2.3 Preventive maintenance - detailed schedules for key systems ............................................. 52
5.3 Predictive maintenance - overview ................................................................................................. 67
5.3.1 Predictive maintenance - general requirements ................................................................... 68
5.3.2 Predictive maintenance - high-level plan ............................................................................. 68
5.4 Emergency plan overview .............................................................................................................. 68
5.4.1 Emergency drill plan ............................................................................................................ 69
5.4.2 Emergency drill items .......................................................................................................... 69
5.5 System availability check ............................................................................................................... 70
5.5.1 Monthly check of data center facilities ................................................................................ 70
5.5.2 Data center room environment check .................................................................................. 70
5.5.3 Data center facilities operational information check ........................................................... 71
5.6 Life cycle management ................................................................................................................... 71
5.6.1 Life cycle management - medium-voltage switchgears ....................................................... 71
5.6.2 Life cycle management - low-voltage switchgears .............................................................. 72
5.6.3 Life cycle management - transformers ................................................................................ 72
5.6.4 Life cycle management - diesel generators .......................................................................... 72
5.6.5 Life cycle management - uninterrupted power supplies (UPS) ........................................... 73
5.6.6 Life cycle management – chilled-water units ...................................................................... 73
5.7 Risk management ............................................................................................................................ 73
5.7.1 Acronyms and definitions .................................................................................................... 74
5.7.2 Risk identification and analysis ........................................................................................... 74
5.7.3 Risk mitigation plan ............................................................................................................. 76
5.8 Asset management .......................................................................................................................... 77
5.8.1 Challenges of asset management ......................................................................................... 77
5.8.2 Systematic asset management .............................................................................................. 77
5.8.3 Developing a unique asset management system for the data center .................................... 78
5.8.4 Asset management system illustrated .................................................................................. 79
5.8.5 On-site asset control ............................................................................................................. 81
5.9 Day-to-day operations management ............................................................................................... 83
5.9.1 Challenges of day-to-day operations ................................................................................... 83
5.9.2 Systematic day-to-day operations management ................................................................... 84
5.9.3 Integrated data center management system ......................................................................... 88
Chapter6 Operations Quality Assurance System ................................................................................... 91
6.1 Internal audit ................................................................................................................................... 91
6.1.1 Internal audit at the data center level ................................................................................... 91
6.1.2 Corporate internal audit ....................................................................................................... 95
6.2 External audits ................................................................................................................................ 96
6.2.1 Audit for M&O certification renewal .................................................................................. 96
6.2.2 ISO 9001 audit ..................................................................................................................... 98
6.2.3 ISO 27001 audit ................................................................................................................. 100
6.2.4 ISO 20000 audit ................................................................................................................. 101
1
Chapter1 Introduction
1.1 Purpose and scope
Entering the “Finance + Internet” 3.0 age, Ping An has launched a strategic initiative to
further develop “Finance + Technology” and explore “Finance + Ecosystem” in the coming
decade. Aiming to become a world-leading technology-powered personal financial services
group, Ping An will be focusing on two industries, pan financial assets and pan health care, by
employing the four core enabling technologies: Artificial Intelligence, Block Chain, Cloud
computing, and Security in the five ecosystems of financial services, health care, auto
services, real-estate services, and smart city. As of 2017, the group boasted 436 million
Internet users. To improve technological innovation-enabled customer service and enhance
customer experience, it is required to maintain a data center of bigger capacity and better
performance.
To keep pace with the rapid development of “Internet+ Finance,” Ping An has developed a
network of data center infrastructure facilities covering the entire geography of China, with the
core facilities located in Beijing, Shanghai, and Shenzhen. Ping An Data Center has been
constructed according to Class A of GB 50174 Code for Design of Electronic Information
System Room, with reference to Tier IV of the Tier international standard, and installed with the
most sophisticated high-availability equipment, thereby laying a good foundation for sustaining
the high availability of the data center. With more than a decade’s development, the data center
has accumulated abundant knowledge and experience in the planning, design, and operations of
data center facilities.
Data center operations involve practical management of changing environments. The
operations of Ping An Data Center have evolved from standardized operations to lean
operations and subsequently to customized services-oriented operations. Owing to the three-
stage evolution brought about by an operations team always ready for challenges and
constantly pursuing improvement, an operations model with unique characteristics has been
established in the data center. The operations team continues to explore ways to improve the
power usage effectiveness (PUE) and efficiency of the energy-saving and smart data center
while sustaining its high availability.
This white book is intended to share our experience in developing the standardized lean
2
operations system of Ping An Data Center, which would be helpful for other data centers to
improve their knowledge and operations capacity in order to sustain their high availability.
Our experience in translating specific requirements of international standards applicable to
data centers—such as ISO 9001 and M&O—into tangible operational activities is also
included in this book. We hope that our practical experience in this regard will be helpful for
data centers seeking certification to these standards.
The target audience of this white book includes managers of finance data centers,
telecommunications data centers, data centers of network operators, and company in-house
data centers as well as readers involved in the operations of data center infrastructure
facilities.
1.2 Brief overview
This white book is structured as follows:
Instruction
This chapter includes the purpose of this white book, which summarizes our experience in
data center operations over more than a decade, for safeguarding the reliable operations of our
in-house data center to achieve the goal of future incremental growth and for sharing with
companies and individuals in the industry to help them establish operations systems to satisfy
their specific business requirements.
Operations standardization
This chapter begins with an introduction of delicacy management-related theories
and their application in data center operations from the perspective of operations
standardization, followed by a description of the IT information library (ITIL)
framework, including a detailed illustration of incident management, problem
management, and change management.
This chapter also includes our program for Uptime Institute M&O certification,
with its significance to data center operations illustrated in the following five aspects:
staffing and organization; maintenance; training; planning, coordination, and
management; operating conditions.
Security control
Finance data centers necessitate more stringent security control than data centers in
other industries, which is illustrated in this chapter from the three perspectives of
information, physical, and personnel security.
3
Staffing and staff development
This chapter describes the organization structure as well as the roles and
responsibilities defined for sustaining the high availability of Ping An Data Center.
This chapter also includes the training plan and training assessment system established
for ensuring that the data center staff is capable of fulfilling their job assignments.
This chapter ends with an introduction of the vendor management system, including
requirements for vendor qualification, service level agreements (SLA), and vendor
performance monitoring.
Best practices in high-availability operations
This chapter provides a detailed description of the following:
The frequency, contents, and requirements for the day-to-day check of various
infrastructure equipment of the data center;
The preventive maintenance of eight subsystems of the power-distribution system,
four subsystems of the heating, ventilation, and cooling system, and three subsystems of
the low-voltage system;
The purpose and significance of predictive maintenance as well as predictive
maintenance planning for the data center infrastructure;
The purpose and significance of data center reliability verification as well as
different types and methods of verification;
The life cycle management of medium-voltage switchgears, uninterrupted power
supplies (UPS), batteries, precision cooling units, and water chilling units, including the
procedures for their update, annual inspection, overhaul, renovation, and obsolescence;
The availability check and third-party functional verification of the data center;
And risk management, asset management, and on-site control.
Operations quality assurance system
This chapter illustrates the approaches to check the operations quality of the data
center, including ISO 9001 quality system management, internal audits by the corporate
security and at the data center level, and external audits for M&O certification and other
purposes.
4
Chapter2 Operations Standardization
Data center operations involve two major tasks: 1) maintenance of every element of the data
center to sustain its stability and 2) timely detection and handling of incidents to minimize
downtime.
Centering on these two major tasks, the operations of Ping An Data Center have been
standardized by adopting the lean management methodology and incorporating the requirements
of international standards such as ISO 9001, ISO 27001, ISO 20000, ITIL, and M&O. The current
operations system having unique characteristics is a result of our experience and lessons learned
during these efforts.
2.1 Lean management: theories and methods
2.1.1 The concept of lean management
Underlying the concept of lean management is a culture. It is the natural result of increasing
division of labor and quality requirements in our modern society. In modern management,
scientific management involves a three-stage evolution: 1) standardized management, 2) lean
management, and 3) personalized management.
2.1.2 Lean management practices
In the context of a data center, lean management is the process of breaking down the
objective of high availability into tangible actions with well-defined responsibilities. Thus, the
objective of high availability can be effectively implemented down to every element and it can
serve as a major driving force for the team to improve its execution power.
Lean operations involve every person in the organization; for lean operations to be successful
in an organization, every person is both the object and subject of actions.
To realize lean management, the data center continuously fine-tuned the definition of roles
and responsibilities, configuration of the operations platform, equipment maintenance processes,
and customer services, by following the fundamental principle of precise, accurate, thorough, and
rigorous management. The efforts made toward lean management have resulted in better staff
qualifications and skills, more rigorous internal control, and improved stability and security of the
data center.
Precise management indicates an attitude of pursuing continuous improvement and
perfection of day-to-day tasks to maintain the optimal operation of the infrastructure and sustain
the high availability of the data center.
5
Accurate management indicates accurate and timely completion of tasks by carefully
following the standardized operations procedures. Accurate management also indicates
information accuracy—accurate physical status information of on-site equipment, accurate
identification and labeling, accurate clocks, accurate monitoring equipment data and operating
status, accurate instruments, accurate processes, and accurate manuals. This information is
necessary for risk identification and failure diagnostics and handling. Information accuracy has an
immediate impact on optimal equipment operation, timely failure handling, and prevention of
secondary failure resulting from human errors. The day-to-day maintenance of the infrastructure
equipment involves numerous tasks. Any change in the maintenance schedule is based on a
comprehensive risk analysis. Every maintenance task should be carried out in a timely manner
according to the pre-established maintenance schedule.
Thoroughness management indicates comprehensive and detailed definition of roles and
responsibilities for every operations task and detailed systems, specifications, and quality
assessment criteria as well as standardized manuals for maintenance, operation, and emergency
response, such that the security and reliability of the data center infrastructure can be ensured if
the manuals are followed step by step, even under the most disadvantageous conditions.
Rigorousness management indicates rigorous and strict execution and quality control of all
the tasks, processes, systems, and rules for the operations of the data center. For data center
operations, excessive rigorousness is better than lack of rigorousness.
Strictly following the requirements of lean management, the operations team of Ping An
Data Center has established a unique operations system and keeps improving on it by
continuously reviewing its processes, systems, specifications, and human resources, in order to
explore its potential and sustain the high availability of the data center.
2.2 IT infrastructure library (ITIL) framework for operations
The operations of Ping An Data Center are managed with reference to ITIL processes. Based
on years of experience, the most widely applied modules in the ITIL framework have been adopted
in our operations, including incident management, problem management, change management,
service request management, asset management, and security management. Considering the
importance of security and asset management to data center operations, these two modules will be
detailed in Chapters III and V, respectively. The implementation of the incident management,
problem management, and change management modules in the data center will be described in this
chapter.
6
Incident management, problem management, and change management in Ping An Data Center
are all performed in its Service Bot system. The Service Bot system records information of
incidents, service requests, problems, and changes, including the series number, reporter, time of
reporting, team in charge, person in charge, type of incident, source of incident, priority level,
detailed description, incident root cause analysis, solution, and other information of the handling
process.
An incident management form is generated in the ServiceBot system, where an incident can be
escalated and tracked with reference to the interconnected parent incidents, problem records, service
requests, and change records.
The system tracks and records the status (newly created, assigned, being processed, pending
solution, resolved, or closed) and SLA information of incidents, problems, and changes, thereby
enabling a closed-loop control. The closing of an incident, problem, or change is subject to the
review and satisfaction assessment by the initiator.
2.2.1 Incident management
Incident management in the data center aims to restore normal system operation as quickly as
possible and prevent disruption to the business in case of incidents, by following the pre-
established internal incident management process and measures.
The incident management process established in Ping An Data Center covers the reporting,
register, classification, handling process, escalation mechanism, response mechanism, and status
control of incidents, with the entire handling process tracked and recorded using the Service Bot
system.
2.2.1.1 Classification of incidents
(1) Warning alarm: Defining the concept and scope of alarms
(2) Failure: Defining the concept and scope of failures that may occur in the data center
(3) Level I failure: A failure having direct impact on the reliability of business operations,
with reference to SLA requirements
(4) Level II failure: A failure occurring to a single piece of critical equipment of the
infrastructure (according to a pre-defined critical equipment list)
(5) Level III failure: An incident that threatens the normal equipment operation and security in
the computer room but has resulted in no actual impact
(6) Urgent Incident Office Center (UIOC): The major incident management process for
7
addressing application-level severe failures caused by abnormal hardware or software
operation of data center facilities
2.2.1.2 Incident detection
The infrastructure operations team obtains alarm information about the infrastructure,
operating system, and data center facilities through routine check, remote monitoring, mobile
phone text message, and phone call. Upon receiving an alarm, the person in charge should go to
the scene of the alarm to obtain comprehensive information about the alarm. Any failure to the
infrastructure or operational environment should be immediately reported to the Infrastructure
Engineer on duty, who will decide the classification of the failure (Level I, II, or III) based on the
actual situation.
2.2.1.3 Reporting paths for the different levels of failures
A Level III failure is handled and followed up by the Infrastructure Engineer through
coordination with relevant technicians and service providers.
The Infrastructure Engineer should report a Level II failure within two minutes to the team
leader in charge, who will handle the failure and report the progress of failure handling to the
Management Representative in a timely manner.
The team leader in charge should report a Level I failure within two minutes to the
Management Representative, who will in turn report it within two minutes to the Data Center
Manager and update the progress of failure handling every two hours. The Data Center
Manager should circulate details of the failure to relevant leaders of the Company and decide
whether to initiate the UIOC process.
2.2.1.4 Failure handling
The Infrastructure Engineer is responsible for the response, classification, and reporting of
failures as well as the coordination of resources for failure handling.
For a Level III failure, the Infrastructure Engineer is responsible for 1) coordinating with
relevant employees for failure handling; 2) where necessary, notifying relevant service
providers for emergency response and repair within 30 minutes; 3) reporting the progress of
failure handling to the team leader.
For a Level II or I failure, the Infrastructure Engineer should go the scene of incident and
notify the team leader as soon as possible. She/he is also responsible for 1) notifying relevant
vendors/equipment manufacturers for failure handling and repair within 10 minutes; 2) if
there is still no successful progress in failure handling, urging the manufacturers to take
emergency actions (for example, providing back-up equipment); 3) reporting the progress of
8
failure handling to the team leader and Management Representative every two hours. For a
Level I failure, the Management Representative should report it to the Data Center Manager.
Upon successful handling of the failure, the failure incident should be recorded in the
management system, including a full description of the entire failure handling process.
2.2.2 Problem management
A problem is the root cause of one or more incidents. Problem management aims to identify
the root causes of incidents and prevent the occurrence of incidents by taking proactive actions to
identify and resolve problems before they can cause incidents. The management of a problem very
often involves a long time-cycle to diagnose and resolve its root cause based on appropriate
planning.
As problems are root causes of risks and incidents, they should be managed in association
with risks and incidents. A problem is ranked by referring to the risk ranking of Ping An Data
Center to be detailed later and is classified similar to the incident classification detailed above.
2.2.3 Change management
Change management aims to assess, approve, implement, and review every change in a
controlled manner in order to ensure the implementation of standardized methods and processes,
prevent unauthorized changes, minimize the risk and impact of emergency changes and related
emergency incidents, and maintain the traceability of changes.
The elements of change management include classification of changes, change management
process, definition of roles and responsibilities for change management as well as the initiation,
approval, implementation, and closing of changes and policies for normal approval and pre-
authorization of changes.
2.2.3.1 Definition of change management
Change management: the documented process of managing risky actions involved in the day-
to-day operations and maintenance of the infrastructure.
Change management aims to avoid risks associated with change implementation through a
standardized management process. The scope of change management covers annual routine
changes, incident-type changes, changes to the data center system structure, and changes to
equipment conditions, parameters, and configurations.
2.2.3.2 Change classification
A change to the infrastructure operations of Ping An Data Center is classified as Level I, II,
or III based on its impact.
9
Level I changes (or major changes) are those changes that pose big hazards to the power
distribution and HVAC systems of the data center or affect the security of the dual power
supply to racks, the overall cooling system of the computer room, the monitoring system, or
the fire fighting and security system.
Level II changes include maintenance-related changes and modifications to parameter
settings. Maintenance-related changes mainly include repairs to individual malfunctioning
equipment sets, alterations to individual equipment set configurations, and maintenance-type
incidents having no impact on the security of the dual power supply of IT power load.
Level III changes are mainly normal modifications to the parameters and alterations to the
operational conditions of individual equipment sets.
2.2.3.3 Definition of roles and responsibilities for change management
Figure 2.2-1 defines the roles and responsibilities of the Change Management Commission,
Daily Operations Manager/Bank IT Manager, Infrastructure Manager, engineers, monitoring
personnel, and technicians in the change management process.
2.2.3.4 Hierarchical change management
Fig. 2.2-1 Schematic illustration of the hierarchical change management
Schematic illustration of the Hierarchical change management
Change
Management
Commission
Daily
Operations
Manager/Bank
IT manager
Infrastructure
Manager
Engineers
Monitoring
personnel, and
technicians
Updating on
change
implementation
Updating on change
implementation
Updating on change
implementation
Change
implementation
Change
approval
Change
approval
Change
approval
Change
initiation
Change
initiation
Change
approval
Change
implementationLevel 3
Level 1
Level 2
Level 3
10
2.2.3.5 Initiating a change
The change management system provides a detailed definition of change initiators and the
major elements of change management, including the type of change request form as well as the
basic information, justification, schedule, and classification of the requested change.
2.2.3.6 Approving a change
At this step of the change management process, the person responsible for the approval of the
change request assesses and checks the potential impact of the requested change and decides
whether to proceed with the requested change, in order to ensure that the requested change can be
implemented to satisfy business requirements while minimizing its impact on services.
2.2.3.7 Implementing a change
At this step of the process, an approved change is implemented in the production system
according to the schedule and procedure provided on the approved change request form. The
details of the on-site change implementation should be recorded.
2.2.3.8 Closing a change
This step aims to investigate whether the expected effect of a change has been realized,
verify the results of the change, and check whether correct and complete information has been
recorded on the change request form.
2.3 Uptime Management & Operations (M&O) program
The Uptime Institute M&O certification, a well-recognized certification in the international
data center industry, aims to help data centers improve their operations and management by
assessing a comprehensive set of indexes.
The major philosophy of M&O is to minimize human and equipment risks and improve the
availability of data centers by providing best practices obtained from the cases of data center
operations across the world.
Ping An Data Center passed the M&O certification in 2017–2018, with the highest score of
96.3, achieved through the shortest certification program among Chinese data centers. The M&O
certification is based on an assessment of five categories: staffing and organization; maintenance
management; training management; planning, coordination, and management; operating
conditions. The certification requires an overall score of 80 or above for the five categories and is
valid for a period of two years. The following is a description of our M&O certification program
according to the five categories.
11
2.3.1 Staffing and organization
Appropriate staffing of qualified personnel is critical for achieving the long-term
performance objective of the data center. To achieve the uptime target for the data center,
adequate staffing and vendor support must be provided to carry out all the maintenance and
operating activities. All the employees in the data center must have the experience and technical
qualifications required to carry out the activities assigned to them and all the roles and
responsibilities must be defined, with their importance confirmed by the management.
2.3.1.1 Staffing
Ping An Data Center houses the systems and associated components required to run the core
business, and is expected to operate 24 × 7. The data center is provided with adequate staffing
required for this level of operations availability. A job description is established for each of the jobs.
A job description covers the recruitment requirements of education, experience, professional
competence, and core competence for the prospective job holders as well as the scope of
responsibilities, main responsibilities, and challenges and solutions of the job and the hierarchical
position of the job in the organizational structure, in order to ensure that any new employee of the
data center satisfies all the requirements and understands her/his roles and responsibilities.
A job responsibility matrix is developed for the 47 different roles defined for the operations of
the data center. The matrix provides a brief description of each task and indicates the four different
ways for each of the roles to participate in the task: implementation, approval, support, and
informed. The matrix is updated to reflect the latest changes in roles and responsibilities assigned to
the employees. This facilitates all the employees to understand their roles clearly and carry out their
assigned tasks in an orderly manner.
A data center is a complicated, equipment-intensive facility. Ping An Data Center divides the
facility into 15 zones and assigns a person to be responsible for the equipment in each zone, with the
detailed responsibilities defined and documented. The person-in-charge for each zone is assigned on
a regular rotation basis, such that all the employees of the data center can gain a clear, in-depth
understanding of the facility.
2.3.1.2 Personnel qualification
The operations of the data center involve day-to-day operating activities for the medium- and
low-voltage power distribution, cooling, and firefighting systems, elevator management, and work
above the ground. The employees involved in these tasks have been certified for operating the
medium- and low-voltage and HVAC systems by the State Administration of Work Safety, for
primary building (structure) fire-fighting by the Fire Department of the Ministry of Public Security,
12
and for elevator management (safety management for special equipment) by the Market and Quality
Supervision Commission of Shenzhen Municipality.
Personnel qualification management covers the collection and regular review of personnel
qualification information and follow-up with relevant employees for certification renewal/review, in
order to ensure the validity of all certificates.
2.3.1.3 Organizational Structure
An organizational chart of Ping An Data Center is available, clearly indicating the work
interfaces and reporting lines of the departments (Infrastructure, IT, Security Management,
Vendor Management, and Housekeeping) as well as the communication channels between the
different organizational functions.
2.3.2 Maintenance management
2.3.2.1 Preventive maintenance plan
Preventive maintenance plan: At the end of each year, the operations team of Ping An Data
Center prepares the next-year preventive maintenance plan by equipment type based on inputs from
equipment suppliers. The plan, which consists of more than 150 line-items, is approved by the
management and implemented strictly according to pre-established methods of procedures (MOP).
The completion rate of the preventive maintenance plan is a major key performance index (KPI) of
the data center, with the target set at 95%.
2.3.2.2 Maintenance management system
An effective maintenance management procedure for tracking the status and results of all
maintenance activities
Providing tabulated information (brand, model, date of manufacturing, date of installation,
maintenance contract, and operating instructions) of all major equipment sets
An order of maintenance providing special tools and materials required for the preventive
maintenance (PM)
Saving data of equipment maintenance activities and their trends
List of critical spare parts and re-ordering points
Equipment list: a list of all critical equipment sets, including information of equipment sets,
their maintenance, and their critical parts. Equipment information includes the classification,
location, description, brand, function/model, date of installation, and series number of equipment.
Equipment maintenance information includes the department in charge of maintenance, date of
equipment insurance, and contact person and phone number for maintenance. Information of
13
critical equipment parts is a list of information of critical parts by equipment set, with different
equipment sets having different critical parts.
Tool management: including specification for equipment calibration, a list of tools, and
records of tool calibration.
Management of critical spare parts: As different data center facilities have different types and
physical locations of equipment sets and different levels of vendor support, each facility defines
its own critical spare part list based on its own actual situation and performs regular check against
the critical spare part list. The aim is to repair malfunctioning equipment sets as quickly as
possible, shorten the mean time between failures (MTBF), and minimize the impact on business.
2.3.2.3 Computer room housekeeping policy
Standard of data center housekeeping:
Tidy and clean computer room floor
Computer room free of flammable and combustible materials, tools for housekeeping,
personal belongings, and paper packings
Tidy and clean computer room environment (IT computer room, power distribution room,
cooling station, and other functional areas)
2.3.2.4 Vendor support
Approved vendor list (for support under both normal and emergency conditions),
including names, contact persons, and contact information of vendors
SLAs, including clauses for scope, time, frequency, and response time of maintenance
and support as well as training needs
Vendor engagement process and qualified vendor service persons
2.3.2.5 Deferred maintenance procedure
The process for tracking and supervising deferred maintenance, including the initiation,
approval, implementation, and closing of deferred maintenance as well as analysis of
associated risks.
2.3.2.6 Life cycle planning
The procedure for the planning and financial control of the life-cycle-based replacement
of major equipment sets or components
2.3.2.7 Failure analysis policy
Equipment failure list (including the time of failure, equipment involved, failure analysis,
and lessons learned)
An effective process for identifying the root causes of problems and taking appropriate
14
corrective actions
2.3.3 Training management
2.3.3.1 Staff training
On-board training for every new employee to ensure that they are technically competent and
understand the working systems. Document presentation-based training and on-site drilling to
cover:
1) All processes, procedures, and policies for operations and management
2) Site Configuration Procedures (SCP)
3) Standard Operating Procedures (SOP)
4) Emergency Operating Procedures (EOP)
5) Maintenance Operating Procedures (MOP)
6) Maintenance Management System (MMS)
This also includes the training management procedure, which covers the curriculum, course
materials, and records of training, and the procedure for personnel qualification.
2.3.3.2 Vendor training
A list of training courses to be taken by vendors
Introduction to the process and procedure for vendors to provide on-site services
Vendor training is mandatory for every regular employee.
The training management procedure, covering the curriculum, course materials, and
records of training
2.3.4 Planning, coordination, and control
2.3.4.1 Computer room policy
The well-established procedures of the data center, including:
1) Equipment management policy of the data center (for example, the principle for
configuration changes and operating solutions under normal and emergency conditions)
2) Site Configuration Procedures (SCP)
3) Standard Operating Procedures (SOP)
4) Emergency Operating Procedure (EOP)
5) Change management (risk assessment and approval of requested changes)
2.3.4.2 Financial policy
The financial procedure for ensuring that an adequate fund is available for the data center
15
2.3.4.3 Document and data library
The following data and records must be maintained (kept at the data center or off-site):
1) As-built drawings
2) Documents for operations maintenance
3) Research results
4) Testing reports
5) Maintenance contacts and clauses
6) Documented automatic control procedures
The above data must be made readily available at the data center, maintained at the data
center in a centralized manner, and accessible to all employees. A procedure should be established
for the revision/update of the above data and should be made available to all employees of the
data center.
2.3.4.4 Capacity management
Capacity management includes the following processes:
1) Regular review and update of the used capacity of the data center in order to add new or
remove existing IT facilities as necessary;
2) Regular tracking of used rack, power, and cooling capacities, which is combined with the
prediction of the increasing demand for space, power, and cooling, air flow planning and
management, and power consumption analysis.
2.3.5 Operating conditions
2.3.5.1 Load management
Procedure for ensuring that the actual load does not exceed the capacity when switching
between the primary and redundant paths.
2.3.5.2 Operating configurations
Critical configuration points are defined based on risk, availability, and cost.
16
Chapter3 Security Management
The security management in data centers is broken down into the following three
categories: Information security, physical security, and personnel safety. Financial data
centers require higher security standards than other types of data centers. The Ping An Data
Center has established a precise and accurate security management system to protect the
operations of its various components by following the ISO 27000:2005 Information Security
Management System, GB/T 21052-2007: Information security technology—Physical
security technical requirement for information system, ISO 9001, and M&O.
3.1 Information security
As the level of informatization is increasing across the world, the information security of
data centers has become a popular concern and many organizations in the world are exploring
techniques to safeguard information security. The Ping An Data Center has established a
systematic management system for information security, by following the ISO 27000:2005
Information Security Management System (that have been adopted in the majority of the
world countries). Following are the rules for information confidentiality:
1) All the rules of Ping An Technology (Shenzhen) Co., Ltd. for computer information
and cyber security shall be followed.
2) No one may take any materials out of the computer room or disclose any
information stored in the computer room without permission, including
confidential documents, software copies, technical files, and other classified
data.
3) No one may disclose any secret information, classified information, or high
confidential information (including data and documents) about the data center.
4) No one may disclose, share, or embezzle the server data such as account IDs,
passwords, IP address, and the other server data
5) Non-authorized personnel are not allowed to access the restricted area, use the IT
facilities, or interfere with anybody else’s work in the data center; no one may
use any IT facilities other than those necessary for work; and no one may
interfere with anybody else's work or the operating of the data center.
6) Non-authorized personnel are not allowed to modify the operating system or
settings of the IT facilities (such as networks and servers).
17
7) No one may embezzle, alter, or sabotage the utilities in the data center.
8) An external person (such as a vendor or visitor) is required to sign a confidential
declaration prior to his/her first access to the data center and shall be subjected to a
security check by the administrators and security guards of the data center. Any
person violating the confidentiality rules shall be bound by the relevant rules of the
data center and governed according to the severity of the violation. In cases where
the violation constitutes a crime, the violation shall be reported to the Legal and
Security department of the company for investigating legal responsibilities.
9) An employee is only allowed to use the office computer allocated to him and is
not allowed to alter the operating system installed by the IT administrator.
10) The password policy is strictly mandatory for all employees, including non-
disclosure of account ID/password to others; the log-in password must be changed
every 90 days, and the allocated computer must be returned upon job rotation or
resignation.
11) Any work e-mailed to an external party must be copied to and approved by the
line manager. Any sensitive information (for example, account number, key, and
IP address) in emails and attachments must be appropriately shielded.
3.2 Physical security management
Physical security, referring to the security of the computer room as well as the
equipment and facilities of the data center, is the premise for safeguarding the information
system security of the data center. If the physical security of the computer room cannot be
safeguarded or there exist security hazards, then the security of the entire data center cannot
be realized.
The Ping An Data Center was constructed according to Class A of GB 50174 Code for
Design of Electronic Information System Room that provides a solid foundation for the
physical security of the data center. In addition, a control system for different levels of access
has been incorporated into the day-to-day operations of the data center, including access
control, material control, and fire safety.
3.2.1 Physical security configuration
Physical security of the Ping An Data Center is configured by the following five access
levels: site, building, compute room, zone, and rack.
Site: Security guards are employed to perform access control at site entrances by
18
ensuring that employees and visitors display proper passes or identification before
entering. In addition, patrolling is also part of the security guards’ duties..
Building: At the entrance to the data center building, access control system,
material screening, and face recognition are installed. In addition to these facilities,
security guards are responsible for the administration of persons entering and
leaving the building.
Computer room: Access to the computer rooms of the data center is controlled by
face recognition, access card, and fingerprint verification.
Zone: A computer room is zoned for clients, where the zones are separated by wire
meshes and cold aisles. Access to the zones is separately controlled with the door
access control system to ensure that the zone is only accessible to pre-authorized
users.
Rack: The front and rear doors of a rack in the computer rooms are locked and can
only be unlocked by the pre-authorized users.
Surveillance cameras are installed in the data center building and computer rooms, with
the surveillance videos in the last three months stored for inquiry.
The record of accesses to control points in the computer room is maintained for a period
of one year.
3.2.2 Terminology and definition
Permanent access
This level of access is granted to those employees who require permanent access to the data
center and is controlled by the means of access card, fingerprint, and iris information. The
facility administrator of the data center maintains a list of the persons with permanent access to
the data center, which is updated when an access is granted, or a granted access is canceled, and
is reviewed by the leader of the Data Center Infrastructure Management department.
Temporary access
This level of access is granted to those employees of Ping An Technology who do not
require permanent access to the data center but rather temporary access for work or external
parties who request temporary access to the facility. A person who has been granted a
temporary access through a pre-established procedure is required to enter and leave the data
center in the company of a person with permanent access.
IT facility zone of the data center
19
This is the zone for installing the IT facilities of the data center—racks for storage
devices, network devices, and servers, excluding other areas of the data center—for example,
rooms for infrastructure equipment, gas fire extinguishers, and uninterruptible power supplies
(UPS).
3.2.3 Procedure
Access control
Access control by zone: Access to the data center is further defined by zones according to
job responsibilities. That is, a person is only allowed to access those zones in the data center
that he requires to enter for job-related purposes.
Access application: Permanent access to the data center can be applied according to the
data center access application procedure. The Administrator of Data Center Infrastructure
Management add the approved access into the employee identity card of the applicant.
Changes to the granted access to the data center
When an employee with permanent access to the data center resigns or he is assigned to
a different job or different zone of the data center, his access needs to be deleted or updated
according to the data center access change approval procedure. The Administrator of Data
Center Infrastructure Management then updates the approved change into the employee
identity card.
Record of granted access to the data center
The unique permanent access to the data center granted to an employee (including his
employee identity card, fingerprint, and iris information) is recorded in the access control
system of the data center. The infrastructure engineer of the data center checks and updates
the system every month as per the granted accesses and retrieves from the system the list of
accesses and submits it to the data center manager for approval. A person with permanent
access to the data center shall fulfill his security commitment to the data center. He shall
retain his employee identity card in a proper manner and may not lend it to any other person.
In case the employee identify card is lost, he shall immediately report the loss.
Temporary access to the data center
If an employee of Ping An Technology requires temporary access to the data center for a
certain time period for job-related purposes, he shall submit an application for access
according to the temporary access approval procedure of the data center. The application for
temporary access should indicate who will enter the data center, at what time, for what
20
authorized task, and on what object, as well as the coordination required to perform the
intended task (including risk assessment and risk mitigation plan). Upon the approval of the
temporary access, the applicant will be provided with a visitor identity card. Prior to entering
the working area of the data center, his relevant identity information will be recorded by the
data center administrator on duty, who will accompany him into the working area and collect
the visitor identity card upon his taking leave of the data center.
External visitor to the data center
For an external visitor to access the data center, an employee of Ping An Technology
shall submit an application for the visitor’s access two days in advance. The application shall
be made via the relevant electronic access application system according to the data center
visitor access application procedure and should indicate the reason and time of the visit as
well as the specific zones of the data centers to be visited. An approved visitor shall visit the
data center in the company of a member of the Data Center Operations Team.
3.2.4 Site access registration system
To enter the data center, persons with only temporary access shall register at the security
post. For employees with temporary access, the registration shall be carried out in sequence;
for the group of external visitors or material handling operators, the registration process may
be completed under the name of a representative. A courier man may be allowed to enter the
office area of the data center by showing his identity card without going through the
registration procedure; however, the courier man must be accompanied by thevisited
employee working in the computer room. A person granted with temporary access to the data
center shall sign a non-disclosure agreement prior to his first entry into the data center.
The security guard on duty at the security post shall request persons with temporary
access to enter the following information in the visitor registration system: name,
company/department, time of visit, purpose of visit, zones to be visited, materials brought
along, and number of companions. For employees of Ping An with temporary access,
registration shall be carried out using an employee identity card; for external visitors,
registration procedure shall be executed by showing a valid personal identity certificate
(identity card, passport, social security card, or driver's license). A visitor card and visitor
registration form will then be issued, which should be carried by the visitor during the visit.
Upon leaving the data center, a person with temporary access shall return the visitor
card and visitor registration form. The visitor registration form shall be signed by the person
21
visited and the time of leaving should be indicated as well. The visitor shall record the time
of leaving the data center in the visitor registration system. The security guard on duty at the
security post shall check if the information provided in the visitor registration system is true
and complete.
In cases where the visitor registration system is technically unavailable, the registration
should be carried out via the data center access registration form instead and the record shall
be filed and maintained according to the relevant record control procedures.
Cleaning workers and supervisors may enter the pre-authorized zones using their special
access cards, but shall not be allowed into unauthorized zones without the company of the data
center’s technician on duty.
22
Fig. 3.2-1 Flow chart depicting the access control of external visitors
Control of temporary access
1) For an external party to visit the data center or any other working area containing
23
sensitive information, an application should be submitted to the administration office of the
data center. Upon approval of the application, a permit will be issued by the administration
office, and the data center technician on duty will then lead the visitor into the data center in
the pre-established visit window. For an external party to enter the computer room (excluding
external parties delivering goods into the warehouse), the security guard on duty shall use the
walk-through metal detector for performing a security check of the visitor and the material
brought along by him.
2) Once allowed into the data center, the visitor shall prepare for the intended
maintenance task or take rest in the designated area, no lingering is allowed in any other part
of the office area.
3) For a person with temporary access who will be involved in implementing changes to
the computer room, he is required to be prepared (for example, ensuring that the required IT
equipment and spare parts are identified and issued from the warehouse) before the daily
maintenance window (23:00 - 06:00).
4) Any person entering the data center shall check if the door to the data center has been
properly closed (the door should be normally closed). The data center technician on duty shall
check if the doors to computer rooms are properly closed and address any problem in a timely
manner.
5) No visitor shall enter any area that he has not been permitted to enter. Any violation of
this requirement will be reported to the customers concerned and the management of the data
center. The data center reserves the right to revoke the violator's access to the data center,
depending on the severity of the violence.
6) Photography or video recording using IT equipment is not allowed in the data center
without permission, except for job-related purposes by the employees. No person may take
any materials out of the data center. No person may take any software copies, technical files,
or any internal data classified as secrete information or marked with higher levels of
confidentiality out of the data center or disclose them to any third-party. A visitor to the data
center must sign a non-disclosure agreement as defined by the confidentiality management
system of the data center.
7) In cases where a person is allowed into the data center for hardware maintenance or
installation of any infrastructure or IT facility, or alteration to any optical fiber, network cable,
power socket, or power cable, the data center technicians on duty shall be notified of the
attempted maintenance or alteration, which shall then be performed under the supervision of
24
the technician on duty.
8) No person may alter any cable or floor in the data center without permission. In cases
where the power cables or network cables or any other wirings in the data center is planned
to be extended, the data center planning administrator shall be notified of the planned
extension. The planning administrator will design the layout of the sockets and ports required
for the system extension, and the data center operations team will implement the planned
extension. No person may open the floor or alter the power or network cabling without
permission.
9) No external visitor may carry any baggage into the IT facility zone of the data center.
10) In cases where an external service person is involved in the maintenance of any IT
facility or equipment and has logged into a server for this purpose, the IT facility
administrator of the data center shall confirm that the external service person has logged out
of the server and has closed the log-in page before leaving the data center. Furthermore, the
service person shall go through the leaving procedure at the security post before leaving the
data center.
3.2.5 Control of goods
The security guard is in charge of checking goods delivered into and out of the data
center as per the relevant goods control requirements.
No foods, beverages, or any other non-work related materials (including personal bags)
are allowed into the data center.
No combustible, flammable, fragile, polluting, or any other dangerous materials as well
as materials with strong magnetic fields that may interfere with IT facilities are allowed into
the data center.
All materials to be carried into the IT facility zone of the data center must be placed in
the baskets provided at the security post and should be subjected to security checks when
carried into and out of the data center. Personal belongings must be placed in the designated
lockers.
No personal notebooks or cameras are allowed into the data center without permission.
Notebooks and cameras are available at the data center upon requesting the data center
operations administrator (such a request may be made by filling a registration form for
borrowing tools from the data center).
For any non-personal belongings to be taken out of the data center, a gate pass shall be
25
prepared and approved. For any IT equipment with any magnetic media (a data security
concern) to be taken out of the data center, demagnetization treatment must be given to the
equipment by the operations administrator on duty and verified by the security guard on duty.
3.2.6 Fire safety management system
3.2.6.1 Regulations on fire and safety education and training
1) Regular training to employees on firefighting related laws, rules, and regulations.
2) Annual written examinations on firefighting knowledge and firefighting drills to
improve the firefighting and safety awareness and skills of the employees.
3.2.6.2 Regulations on fire hazard screening
1) Implement a responsibility system for fire prevention and safety (where the
responsibilities for fire prevention and control are defined for each job and included in the job
performance appraisal) and carry out regular fire and safety hazard screenings.
2) The firefighting facilities of the data center are maintained by a service provider, who
performs monthly fire hazard screenings and tracks the mitigation/elimination of identified
hazards.
3) Identified fire hazards shall be recorded by the inspector and signed by the parties
responsible for the mitigation/elimination.
3.2.6.3 Administrative regulations on emergency evacuation facilities
1) Escape routes and emergency exits shall be kelp clear, shall not be occupied for any
other purposes, and shall not be installed with fences or any other barriers that may obstruct
evacuation.
2) Emergency escape signs and emergency lighting shall be provided according to
relevant national regulatory requirements.
3) Firefighting facilities such as fire doors, emergency evacuation signs, emergency
lighting, mechanical smoke-discharging and ventilation, and emergency broadcasting shall
be regularly inspected, tested, maintained, and serviced for normal operation.
3.2.6.4 Regulations on fire safety
1) A hot work permit shall be obtained for any operation involving open flames.
2) Prior to any hot work, the scene (within a radius of 5 m) shall be free of flammable
and combustible materials and shall be properly segregated. Moreover, it shall be equipped
with appropriate types and quantities of fire-extinguishing materials (which are available
from the security department and shall be returned immediately at the end of the hot
operation, along with a record prepared for reporting any material that was used during the
26
operation).
3) If hot work is attempted in a production area, the hot work permit shall be approved
by the line managers or above and the entire operation shall be supervised by the operations
team. For hot work being undertaken 2 m above the ground or higher, a person shall be
assigned specially for watching the operation and extinguishing any flames that may lead to a
fire.
3.3 Personnel safety management
Personal safety of the operations team must be taken as a priority while sustaining the
normal operation of the data center. Ping An Data Center pays high attention to the personal
safety of the operations team and has incorporated personnel safety management into every
process of the data center.
3.3.1 Personnel safety training
Personal safety is included in both the pre-job training to new employees and on-the-job
training to the essential operations team.
A new employee shall complete the pre-job safety training and pass an examination
(with a minimum score of 80) at the end of the training during the probation period according
to the working instructions of data center on employee training. An employee must pass the
safety training examination to qualify for his job.
The safety training specialist prepares an annual safety training plan each December and
submits it to the management for approval. To minimize personal safety risks to the
operations team, the plan is based on the current safety training curriculum, the actual
operations situation, and the latest lessons learned from safety incidents that occurred both
inside and outside the company. Every member of the operations team shall take annual
safety training and pass an examination (with a minimum score of 80) at the end of the
training according to the working instruction of the data center on employee training.
The examination result is included as an index for the annual performance appraisal.
The training covers
1)electrical safety specifications;
2)HVAC safety specifications;
3)regulations on the use of facilities and tools;
4)regulations on accessing computer rooms;
27
5)reviews of safety incidents.
3.3.2 Day-to-day operational safety management
The operations team shall strictly follow safety specifications established by the data
center for the day-to-day operational activities (for example, electrical operations, operations
on HVAC systems, and use of facilities and tools).
3.3.2.1 Electrical safety specifications
1. An operator of electrical equipment must be physically fit (free from any disease
that may compromise personal safety during electrical operations as certified by a
doctor), equipped with appropriate electrical operation knowledge, certified for
electrical operation, and have skills for administering first aid in case of electrical
shocks as well as electrical fire prevention and extinguishing skills.
2. An electrical operation shall be performed by at least two persons, one for
operating and one for keeping vigilance. In cases where only one person is on
duty, he must be capable of working and handing incidents independently and is
only permitted for monitoring equipment operations, but not for operating any
electrical equipment without any person keeping a careful watch for possible
dangers.
3. The operator of electrical equipment must wear insulating boots, and should wear
insulating gloves when accessing the housing or structure of an equipment set..
4. The switch or knife-switch that directly controls the power supply to the
electrical equipment being operated shall be switched off and attached with a
label indicating that the switch should not be turned on.
5. A power distribution device, irrespective of whether its instruments indicate a
voltage or not, shall be taken as live unless it has been confirmed discharged.
6. When a power outage is planned following a major change approval procedure, the
power outage shall be restricted to the approved scope, and may not be extended
without further approval.
7. The operations team shall carry out patrol inspections earnestly and carefully,
correctly update operation logs, and properly prepare records and reports in a
timely manner.
8. No operations administrator may take duty under alcohol intoxication, may not
be involved in non-job-related affairs while on duty, and may not leave his
28
position without permission.
3.3.2.2 HVAC safety specifications
1. An operator of the HVAC systems must be physically fit (free of any disease that
may compromise personal safety during operation as certified by a doctor),
equipped with appropriate HVAC knowledge, and certified for HVAC operation.
2. An operation on the HVAC systems shall be performed by at least two persons,
one for operating and one for supervising.
3. Switching between the water cooling units shall be performed in the monthly
routine maintenance window, no switching is allowed without permission. In
cases where the primary cooling unit is malfunctioning, switching to the
redundant unit can only be performed after consent obtained from the engineer on
duty.
4. The operations team shall carry out patrol inspections earnestly and carefully,
correctly update operation logs, and properly prepare records and reports in a
timely manner.
5. No operations administrator may enter the data center barefooted, stripped to the
waist, wearing short sleeves, shorts or slippers, or under the effect of alcohol
intoxication, fatigue, or a serious illness. Employees must behave formally and
appropriately in the data center.
3.3.2.3 Regulations on the use of facilities and tools
1. The operations team shall use tools according to the regulations on the use of
tools; carefully preparing the tool use registration form, using tools in a cautious
manner, and return tools in a timely manner.
2. When performing a welding and cutting operation, the operations team shall
ensure that appropriate fire prevention measures are in place, follow the working
instruction for welding operations, and wear safety goggles and other personal
protective equipment.
3. For any operation being performed 2 m above the ground or higher, the operator
must wear a safety belt. Safety belts shall be regularly checked, verified for
proper strength before use, and may not be extended without permission. A safety
belt shall be tied to a support located higher than the object to be operated on; it is
not permissible to tie the safety belt to a support located lower than the object to
be operated on.
29
4. To operate on live parts, the operator must wear insulating gloves. Electric
instruments such as test pencils and multimeters shall be regularly checked for
electric performance.
5. When operating using a hand-held electrical device (for example, sander, cutter,
and screwdriver) the operator shall wear protective goggles and ensure that the
device is equipped with leakage protection. Any damaged device shall be
repaired by a specialist and can only be re-commissioned for use after its proper
functioning has been verified.
6. When a ladder is used for an operation above the ground, the ladder shall be
checked for robustness to prevent the operator from falling-off and getting
injured.
30
Chapter4 Staffing and Staff Development
4.1 Organizational structure
As the data center is fundamental and central to the Company’s IT infrastructure, establishing
an appropriate organizational structure for and clearly defining the functional roles of the data
center is of great significance in driving and guiding its effective, efficient, and secure operations
and meeting the Company’s business goals.
An appropriate organizational structure design facilitates the streamlined workflow, close
cooperation between departments, clear definition of role and responsibilities, and employee
motivation, thereby sustaining efficient operations of the data center where all the employees are
assigned appropriate tasks and are aligned to make concerted efforts toward a common goal.
The Uptime Institute Tier Standard: Operational Sustainability sets forth different staffing
requirements for data centers of different classifications, with a greater number of staff and better
skills specified for higher classifications. As shown in Table 4.1-1, the Uptime Institute standards
categorize data centers into four classifications (Tier I to IV, from low to high), and specify
greater number of staff and higher staff presence requirements for higher classifications. A Tier
IV data center is expected to sustain a very high level of availability throughout the year, and
hence, it requires 24-hour presence of a technical specialist to oversee its operations, so that any
problem can be resolved immediately, or redundancy is readily available to sustain its operations.
Data center operational sustainability requirements by classification Table 4.1-1
As a key infrastructure of Ping An Group, the data center plays a fundamental role in the
Group’s core business and disaster recovery. Its operations are configured according to Tier IV,
31
with 7 x 24 staff presence in three shifts (rotating among five teams). Each shift is staffed with an
experienced engineer (shift leader) who is capable of timely handling of emergencies, in addition
to a Monitoring Specialist and three Technicians (for electric equipment, HVAC, and electronic
systems, respectively). The Monitoring Specialist at the Guanlan data center site is responsible for
centralized monitoring of all the sites of Ping An Data Center. The operations team must be
staffed according to the width and depth of the operations. For any important position in the team,
a back-up person is assigned in case of unavailability of the primary person.
The operations team of Ping An Data Center is organized by three functional blocks, with the
function of each block further defined to establish a perfect operations system, as shown in the
table below.
Organizational structure of the data center operations team Table 4.1-2
Data center
operations team
Day-to-day operations
management
(IT management)
Network operations
Server operations
Software application operations
Data storage operations
Cloud platform operations
Infrastructure management
Electric systems operations
HVAC operations
Firefighting operations
System monitoring operations
Building security and
housekeeping
Security Department
Housekeeping Department
Logistics Department
4.2 Roles and responsibilities
With the continuous development of the Internet and information industries, data centers with
high-availability service and uptime become increasingly important. Consequently, it becomes
more critical to ensure the secure operations of data centers and the operations management of
data centers becomes increasingly more complicated and poses greater technical challenges. It is
important to define precisely the roles and responsibilities of the data center operations team,
which mainly consists of the following positions: Data Center Manager, Infrastructure Operations
Team Leader, Infrastructure Engineer, Infrastructure Monitoring Specialist, and Infrastructure
Technician.
Data Center Manager
The Data Center Manager assumes the overall responsibility for the data center and is
32
specifically responsible for
1. overall planning of the data center (capacity, energy efficiency, availability, and business
sustainability) to satisfy business requirements;
2. translating business requirements into requirements for the data center;
3. all day-to-day operations management;
4. planning, implementation, and continuous improvement of the operations system of the
data center;
5. establishing and implementing operating plans for the data center;
6. effectively controlling the operating cost of the data center;
7. driving to improve the service capability of the data center;
8. managing the data center team;
9. reporting, tracking, and handling major incidents.
Infrastructure Operations Team Leader
The Infrastructure Operations Team Leader reports to the Data Center Manager and is
specifically responsible for
1. planning the infrastructure required in the data center to satisfy business requirements;
2. establishing, implementing, and improving the data center’s infrastructure service and
protection plans;
3. operation, maintenance, and service as well as regular and irregular patrol inspection of
facilities and equipment and ensuring that operation specifications and equipment repair
and maintenance procedures are followed;
4. assessment of service providers of the data center and acceptance inspection of
constructions at the data center;
5. review of major changes to facilities and taking timely action to improve the data
center’s equipment capacity;
6. supporting the Project Department to prepare and deploy operations sustainability plans;
7. reporting, tracking, and handling major incidents to the infrastructure;
8. improving the energy efficiency and overall equipment efficiency of the data center.
Infrastructure Engineer
The Infrastructure Engineer reports to the Infrastructure Operations Team Leader and is
specifically responsible for
1. maintaining secure operation of the infrastructure when on shift; carrying out a
comprehensive patrol inspection of the data center site during each shift to ensure normal
33
operation of equipment and facilities;
2. people management, including
a) coordination and management of the Monitoring Specialist and Technicians; overseeing
their working discipline, quality of performed tasks, and progress in carrying out their job
responsibilities, supervising their work, and providing coordination when necessary;
b) reviewing and confirming by signature the change-of-shift reports and patrol inspection
records prepared by the Technicians on his shift;
c) service provider management, including managing the tasks performed by service
providers and reviewing and confirming by signature service provider reports generated on
his shift;
3. failure handling, including constantly watching the status shown on the monitoring
system as well as e-mail and text-message alerts; upon receiving an alert or failure notice,
locating and handling the failure in a time manner; for a Level II or more severe failure,
reporting it to the team leader and Management Representative immediately and updating
the latest progress in failure handling in a timely manner;
4. designing the sequence of the changes as necessary according to relevant plans or work
demands, initiating change requests accordingly, and implementing the planned changes;
5. keeping track of the operating conditions as well as technical data and files of major
equipment under his charge and ensuring that the equipment is in good operating
conditions; planning changes to remedy equipment defects and satisfy improvement
requirements according to pre-established procedures and initiating change requests
accordingly;
6. fulfilling major documentation tasks in time, including timely updating of documents
according to relevant specifications;
7. taking initiative to perform or participate in temporary tasks, e.g., organization and
coordination of training and drills, follow-up of changes, and preparation of operations
plans;
8. organizing relevant persons to support the construction, on-site management, and final
acceptance of projects;
9. hand over and take over shifts according to the change-of-shift procedure;
10. fulfilling other tasks assigned by line managers.
Infrastructure Monitoring Specialist
The Infrastructure Monitoring Specialist reports to the Infrastructure Engineer and is
34
specifically responsible for
1. monitoring the operation of the data center infrastructure 7 x 24 through the monitoring
system;
2. checking the operating status of the data center through the monitoring system one round
each hour, including the water chilling, precision air-conditioning, and high- and low-
voltage power supply and distribution systems, UPS, STS, precision power switchgears,
power switchgears for air conditioning, and surveillance videos;
3. in case of an infrastructure failure, performing preliminary root cause analysis and
notifying the Infrastructure Engineer (who will coordinate with the Electric Technician and
Air Conditioning Technician to remedy the failure), or calling a conference call for failure
remedy, recording and updating the failure remedy progress, and issuing failure alerts;
4. in case of failure emergency response, reporting the failure as alerted by the monitoring
system through e-mail and interphone;
5. summarizing Level VI and more severe infrastructure alerts and remedial measures by
shift;
6. checking surveillance videos during each night shift, reporting any issues identified to
the operations team, and reporting and following up on incidents;
7. actively participating in drills, training, team meetings, and other team activities
organized by the Company to improve job skills and professional competence;
8. fulfilling other tasks assigned by line managers.
Infrastructure Technician
The Infrastructure Technician reports to the Infrastructure Engineer and is specifically
responsible for:
1. managing the data center infrastructure (including power supply and distribution, air
conditioning. and firefighting systems and environmental sanitation); performing patrol
inspections according to the pre-established specification and frequency, and reporting any
equipment defect identified to the Infrastructure Engineer in a timely manner;
2. repair of the data center’s building structures and decorations (stairways, passageways,
walls, floors, ceilings, and roofs); regular check of the walls and roofs of the data center
buildings for water leakage and seepage and peeling-off; regular check of the lighting and
emergency lighting of the data center to ensure their normal operation;
3. optimizing the operation of the central air-conditioning unit (for the new computer room)
35
and precision air-conditioning equipment of the data center and improving their operating
efficiency and energy efficiency as instructed by the Infrastructure Engineer;
4. timely handling of failures in the power supply and distribution, firefighting, air-
conditioning, and water supply and drainage equipment to ensure their normal operation as
instructed by the Infrastructure Engineer;
5. supporting maintenance providers to maintain the power supply and distribution,
firefighting, air-conditioning, and water supply and drainage equipment and following up
on outstanding issues;
6. keeping the environment of the equipment under his charge clean, and safekeeping the
materials, keys, and tools issued to him for working his shift;
7. maintenance and repair of building decorations, office furniture, doors, windows, door
locks, floors, carpets, painting, lightings, and indicator lights of the data center;
8. making steel structures, floor-supporting structures, and floor holes; repairing floors in
the data center, and overseeing the operation performed by constructors on floors to ensure
that floor-supporting structures remain intact, underfloor spaces are free of foreign
materials such as cable ties and cable scraps, and floors are properly reinstalled after the
operation;
9. acquainting himself with the layout of the holes in the data center and overseeing that the
holes affected by construction operation are properly sealed and secured;
10. supporting the on-site management and final acceptance of new projects;
11. supporting the data center access control; No outsiders may enter the data center
without permission; Outsiders for construction and failure remedy may enter the data
center only after the employees in charge of infrastructure change control have arrived at
the site;
12. overseeing construction works when on duty to ensure that on-site construction
materials are arranged in an orderly manner; maintaining control of data center access to
ensure that equipment hardware is not affected by persons entering and leaving the data
center;
13. safekeeping materials—such as data, tools, and spare parts—in the data center;
checking the inventory of the materials at each change of shift and recording the final
inventory and changes to the inventory during the shift in the shift logbook;
14. actively participating in drills, training, team meetings, and other team activities
organized by the Company to improve job skills and professional competence;
36
15. updating the shift logbook according to the change-of-shift procedure with detailed,
accurate, and complete description of events;
16. fulfilling other tasks assigned by line managers.
4.3 Staff training
The incumbent and new employees for the data center site infrastructure operations shall
complete comprehensive rigorous training to ensure that they are equipped with the knowledge
and skills necessary for performing their respective jobs, such that the data center operations team
is competent for its roles, the data center is operated securely in an orderly and standardized
manner, and operation risk caused by human factors is minimized. The training includes the
following five categories: general training, procurement training, professional skills training,
training on the data center’s systems and procedures, and occupational qualification certification
training.
4.3.1 New-employee training
A new employee shall complete a two-month-long pre-job training program starting from the
on-board date. The training covers the basic elements of the data center operations, such as
operational safety, rules and regulations, working procedures, equipment operation, equipment
maintenance, and equipment emergency. The pre-job training is clearly specified by job,
including the instructors for the training courses. A new employee must pass the assessment for
all the training courses, such that he is qualified for his job. Table 4.3-1 shows the training
schedule.
New-employee pre-job training schedule Table 4.3-1
A separate assessment shall be given for each training course, and the assessment is designed
considering the importance of each course; every new employee must pass the assessment for
every course to be qualified for his job.
37
4.3.2 Training plan
4.3.2.1 Training plan for Engineer
The Engineer, a core technical and managerial position in the data center, assumes various
managerial and technical responsibilities in the data center. The training plan for this position,
which is based on the job description and performance targets pre-established for this position,
covers
all the management policies, processes, and systems of the data center;
the system configuration structure and operating plan of the data center;
the operation, maintenance, and emergency response of the power supply and
distribution equipment of the data center;
the operation, maintenance, and emergency response of the HVAC equipment of the data
center;
the operation, maintenance, and emergency response of the firefighting electronic
equipment of the data center.
4.3.2.2 Training plan for Technician
The Technician, a core position for on-site safeguarding of the data center, is responsible for
7 x 24 patrol, on-site control, and on-site emergency response in the data center. The training plan
for this position, which is based on the job description and performance targets pre-established for
this position, covers
all the management policies, processes, and systems of the data center;
the system configuration structure and operating plan of the data center;
the operation, maintenance, and emergency response of the power supply and
distribution equipment of the data center;
the operation, maintenance, and emergency response of the HVAC equipment of the data
center;
the operation, maintenance, and emergency response of the firefighting electronic
equipment of the data center.
4.3.2.3 Training plan for Monitoring Specialist
The Monitoring Specialist serves as the 7 x 24 alert service desk covering multiple sites of the
data center and is responsible for issuing alerts and notifications from the backstage. The training
plan for this position, which is based on the job description and performance targets pre-
established for this position, covers
the operation of the centralized power and environment monitoring system of the data
38
center;
the operation of the security systems of the data center;
the system configuration structure and operating plan of the data center;
the operation of the Service Bot system of the data center;
the incident management procedure of the data center.
4.3.3 Training procedure
4.3.3.1 Sign-in for training
A sign-in record shall be available for every training course and shall indicate who is required
to attend and who has attended the training course. An employee must complete and pass all the
required training courses. Otherwise, he will be disqualified from his job.
A person shall be specially assigned to oversee if a sign-in record is properly completed for a
training course and to subsequently send to it to be filed together with other records of the training
process.
4.3.3.2 Training assessment
At the end of a training course, the person in charge of the training shall conduct an
assessment of the training attendants. All the training attendants shall complete and pass the
training assessment. Otherwise, they will be disqualified. An attendant who fails the first instance
of assessment is allowed a second chance. A person who fails the second instance of assessment
shall be considered disqualified for his current job. The disposition of a disqualified employee
includes reassignment.
Training assessments may be conducted in the form of written examination, interview, and
operating skills assessment. Records shall be available for all assessments and shall be maintained
together with other training records by a specially assigned person.
4.3.3.3 Training review
At the end of a training course, the person in charge of the training shall conduct a review of
the implemented training. The review shall cover the reasonableness of the training plan,
completeness of the training materials, training effect, and outcome of training attendant
assessment.
The person-in-charge shall modify and improve the training curriculum according to the
outcome of the training assessment and implement the changes in the future curriculum.
The required training courses of a new employee shall be monitored with a tracking sheet,
which shall be updated by the training instructor in a timely manner. When the new employee has
39
completed and passed all the required training courses, the tracking sheet is used to document his
qualification for his job and is included in the centrally managed personnel file.
4.4 Staff development
As a data center site grows in scale and becomes more sophisticated in system structure, it is
more challenging to sustain its operations. A systematic training program that is comprehensive
and rich in content helps the operations team plan the data center’s operations and services more
effectively, reduce cost, improve operations processes, and render better support to business
processes, thereby improving the quality of the overall business operations.
4.4.1 Routine training
Employee routine training is planned periodically. In each December, the Engineer prepares
an annual training plan, which is subsequently approved by the leader-in-charge for
implementation. The training curriculum also covers management policies and includes courses
aimed to improve the professional competence of employees and facilitate their career
development.
Training and drilling courses:
(1) management policies of the data center;
(2) annual infrastructure security training;
(3) high- and low-voltage power distribution systems;
(4) air-conditioning systems;
(5) technical training on the firefighting systems;
(6) UPS systems;
(7) water supply and drainage systems of the data center;
(8) BA systems.
4.4.2 Special training
To power the sustainable development of and initiate changes necessary to the data center
operations, special training courses are offered irregularly to cover special events, processes, or
technologies. Such courses may be offered to employees or vendors, and the training instructors
may be provided by vendors or equipment manufacturers.
If any occupational qualifications are required for an employee on the operations team, he
will need to attend relevant third-party or national occupational qualification training courses and
pass occupational skill testing.
40
4.5 Vendor management
Vendors play an important role in the data center operations, and hence, their service persons
shall acquaint themselves with the site work, management policies, and technical requirements of
the data center. The service support persons from vendors may enter the data center for service
delivery only after they complete and pass the required training courses. Among them, those who
have passed the training courses are included in the Master List of Qualified Service Persons.
Vendor training shall be conducted on an annual basis as a minimum. The training covers the
relevant management systems, working processes, and technologies of the data center. The vendor
service persons who fail the training may not enter the data center for service delivery.
4.5.1 Vendor training
Vendor training aims to acquaint vendor service persons with relevant management policies,
working procedures, and service requirements of the data center, so that they can provide services
to support the secure operations of the data center in a secure and effective manner. At the end of
the training, attendants are assessed for their understanding of the vendor service person
qualification requirements of the data center, the dos and don'ts when working on-site, systems for
controlling materials and persons entering and leaving the data center, and vendor service
requirements.
A minimum score of 80 is required to pass the assessment.
4.5.2 Service level agreement (SLA)
4.5.2.1 Power supply and distribution and UPS systems
Service response and commitment:
The vendor shall respond within 30 minutes of acknowledging the receipt of a failure
notification (by email, telegraph, telex, or telephone) from the data center and shall work
immediately to remedy the failure to safeguard the normal operation of the systems.
Level I failure: Any power distribution equipment failure that results in the failure of two or
more equipment sets (servers, storage devices, and switches), e.g., tripping of the main switch or
output switch of a power management module (PMM) cabinet, STS output failure, and air-
conditioning switchgear failure. The vendor shall arrive at the scene within one hour and remedy
the problem within two hours.
Level II failure: Any power distribution equipment failure that results in the failure of a single
equipment set in the data center, e.g., failure of a single circuit of a PMM cabinet and failure of a
single air-conditioning switch. The vendor shall arrive at the scene within two hours and remedy
41
the problems within four hours.
Level III failure: Any power distribution equipment failure that has not resulted in any failure
in other equipment in the data center and has no impact on the availability of the data center, e.g.,
abnormal display on a PMM cabinet, and abnormal communication with a PMM cabinet or
electricity meter. The vendor shall arrive at the scene within six hours and remedy the problem
within 12 hours.
4.5.2.2 Air-conditioning systems
Service response and commitments:
The vendor shall respond within 30 minutes of acknowledging the receipt of a failure
notification (by email, telegraph, telex, or telephone) from the data center.
Level I failure: Any precision air-conditioning equipment failure or any precision chilled-
water pipe fracture that results in the failure or failed cooling of two or more precision air-
conditioning equipment sets, e.g., failure in the power supply to the precision air-conditioning
systems, fractured belt or malfunction ventilation fan of precision air-conditioning units, tripping
of air-conditioning switches, fractured chilled-water pipe, malfunctioning compressor of air-
cooled air-conditioners, and coolant leakage of air-cooled air-conditioners. The vendor shall arrive
at the scene within one hour and remedy the problem within two hours.
Level II failure: Any precision air-conditioning equipment failure or any precision chilled-
water pipe fracture in the data center that results in the failure or failed cooling of a single
precision air-conditioning equipment set in the data center, e.g., fractured belt or malfunctioning
ventilation fan of a single precision air-conditioning unit, malfunctioning compressor of an air-
cooled air-conditioner, and coolant leakage of an air-cooled air-conditioner. The vendor shall
arrive at the scene within two hours and remedy the problem within four hours.
Level III failure: A partial malfunctioning of the precision air-conditioning systems in the
data center that has not resulted in the failure of any other equipment or unavailability of cooling
in the data center and has no impact on the availability of the data center, e.g., abnormal display
on the air-conditioning systems and a malfunctioning humidifier. The vendor shall arrive at the
scene within six hours and remedy the problem within 12 hours.
4.5.3 Vendor qualification
Service persons from vendors must have obtained relevant occupational qualification
certifications issued by national authorities. A service person who does not have the above
qualifications may not enter a data center site for service delivery.
42
Requirements for vendor communication interface:
A vendor shall designate a liaison person and a back-up person as its communication
interface with the data center.
The vendor's liaison person shall be readily available for communication and shall be able to
provide quick support in case of emergency. The support can be provided remotely through
telephone or, where necessary, on-site service in the time frame as set forth in the SLA.
The vendor shall maintain at least one qualified person for providing on-site emergency
support to the data center.
Working procedure:
A maintenance event or change to the infrastructure of Ping An Data Center is initiated in the
form of a work order. The work order for a maintenance event or change to be performed by a
vendor is initiated by an employee of the data center. A work order must be duly approved prior
to implementation.
In cases where a vendor intends to delay a maintenance event, a written application for the
delay shall be provided to the data center three days in advance. The vendor may not delay the
maintenance without prior permission from the data center. The maximum delay allowed is 10
days.
4.5.4 Vendor performance evaluation
An infrastructure maintenance provider shall submit a maintenance service summary report to
Ping An Data Center every six months. The report shall be well-formatted and true in its content.
The maintenance provider shall also review the maintenance service provided in the whole year
and submit an annual maintenance service report by the last working day before the termination
date of the contract. The service quality of a maintenance provider is evaluated against the
services defined in the contract, and payment for services will be made according to the outcome
of the evaluation.
43
Chapter5 Best Practices of High-availability Operations
5.1 Routine check - Overview
Ping An Data Center is required to sustain a very high level of availability. To ensure the
stable and reliable operation of the IT facilities of the data center, the operations team must
monitor the infrastructure of the data center on a 24 × 7 basis. A small defect may lead to a major
failure. A data center infrastructure failure can always be traced down to some identifiable defect,
and hence, it is very important to conduct routine checks to detect and remedy operational defects
in a timely manner.
Two types of routine checks are implemented in the data center: on-site periodic check of the
infrastructure by infrastructure technicians and engineers; real-time monitoring of the power
supply and distribution, HVAC, firefighting, and security systems as well as the operating
environment of the data center by the infrastructure monitoring specialist through the monitoring
system of the data center. These two types of routine check complement each other to minimize
the major infrastructure failure occurrence rate and sustain the high availability of the IT facilities
of the data center.
5.1.1 Routine check - basic requirements
Smell: the odor of electrical discharge and burning odor of overheating insulators.
Listen: the sound of electric sparks and mechanical vibrations, abnormal sound caused by
abnormally high voltage or current, and mechanical vibrations caused by water pumps and
ventilation fans.
Feel: the temperature and vibration of non-live parts of equipment.
Look: electric sparkling, discoloring, deformation, dislocation, damages, oil seepage, water
seepage, relay actions, electricity meter readings, indication of instruments and signal lights,
and leakage, seepage, and dripping of pipes and valves.
5.1.2 Routine check - frequency and methods
Medium- and low-voltage switchgears, UPS, precision power distribution systems, diesel
generation systems, HVAC systems, and firefighting systems are checked every four hours
through manual on-site patrol inspection. Any anomaly identified should be immediately
reported to the infrastructure engineer and logged on the ServiceBot working platform to
44
facilitate follow-up and remedial actions (inspection data can be recorded and transmitted
using a software application running on tablets).
Security systems: The video recording of designated cameras is checked every eight hours;
the real-time video capturing of all cameras is checked every 24 hours. In addition, the real-
time video capturing of cameras is monitored through the data center’s monitoring system or
online video surveillance system, and the storage condition of videos is monitored through
the online video surveillance system. Any anomaly identified should be immediately reported
to the engineer and logged on the ServiceBot working platform to facilitate follow-up.
The electronic monitoring system is checked every two hours. The operating conditions of the
data center (including environment systems, power distribution systems, and security systems)
are monitored using the Data Center Surveillance Application. Any anomaly identified should
be immediately reported to the engineer and logged on the ServiceBot working platform to
facilitate follow-up.
5.1.3 Routine check of medium- and low-voltage switchgears
1. Look: check medium- and low- voltage switchgear panels for abnormal display of
indicator lights and meters as well as warning lights; check the open/close status of medium- and
low- voltage switchgear circuit breakers against the required status for data center power
distribution.
2. Listen: check medium- and low- voltage switchgears for abnormal sound caused by partial
electrical discharge and abnormal vibration.
3. Smell: check medium- and low- voltage switchgears for odor of electrical discharge and
burning odor of overheating insulators.
4. Feel: check the live parts of medium- and low- voltage switchgears for abnormal
temperature and vibration.
5. Record the voltage and current values at medium-voltage incoming line switches, the
current values at feeder switches of medium-voltage transformers, and the voltage and current
values of incoming line main switches.
6. Input the above values into the mobile inspection app installed in the tablet. If an input
value is outside the preset limits, the app page will turn red, indicating an anomaly. Where
45
necessary, take photographs of any anomaly identified during the inspection, and upload them
onto the mobile inspection app, which is synchronized with the ServiceBot platform of the data
center, where a work order will be generated and processed to address the anomaly. The work
order will be closed when the anomaly is remedied and the remedy is verified.
7. The mobile inspection app can record the route and time of inspections, such that the
frequency and quality of routine check can be monitored.
5.1.4 Routine check of uninterrupted power supplies (UPS)
1. Check if AC power input, bypass input, and power output switches are properly closed and
if indicator lights work normally; check circuit breaker protection units for warning indications.
2. Check UPS panels for warning messages and buzzer alarms;
3. Check for abnormal indication of the indicator lights on UPS panels, abnormal readings of
operating parameters, and new warning messages in the history record.
4. Check for abnormal operating sound or vibration; check electrical parts for burning odor.
5. Check the operating conditions of the fans installed on the housing; check if any filtering
screens are blocked.
6. Check the temperature and humidity of the UPS room and battery room for any out-of-
limit readings.
7. Check batteries for abnormal conditions (dirt, deformation, swelling, and liquid/acid
leakage); check the battery room for abnormal odor and sound.
8. Check battery packs for overheating connections and oxidized bolts;
9. Check the tools in the UPS room and battery room for missing/damaged items, integrity of
operating tips, and inappropriate marking and labeling.
5.1.5 Routine check of precision power distribution systems
1. Check the indicator lights on switchgear panels for flashing alarm lights; check
switchgears for abnormal sound and odor.
2. Check and record the readings of the electric parameters on switchgear panels; check if the
46
dual mains supply is properly indicated;
3. Check precision switchgear panels for warning messages and buzzer alarms.
4. Check the isolating transformers inside switchgears for abnormal vibration, overheating,
and burning odor.
5. Check radiator fans inside switchgears for abnormal operating conditions.
6. Check for missing or inappropriate operating tips, marking, and labeling.
5.1.6 Routine check of diesel generation systems
Routine check of diesel generator units
1. Check diesel generator local control panels for alerts; check if control mode selection
switches are switched to the “Remote” position.
2. Check the operating condition of output switchboards, compound switchboards, grounding
resistance cabinets, and dehumidifier-heaters.
3. Check component surfaces and piping connections for traces of oil and water leakage;
check the floor for water and oil stains; check for bite marks and other traces indicating the
presence of rats or other varmints.
4. Check the water level of cooling-water tanks; check the operating condition of the
cooling-water heaters.
5. Check engine oil level; check oil–water separators for water content at the bottom and
discharge the water from the bottom if necessary.
6. Check the charging voltage and current of charging panels; check batteries and start relays
for terminal oxidization and corrosion; check the charging of the emergency battery packs.
7. Check the oil level of daily oil tanks; check for oil seepage.
Routine inspection of the diesel generation low-voltage power distribution room
1. Check the display of compound switchboards; check if selection switches are switched to
the “automatic” position; check for warning indications and buzzer alarms.
47
2. Check the indicator lights on oil supply switchboard panels; check if selection switches are
switched to the pre-defined position (“Manual” for standby of the generator units and “Automatic”
for loaded operation); check the oil level of tanks (lower limit: 500 mm; upper limit: 900 mm).
3. Check direct current cabinets for abnormal parametric readings and alerts; check central
signal cabinets for alerts; check the heat radiation of power module cabinets.
4. Check the operating condition of power and lighting switchboards; check the lighting in
computer rooms.
Routine check of diesel generation high-voltage power distribution room
1. Check if switches are switched to the appropriate positions for diesel generators to
maintain their hot standby status.
2. Check the indicator lights of instrument REF615 on switchboards for alerts.
3. Check the operating condition of switchboard electrical heaters.
4. Check if dummy load controllers display any warning signals.
5. Check if the protection equipment and tools for high-voltage operation are stored in the
right place.
Routine check of diesel supply systems
1. Check and record the reading of the magnetic level meter on the outdoor oil tank
(specification: 200–1800 mm).
2. Check the oil tank valve well for ponding, settlement, and deformation.
3. Screen the oil tank area for fire hazards; check if proper lightning protection and
grounding measures are in place.
4. Check if emergency oil pumps and pipes are properly stored.
5.1.7 Routine check of heating, ventilation, and air conditioning (HVAC) systems
Routine check of precision air-conditioning units
1. Check the parametric readings and warning messages displayed on control panels.
48
2. Check if generators produce abnormal vibration and sound during operation.
Routine check of centrifugal chilled-water units
1. Check the parametric readings, alerts, and alarms displayed on the main unit control
panels.
2. Listen carefully to the operating sound of the main units.
3. Check the units for water and oil leakage.
4. Check the oil level of the main units (the reading of the level gage should be 1/3 at the
minimum when the main unit is shut down).
5. Check the refrigerant piping through sight glasses (the normal color observed through a
sight glass is green).
6. Check the differences between inlet and outlet water pressures of the chilled water and
cooling water piping of the main units (500 KPa at the minimum).
7. Check and record the percentage of operating current of the main units.
Routine check of circulating pumps and control cabinets for chilled-water units
1. Check and record the operating current of starter boxes.
2. Check the heat radiation of starter boxes for overheating/burning odor.
Routine check of cooling tower
1. Check the water level of the cold-water tray; check for scaling and deposit in the tray.
2. Check the operating condition of cooling tower fans.
3. Check the circulating water quality.
5.1.8 Routine check of firefighting systems
1. Check for fire alarm, fault, shielding, and monitoring messages displayed on the fire
alarm/gas extinguisher system control panel as well as buzzer alarms.
2. Check if “system normal” is displayed on the panel of the integrated firefighting/alarm
49
control cabinet; check the operating condition of the indicator lights on the control panel, fire
emergency telephone, fire emergency broadcasting, audio input, amplifier, and printer.
3. Check the manual control panel of the gas extinguisher system for warning lights and
buzzer alarms (the “manual” indicator light is on under normal conditions). Check the indicator
lights of manual/automatic gas extinguishing switches (the “Manual” indicator light is on under
normal conditions).
4. Check the indicator lights on the air sampler control panel (the power indicator light
should be normally on); check if any of the fault indicator lights is on.
5. Check if the pressure readings of IG541 gas cylinders (in the gas cylinder room) fall in the
green area; check the air cylinder head valves and zone selection valves in the air cylinder room;
check the magnetic valve control box in the air cylinder room.
6. Check if the pressure readings of the Heptafluoropropane fire extinguishing cylinders in
the diesel power distribution room fall in the green area.
7. Check if the pressure readings of fire extinguishers in the various areas fall in the green
area.
8. Check if the selection switches of the power control boxes for smoke exhaust fans, fire
pumps, sprinkler pumps, and jockey pumps are switched to the “Automatic” position.
5.1.9 Routine check of security systems
1. A checklist for all the cameras by their physical wiring is prepared for check by shift
(three shifts rotated in a cycle of one week). In each shift, the latest three-day video recordings of
certain number of cameras are checked through the online video surveillance system, such that all
the memory devices, video coders, and cameras can be covered to ensure that video footages are
properly stored and any anomaly can be quickly identified.
2. In the night shift, the real-time camera videos are also checked (including camera
identification description, system time, angle, and image resolution) through the data center’s
monitoring system or online video surveillance system.
5.1.10 Routine check of electronic monitoring systems
Table 5.1-1 Data center infrastructure monitoring checklist
50
System/equipment Check items
Air-conditioning systems Ambient temperature and humidity, inlet air temperature, return air temperature, and warnings
Power supply and
distribution systems
Voltage, current, power factor, active power, and reactive power
Generators Startup and shutdown conditions, current, voltage, load factor, and power supply to control
systems
UPS systems Input Voltage and Current, output Voltage and Current, frequency, power factor, load factor,
temperature, and warnings
Firefighting systems Alarms
Security and electronic
monitoring systems
Operating conditions of door access systems, alarms, surveillance videos, and visitor record
1. Check if any devices are shielded on the “Security Period” page of the Data Center
Surveillance Application.
2. Check if any devices are disconnected for communication on the “Devices” page of the
Data Center Surveillance Application.
3. Check if there are any red alarms displayed on the Data Center Surveillance Application.
The operating status of devices is indicated on the monitoring system of the data center by color,
with blue or green color indicating normal operation, red indicating abnormal operation or alarm,
and grey indicating disconnected communication.
5.2 Preventive maintenance - overview
Preventive maintenance is planned for extending the service life and reducing the failure rate
of equipment. It aims to identify defects of equipment before they develop into major failures
through regular check and service.
Ping An Data Center has established annual, quarterly, and monthly preventive plans based
on equipment operating conditions and the recommendations of equipment suppliers. The
maintenance personnel are required to follow the maintenance process and carry out maintenance
activities in a timely manner according to the systematic characteristics of equipment. The records
and reports generated from maintenance activities should be objective, practical, and properly
filed. The operations team should perform regular statistics and quantitative trend analysis of
equipment operating condition. For any abnormal trend identified, they will issue a warning and
51
propose and implement reactive as well as corrective actions to minimize the possibility of major
equipment failure.
5.2.1 Preventive maintenance - general requirements
Ping An Data Center has established detailed maintenance operation procedures (MOPs) for
all infrastructure maintenance activities, including step-by-step description as well as the
person-in-charge and schedule for each maintenance activity. Equipment standard operation
procedures (SOPs) should be followed during maintenance. This is to ensure the smooth
completion of maintenance activities and avoid wrong equipment operation that may result in
major equipment failure or personal injury. For example, the switching of medium-voltage
switches, manual startup of generator units, and switching of a UPS to its bypass circuit must
follow the respective SOPs.
The annual preventive maintenance plan of the data center must be followed, and the target
completion rate for annual preventive maintenance is set at 95%.
5.2.2 Checklists for preventive inspection, maintenance, and operation (including but not
limited to the systems and equipment listed below)
Table 5.2-1 Data center infrastructure preventive inspection checklist
System/
equipment
Functional check Vulnerability check
Power supply
and
distribution
systems
Power frequency voltage withstand test of circuit
breakers, main circuit insulation resistance test of
circuit breakers, transmission test and interlock test of
switchgears, checking the primary and secondary
circuits of switchgears, cleaning dust inside
switchgears, checking if holes are properly plugged
and sealed, and insulation, voltage withstand, and
grounding tests of mains cables and transformers
Power rating test of circuit breakers, partial
discharge test of switchgears, test of capacitors,
checking lightning protection devices, checking
cables and components for overheating
Generators Checking operating parameters, checking the
generator units for vibration and overheating
Checking startup batteries, oil level, cooling liquid
level, and air suction and smoke exhaust channels
UPS systems Checking components for overheating, checking
batteries (appearance, liquid level, and wiring
terminals)
Checking components and cables for overheating,
checking the discharging time of batteries
Air-
conditioning
High- and low-pressure pressures (air cooling
system), chilled-water pressure and temperature,
cooling-water pressure and temperature (water
Hot spots in computer rooms, checking indoor units
for water leakage, checking the operating
conditions of outdoor fans, checking filtering
52
systems cooling system), operating conditions of fans, dusts screens
Firefighting
systems
Pressures and expiration dates of air cylinders,
checking sensors for contamination
Pilot cylinders, pipe switches, and air pressures
Security
systems
Sensitivity of components, image sharpness (at
different levels of illumination)
Sensitivity of components, monitoring blind angle
Table 5.2-2 Data center infrastructure preventive maintenance and operation checklist
System/
equipment
Basic maintenance Testing Data operation
Power supply
and
distribution
systems
Switching operations Spare power automatic switching
test, spare power automatic
interlocking test
Backup of the logs of circuit breaker
protection units
Generators Replacing filtering devices,
cleaning generator body
No-load test, loaded test, and
switchover test
Backup of operating log,
backup/deletion of alarm record
UPS systems Cleaning the bypass circuit
and the inside of the housing
Bypass test, battery discharge
test
Backup of operating log,
backup/deletion of alarm record
Air-
conditioning
systems
Startup and shutdown,
cleaning/replacing filtering
screen, cleaning/replacing
humidifier system,
cleaning condensers
Water leakage alarm test Backup of operating log,
backup/deletion of alarm record
Firefighting
systems
Cleaning sensors Startup test, testing sensors Backup/deletion of alarm record
Security
systems Door access authorization Sensitivity of components,
image resolution (at different
levels of illumination)
Export and backup of door access
record, backup/deletion of surveillance
videos, backup/deletion of alarm record
5.2.3 Preventive maintenance - detailed schedules for key systems
Preventive maintenance of medium- and low-voltage switchgears - general requirements
1. The preventive maintenance of medium- and low-voltage switchgears includes general
live-line check (semi-annual), spare power automatic switching logic test (annual), and switchgear
test and maintenance (every three years).
2. The above maintenance activities are carried out by engineers of switchgear manufacturers
according to the MOP and SOP of the data center.
53
3. As the most important means for maintaining the power supply and distribution systems of
the data center, the preventive maintenance is intended to identify and remove safety hazards with
the operation of switchgears in a timely manner, extend equipment service life, and improve
system availability.
Preventive maintenance checklist for medium-voltage switchgears
Table 5.2-3 General live-line check
Category Maintenance item
Operating
environment of power
distribution room
Check and record the temperature and humidity of the power distribution room; check if the room is
properly ventilated, cable ducts are properly sealed, and appropriate measures have been taken
against varmints; check protection and operation tools
Load of switchgears Record voltage and current values
Temperature of
switchgears
Record the temperatures of the low-voltage chamber, rear panel, and front panel of switchgear
Condition of
switchgear
Check the condition of the display panel of the protection unit, indicator lights (electrical heating,
closing/opening of switches, energy storage, grounding switch, and high-voltage presence), relay
plate, and low-voltage chamber lighting
Table 5.2-4 Spare power automatic switching logic test
Category Test description
Automatic switching
between mains lines
Disconnect one mains line and test the logic of automatic switching between the two medium-
voltage mains lines (connected to one busbar)
Automatic switching
between mains power
and diesel generation
power
Disconnect both mains lines and test the logic of automatic switching between mains power and
diesel generation power
Table 5.2-5 Switchgear testing checklist
Category Subcategory Test items
Earthing of the housing Test the integrity and resistance of the main earthing circuit
Switchgear main circuit Test the resistance and voltage withstand (destructive test, not
recommended unless necessary) of the main circuit
Lightning protection devices Check and test lightning protection and monitoring devices
54
Housing Current transformer Calibrate polarity, transformation ratio, and excitation characteristic
curve
Potential transformer Test transformation ratio and non-load current
Protection relay Rating test, protection and signaling function test
The five error-proof functions of
the interlock mechanisms
Calibrate the mechanical and electrical interlock mechanisms
Low-voltage chamber secondary
circuit insulation
Sensitivity of components, image sharpness (at different levels of
illumination)
Circuit breaker
Main circuit Test the main circuit resistance
Opening/closing coils Test DC resistance and low-voltage operations
Maintenance of operating
mechanisms
Adjustment, repair, lubrication, and other in-depth maintenance items
(special-purpose solvent and lubrication grease); replacement of quick-
wear parts
Insulation of the control
component of circuit breaker
Test the insulation resistance of secondary components
(opening/closing coil, auxiliary contact, relay, and energy-storage
motor)
Integrity of vacuum interrupter Voltage-withstand test (destructive test, not recommended unless
necessary)
Special-purpose
diagnostics and
tests
Partial discharge test Switchgear partial discharge test
Operating behaviors of fuses Preventive failure diagnostics and testing of fuses
Mechanical behaviors of circuit
breakers
Test mechanical behaviors of circuit breakers
Table 5.2-6 Switchgear maintenance checklist
Category Sub-category Maintenance
Busbar chamber
Cleaning Cleaning main circuit and insulation parts with anhydrous alcohol
Bolt tightening torque
calibration
Tighten busbar bolts with a torque of 70 N.m (the bolts should not move)
Maintenance of insulation
parts
Check insulation plate, moving and fixed contacts box between main line,
busbar, and housing for damage, electrical discharge, and flashover, and
55
clean them with anhydrous alcohol
Cable chamber
Cleaning Clean main circuit, insulation parts, cable heads, and transformers with
anhydrous alcohol
Bolt tightening torque
calibration
Tighten cable bolts with the specified torque (the nuts should not be
removed)
Maintenance of insulation
parts
Check wall bushings, insulation plates, cable heads, and transformers for
damage, electrical discharge, and flashover, and clean them with anhydrous
alcohol
Maintenance of earthing
switches
Check if earthing knife-switches operate normally; check the operation and
position indication of interlock couplers; check if auxiliary contact switches
operate normally; clean and lubricate contacts
Maintenance of sealings Check the ingress protection of the cable chamber against varmints and
water vapor; improve the sealings where necessary
Trolley chamber Cleaning Clean contact boxes and curtain doors with anhydrous alcohol
Bolt tightening torque
calibration
Check if fixed contacts are properly tightened; check the integrity of curtain
door mechanism bolts and jump rings
Maintenance of insulation
parts
Check contact boxes for damage, electrical discharge, and flashover, and
clean them with anhydrous alcohol
Lubrication Clean curtain door mechanisms and earthing trolley rails of fixed contact
boxes with anhydrous alcohol
Low-voltage
chamber
Functionality of secondary
components
Secondary components should be functionally reliable and free of loose
connection, electrical discharge, and ablation.
Security of terminal
wiring
Tighten terminal wiring; check terminals for ablation and loose connection
Circuit breaker
Maintenance of operating
mechanisms
Check the inside of operating mechanisms for missing or damaged parts;
clean and lubricate them where necessary
Secondary circuit Check opening/closing coils, energy-storage motors, relays, and sensitive
switches
Trolley chamber
Signal plates Adjust or replace signal plates
Mechanic interlock
mechanisms
Lubricate and check mechanic interlock mechanisms; check if they function
reliably
Contacts and contact arms Clean contact arms; clean, lubricate, and tighten moving contacts
56
Preventive maintenance checklist for low-voltage switchgears
1. The preventive maintenance of low-voltage switchgears includes general live-line check
(semi-annual), spare power automatic switching logic test (annual), and switchgear test and
maintenance (every three years).
2. The above maintenance activities are carried out by engineers of switchgear manufacturers
according to the MOP and SOP of the data center.
Table 5.2-7 General live-line check
Category Maintenance item
Operating
environment of power
distribution room
Check and record the temperature and humidity of the power distribution room; check if the room is
properly ventilated, cable ducts are properly sealed, and appropriate measures have been taken
against varmints; check protection and operation tools
Load of switchgears Record voltage and current values
Temperature of
switchgears
Record the temperatures of the low-voltage chamber, rear panel, and front panel of switchgear
Condition of
switchgear
Check the condition of the display panel of the protection unit, indicator lights (closing/opening of
switches and energy storage)
Table 5.2-8 Spare power automatic switching logic test
Category Test description
Spare power
automatic switching
Disconnect one feeder line of the transformer and test the logic of automatic switching between the
two low-voltage mains lines (connected to one busbar)
Table 5.2-9 Switchgear testing checklist
Category Sub-category Test description
Housing
General check No paint peeling-off, no housing deformation, legible labeling on
instrument dials, no abnormal condition inside the housing
Insulation resistance of main
busbar and control circuit
Test with 500 VDC or 1000 VDC insulation resistance tester;
minimum 1000 MΩ insulation resistance. Test to be conducted via the
grounding method and secondary control function to be considered.
Break grounding connections for the test
57
Grounding connections Check the reliability of the system, cabinet, and board grounding
connections against the specific grounding system requirements;
grounding connection of output cables; the equal-potential grounding
of cabinet doors
Busbar and cable connections Check cable and busbar connections for overheating (using an infrared
thermometer or imager); check major connections using a torque
wrench against preset torque value
Mechanical function of drawer
circuit
Check the indication of the drawer circuit; check if it can be pushed in
and pulled out normally
Circuit breaker
General check
Check appearance (no overheating-caused contact oxidization, no
traces of flashover outside the arc-extinguishing chamber, integrity of
front panel, framework deformation, integrity of secondary terminals,
legibility of secondary line labeling)
Phase-phase insulation and
insulation between upper and
lower ports
Test with 500 VDC insulation resistance tester (minimum 1000 MΩ
insulation resistance required)
Contact wear (air circuit breaker) Open the arc-extinguishing chamber cover and check the wear of
phase contacts
Trip force (air circuit breaker) Test the trip force of air circuit breaker actuators using a special-
purpose tester
Mechanic operation Test the following operations: rocking in and out, manual energy
storage, and manual closing/opening; check the snap-in force of
framework clamps
Interlock function Check mechanical and electrical interlock function
Mechanical behaviors (air circuit
breaker)
Test the current curve, energy storage speed, three-phase
synchronization, contact resistance, bouncing, and over travel using a
Prodia mechanical characteristics tester
Operating characteristics of
protection units
Test the functionality of protection units and conduct selective
analysis using a Proselect protection unit tester
General check Check capacitors for swelling and deformation; check connection
cables for discoloring; check the appearance of contactors and series
reactors; check if ventilation holes are plugged; check for dust
deposited on dust screens
Main incoming line harmonic Test total harmonic distortion rate and specific harmonic content using
58
Compensation
capacitor
(loaded) a power quality analyzer
Controller configuration and
alarm record
Check its measurement display, parameter setting, and alarm record
Phase current of capacitor (live-
line)
Test with clip-on ammeter while switching it on manually
Operating condition of contactors
during stepped switching
Observe contactor’s vibration and noise while it is being switched on
and off
Panel display during stepped
switching
Observing the varying display of power factor, current, and step
number during manual switching
Startup of fans Check functionality of fans while they are being manually switched
on and off
Temperature alarm devices Test their operating condition manually
Capacitance of capacitors (power
off)
Test phase-phase capacitance of capacitors using a capacitance meter
(the measurement should be higher than 90% of the theoretical value)
Contactor circuit resistance Test the contact resistance of each contactor by phase (power off,
manual switching)
Table 5.2-10 Switchgear maintenance checklist
Category Maintenance item Maintenance description
Housing
Cleaning dust inside the
cabinet
Clean dust with a vacuum cleaner; scrub insulators and cable connections
with a dry cloth and anhydrous alcohol
Lubrication of clips for
plug-in type functional units
Apply a small amount of conductive paste to connections (clips, silver-plated
bars of the moving part, copper bar at the incoming side of the drawer)
Cleaning and lubricating
mechanical parts
Clean the positioning mechanism, bearings, and sliding guide of drawer;
lubricate the positioning mechanism only
Circuit breaker
Cleaning and lubricating
exterior mechanisms
Clean and lubricate rock-in and -out mechanisms and interlock mechanisms
Disassembling air circuit
breakers for maintenance
Disassemble energy-storage springs, opening/closing coils, energy-storage
motors, secondary auxiliary contacts, and tripping units for comprehensive
check, cleaning, and service; replace consumable parts
Cleaning and lubricating
main contacts
Clean and lubricate contacts and clips on the main body and chassis
59
Tightening chassis bolts Tighten the bolts for connecting the chassis to the cabinet
Replacing control unit
batteries
Replace the batteries in the control unit
Compensation
capacitor
Cleaning the inside of
capacitance compensation
cabinet
Clean dust with a vacuum cleaner; scrub insulators and cable connections
with a dry cloth and anhydrous alcohol
Tightening internal cables Tighten primary and secondary connection cables
Replacing ventilation hole
dust screens
Replace dust screens and sealing rubber strips
Cleaning and lubricating
fuse seats
Cleaning contacts and clips on fuse seats, and apply a small amount of
conductive paste
Replacing failed parts and
aged capacitors
Replace capacitors, fuses, and contacts that have failed testing
Preventive maintenance of diesel generation systems
1. No-load test: conducted monthly by the operations team of the data center to verify the
automatic startup and parallel operation functions of the generator units.
2. Single-unit dummy load test: conducted monthly jointly by the service provider and
operations team to verify the effective load capacity of the generator units.
3. Loaded test under parallel operation: conducted annually jointly by the service provider
and operations team to verify the automatic startup and parallel operation functions and effective
load capacity of the generator units.
4. Monthly preventive maintenance:
A. Check engine appearance: Check the fastenings of the engine’s coolant, fuel, and
smoke exhaust systems and tighten or replace them where necessary.
B. Check engine oil level: Pull out the engine oil level gauge after the generator units are
shut down for five minutes and check if the oil level is between the “L” (low) and “H” (high)
marks. Replenish engine oil if the oil level is lower than the “L” mark.
C. Check coolant level: Open the pressure cover of the cooling system and check the
60
coolant level. Replenish coolant (to below the coolant filling neck on the radiator) if the coolant
level is too low. Be sure not to replenish coolant until the coolant temperature decreases to below
50 ℃. Re-install the cooling system pressure cover after the replenishment.
D. Visual check of cooling fans: Visually check cooling fans for cracking, loose screws,
bent blades, and other anomalies. Liaise with the vendor to remedy any damage or anomaly.
E. Check the operating condition of engine coolant heaters. If the working power supply of
a heater is normal but the temperature is too low, then the heater may possibly have stopped
working. Any malfunctioning heater should be remedied in a timely manner to resume its normal
operation.
F. Check engine’s air intake filter: The air filter indication meter is located on the air filter
assembly or between the assembly and turbocharger. As the dust deposit on the filtering element
increases, the accumulative dust meter increases accordingly on the indication display. Clean or
replace the filtering element when the accumulative dust displayed on the indication meter
exceeds the threshold.
G. Check air intake pipes for looseness: Check air intake pipes for cracking, piercing, or
loose clamping. Tighten or replace the loose parts where necessary to ensure no leakage in the air
intake system. Check the hoses under clamps for corrosion. Replace the hoses where necessary, to
prevent foreign materials from entering the engine.
H. If the diesel fuel system is equipped with an oil–water separator, drain the water inside
it as follows: Turn the water drain valve anticlockwise two rounds. Wait until only clean fuel is
discharged from the oil–water separator. Close the water drain valve by turning it clockwise two
rounds. Do not tighten the valve with too much force, to avoid damage to the screw.
I. Where necessary, discharge sludge in fuel tanks as follows: loosen the screwed oil drain
plug with a spanner. Drain the tank until only clean fuel is discharged from it. Close the blow-
down valve and restore the screwed plug.
J. Check storage batteries and DC startup systems: Check if storage battery terminals are
clean and securely wired. Clean and re-wire them where necessary. Check if wire harnesses of DC
systems are properly connected, and replace damaged harnesses. Check the connections between
storage batteries and AC chargers. Check charger belts visually for cracking and other anomalies.
61
5. Annual preventive maintenance:
A. Refer to the monthly preventive maintenance items above.
B. Replace engine oil and engine oil filters.
C. Clean daily fuel tanks, and replace fuel filters.
D. Replace coolant filters and air filters.
6. Preventive maintenance of diesel generators:
A. Check the underground fuel tank; check the water level in the inspection hole and drain
the water (biweekly).
B. Check if there is water in the underground fuel tank by drawing a fuel sample from its
bottom through the oil drain port (monthly).
C. Replace startup batteries and startup relays (biannually).
D. Replace the spare batteries in the integrated control cabinet (biannually).
E. Replace coolant (every three years).
F. Replace fuel in the underground fuel tank and clean and test the tank according to fuel
quality test results (every five years).
G. Perform in-depth maintenance and test generator units (every ten years). Scrap or
replace the units if their reliability is compromised or their main performance indexes cannot
satisfy the preset specification.
Preventive maintenance of UPS
Preventive maintenance of UPS is carried out quarterly by service engineers of the original
manufacturer according to the MOP and SOP of the data center. Where the condition permits, the
preventive maintenance also includes more in-depth functional checks of the UPS systems
performed quarterly or at longer intervals. These checks may involve switching operations of UPS
and cannot be performed without putting adequate protection measures in place.
1. Check input power quality (input voltage and frequency) and output power quality (output
62
voltage, frequency, and output waveform distortion factor).
2. Check if the power switchover time is smaller than the specified value.
3. Check if the transient output voltage drop during power change-over is smaller than the
specified value.
4. Check if the output harmonic distortion factor is smaller than the specified value.
5. Check if the floating charge voltage and charging current fall within the respective design
specifications.
6. Check the voltages of battery pack and single batteries.
7. Check the battery pack backup time as follows: Turn off the main circuit input switch,
discharge the batteries for 30 minutes, turn on the switch, and record the backup time.
8. Check if the battery pack outputs large transient current while starting up.
9. Check the internal resistance of battery packs. If the internal resistance exceeds the
specified value, perform equalizing charge of the battery packs and thereafter discharge or treat
them with activation.
10. Check the manual opening and closing of prime and post switchgear circuit breakers.
11. Check the homogeneous current under parallel operation and parallel operation change-
over logic.
12. Shut down the UPS, check the tightness of its internal connections, and clean the dust on
key electrical parts.
13. Check the operating condition of radiator fans. Replace defective fans.
14. Simulate failures of the UPS systems to identify potential issues with the systems. This
helps prevent failures of the UPS systems when they are required to support operations. Ensure
that protection measures are put in place for the simulation.
A. Simulate mains power outage and observe if the UPS units switch to different working
modes normally.
63
B. Simulate mains power outage and record the discharging voltage curves of the battery
packs.
C Simulate one of the parallel connected UPS units being down, and observe if the other
units work normally.
15. As recommended by the manufacturer, replace the AC and DC capacitors of UPS units
preventively after five years of service.
Preventive maintenance of air-conditioning systems
1. The data center conducts preventive maintenance of the air-conditioning systems to ensure
their operating safety and stability and sustain their energy-saving performance.
2. Monthly preventive maintenance of chilled-water units:
A. Check, record, and analyze the operating conditions of the units.
B. Check the level and color of lubrication oil.
C. Check the lubricant supply and return circuits of the lubrication system, lubricant
temperature, and the operating condition of lubricant coolers.
D. Check the time differences of startup/shutdown between lubricant pumps and main
units.
E. Check for abnormal vibration and noise.
F. Check the temperature of output chilled water against the specified value.
G. Check the evaporating temperature and condensing temperature against inlet and
outlet chilled water and cooling water temperature differences.
H. Check for any leakage in the units.
I. Check motor current against actual electricity consumption.
J. Check the operating condition of guide vane actuators.
K. Check the control configuration of the units.
64
L. Analyze the operating condition of the units.
3. Annual preventive maintenance of chilled-water units
A. Check the evaporator and condenser pressures displayed on the control panel
against measurements.
B. Transfer refrigerant to the condenser, and discharge refrigerant oil from the
refrigerant oil filling valve.
C. Check oil system circuits and oil cooling systems, replace oil filters, and replenish
refrigerant oil.
D. Check refrigerant system circuits, and replace refrigerant filters.
E. Dehumidify and vacuum evaporators.
F. Balance refrigerant system pressure and check the housing of the units for pressure
leakage.
G. Test the insulation of compressor and pump motors.
H. Check the operating condition of guide vane actuators.
I. Check and clean startup cabinets.
J. Check parameters and automatic control of the units: condensing and evaporating
pressures, bearing temperature, motor coil temperature, oil bath temperature, inlet and outlet
chilled water temperatures, pressures, oil pressures, and oil pressure differences. Start up and
shut down guide vanes and start up oil pump to check oil pressure and output digital signals of
oil heating relays.
K. Start up the units for test run, and provide a worker order for annual preventive
maintenance of the units based on the operating conditions.
4. Monthly preventive maintenance of computer room air conditioners
A. Check and record operating parameters of precision conditioners; check controllers
for warning messages.
65
B. Check the tightness and wear of belts. Adjust or replace them where necessary.
C. Clean or replace air filtering screens.
D. Check the working condition of proportional control valves.
E. Check the discharge of chilled water and the outlet air of the units.
5. Monthly preventive maintenance of cooling tower
A. Check and record the operating current of the cooling tower.
B. Check the operating condition of the cooling tower. The air blade rotation should be
balanced, without significant vibration or scraping against the cooling tower wall. The water tray
should be filled with an appropriate level of water.
C. Replenish the lubrication oil for fan reducers. Check belts and belt pulleys, and
adjust them where necessary.
D. Check water distribution devices and cooling tower water replenishment devices.
E. Check the condition of fillers for clogging or damage.
F. Check the cooling tower piping, framework, and ladder for corrosion.
6. Other preventive maintenance items for the cooling tower
A. Clean cooling tower tray and filler (quarterly).
B. Check motor insulation (annually).
C. Replace cooling tower filler (every five years or depending on the working condition
of the cooling tower filler).
7. Monthly maintenance of water pipe network and water quality
A. Check pipes and valves for water dripping and leakage. Check piping heat insulation
materials for traces of water dripping and leakage.
B. Check pipes for displacement, settlement, bending, and deformation, and report any
anomaly identified immediately.
66
C. Check valve surface for seepage and corrosion, and remedy any leakage identified.
Perform regular test operation of valves to ensure that they can be easily switched on and off.
D. Check pipe flanges for corrosion, looseness, and water dripping and leakage.
E. Check water system piping. Check pipes and accessories (flexible joints, check
valves, and water treaters) for aesthetic defect and cracking. Check the joints for water seepage.
Take immediate actions for any defect identified.
F. Remove rust on water pipes and valves and repaint them to maintain integrity of
painting (no peeling-off). Repair any insulation layer damage immediately.
G. Check pipe brackets for insecure installation, dislocation, or deformation. Check
wooden pipe carriers for corrosion and deformation.
H. Check if the cooling water is clean. Replace it where necessary. Analyze water
quality regularly, and add germicide, algicide, anti-sludging agent, and/or corrosion inhibitor to
the water where necessary.
I. Check the quality of softened water for the chilled-water system. Check the softened
water system.
J. Check the accuracy of pressure gages and thermometers. Instrument dials should be
clear. Replace any damaged dials immediately.
K. Check the operating condition of float valves for cooling water replenishment and
chilled-water pressure-stabilization and replenishment devices.
L. Clean water piping filters (the difference between the pressures at the two ends of a
filter is greater than 0.05 MPa).
M. Ensure that appropriate anti-freezing measures are in place for outdoor piping in
winter.
N. Check the accuracy of pressure gages and thermometers for water distributors and
collectors.
8. Preventive maintenance of circulating water pumps
67
A. Replenish lubrication oil (quarterly).
B. Check water pump sealing (quarterly). Repair any water leakage identified.
C. Test and calibrate the concentricity of couplings, and check coupling bolts and
rubber rings (annually). Replace damaged parts.
D. Tighten pump seat screws and perform antirust treatment to pumps (annually).
E. Service water pumps (annually), including the check of major parts—such as vane
wheel, sealing ring, and bearing. Clean vane wheel and remove scaling in vane wheel water
channels.
9. Monthly maintenance of motors and power distribution and control systems
A. Motors should operate normally, with bearings well lubricated and insulation
resistance greater than 2 MΩ. All wiring connections should be secure, and the load current and
temperature increase should satisfy the respective specifications.
B. Check the operating conditions of frequency converters and soft-start starters (the
temperature increase should not exceed the specified value).
C. Electrical and control components should be clean in surface, integrated in structure,
accurate in operation, and integrated in display and alarm functions.
5.3 Predictive maintenance - overview
To sustain the secure and stable operation of the data center, the operations management team
regularly monitors the infrastructure of the data center (power supply and distribution, UPS, diesel
generator, chilled water, and lightning protection and grounding systems) using various
instruments and professional third-party testing services. As one of the major types of proactive
maintenance activities to sustain the secure operation of the data center, predictive maintenance
involves comprehensive trend analysis of data about infrared temperature increase, vibration, and
chemical composition of fuel and lubrication oil, with the aim of diagnosing the operating health
of the component systems and facilitating the early identification and timely, effective mitigation
of potential risks with the systems by the operations management personnel.
68
5.3.1 Predictive maintenance - general requirements
Establish and implement detailed annual predictive maintenance plans.
Measurement tools used for predictive maintenance should be regularly calibrated according
to the quality inspection department’s calibration procedure to maintain their measurement
accuracy.
Employ third-party testing professionals to test the systems and equipment of the data center
and produce relevant testing reports.
Predictive maintenance should be performed according to MOP and SOP to ensure equipment
and personnel safety during the maintenance.
Reports should be generated for completed predictive maintenance activities and include
trend analysis based on comparison with historic data.
5.3.2 Predictive maintenance - high-level plan
Table 5.3-1 Data center infrastructure predictive maintenance checklist
Component systems Check item
Power supply and
distribution systems
Test transformers, busbars, circuit breakers, and capacitors using infrared thermography; test the
discharging of DC cabinet storage batteries
Generators Test the chemical composition of fuel and lubrication oil; test electrical systems using infrared
thermography; check mechanical vibration
UPS systems Test them using infrared thermography
Air-conditioning systems Test the chemical composition of refrigerant oil; test pipes for defect; check the mechanical
vibration of refrigerators and water pumps
Computer room
environment
Employ third-part professionals to test the dust load, electromagnetic radiation, noise, and
lightning protection and grounding in the computer room
Lightning protection and
grounding
Test the lightning protection and grounding of the building regularly according to the lightning
protection test specification
5.4 Emergency plan overview
The operations team of the data center has established detailed, comprehensive
failure/incident emergency response procedures according to actual operating conditions. The
69
procedures are regularly drilled to improve the capacity of the team to deal with emergent failures
and incidents. This contributes toward building a foundation for sustaining the high availability of
the data center.
5.4.1 Emergency drill plan
Comprehensive emergency response procedures must be established proactively for potential
failures or anomalies. The operations team must become acquainted with the procedures.
Establish and implement annual emergency drilling plans.
Sand table exercise: The operations personnel gather around a sand table and report verbally
their respective responsibilities and actions to be taken during emergencies.
Movement exercise: The personnel for emergency response run to the failure simulation
scene and simulate the failure response procedure. They should be able to report verbally the
failure response plan step by step.
5.4.2 Emergency drill items
Table 5.4-1 Emergency drilling for system/equipment failures
Drilling item Drilling description
Low-voltage power distribution
systems
Simulate the tripping of a transformer incoming line switch, and manually close the
interconnection switch that has been interlocked for spare power automatic
switching.
Medium-voltage power distribution
and diesel generators
1. Disconnect one line of the double-circuit mains power supply, and manually close
the medium-voltage bus tie switch that has been interlocked for spare power
automatic switching.
2. Simulate mains power outage and failed automatic startup of the diesel
generators, and manually start the diesel generators for parallel operation.
Switching between primary and
redundant power supply and air-
conditioning systems
Switch between the primary and redundant power supply and air-conditioning
systems to verify the high availability of the power supply systems of the data
center.
Chilled-water systems (main unit
failure)
Simulate the failure of the primary chilled-water unit, and switch quickly to the
redundant unit.
70
UPS systems and precision
switchgears failure
1. Simulate the failure of a UPS system, and switch to the bypass circuit for power
supply.
2. Simulate the failure of precision switchgear, and switch to the UPS systems to
resume power supply.
Monitoring system Simulate the failure of the primary monitoring server, and switch to the redundant
monitoring server.
Air-conditioning system (water
system anomaly)
Simulate a leakage in the chilled-water piping, close the chilled-water valves, switch
precision air conditioners to air cooling mode, and check the heat radiation capacity
of the outdoor air conditioner units and temperature variation in the computer room.
Elevator emergency Simulate the failure of an elevator, and rescue people in the elevator carriage.
Water supply and drainage systems Simulate flooding in an underground space, and quickly drain the flooded space.
Firefighting system 1. Simulate a fire in the data center, and test the automatic and manual gas fire-
extinguishing procedure and the integrated fire alarm control.
2. Personnel emergency evacuation
5.5 System availability check
The operations team of the data center works toward further improving the availability of the
data center by regularly checking the operating environment and condition of the data center (for
example, parameter configurations of systems and equipment, control/alarm limits for critical
equipment, equipment information list, rack power distribution units (PDUs), and logic
relationship between switches) and employing third-party professionals to regularly inspect
computer rooms.
5.5.1 Monthly check of data center facilities
In addition to the routine check, a comprehensive monthly inspection of the data center
infrastructure is conducted to identify defects and opportunities for improvement, which are
subsequently logged in the ServiceBot system for remedy and tracking by engineers. A
defect/opportunity for improvement will be closed in the system when remedied or improved,
with the details of the remedies and corrective actions taken recorded in the system. This
contributes toward further improvement of the system and equipment availability.
5.5.2 Data center room environment check
A comprehensive monthly inspection of the working environment of the data center
71
infrastructure is conducted to identify opportunities for improvement, which are subsequently
logged in the ServiceBot system for remedy and tracking by the person-in-charge. Details of the
improvement actions taken are also recorded in the system.
5.5.3 Data center facilities operational information check
To facilitate delicacy management of the data center infrastructure, regular checks and
updates are carried out for equipment operation settings, opening/closing status of switches, rack
PDU and the corresponding operating status labeling for switches and equipment
(operating/standby), detailed equipment list, equipment operation tips, monitoring/alarm limits,
and monitoring and alarm filters.
5.6 Life cycle management
The life cycle of a data center refers to the entire process from the demand of data center
construction to the end of its economic life. The life cycle can be divided into decision-making,
implementation, and operations maintenance stages, and each of the stages can be further divided
into several sub-stages. The decision-making stage includes needs collection, planning, site
selection, and feasibility analysis. The implementation stage includes project design, construction,
acceptance, and hand-over. The operations stage covers the entire process from the completion of
basic construction and commissioning of the data center to the end of its economic life.
This chapter focuses on the equipment life cycle management at the operations stage of the
data center. Good equipment life cycle management is achieved by identifying equipment
operating risks and establishing risk mitigation plans. This not only reduces equipment failure rate
and improves the availability of the data center, but also extends the service life of the data center
and maximizes its benefit.
In terms of life cycle management of data center infrastructure, Ping An Data Center focuses
on medium- and low- voltage power distribution equipment, transformers, UPS, diesel generators,
and chilled-water units. The major activities in this regard include regular equipment check,
replacement of quick-wear critical parts, and equipment obsolescence and replacement.
5.6.1 Life cycle management - medium-voltage switchgears
The critical parts of medium-voltage switchgear (including circuit breaker, busbar, and
cabinet housing) are subject to routine maintenance every six months and in-depth maintenance
72
every three years. The planned service life of circuit breakers is 15 years (or 10,000 operations).
In the 14th year of its service life, a circuit breaker shall be evaluated for its operating condition
and, where necessary, a proposal shall be initiated and a budget shall be set up to replace it in the
following year. The planned service life of busbars and cabinet housing is 20 years. In the 19th
year of the service life, a proposal shall be initiated and a budget shall be set up to have them
obsoleted and replaced in the following year. Life cycle management and maintenance plans shall
be established for new replacement switchgear.
5.6.2 Life cycle management - low-voltage switchgears
The critical parts of low-voltage switchgear (including circuit breaker, busbar, cabinet
housing, and capacitance compensator) are subject to routine maintenance every six months and
in-depth maintenance every three years. The planned service life of circuit breakers is 15 years (or
30,000 operations). In the 14th year of the service life, a circuit breaker shall be evaluated for its
operating condition and, where necessary, a proposal shall be initiated and a budget shall be set up
to replace it in the following year. The planned service life of busbars and cabinet housing is 20
years. In the 19th year of the service life, a proposal shall be initiated and a budget shall be set up
to have them obsoleted and replaced in the following year. Life cycle management and
maintenance plans shall be established for new replacement switchgears. The planned service life
of capacitance compensators is 5–8 years, shorter than that of other parts. Capacitance
compensators are replaced as recommended by the manufacturer or according to their operating
conditions. It is recommended to have a capacitance compensator replaced twice during the life
cycle of the switchgear.
5.6.3 Life cycle management - transformers
Transformers are subject to annual de-energized maintenance and preventive maintenance
every six years. The planned service life of transformers is 20 years. In the 19th year of the
service life, a transformer shall be evaluated for its operating condition and a proposal shall be
initiated and a budget shall be set up to have it obsoleted and replaced in the following year. Life
cycle management and maintenance plans shall be established for new replacement transformers.
5.6.4 Life cycle management - diesel generators
The engine oil, diesel, and air filtering elements of a diesel generator unit’s lubrication, fuel,
and air filtering systems shall be replaced every year.
73
The coolant and cooling water filters of the cooling system shall be replaced every three years.
Startup batteries shall be replaced every two years.
The planned service life of diesel generator units is 15 years. In the 10th year of the service
life, its operating condition shall be evaluated to decide whether to continue its service. If it is
decided to continue its service, in the 14th year of the service life, a budget shall be set up to have
it obsoleted and replaced in the following year.
5.6.5 Life cycle management - uninterrupted power supplies (UPS)
The AC and DC capacitors in a UPS are quick-wear parts and have a service life of five to six
years. They need to be replaced as recommended by the manufacturer and the general principle is
two replacements in the life cycle of UPS.
UPS storage batteries shall be replaced according to their operating condition and the general
principle is at least one replacement in the life cycle of UPS.
The planned life cycle of UPS is 20 years. In the 19th year of the service life, a proposal shall
be initiated and a budget shall be set up to have it obsoleted in the following year. Life cycle
management and maintenance plans shall be established for a new replacement UPS.
5.6.6 Life cycle management – chilled-water units
The oil filters, refrigerant drying and filtering devices, and refrigerant oil in the chilled-water
units need to be replaced every year.
The planned service life of chilled-water units is 15 years. In the 10th year of the service life,
a chilled-water unit shall be evaluated for its operating condition to decide whether to continue its
service. If it is decided to continue its service, in the 14th year of the service life, a budget shall be
set up to have it obsoleted and replaced in the following year.
5.7 Risk management
The operations team of the data center effectively manages the operating risks of the data
center. This facilitates the operations team to make correct decisions, protect the security and
integrity of company assets, and achieve its performance goals. This is significant for the
operations of the data center.
74
5.7.1 Acronyms and definitions
The risk management of the data center refers to the management process to identify risks in
an environment and minimize the potential impact of the identified risks.
5.7.2 Risk identification and analysis
As the first important step of the risk management process, the risk identification of the data
center involves identification of risks in the computer room in the form of a comprehensive risk
analysis list. The identified risks are subsequently proactively analyzed for their potential impact
and best measures to mitigate the impact.
Risk identification is conducted in the form of a risk analysis list. The identified list is
thereafter analyzed and categorized into the following three categories: high, medium, and low
risks. A high risk is an unbearable operating risk whose occurrence will result in the inability of
the computer room to quickly resume its operation and cause serious loss to the company.
Medium and low risks are tolerable and controllable operating risks that threaten operational
security but only in the local scale.
Note: The risk identification and evaluation form is a live document and needs to be updated
regularly, as an operating risk may change and need to be reclassified and new risks may arise as
relevant factors in the computer room evolve.
Table 5.7-1 Computer room operating risk analysis list
Risk
classification
High Medium Low
Computer room
security
Fire impacting the entire
computer room
Fire impacting some of the
computer room equipment
Leakage water pooling in a
large area of the computer
rooms
Water pooling in the main
computer room
Leakage water pooling locally in the
computer rooms
Collapse of the computer
room building
Local damage of the computer
room building
The structural integrity of the computer
room threatened
Firefighting systems out of control Firefighting system faults
Air-conditioning system failure or Abnormal temperature or humidity
75
out of control
Door access system out of control Door access system fault
Computer room lighting system
failure
Lighting system fault
Computer room monitoring system
failure
Computer room monitoring system
warning
Operational
security
Core equipment failure Major equipment failure
Large-scale power outage
in the computer room
Power supply fault
Air-conditioning water
system piping blow-up
Air-conditioning system failure in
a single computer room
Entire diesel generation
system failure
Diesel generation unit failure
Core network cable broken Primary/redundant network cable
broken
Local failure of network cable
Management
and personnel
safety
Sabotage Severe operating error General operating error
Incomplete definition of
management structure or
responsibilities
Incomplete rules and regulations Poor implementation of rules and
regulations
Personnel casualties Personal injury
Property
management
Damaged major equipment Local damage of equipment Equipment failure
Major equipment (data)
missing
Equipment missing Equipment components missing
Others
Power outage or network
communication failure
caused by lightning
Lightning Lighting protection device failure
76
Cable damaged by varmints Presence of varmints
Severe electromagnetic
interference
General electromagnetic
interference
5.7.3 Risk mitigation plan
The operating risks identified in the data center are tracked and controlled in the form of a
risk control list, where risk mitigation plans as well as the status of the planned actions are
recorded (a mitigated risk may be controlled as a generic issue). The risk control list includes the
following information:
Date of risk identification: The date on which a risk is identified.
Risk description: A description of the identified risk to facilitate the data center operations
team to understand the risk.
Risk occurrence probability: Three levels of risk occurrence probability are defined: high,
medium, and low.
Risk impact: Three levels of risk impact are defined: high, medium, and low.
Risk severity: Three levels of risk severity are defined: high, medium, and low.
Risk owner: A person is specially designated for controlling and tracking a risk.
Risk control strategy: Risks are controlled through any of the following three strategies:
avoidance, mitigation, and acceptance. The specific control strategy for a risk is decided by
the risk owner according to the outcome of risk evaluation.
Risk mitigation plan: With the identified risks analyzed qualitatively and quantitatively and
prioritized, the owner of a specific risk develops a risk mitigation action plan according to the
operating condition of the data center.
Risk emergency plan: A plan is established for a quick response to the occurrence of each
specific risk and resuming normal operation. For an emergency plan to be comprehensive,
scientific, and effective, the following information for risk emergency response shall be
included: emergency reporting system and emergency response organization responsible for
mobilization, on-site coordination, and staffing (including technical professionals for risk
response).
Risk control status: The status of risk control can be closed or open. An open risk needs to be
tracked and regularly updated. A closed risk may be referenced for similar risks in the future.
Risk change record: Record of major actions taken and major progresses made in risk control.
Risk update date: The date on which a risk in the risk control list is deleted or modified or the
77
date on which a new item is added to the list.
Approaches to close a risk: A risk may be closed if it is mitigated, changed to a generic issue,
or taken as it is.
Date of closing a risk: The date on which a risk is avoided, mitigated, or taken as it is and is
thereafter closed for control after actions are taken to cope with the risk.
Risk transfer: Risks with low probabilities can be transferred to insurance companies and
service provides by purchasing insurances and outsourcing equipment maintenance. For
example, purchasing property insurances can transfer some computer room risks (for example,
risks with the computer room building and risk of fire) to insurance companies; outsourcing
computer room equipment maintenance can transfer the risk of equipment failures (for
example, UPS and precision air conditioners) to equipment maintenance service providers.
5.8 Asset management
5.8.1 Challenges of asset management
Ping An Group is one of China’s personal financial service groups with the most
comprehensive range of financial business licenses, the most extensive business scope, and the
most compact shareholding structure. Owing to the interaction between its diversified businesses,
its IT systems are tightly coupling and have complicated infrastructure. To cope with its rapid
business development and frequent business changes, its IT facilities are faced with the challenge
of accommodating approximately 100 changes a day. Owing to its nature of financial service,
Ping An Data Center is required to quickly resume operation after the occurrence of failures.
Therefore, it is essential to locate the hardware failure and affected applications of large-scale IT
infrastructure (100,000+ units of equipment) in a timely manner. This in turn dictates highly
efficient asset management in the data center, which requires a customized tool that supports
systematic management.
5.8.2 Systematic asset management
5.8.2.1 Scope of asset management
Ping An Group involves many business units and has a multitude of IT-related assets that are
widely distributed. Considering the complicated asset management situation, the data center and
the Group Asset Management Office have defined the scope of data center asset management as
the physical area of Ping An owned data center, which has been officially published.
78
5.8.2.2 Asset issuance procedure
The assets of the data center include both operating and non-operating equipment units.
Commissioned equipment units are installed with application systems and can be monitored
automatically at the following three levels according to the company’s IT system management
specification: application level, operating system (OS) level, and hardware status level. An
unauthorized change to a commissioned equipment unit will trigger an alarm, which is monitored
by the asset management officer. However, there is no effective means for automatic monitoring
of noncommissioned equipment units, which are controlled through the asset issuance procedure.
The procedure is linked with the company’s financial system. If the procedure is not followed, the
expenses for acquiring the equipment cannot be processed for reimbursement and payment.
5.8.2.3 Asset management responsibility system
The position of the asset management officer is specially established in the data center for
asset management. The asset management officer is required to become acquainted with the
equipment classification system of the data center and work carefully, earnestly, and patiently to
manage assets according to the asset management system.
5.8.2.4 Asset obsolescence and disposition procedure
The data center obsoletes and disposes equipment that has not been in use and has exceeded
the financial depreciation life. This is carried out twice a year according to the asset obsolescence
and disposition procedure established by the corporate asset management office. The timely
disposition of obsolete assets contributes to refreshed asset data.
5.8.3 Developing a unique asset management system for the data center
5.8.3.1 The necessity to develop a unique asset management system
The number of assets of the data center increases in the magnitude of more than 10,000 units
a year. This rapid increase dictates a unique asset management system that fits well with the
situation of the data center, such that asset changes can be recorded and data collected and
analyzed using big-data technology. Ping An Data Center has now developed Goods Receipt
System, Integrated Data Center (IDC) Visual Management System, and OPCM Management
system that satisfy its management requirements. The systems have a PC version and a mobile
phone APP version to enable system access in the office environment and mobile access while
working on-site in the computer room.
79
5.8.3.2 Top priorities in asset management system development
The top priority is the design of configuration management database (CMDB) and
configuration items (CI). Two considerations are given in this regard: 1) it is not advisable to
cover every configuration item during design phase, as data on CIs and the relations between them
are constantly changing. This would require much more efforts only for maintenance; 2) it is not
advisable to seek an all-round system that provides solutions at many levels (data center, server,
storage, network, and application), as this may result in no good solution for any single failure.
A key challenge is the integration of off-the-shelf products into the asset management system.
The biggest issue with off-the-shelf products is that they are for general purposes and provide no
solution for the practical problems of the data center. Another potential issue is that the System
Developer does not understand the requirements of data center operations. This may result in a
long development cycle and a system that is too far from satisfying the operational needs.
5.8.3.3 The asset management system of the data center
To address the above issues, Ping An Data Center developed OPCM—an asset management
system that fits well with its particular situation in 2016. The system was developed based on the
following two principles: 1) streamlining the CMDB and CI to realize an asset management
system that has fewer but better functions. The target is to design a system that can manage 95%
of the day-to-day work, with the remaining 5% to be managed by on-site check or logging onto
the OS to check configurations (for example, the number of network cards for an equipment unit
and the MAC address of each network card). This is intended to avoid too big a CMDB. 2) To
address the issue that a system developed by personnel without operations knowledge is prone to
be unsuitable for operations, the data center operations team provided operations training to the
system development personnel. The OPCM system has now been commissioned and proven to be
capable of facilitating the asset management of the data center as expected.
5.8.4 Asset management system illustrated
5.8.4.1 Total life cycle management of the assets of the data center
The figure below shows an example of total life management of assets in the OMCP
system—a process starting from asset acceptance.
80
Fig. 5.8-1
5.8.4.2 Equipment hardware configuration information management
Fig. 5.8-2
5.8.4.3 Equipment-application correlation management
Fig. 5.8-3
81
5.8.5 On-site asset control
5.8.5.1 Characteristics of on-site asset management
The on-site asset management of the data center covers two types of assets: 1) those that have
been commissioned in the data center and 2) those that have not been commissioned and are
stored in the warehouse. Quicker failure recovery is required of financial data centers, which
dictates quick acquisition of information about equipment configurations, applications running on
equipment, and persons in charge of the applications as well as configurations of spare equipment
stored in the warehouse when needed to replace failed equipment. This constitutes a special
challenge for equipment management in data centers. To cope with this challenge, the data center
applies QR code labels on commissioned equipment and has developed an app for mobile data
center management that runs on tablets and mobile phones.
5.8.5.2 Introduction of the QR code technology used in the data center
As the mobile technology is advancing, the application of QR code—a technology that makes
life and work much easier and more convenient—has become increasingly popular. QR code is
employed in the asset management of the data center. Two types of QR codes are used for asset
management: 1) those for assets identified with serial numbers and 2) those for assets identified
with asset descriptions. In the first case, the serial number of an asset is coded into a QR code,
which is thereafter printed out and stuck somewhere in the vicinity of the asset, whereas in the
second case, an asset description is generated according to the company’s pre-established
specification and thereafter input into a QR code generator to create a QR code label. The figures
below show examples of QR codes.
Equipment QR code label
Fig. 5.8-4
Rack QR code label
82
Fig. 5.8-5
5.8.5.3 Application of QR code illustrated
Scan equipment QR code to acquire equipment information
Fig. 5.8-6
Scan rack QR code to acquire information about all the equipment units in the rack
Fig. 5.8-7
83
5.8.5.4 Asset obsolescence and disposition procedure
An asset that has reached the end of service life or cannot continue service (an item in the list
of obsolete assets) shall be disposed in a timely manner. This is to improve the power and space
efficiencies of the data center, reduce operations cost, and improve asset data cleanness. The asset
management officer of the data center is responsible for asset obsolescence and disposition. He
shall arrange at least two rounds of asset disposition a year, which is defined as one of his KPIs.
The asset obsolescence and disposition process is as follows. The asset management officer
prepares a list of obsolete assets and emails it to the asset users for confirmation. If an asset is
confirmed to be obsolete, the asset management officer prepares an asset obsolescence request
and sends it to the end user, data center manager, departmental managers of the data center,
corporate asset management office, and finance department for approval. The asset management
officer thereafter arranges it for auction. The asset management officer thereafter updates the asset
financial data in corporate material system, and the data center updates the record in the OPCM.
The asset management officer thereafter prepares an asset disposition end-of-availability (EOA)
request and sends it to the user, data center manager, departmental managers of the data center,
corporate asset management office, and finance department for approval. The auction winner is
permitted to take the asset away. This completes the asset disposition process.
5.8.5.5 Asset inventory check
There are “dirty asset data” owing to human error even with the OPCM system implemented
in the data center for asset management. Asset inventory check is the only effective way to
identify and correct dirty data. There are two types of asset inventory checks implemented in the
data center: 1) quarterly self-check by the data center and 2) annual corporate asset inventory
check, which is conducted by the corporate asset management office for company-wide assets.
With these two types of asset inventory checks put in place, the asset data accuracy of the data
center is now higher than 99.8%.
5.9 Day-to-day operations management
5.9.1 Challenges of day-to-day operations
Ping An Data Center supports not only Ping An Group’s traditional financial services such as
insurance, banking, and investment but also Internet financial services such as Lufax, OneConnect,
and eWallet. The traditional financial services are mature but complicated in structure. This
dictates the support of a data center that is stable and quick in failure recovery. In addition, a data
84
center failure that is not recovered in the regulatory time frame must be reported to the regulator.
Therefore, for the data center to be able to support the traditional financial services, the top
priority is stability i.e., the fewer the changes, the better. In contrast, the new Internet financial
services require short time to grab market share and frequent remedies as problems may pop up
after a new service goes live. Therefore, the new Internet financial services require the data center
to be capable of short time to market and frequent changes. In addition, the traditional financial
services are incorporating more and more Internet service elements. This results in a complicated
business structure of the data center: the coexistence of old structures based on traditional “OEM”
products, new Internet structures based on Ping An’s financial clouds but correlated with
traditional OEM products, and new structures completely based on the Internet framework and
philosophy. This poses continuous new challenges to the data center. To satisfy the requirements
of both the traditional financial businesses and new Internet financial businesses, the data center is
required to break down the requirements of the financial services it is required to support and
carry out delicacy management of its operations.
5.9.2 Systematic day-to-day operations management
5.9.2.1 Zoned management
Ping An Data Center supports Ping An Group’s insurance, banking, and securities businesses,
and its support service shall satisfy the regulatory requirement of China Insurance Regulatory
Commission, China Banking Regulatory Commission, and China Securities Regulatory
Commission, respectively; its support service to Internet financial services (for example, Lufax
and credit inquiry) shall satisfy the regulatory requirements of the People’s Bank of China. The
data center is also subject to annual inspections by the above regulators. Considering this
challenge, the data center has established a zoned service management system. Some zones are
physically segregated into segregated modules or by physical barriers, if physical segregation is
required by the regulator. If physical segregation is not required by the regulator, service zoning is
realized by concentrating a service in a separate rack and locking the rack. Zoned delicacy
management of different services is realized by establishing different management systems
according to their different characteristics.
However, as technology advances and new business forms emerge, regulators may update
their regulatory requirements according to the latest situation. Thus, the data center needs to
closely follow changes in regulatory requirements for data centers and update its zoned
85
management system accordingly.
5.9.2.2 Service window and maintenance window
With the zoned management system, the data center is able to guarantee no impact of
equipment maintenance in one zone on the operation in any other zones. However, as the data
center has structures for both traditional financial businesses and Internet financial businesses,
very often, its component systems are interconnected and a minor change in one part of the data
center may affect the entire data center. To ensure no impact of a change on the major businesses
serviced, the data center has set up service and maintenance windows, which have been agreed to
by relevant parties.
In the service window, no maintenance events or changes are allowed in order to ensure the
stable operation of business systems. Maintenance activities and changes can only be
implemented in the maintenance window. If a maintenance event or change for a business does
not impact any other businesses, it can be implemented in the pre-established maintenance
window; in cases of an event or change impacting several interrelated businesses, it can only be
implemented in a maintenance window that is acceptable to all the businesses. Thus, delicacy
management of routine maintenance for different businesses can be realized. In cases of a service
outage or severe vulnerability that may lead to a service outage in the service window,
maintenance is allowed in the service window but only after undergoing a rigorous approval
procedure. This is to provide flexibility in the time of emergency while preventing the abuse of
this emergency channel.
Table 5.9-1 Examples of service and maintenance windows
Business systems Service window Maintenance window
Insurance **:** - **:** **:** - **:**
Banking **:** - **:** **:** - **:**
Securities **:** - **:** **:** - **:**
Internet financing **:** - **:** **:** - **:**
5.9.2.3 Business contingency plan
To sustain service availability, the data center has established a business contingency plan,
which provides differentiated contingency protection based on the criticality of businesses
86
serviced. For example, Class I systems (or traditional structures) are protected with both remote
and local backups. Furthermore, resource investment is differentiated based on the pre-established
recovery point objective (RPO) and recovery time objective (RTO), such that Class I systems are
capable of sustaining business continuity. For Internet-financial-service-oriented structures and
applications, multiple remote backups and double local backups are planned to ensure that Class I
systems are capable of sustaining business continuity. In addition, the corporate contingency
planning department carries out contingency drilling every year, to ensure that the data center can
sustain business system continuity.
5.9.2.4 Change management procedure
According to industry data, 70% data center failures are caused by human errors. As a critical
component of the group’s IT system, the data center may impact the entire group’s business
systems if any of its parts fails. Therefore, the data center has implemented a rigorous control of
changes. Changes are categorized into the following categories according to their characteristics:
routine, normal, and major changes. Routine changes are initiated by the engineer on duty and
subject to approval of the reporting line manager. Normal changes are subject to review by the
engineer on duty and reporting line manager and approval of the departmental manager. For a
major change, the engineer on duty shall prepare an implementation plan, which is subject to
review by the reporting line manager and department manager and elaboration and approval of the
Change Approval Board (CAB) of the data center. Thus, delicacy management of changes can be
realized.
Change management is one of the four core tasks in data center day-to-day management, the
other three being incident management, problem management, and configuration management,
which have already been covered in the previous chapters.
5.9.2.5 Equipment/system access authority classification system
Ping An Data Center runs the business systems of Ping An Group’s professional companies.
To ensure data security, the data center has implemented a system access authority classification
system. Specifically, the data center management personnel are only permitted to change the data
center’s equipment operating environment, physical wiring, and equipment location; hardware
management personnel have the authority to manage hardware only; operating system
management personnel have the authority to manage operating systems only; application
operations personnel have the authority at the application level only; development personnel do
87
not have access to production systems. Thus, an employee has the authority to manage the data
center components related to his work only, having no access to the entire system. In addition, the
system is managed by different functional units in different operating environments (development,
staging, production, and contingency). In cases where production data are required in a testing
environment, the data must be desensitized. With the access authority classification system,
personnel and environment authorities are minimized, such that intentional disclosure, tampering,
and embezzlement of user data can be minimized.
5.9.2.6 Information security management system
An outstanding data center needs to ensure not only operations stability but also information
security. Information security is particularly important for financial data centers. Ping An Data
Center has established two zero-tolerance objectives for information security: zero tolerance of
major regulatory compliance issue and zero tolerance of major information security issue. To
achieve this, the data center has established a document (file) classification system. Documents,
whether in hardcopy or electronic, are classified into the following categories: secret, classified,
and highly confidential. The position of document control officer is specially set for controlling
the documents of the data center. Defective hard disks that need to be taken out of the data center
shall be demagnetized or physically damaged to prevent data disclosure. For solid-state drives that
cannot be demagnetized to prevent data disclosure, the manufacturer has agreed contractually to
have them serviced without the need to return them to the manufacturer and the manufacturer’s
engineers cannot take them away from the data center—all defective drives are reclaimed and
physically destroyed by the data center in a centralized manner. Magnetic tapes for data backup
purposes must be written using encryption technologies. For such a magnetic tape to be
transferred for storage in a different site, it must be placed in a special-purpose magnetic tape
storage box, the box must be locked, and the handover form must be signed and locked in the box.
The box must be escorted during transportation by a qualified security company that has signed a
nondisclosure agreement with the company.
5.9.2.7 Audits of day-to-day operations
To assess its day-to-day operations, Ping An Data Center conducts an internal audit every
quarter. It also employs the corporate information security department and well-known
organizations such as BSI and Ernst & Young to audit its information security and day-to-day
operations systems. Issues identified in such audits must be remedied as part of the continual
improvement process, such that the effectiveness of the management system of the data center can
88
be sustained. The data center has now been certified to the ISO 9001, ISO 20000, ISO 27001, and
Uptime M&O standards.
5.9.3 Integrated data center management system
5.9.3.1 The necessity to develop a unique integrated data center (IDC) management system
Ping An Data Center operates more than 100,000 units of equipment and approximately 1,000
business systems. Manual labor only cannot sustain stable operation of the data center. The
challenges can be summarized in the following five aspects:
1) With the multitude of equipment units and many systems running in the data center,
manual labor cannot solely satisfy the requirements of business system operations;
2) Different persons have different skill levels and, therefore, may yield different outcomes
for the same task;
3) The same person may yield different outcomes for the same task in different conditions,
psychologies, or times;
4) There is no effective way to pass human experiences from one person onto another;
5) It is difficult to realize standardized operation.
Therefore, it is necessary to develop an effective data center management system, such that
standardized management can be realized. With the data center operations training provided by
Ping An Data Center, the development personnel have developed an IDC visual management
system and computer room management app, which contribute toward improved computer room
management efficiency and standardized delicacy management of the data center.
5.9.3.2 Delicacy management of Ping An Data Center
5.9.3.2.1 Integrated delicacy management
With the IDC visual management system, the data center can understand the real-time status
of used power and space resources and layout of business systems. In addition, big-data
technology is employed to analyze the historic data and development trends of business systems.
Thus, future requirements for rack resource by each business system can be predicted. This
facilitates proactive capacity expansion, flexible allocation of data center resources for business
systems in a holistic manner, and integrated delicacy management of the data center. The figure
89
below shows the operating condition of a module of the data center.
Fig. 5.9-1
5.9.3.2.2 Delicacy management by module
With the IDC visual management system, the data center can understand the used capacity
and power consumption of each rack in real time as well as the current condition and future trend
by data center site. Thus, delicacy management of the various modules of the data center can be
realized.
Fig. 5.9-2
5.9.3.2.3 Delicacy management by rack
90
With the IDC visual management system, the data center can understand the used capacity,
power consumption, and equipment operating condition in each rack in real time. This is
subsequently combined with characteristics analysis of each rack’s functional areas and each
business as well as big-data analysis. Considering that each rack has a maximum power of 6 kW
and height of 46 U, 18 servers can be placed in every rack in the VXLAN framework or in the
TOR DB framework; 15 servers can be placed in every rack in the GBD framework; 16 servers in
every rack can deploy a financial cloud platform. Based on these data and the characteristics of
applications, each unit of each rack can be utilized to its full capacity, thereby facilitating delicacy
management at the rack level.
5.9.3.3 Equipment location automatic distribution system
Ping An Data Center has developed an equipment location automatic distribution system,
according to the service characteristics of its server, which include small but frequent batches, and
the principle of full utilization of space and old wiring.
The design principle of the rack location automatic distribution system is as follows: the
feasibility of installing a server into a rack is based on the equipment specification (the rack space
and power capacity of the same equipment type), as well as the analysis of zoning, power
consumption, and available rack space.
Fig. 5.9-3
91
Chapter6 Operations Quality Assurance System
This chapter introduces approaches to test the operations quality of the data center, including
an internal audit by the security department of the group, an internal audit conducted in the form
of a crosscheck between different teams of the data center, and external audits for M&O, ISO
9001, ISO 27001, and ISO 20000 certification.
6.1 Internal audit
Internal audits, sometimes called first-party audits, are conducted by, or on behalf of, the
organization itself for management review and other internal purposes, and can form the basis for
an organization’s declaration of conformity. In many circumstances and in small organizations in
particular, internal audits can be conducted by personnel not responsible for the activity being
audited, in order to demonstrate their independence.
There are two types of internal audits in the data center: those at the data center level and
those at the corporate level.
6.1.1 Internal audit at the data center level
Internal audits of the data center are conducted quarterly in the form of a crosscheck between
different data center sites and between different teams of the same data center site. Internal audits
are conducted strictly according to the pre-established standard procedure, in order to review and
assess the conformity and effectiveness and ensure continuous effective operation of the quality
management system and provide input for quality system improvement.
Responsibilities
(1) Accountable Role in the data center: taking corrective actions against nonconformities
identified in internal audits.
(2) Internal Auditor: conducting internal audits against the Data Center Internal Audit
Checklist.
(3) Lead Internal Auditor: planning for internal audits, leading the internal audit team to audit
the quality management system, chairing opening and closing meetings for internal audits,
preparing internal audit reports, and following up on corrective actions.
(4) Management Representative: reviewing annual internal audit plans and audit reports,
submitting them to the Data Center Manager for approval.
92
(5) Data Center Manager: approving annual internal audit plans and internal audit reports.
Audit procedure
Audit plan
The Lead Internal Auditor shall prepare an annual audit plan and submit it for discussion at
the management review. The plan shall ensure that
(1) a minimum of four internal audits are conducted each year;
(2) all the requirements of ISO 9001 are covered in a period of one year;
(3) audits are focused on areas with frequent occurrence of nonconformities;
(4) audits are conducted independently or auditors are not responsible for the activity audited;
(5) audits are conducted in a timely manner for the occurrence of major quality defects or
major changes to the quality management system, including changes to documentation,
organization structure, operations procedures, and products (services);
(6) the schedule, frequency, and scope of audits are defined.
(7) The plan is subject to approval of the Data Center Manager. The Management
Representative shall communicate the plan to all personnel in the data center.
Audit preparation
(1) The Management Representative shall establish an internal audit team and designate a
Lead Auditor one month in advance of a planned audit.
(2) The Lead Auditor is responsible for assignment among the auditor team. An internal
auditor should have no direct responsibility for the object (department or position) being
audited.
(3) The audit plan (prepared by the Lead Auditor and approved by the Management
Representative) should be communicated to the departments and persons to be audited at
least one week in advance. The audit plan should include the auditee, scope, date, and
criteria of the audit as well as the assignment among the auditors.
(4) If an auditee does not agree with the audit plan, he can request the audit team to change
the plan within two days of the receipt of the plan. Changes to the plan should be based
on mutual consultation.
(5) The Lead Auditor should ensure that the auditors use the latest version of the Data Center
Internal Audit Checklist for the audit.
Implementing an internal audit
Participants of an internal audit opening meeting include all the auditors, auditee
93
representatives, main auditee contacts, the Management Representative, and top managers (where
necessary). An opening meeting may not be necessary for a crosscheck between local teams but is
mandatory for a crosscheck between different data center sites. The opening meeting is chaired by
the Lead Internal Auditor and should cover:
(1) introduction of the auditors and the assignment among them (undertaken by the Lead
Auditor);
(2) restatement of the scope, criteria, and purpose of the audit;
(3) a brief introduction of the audit methodology;
(4) request for assistance required from the auditees;
(5) clarification on the audit plan.
On-site audit
(1) Internal auditors conduct the audit against the Data Center Internal Audit Checklist. They
may conduct the audit through sampling check of records, on-site observation, interview,
and check of documents.
(2) If any issue is identified during the audit, the auditor should confirm the issue with the
person-in-charge or operator and thereafter record it in the Data Center Internal Audit
Checklist. This is intended to facilitate the understanding and remedy of nonconformities.
(3) At the end of on-site audit (prior to the closing meeting), the Lead Auditor should conduct
an audit team meeting to summarize the audit findings and confirm the nonconformities
identified during the audit.
Closing meeting.
Participants of a closing meeting include all the auditors, auditee representatives, main
persons involved in the audit, the Management Representative, and top managers (where
necessary). A closing meeting may not be necessary for a crosscheck between local teams but is
mandatory for a crosscheck between different data center sites. The closing meeting is chaired by
the Lead Internal Auditor. It is intended to provide a summary of the audit. A closing meeting
should cover the following aspects:
(1) restatement of the scope, criteria, and purpose of the audit;
(2) clarification on audit findings to the auditees;
(3) nonconformities identified during the audit and their supporting evidence;
(4) conclusions and proposals by the audit team;
(5) clarification on the corrective action process for nonconformities (undertaken by the Lead
94
Auditor).
Audit report
(1) The Lead Auditor should prepare an internal audit report for the audit. It is intended to
summarize the audit, statistically analyze the nonconformities, identify areas of concern
and opportunities for improvement, and propose areas to be focused on during the
subsequent audit.
(2) The Lead Auditor submits the report to the Management Representative and sends a copy
to the Data Center Manager.
(3) The Lead Auditor communicates the audit findings to the auditees.
(4) The Lead Auditor follows up on corrective actions.
(5) The auditees should provide corrective action plans for nonconformities and opportunities
for improvement identified during the audit. A corrective action plan should:
* be preventive in nature to avoid the occurrence of similar nonconformities;
* provide clear and practical actions, whose effectiveness is measurable;
* provide a timetable for each action to be taken.
(6) The Lead Internal Auditor uses the Data Center Internal Audit Checklist to track the
planned corrective actions. A corrective action will be closed when it is verified to be
effective. If a corrective action is not effective, the Lead Auditor should request the
person-in-charge to take another action. This process is defined in the Analysis and
Improvement Procedure.
(7) The Lead Auditor should update the Management Representative and Data Center
Manager on the status of the correction actions and pay attention to the existence of
similar issues during the subsequent audit.
(8) The Lead Internal Auditor should hand over all the internal audit records to the Document
Controller, as defined in the Quality Record Control Procedure.
(9) The results of the internal audit should be included in management review, as defined in
the Management Review Procedure.
Reference documents:
<Data Center Internal Audit Checklist>
<Analysis and Improvement Procedure>
<Quality Record Control Procedure>
<Management Review Procedure>
95
6.1.2 Corporate internal audit
Corporate internal audits mainly cover information security, as shown in the table below.
Table 6.1-1 Checklist of data center data for internal audit
No. Data type Data description Period covered Remarks
1
Data center
environment
Data center construction planning and site selection as well as
profile
2 Layout of the data center
3 Layout of lightning protection devices
4 Layout of smoke detectors and temperature sensors
5 Layout of water piping and leakage sensors
6 Layout of firefighting devices
7 Layout of surveillance cameras
8 Physical environment security evaluation reports
9 Layout of air-conditioning chilled-water pipes
10
Data center
management
Job descriptions of data center management positions
11 Service provider selection, management, and evaluation
records
12 Equipment procurement contracts
13 Contracts with telecommunication operators
14 Checklist of data center equipment/assets
15 Applications, gate passes, and receipts related to equipment
moving in and out of the data center
16 Equipment acceptance records
17 Equipment disposition records
18 Records of tapes received in and delivered out of the media
room
19 Media checklist and inventory check records
20 Tape demagnetization records
21
Visitor records
Granted data center accesses checklist
22 Applications for data center access
23 Records of deleted data center access
24 Data center access review records
25 Application for temporary data center access
26 Registration of data center visitors
27 Signed letters of confidentiality for data center access
28 Data center access system log/record
29
Data center
operations
Data center routine check records
30 Data center equipment patrol inspection records
31 Equipment maintenance and service records
32 Emergency exits opening/alarming records
96
33 List of issues with the data center
34 Problem/failure handling processes
35
Operations system
Checklist of data center systems
36 Master list of granted accounts and accesses to operations
systems
37
Drilling reports
Firefighting drilling reports
38 Power outage drilling reports
40 Diesel generator drilling reports
41
Management
systems
ISO quality management documents and operation manuals
42 Service provider selection/management/evaluation
standards/systems
43 Visitor registration procedure
44 Inspection standards for portable fire extinguishers
6.2 External audits
External audits include those generally called second- and third-party audits. Second-party
audits are conducted by parties having an interest in the organization, such as customers, or by
other persons on their behalf. Third-party audits are conducted by external, independent auditing
organizations such as those providing certification/registration of conformity with ISO 19001 or
ISO 24001.
The external audits of Ping An Data Center include those for M&O, ISO 9001, ISO 27001,
and ISO 20000 certification and certification renewal.
6.2.1 Audit for M&O certification renewal
The M&O standard provides an overall standardized management configuration for data
center operations and management from multiple dimensions, frameworks, perspectives, and
levels. The standard also includes detailed standardization requirements at the training, drilling,
planning, adjustment, and practical operation levels of the operations and management system, in
order to improve the management competency of operations personnel and sustain high service
levels of data centers.
The M&O certification is valid for two years. To maintain the certification, the data center is
subject to a certification renewal audit of its processes and systems every two years. The audit is
based on a scoring system, and a minimum score of 80 is required to pass the audit.
The audit covers 20 sub-categories in five categories as shown in the table below.
97
Table 6.2-1 M&O audit checklist
No. Category Sub-category Required information
1
Staffing and
organization
Staffing Staffing plan (number and responsibilities)
Escalation and call procedure (between internal parties and between the
data center and vendors)
2 Qualification Training certificates and records
Assignment of responsibilities (responsible area, training, and security)
3 Organization Organizational chart, including the following information:
- Detailed organizational chart at the infrastructure level
- Detailed organizational chart at the data center level (infrastructure, IT, and
security departments)
- Job descriptions for infrastructure-related positions
4
Maintenance
Preventive
maintenance
Checklist and timetable for preventive maintenance
Preventive maintenance methods
Work orders for preventive maintenance
Calibration of testing tools
Checklist of critical spare parts and points of order
Process for switching between redundant components
5 Housekeeping
policy
Housekeeping policy for the main computer room
6 Maintenance
management
system
Completion rate of preventive maintenance
Open-loop and closed-loop working processes
7 Vendor support Approved vendor list and SLA
8
Deferred
maintenance plan
Deferred maintenance checklist
Deferred maintenance procedure
9 Predictive
maintenance
10 Life cycle
management
11 Failure analysis
procedure
History record of power outages and corrective actions
12
Training
Data center
employee training
Tabulated training needs by job position
Training participation record
Training course syllabuses
13
Vendor training
Tabulated vendor training needs
Participation record
Training course syllabuses
98
14
Planning,
Coordination,
and control
Data center policy
Data center policy
- Standard operation procedures
- Emergency response procedures
- Configuration control procedures
15 Financial
management
Development planning and budgeting procedure
16 Reference library Library access control
Data update procedure
17
Main computer
room management
Computer room planning and growth requirements
Power and cooling control procedure
IT facility commission and decommission control procedure
18
Operating
conditions
Load management Load management policy
19 Operation
configuration
point
Operation configuration policy
20 Equipment
rotation
6.2.2 ISO 9001 audit
What is meant by an ISO 9001 certificate?
ISO 9001 specifies requirements for a quality management system when an organization
needs to demonstrate its ability to consistently provide products and services that meet customer
and applicable statutory and regulatory requirements, and aims to enhance customer satisfaction
through the effective application of the system, including processes for improvement of the
system and the assurance of conformity to customer and applicable statutory and regulatory
requirements.
What is not meant by an ISO 9001 certificate?
(1) Note that the requirements specified in ISO 9001 are for the quality management system
of an organization, not products or services of an organization. ISO 9001 certification
should enhance an organization’s confidence in consistently providing products and
services that satisfy customer and applicable statutory and regulatory requirements.
However, the certification does not guarantee that an organization has realized 100%
product compliance, although this is the permanent goal of an organization.
(2) ISO 9001 certification does not indicate an organization’s ability to provide high-quality
products or services or the certification of its products or services to the ISO standard or
any other standards or specifications.
99
Purpose, scope, and criteria of audit
The audit aims to ensure that the management system of an organization can effectively and
consistently satisfy the requirements of the management system standard, enhance the
organization’s confidence, demonstrate the organization’s ability to comply with legal, regulatory,
and contractual requirements as well as the organization’s pre-set targets, and confirm the
continuous effectiveness and suitability of proactive plans, through proactive, evidence-based
monitoring. It is applicable to the scope of the management standard. If an audit is part of a
multiple-site audit, the final recommendation for certification is based on the findings at all the
sites.
The scope of the audit includes an organization’s documented management system as
required in ISO 9001 as well as the locations and areas covered by the management system (to be
indicated in the audit plan).
Definition of audit findings:
(1) Nonconformity:
Non-fulfillment of a requirement
(2) Major non-conformities:
These include nonconformities that compromise the ability of the management system to
realize an expected outcome. A nonconformity can be categorized as a major nonconformity
in any of the following conditions:
A nonconformity results in serious doubt about the effectiveness of process control or the
compliance of products or services with requirements;
Several minor nonconformities are related to the same requirement or issue and indicate
the existence of a systematic failure.
(3) Minor nonconformity:
This indicates a nonconformity that does not compromise the ability of the management
system to realize an expected outcome.
(4) Opportunity for improvement
This is an auditor’s evidence-based statement-of-fact about a weakness or latent defect of the
management system that, if not improved, may develop into a nonconformity in the future.
Our certification organization can provide independent general information for process and
system improvement, including interpretation of the meaning and intent of the standard,
explanation about relevant theories, methods, techniques, and tools, and sharing of non-
100
confidential best practices of the industry. However, it does not provide specific solutions for
particular problems.
(5) Observation:
This is only applicable to certification programs where the certification organization is not
allowed to include opportunities for improvement in audit findings. An observation is an
auditor’s statement-of-fact about a weakness or latent defect of the management system that,
if not improved, may develop into a nonconformity in the future.
6.2.3 ISO 27001 audit
The ISO 27001 standard for information security management systems is now the most
widely implemented information security management standard. It is developed from the British
standard BS 7799. Its latest version is ISO 27001:2013.
Ping An Data center obtained ISO 27001 certification in 2008 and has maintained the
certification since then. It has undergone annual surveillance audits and certification renewal
audits (the certificate is valid for three years) by professional certification organizations.
The ISO 27001 certification has the following values:
(1) sustaining business capacity through the definition, evaluation, and control of risks;
(2) minimizing liabilities that may result from the breach of contracts and violation of legal
and regulatory requirements;
(3) improving business competitiveness and image by demonstrating compliance with the
international standard;
(4) clearly defining internal and external information access control to prevent information
misuse and loss;
(5) establishing a policy for the use of security tools;
(6) preventing the loss of technical know-hows;
(7) enhancing information security awareness inside the organization;
(8) serving as evidence for public accounting audit.
The fact that the data center has maintained the certification demonstrates its commitment to
information security and its successful efforts in information security protection. More
importantly, the certification program contributes to better information security management
in the company.
101
6.2.4 ISO 20000 audit
ISO 20000 is developed by the International Standardization Organization based on ITIL best
practices and BSI 15000. Released on December 15, 2005, it is the first international standard for
IT service management systems. Aiming to sustain IT service quality through management and
standardization of service processes, it is available for certification by organizations to
demonstrate their IT service capability and quality.
However, the value of ISO 2000 certification is not limited to satisfying IT service
requirements and enhancing service quality. The certification also has positive implications in
quantifying services, appraising employee performance, and evaluating the return of IT
investments.
The operations service system of Ping An Data Center has been certified to ISO 2000. This
indicates that the data center’s operations service management capacity has been recognized by
leading international authorities. The certification also contributes to better service management
of the data center and better service management awareness of its personnel. The better service
and operations management of the data center in turn facilitates the long-term business
development of Ping An Data Center.
The data center will maintain the ISO 20000 certification by undergoing annual audits and
certification renewal audits (every three years).