White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf ·...

109
White Book of the High-Availability Operations of Ping An Data Center May 2018

Transcript of White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf ·...

Page 1: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

White Book of the High-Availability Operations of

Ping An Data Center

May 2018

Page 2: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

Preface by the Authors

With more than a decade’s development, the data center of Ping An of China (“Ping An

Data Center”) boasts a well-established operations system that is compliant with multiple

international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing to the

conscientious efforts of the operations team toward meticulously following and continuously

improving the working rules and processes of the system, the data center has sustained a high

level of availability.

We express our special appreciation to the staff and vendors of the data center for their

relentless hard work for the maintenance of high-availability service.

This White Book of the High-Availability Operations of Ping An Data Center, which

embodies the experience of the operations team over more than a decade at sustaining the high

availability of the data center, is an endeavor of Ping An to carry out its social accountability, as

the book aims to summarize and share the excellent experience of Ping An Data Center in

developing and maintaining a high-availability Internet finance data center. We believe that data

centers in China, particularly those in the finance and banking sector, can benefit from the

experience shared here to improve their operations management and sustain high availability.

We hope that the book can serve to mobilize industry players and experts to make concerted

efforts for China’s development in the big-data age.

We would like to acknowledge the support of Zhong Jinghua, Leader of China Data Center

Committee- China, and Philip Hu, Managing Director - North Asia of Uptime Institute and the

hard work of the compilation team of the book.

We will be grateful for feedback regarding any error or negligence in the book.

Data Center of Ping An Technology (Shenzhen) Co., Ltd.

Page 3: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

Preface by Zhong Jinghua

Ping An of China began to plan and construct a data center in Guanlan, Shenzhen in 2009. I

was fortunate to be appointed the chief designer of the project. Having been involved in the

entire construction process, I witnessed the great efforts made by the company to continuously

update its information technology to align closely with national strategies.

One of the first financial companies engaging in data center construction, Ping An has

acquired an in-depth understanding of data center construction and operations and fostered a

pool of data center experts. This enables the company to be well prepared for the Internet

Finance 3.0 age and contributes enormously to the healthy development of the data center

industry in China.

The life cycle of a data center consists of the following phases: requirement analysis,

planning and design, construction and installation, testing and receipt, and operations

management. Operations management is the last and longest phase of the life cycle. For a data

center to be successful, the operations management phase is, in some sense, more important than

the construction phases. Operations management should be considered from the time of

commencement of a data center project, or the requirements for operations management should

be built into the design and construction phases. In this sense, the scope of operations

management covers the entire life cycle of a data center, or the entire process of providing the

data service support required for attaining the development goals of the business.

This book is the crystallization of the continuous efforts of Ping An staff in the spirit of

remaining true to our original aspiration and keeping our mission firmly in mind.

Covering the operations standardization, best practices, organization structure, security

management, and quality system of the data center, this book embodies the devotion of Ping An

staff to the data center and their diligent pursuit of science. I appreciate the hard work of the

compilation team of the book and hope that readers can benefit from the knowledge shared in the

book.

Page 4: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

Zhong Jinghua

Leader of China Data Center Committee (CDCC)

May 2018

Page 5: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

Preface by Philip Hu

The Uptime Institute Tier Standard: Topology has been developed for nearly two decades.

This Standard describes four classifications (Tier I to IV) to evaluate and differentiate data center

infrastructure in terms of availability. Since its creation, this system has been widely adopted for

the design and construction of data centers across the world.

Suppose someone says: I need a data center for business development. Another person will

turn and say: I will build one for you. However, they are possibly not referring to data centers of

the same output performance. I have said on many occasions that the life cycle of a data center is

characterized by short design and constructions phases, anywhere between a few months and one

or two years, but a long operations phase—one decade or even longer. So the guiding principle

of the Tier standard is to design, construct, and manage the operations of data centers to achieve

specific business objectives.

Uptime Institute’s annual industry surveys show that approximately half of the company’s

in-house IT organizations experienced outages of in-house data centers with impact on business

in the prior 12 months. Nearly one third of the company experienced outages of IT services

outsourced from colocation centers. Most of the outage events are attributed to operator errors,

which may have included program error, resource inadequacy, management deficiency, and

inappropriate decision-making. These failures are often attributed to operators for their untimely,

unsuccessful emergency response.

In most cases, however, such failures can be attributed to management decisions (for

example, design compromise, budget cut, staff reduction, vendor selection, and resource

allocation). Very often, an incident can be attributed to a time and space before the incident (a

causative incident). For example, one can question if a management decision has resulted in an

operator not being well prepared or adequately trained for the proper handling of the emergency

event in question.

With increasingly higher data service requirements from business functions, stake-holders

Page 6: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

of data center technology and facilities are faced with the constant pressure of realizing values

while sustaining cost effectiveness and operation efficiency. Therefore, the data center

management & operations (M&O) certification provides guidance and framework as well as the

best practices for achieving effective management and operations of data centers.

The M&O standard established for data center management and operations is applicable to

all teams, departments, cultures, and practices within the organization. It addresses staffing,

organization and training, preventive maintenance, and operational conditions as well as

planning, management, and coordination of practices and resources. In this sense, the standard

provides useful information to not only data center operations teams but also service providers

and top managers to facilitate them to carry out their roles and responsibilities.

I am glad to see the white book, an achievement of the data center industry in China in

general and of Ping An Technology to develop operations standards for the in-house data center

of Ping An Group in particular. I expect that this book can provide substantial help to the

colleagues at the data center of the Ping An group.

Philip Hu

Managing Director - North Asia Uptime Institute

May 2018

Page 7: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

CONTENT

Chapter1 Introduction .............................................................................................................................. 1

1.1 Purpose and scope ............................................................................................................................. 1

1.2 Brief overview .................................................................................................................................. 2

Chapter2 Operations Standardization ...................................................................................................... 4

2.1 Lean management: theories and methods ......................................................................................... 4

2.1.1 The concept of lean management .......................................................................................... 4

2.1.2 Lean management practices ................................................................................................... 4

2.2 IT infrastructure library (ITIL) framework for operations ................................................................ 5

2.2.1 Incident management ............................................................................................................. 6

2.2.2 Problem management ............................................................................................................ 8

2.2.3 Change management .............................................................................................................. 8

2.3 Uptime Management & Operations (M&O) program .................................................................... 10

2.3.1 Staffing and organization ..................................................................................................... 11

2.3.2 Maintenance management ................................................................................................... 12

2.3.3 Training management .......................................................................................................... 14

2.3.4 Planning, coordination, and control ..................................................................................... 14

2.3.5 Operating conditions ............................................................................................................ 15

Chapter3 Security Management ............................................................................................................ 16

3.1 Information security ........................................................................................................................ 16

3.2 Physical security management ........................................................................................................ 17

3.2.1 Physical security configuration ............................................................................................ 17

3.2.2 Terminology and definition ................................................................................................. 18

3.2.3 Procedure ............................................................................................................................. 19

3.2.4 Site access registration system ............................................................................................. 20

3.2.5 Control of goods .................................................................................................................. 24

3.2.6 Fire safety management system ........................................................................................... 25

3.3 Personnel safety management ......................................................................................................... 26

3.3.1 Personnel safety training ...................................................................................................... 26

3.3.2 Day-to-day operational safety management ........................................................................ 27

Chapter4 Staffing and Staff Development ............................................................................................. 30

4.1 Organizational structure .................................................................................................................. 30

4.2 Roles and responsibilities ............................................................................................................... 31

4.3 Staff training ................................................................................................................................... 36

4.3.1 New-employee training ........................................................................................................ 36

4.3.2 Training plan ........................................................................................................................ 37

4.3.3 Training procedure ............................................................................................................... 38

4.4 Staff development ........................................................................................................................... 39

4.4.1 Routine training ................................................................................................................... 39

4.4.2 Special training .................................................................................................................... 39

4.5 Vendor management ....................................................................................................................... 40

4.5.1 Vendor training .................................................................................................................... 40

4.5.2 Service level agreement (SLA) ............................................................................................ 40

4.5.3 Vendor qualification ............................................................................................................ 41

4.5.4 Vendor performance evaluation ........................................................................................... 42

Chapter5 Best Practices of High-availability Operations ...................................................................... 43

5.1 Routine check - Overview .............................................................................................................. 43

5.1.1 Routine check - basic requirements ..................................................................................... 43

5.1.2 Routine check - frequency and methods .............................................................................. 43

5.1.3 Routine check of medium- and low-voltage switchgears .................................................... 44

5.1.4 Routine check of uninterrupted power supplies (UPS) ........................................................ 45

5.1.5 Routine check of precision power distribution systems ....................................................... 45

5.1.6 Routine check of diesel generation systems ........................................................................ 46

Page 8: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

5.1.7 Routine check of heating, ventilation, and air conditioning (HVAC) systems .................... 47

5.1.8 Routine check of firefighting systems ................................................................................. 48

5.1.9 Routine check of security systems ....................................................................................... 49

5.1.10 Routine check of electronic monitoring systems ............................................................... 49

5.2 Preventive maintenance - overview ................................................................................................ 50

5.2.1 Preventive maintenance - general requirements .................................................................. 51

5.2.2 Checklists for preventive inspection, maintenance, and operation ...................................... 51

5.2.3 Preventive maintenance - detailed schedules for key systems ............................................. 52

5.3 Predictive maintenance - overview ................................................................................................. 67

5.3.1 Predictive maintenance - general requirements ................................................................... 68

5.3.2 Predictive maintenance - high-level plan ............................................................................. 68

5.4 Emergency plan overview .............................................................................................................. 68

5.4.1 Emergency drill plan ............................................................................................................ 69

5.4.2 Emergency drill items .......................................................................................................... 69

5.5 System availability check ............................................................................................................... 70

5.5.1 Monthly check of data center facilities ................................................................................ 70

5.5.2 Data center room environment check .................................................................................. 70

5.5.3 Data center facilities operational information check ........................................................... 71

5.6 Life cycle management ................................................................................................................... 71

5.6.1 Life cycle management - medium-voltage switchgears ....................................................... 71

5.6.2 Life cycle management - low-voltage switchgears .............................................................. 72

5.6.3 Life cycle management - transformers ................................................................................ 72

5.6.4 Life cycle management - diesel generators .......................................................................... 72

5.6.5 Life cycle management - uninterrupted power supplies (UPS) ........................................... 73

5.6.6 Life cycle management – chilled-water units ...................................................................... 73

5.7 Risk management ............................................................................................................................ 73

5.7.1 Acronyms and definitions .................................................................................................... 74

5.7.2 Risk identification and analysis ........................................................................................... 74

5.7.3 Risk mitigation plan ............................................................................................................. 76

5.8 Asset management .......................................................................................................................... 77

5.8.1 Challenges of asset management ......................................................................................... 77

5.8.2 Systematic asset management .............................................................................................. 77

5.8.3 Developing a unique asset management system for the data center .................................... 78

5.8.4 Asset management system illustrated .................................................................................. 79

5.8.5 On-site asset control ............................................................................................................. 81

5.9 Day-to-day operations management ............................................................................................... 83

5.9.1 Challenges of day-to-day operations ................................................................................... 83

5.9.2 Systematic day-to-day operations management ................................................................... 84

5.9.3 Integrated data center management system ......................................................................... 88

Chapter6 Operations Quality Assurance System ................................................................................... 91

6.1 Internal audit ................................................................................................................................... 91

6.1.1 Internal audit at the data center level ................................................................................... 91

6.1.2 Corporate internal audit ....................................................................................................... 95

6.2 External audits ................................................................................................................................ 96

6.2.1 Audit for M&O certification renewal .................................................................................. 96

6.2.2 ISO 9001 audit ..................................................................................................................... 98

6.2.3 ISO 27001 audit ................................................................................................................. 100

6.2.4 ISO 20000 audit ................................................................................................................. 101

Page 9: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

1

Chapter1 Introduction

1.1 Purpose and scope

Entering the “Finance + Internet” 3.0 age, Ping An has launched a strategic initiative to

further develop “Finance + Technology” and explore “Finance + Ecosystem” in the coming

decade. Aiming to become a world-leading technology-powered personal financial services

group, Ping An will be focusing on two industries, pan financial assets and pan health care, by

employing the four core enabling technologies: Artificial Intelligence, Block Chain, Cloud

computing, and Security in the five ecosystems of financial services, health care, auto

services, real-estate services, and smart city. As of 2017, the group boasted 436 million

Internet users. To improve technological innovation-enabled customer service and enhance

customer experience, it is required to maintain a data center of bigger capacity and better

performance.

To keep pace with the rapid development of “Internet+ Finance,” Ping An has developed a

network of data center infrastructure facilities covering the entire geography of China, with the

core facilities located in Beijing, Shanghai, and Shenzhen. Ping An Data Center has been

constructed according to Class A of GB 50174 Code for Design of Electronic Information

System Room, with reference to Tier IV of the Tier international standard, and installed with the

most sophisticated high-availability equipment, thereby laying a good foundation for sustaining

the high availability of the data center. With more than a decade’s development, the data center

has accumulated abundant knowledge and experience in the planning, design, and operations of

data center facilities.

Data center operations involve practical management of changing environments. The

operations of Ping An Data Center have evolved from standardized operations to lean

operations and subsequently to customized services-oriented operations. Owing to the three-

stage evolution brought about by an operations team always ready for challenges and

constantly pursuing improvement, an operations model with unique characteristics has been

established in the data center. The operations team continues to explore ways to improve the

power usage effectiveness (PUE) and efficiency of the energy-saving and smart data center

while sustaining its high availability.

This white book is intended to share our experience in developing the standardized lean

Page 10: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

2

operations system of Ping An Data Center, which would be helpful for other data centers to

improve their knowledge and operations capacity in order to sustain their high availability.

Our experience in translating specific requirements of international standards applicable to

data centers—such as ISO 9001 and M&O—into tangible operational activities is also

included in this book. We hope that our practical experience in this regard will be helpful for

data centers seeking certification to these standards.

The target audience of this white book includes managers of finance data centers,

telecommunications data centers, data centers of network operators, and company in-house

data centers as well as readers involved in the operations of data center infrastructure

facilities.

1.2 Brief overview

This white book is structured as follows:

Instruction

This chapter includes the purpose of this white book, which summarizes our experience in

data center operations over more than a decade, for safeguarding the reliable operations of our

in-house data center to achieve the goal of future incremental growth and for sharing with

companies and individuals in the industry to help them establish operations systems to satisfy

their specific business requirements.

Operations standardization

This chapter begins with an introduction of delicacy management-related theories

and their application in data center operations from the perspective of operations

standardization, followed by a description of the IT information library (ITIL)

framework, including a detailed illustration of incident management, problem

management, and change management.

This chapter also includes our program for Uptime Institute M&O certification,

with its significance to data center operations illustrated in the following five aspects:

staffing and organization; maintenance; training; planning, coordination, and

management; operating conditions.

Security control

Finance data centers necessitate more stringent security control than data centers in

other industries, which is illustrated in this chapter from the three perspectives of

information, physical, and personnel security.

Page 11: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

3

Staffing and staff development

This chapter describes the organization structure as well as the roles and

responsibilities defined for sustaining the high availability of Ping An Data Center.

This chapter also includes the training plan and training assessment system established

for ensuring that the data center staff is capable of fulfilling their job assignments.

This chapter ends with an introduction of the vendor management system, including

requirements for vendor qualification, service level agreements (SLA), and vendor

performance monitoring.

Best practices in high-availability operations

This chapter provides a detailed description of the following:

The frequency, contents, and requirements for the day-to-day check of various

infrastructure equipment of the data center;

The preventive maintenance of eight subsystems of the power-distribution system,

four subsystems of the heating, ventilation, and cooling system, and three subsystems of

the low-voltage system;

The purpose and significance of predictive maintenance as well as predictive

maintenance planning for the data center infrastructure;

The purpose and significance of data center reliability verification as well as

different types and methods of verification;

The life cycle management of medium-voltage switchgears, uninterrupted power

supplies (UPS), batteries, precision cooling units, and water chilling units, including the

procedures for their update, annual inspection, overhaul, renovation, and obsolescence;

The availability check and third-party functional verification of the data center;

And risk management, asset management, and on-site control.

Operations quality assurance system

This chapter illustrates the approaches to check the operations quality of the data

center, including ISO 9001 quality system management, internal audits by the corporate

security and at the data center level, and external audits for M&O certification and other

purposes.

Page 12: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

4

Chapter2 Operations Standardization

Data center operations involve two major tasks: 1) maintenance of every element of the data

center to sustain its stability and 2) timely detection and handling of incidents to minimize

downtime.

Centering on these two major tasks, the operations of Ping An Data Center have been

standardized by adopting the lean management methodology and incorporating the requirements

of international standards such as ISO 9001, ISO 27001, ISO 20000, ITIL, and M&O. The current

operations system having unique characteristics is a result of our experience and lessons learned

during these efforts.

2.1 Lean management: theories and methods

2.1.1 The concept of lean management

Underlying the concept of lean management is a culture. It is the natural result of increasing

division of labor and quality requirements in our modern society. In modern management,

scientific management involves a three-stage evolution: 1) standardized management, 2) lean

management, and 3) personalized management.

2.1.2 Lean management practices

In the context of a data center, lean management is the process of breaking down the

objective of high availability into tangible actions with well-defined responsibilities. Thus, the

objective of high availability can be effectively implemented down to every element and it can

serve as a major driving force for the team to improve its execution power.

Lean operations involve every person in the organization; for lean operations to be successful

in an organization, every person is both the object and subject of actions.

To realize lean management, the data center continuously fine-tuned the definition of roles

and responsibilities, configuration of the operations platform, equipment maintenance processes,

and customer services, by following the fundamental principle of precise, accurate, thorough, and

rigorous management. The efforts made toward lean management have resulted in better staff

qualifications and skills, more rigorous internal control, and improved stability and security of the

data center.

Precise management indicates an attitude of pursuing continuous improvement and

perfection of day-to-day tasks to maintain the optimal operation of the infrastructure and sustain

the high availability of the data center.

Page 13: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

5

Accurate management indicates accurate and timely completion of tasks by carefully

following the standardized operations procedures. Accurate management also indicates

information accuracy—accurate physical status information of on-site equipment, accurate

identification and labeling, accurate clocks, accurate monitoring equipment data and operating

status, accurate instruments, accurate processes, and accurate manuals. This information is

necessary for risk identification and failure diagnostics and handling. Information accuracy has an

immediate impact on optimal equipment operation, timely failure handling, and prevention of

secondary failure resulting from human errors. The day-to-day maintenance of the infrastructure

equipment involves numerous tasks. Any change in the maintenance schedule is based on a

comprehensive risk analysis. Every maintenance task should be carried out in a timely manner

according to the pre-established maintenance schedule.

Thoroughness management indicates comprehensive and detailed definition of roles and

responsibilities for every operations task and detailed systems, specifications, and quality

assessment criteria as well as standardized manuals for maintenance, operation, and emergency

response, such that the security and reliability of the data center infrastructure can be ensured if

the manuals are followed step by step, even under the most disadvantageous conditions.

Rigorousness management indicates rigorous and strict execution and quality control of all

the tasks, processes, systems, and rules for the operations of the data center. For data center

operations, excessive rigorousness is better than lack of rigorousness.

Strictly following the requirements of lean management, the operations team of Ping An

Data Center has established a unique operations system and keeps improving on it by

continuously reviewing its processes, systems, specifications, and human resources, in order to

explore its potential and sustain the high availability of the data center.

2.2 IT infrastructure library (ITIL) framework for operations

The operations of Ping An Data Center are managed with reference to ITIL processes. Based

on years of experience, the most widely applied modules in the ITIL framework have been adopted

in our operations, including incident management, problem management, change management,

service request management, asset management, and security management. Considering the

importance of security and asset management to data center operations, these two modules will be

detailed in Chapters III and V, respectively. The implementation of the incident management,

problem management, and change management modules in the data center will be described in this

chapter.

Page 14: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

6

Incident management, problem management, and change management in Ping An Data Center

are all performed in its Service Bot system. The Service Bot system records information of

incidents, service requests, problems, and changes, including the series number, reporter, time of

reporting, team in charge, person in charge, type of incident, source of incident, priority level,

detailed description, incident root cause analysis, solution, and other information of the handling

process.

An incident management form is generated in the ServiceBot system, where an incident can be

escalated and tracked with reference to the interconnected parent incidents, problem records, service

requests, and change records.

The system tracks and records the status (newly created, assigned, being processed, pending

solution, resolved, or closed) and SLA information of incidents, problems, and changes, thereby

enabling a closed-loop control. The closing of an incident, problem, or change is subject to the

review and satisfaction assessment by the initiator.

2.2.1 Incident management

Incident management in the data center aims to restore normal system operation as quickly as

possible and prevent disruption to the business in case of incidents, by following the pre-

established internal incident management process and measures.

The incident management process established in Ping An Data Center covers the reporting,

register, classification, handling process, escalation mechanism, response mechanism, and status

control of incidents, with the entire handling process tracked and recorded using the Service Bot

system.

2.2.1.1 Classification of incidents

(1) Warning alarm: Defining the concept and scope of alarms

(2) Failure: Defining the concept and scope of failures that may occur in the data center

(3) Level I failure: A failure having direct impact on the reliability of business operations,

with reference to SLA requirements

(4) Level II failure: A failure occurring to a single piece of critical equipment of the

infrastructure (according to a pre-defined critical equipment list)

(5) Level III failure: An incident that threatens the normal equipment operation and security in

the computer room but has resulted in no actual impact

(6) Urgent Incident Office Center (UIOC): The major incident management process for

Page 15: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

7

addressing application-level severe failures caused by abnormal hardware or software

operation of data center facilities

2.2.1.2 Incident detection

The infrastructure operations team obtains alarm information about the infrastructure,

operating system, and data center facilities through routine check, remote monitoring, mobile

phone text message, and phone call. Upon receiving an alarm, the person in charge should go to

the scene of the alarm to obtain comprehensive information about the alarm. Any failure to the

infrastructure or operational environment should be immediately reported to the Infrastructure

Engineer on duty, who will decide the classification of the failure (Level I, II, or III) based on the

actual situation.

2.2.1.3 Reporting paths for the different levels of failures

A Level III failure is handled and followed up by the Infrastructure Engineer through

coordination with relevant technicians and service providers.

The Infrastructure Engineer should report a Level II failure within two minutes to the team

leader in charge, who will handle the failure and report the progress of failure handling to the

Management Representative in a timely manner.

The team leader in charge should report a Level I failure within two minutes to the

Management Representative, who will in turn report it within two minutes to the Data Center

Manager and update the progress of failure handling every two hours. The Data Center

Manager should circulate details of the failure to relevant leaders of the Company and decide

whether to initiate the UIOC process.

2.2.1.4 Failure handling

The Infrastructure Engineer is responsible for the response, classification, and reporting of

failures as well as the coordination of resources for failure handling.

For a Level III failure, the Infrastructure Engineer is responsible for 1) coordinating with

relevant employees for failure handling; 2) where necessary, notifying relevant service

providers for emergency response and repair within 30 minutes; 3) reporting the progress of

failure handling to the team leader.

For a Level II or I failure, the Infrastructure Engineer should go the scene of incident and

notify the team leader as soon as possible. She/he is also responsible for 1) notifying relevant

vendors/equipment manufacturers for failure handling and repair within 10 minutes; 2) if

there is still no successful progress in failure handling, urging the manufacturers to take

emergency actions (for example, providing back-up equipment); 3) reporting the progress of

Page 16: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

8

failure handling to the team leader and Management Representative every two hours. For a

Level I failure, the Management Representative should report it to the Data Center Manager.

Upon successful handling of the failure, the failure incident should be recorded in the

management system, including a full description of the entire failure handling process.

2.2.2 Problem management

A problem is the root cause of one or more incidents. Problem management aims to identify

the root causes of incidents and prevent the occurrence of incidents by taking proactive actions to

identify and resolve problems before they can cause incidents. The management of a problem very

often involves a long time-cycle to diagnose and resolve its root cause based on appropriate

planning.

As problems are root causes of risks and incidents, they should be managed in association

with risks and incidents. A problem is ranked by referring to the risk ranking of Ping An Data

Center to be detailed later and is classified similar to the incident classification detailed above.

2.2.3 Change management

Change management aims to assess, approve, implement, and review every change in a

controlled manner in order to ensure the implementation of standardized methods and processes,

prevent unauthorized changes, minimize the risk and impact of emergency changes and related

emergency incidents, and maintain the traceability of changes.

The elements of change management include classification of changes, change management

process, definition of roles and responsibilities for change management as well as the initiation,

approval, implementation, and closing of changes and policies for normal approval and pre-

authorization of changes.

2.2.3.1 Definition of change management

Change management: the documented process of managing risky actions involved in the day-

to-day operations and maintenance of the infrastructure.

Change management aims to avoid risks associated with change implementation through a

standardized management process. The scope of change management covers annual routine

changes, incident-type changes, changes to the data center system structure, and changes to

equipment conditions, parameters, and configurations.

2.2.3.2 Change classification

A change to the infrastructure operations of Ping An Data Center is classified as Level I, II,

or III based on its impact.

Page 17: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

9

Level I changes (or major changes) are those changes that pose big hazards to the power

distribution and HVAC systems of the data center or affect the security of the dual power

supply to racks, the overall cooling system of the computer room, the monitoring system, or

the fire fighting and security system.

Level II changes include maintenance-related changes and modifications to parameter

settings. Maintenance-related changes mainly include repairs to individual malfunctioning

equipment sets, alterations to individual equipment set configurations, and maintenance-type

incidents having no impact on the security of the dual power supply of IT power load.

Level III changes are mainly normal modifications to the parameters and alterations to the

operational conditions of individual equipment sets.

2.2.3.3 Definition of roles and responsibilities for change management

Figure 2.2-1 defines the roles and responsibilities of the Change Management Commission,

Daily Operations Manager/Bank IT Manager, Infrastructure Manager, engineers, monitoring

personnel, and technicians in the change management process.

2.2.3.4 Hierarchical change management

Fig. 2.2-1 Schematic illustration of the hierarchical change management

Schematic illustration of the Hierarchical change management

Change

Management

Commission

Daily

Operations

Manager/Bank

IT manager

Infrastructure

Manager

Engineers

Monitoring

personnel, and

technicians

Updating on

change

implementation

Updating on change

implementation

Updating on change

implementation

Change

implementation

Change

approval

Change

approval

Change

approval

Change

initiation

Change

initiation

Change

approval

Change

implementationLevel 3

Level 1

Level 2

Level 3

Page 18: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

10

2.2.3.5 Initiating a change

The change management system provides a detailed definition of change initiators and the

major elements of change management, including the type of change request form as well as the

basic information, justification, schedule, and classification of the requested change.

2.2.3.6 Approving a change

At this step of the change management process, the person responsible for the approval of the

change request assesses and checks the potential impact of the requested change and decides

whether to proceed with the requested change, in order to ensure that the requested change can be

implemented to satisfy business requirements while minimizing its impact on services.

2.2.3.7 Implementing a change

At this step of the process, an approved change is implemented in the production system

according to the schedule and procedure provided on the approved change request form. The

details of the on-site change implementation should be recorded.

2.2.3.8 Closing a change

This step aims to investigate whether the expected effect of a change has been realized,

verify the results of the change, and check whether correct and complete information has been

recorded on the change request form.

2.3 Uptime Management & Operations (M&O) program

The Uptime Institute M&O certification, a well-recognized certification in the international

data center industry, aims to help data centers improve their operations and management by

assessing a comprehensive set of indexes.

The major philosophy of M&O is to minimize human and equipment risks and improve the

availability of data centers by providing best practices obtained from the cases of data center

operations across the world.

Ping An Data Center passed the M&O certification in 2017–2018, with the highest score of

96.3, achieved through the shortest certification program among Chinese data centers. The M&O

certification is based on an assessment of five categories: staffing and organization; maintenance

management; training management; planning, coordination, and management; operating

conditions. The certification requires an overall score of 80 or above for the five categories and is

valid for a period of two years. The following is a description of our M&O certification program

according to the five categories.

Page 19: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

11

2.3.1 Staffing and organization

Appropriate staffing of qualified personnel is critical for achieving the long-term

performance objective of the data center. To achieve the uptime target for the data center,

adequate staffing and vendor support must be provided to carry out all the maintenance and

operating activities. All the employees in the data center must have the experience and technical

qualifications required to carry out the activities assigned to them and all the roles and

responsibilities must be defined, with their importance confirmed by the management.

2.3.1.1 Staffing

Ping An Data Center houses the systems and associated components required to run the core

business, and is expected to operate 24 × 7. The data center is provided with adequate staffing

required for this level of operations availability. A job description is established for each of the jobs.

A job description covers the recruitment requirements of education, experience, professional

competence, and core competence for the prospective job holders as well as the scope of

responsibilities, main responsibilities, and challenges and solutions of the job and the hierarchical

position of the job in the organizational structure, in order to ensure that any new employee of the

data center satisfies all the requirements and understands her/his roles and responsibilities.

A job responsibility matrix is developed for the 47 different roles defined for the operations of

the data center. The matrix provides a brief description of each task and indicates the four different

ways for each of the roles to participate in the task: implementation, approval, support, and

informed. The matrix is updated to reflect the latest changes in roles and responsibilities assigned to

the employees. This facilitates all the employees to understand their roles clearly and carry out their

assigned tasks in an orderly manner.

A data center is a complicated, equipment-intensive facility. Ping An Data Center divides the

facility into 15 zones and assigns a person to be responsible for the equipment in each zone, with the

detailed responsibilities defined and documented. The person-in-charge for each zone is assigned on

a regular rotation basis, such that all the employees of the data center can gain a clear, in-depth

understanding of the facility.

2.3.1.2 Personnel qualification

The operations of the data center involve day-to-day operating activities for the medium- and

low-voltage power distribution, cooling, and firefighting systems, elevator management, and work

above the ground. The employees involved in these tasks have been certified for operating the

medium- and low-voltage and HVAC systems by the State Administration of Work Safety, for

primary building (structure) fire-fighting by the Fire Department of the Ministry of Public Security,

Page 20: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

12

and for elevator management (safety management for special equipment) by the Market and Quality

Supervision Commission of Shenzhen Municipality.

Personnel qualification management covers the collection and regular review of personnel

qualification information and follow-up with relevant employees for certification renewal/review, in

order to ensure the validity of all certificates.

2.3.1.3 Organizational Structure

An organizational chart of Ping An Data Center is available, clearly indicating the work

interfaces and reporting lines of the departments (Infrastructure, IT, Security Management,

Vendor Management, and Housekeeping) as well as the communication channels between the

different organizational functions.

2.3.2 Maintenance management

2.3.2.1 Preventive maintenance plan

Preventive maintenance plan: At the end of each year, the operations team of Ping An Data

Center prepares the next-year preventive maintenance plan by equipment type based on inputs from

equipment suppliers. The plan, which consists of more than 150 line-items, is approved by the

management and implemented strictly according to pre-established methods of procedures (MOP).

The completion rate of the preventive maintenance plan is a major key performance index (KPI) of

the data center, with the target set at 95%.

2.3.2.2 Maintenance management system

An effective maintenance management procedure for tracking the status and results of all

maintenance activities

Providing tabulated information (brand, model, date of manufacturing, date of installation,

maintenance contract, and operating instructions) of all major equipment sets

An order of maintenance providing special tools and materials required for the preventive

maintenance (PM)

Saving data of equipment maintenance activities and their trends

List of critical spare parts and re-ordering points

Equipment list: a list of all critical equipment sets, including information of equipment sets,

their maintenance, and their critical parts. Equipment information includes the classification,

location, description, brand, function/model, date of installation, and series number of equipment.

Equipment maintenance information includes the department in charge of maintenance, date of

equipment insurance, and contact person and phone number for maintenance. Information of

Page 21: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

13

critical equipment parts is a list of information of critical parts by equipment set, with different

equipment sets having different critical parts.

Tool management: including specification for equipment calibration, a list of tools, and

records of tool calibration.

Management of critical spare parts: As different data center facilities have different types and

physical locations of equipment sets and different levels of vendor support, each facility defines

its own critical spare part list based on its own actual situation and performs regular check against

the critical spare part list. The aim is to repair malfunctioning equipment sets as quickly as

possible, shorten the mean time between failures (MTBF), and minimize the impact on business.

2.3.2.3 Computer room housekeeping policy

Standard of data center housekeeping:

Tidy and clean computer room floor

Computer room free of flammable and combustible materials, tools for housekeeping,

personal belongings, and paper packings

Tidy and clean computer room environment (IT computer room, power distribution room,

cooling station, and other functional areas)

2.3.2.4 Vendor support

Approved vendor list (for support under both normal and emergency conditions),

including names, contact persons, and contact information of vendors

SLAs, including clauses for scope, time, frequency, and response time of maintenance

and support as well as training needs

Vendor engagement process and qualified vendor service persons

2.3.2.5 Deferred maintenance procedure

The process for tracking and supervising deferred maintenance, including the initiation,

approval, implementation, and closing of deferred maintenance as well as analysis of

associated risks.

2.3.2.6 Life cycle planning

The procedure for the planning and financial control of the life-cycle-based replacement

of major equipment sets or components

2.3.2.7 Failure analysis policy

Equipment failure list (including the time of failure, equipment involved, failure analysis,

and lessons learned)

An effective process for identifying the root causes of problems and taking appropriate

Page 22: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

14

corrective actions

2.3.3 Training management

2.3.3.1 Staff training

On-board training for every new employee to ensure that they are technically competent and

understand the working systems. Document presentation-based training and on-site drilling to

cover:

1) All processes, procedures, and policies for operations and management

2) Site Configuration Procedures (SCP)

3) Standard Operating Procedures (SOP)

4) Emergency Operating Procedures (EOP)

5) Maintenance Operating Procedures (MOP)

6) Maintenance Management System (MMS)

This also includes the training management procedure, which covers the curriculum, course

materials, and records of training, and the procedure for personnel qualification.

2.3.3.2 Vendor training

A list of training courses to be taken by vendors

Introduction to the process and procedure for vendors to provide on-site services

Vendor training is mandatory for every regular employee.

The training management procedure, covering the curriculum, course materials, and

records of training

2.3.4 Planning, coordination, and control

2.3.4.1 Computer room policy

The well-established procedures of the data center, including:

1) Equipment management policy of the data center (for example, the principle for

configuration changes and operating solutions under normal and emergency conditions)

2) Site Configuration Procedures (SCP)

3) Standard Operating Procedures (SOP)

4) Emergency Operating Procedure (EOP)

5) Change management (risk assessment and approval of requested changes)

2.3.4.2 Financial policy

The financial procedure for ensuring that an adequate fund is available for the data center

Page 23: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

15

2.3.4.3 Document and data library

The following data and records must be maintained (kept at the data center or off-site):

1) As-built drawings

2) Documents for operations maintenance

3) Research results

4) Testing reports

5) Maintenance contacts and clauses

6) Documented automatic control procedures

The above data must be made readily available at the data center, maintained at the data

center in a centralized manner, and accessible to all employees. A procedure should be established

for the revision/update of the above data and should be made available to all employees of the

data center.

2.3.4.4 Capacity management

Capacity management includes the following processes:

1) Regular review and update of the used capacity of the data center in order to add new or

remove existing IT facilities as necessary;

2) Regular tracking of used rack, power, and cooling capacities, which is combined with the

prediction of the increasing demand for space, power, and cooling, air flow planning and

management, and power consumption analysis.

2.3.5 Operating conditions

2.3.5.1 Load management

Procedure for ensuring that the actual load does not exceed the capacity when switching

between the primary and redundant paths.

2.3.5.2 Operating configurations

Critical configuration points are defined based on risk, availability, and cost.

Page 24: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

16

Chapter3 Security Management

The security management in data centers is broken down into the following three

categories: Information security, physical security, and personnel safety. Financial data

centers require higher security standards than other types of data centers. The Ping An Data

Center has established a precise and accurate security management system to protect the

operations of its various components by following the ISO 27000:2005 Information Security

Management System, GB/T 21052-2007: Information security technology—Physical

security technical requirement for information system, ISO 9001, and M&O.

3.1 Information security

As the level of informatization is increasing across the world, the information security of

data centers has become a popular concern and many organizations in the world are exploring

techniques to safeguard information security. The Ping An Data Center has established a

systematic management system for information security, by following the ISO 27000:2005

Information Security Management System (that have been adopted in the majority of the

world countries). Following are the rules for information confidentiality:

1) All the rules of Ping An Technology (Shenzhen) Co., Ltd. for computer information

and cyber security shall be followed.

2) No one may take any materials out of the computer room or disclose any

information stored in the computer room without permission, including

confidential documents, software copies, technical files, and other classified

data.

3) No one may disclose any secret information, classified information, or high

confidential information (including data and documents) about the data center.

4) No one may disclose, share, or embezzle the server data such as account IDs,

passwords, IP address, and the other server data

5) Non-authorized personnel are not allowed to access the restricted area, use the IT

facilities, or interfere with anybody else’s work in the data center; no one may

use any IT facilities other than those necessary for work; and no one may

interfere with anybody else's work or the operating of the data center.

6) Non-authorized personnel are not allowed to modify the operating system or

settings of the IT facilities (such as networks and servers).

Page 25: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

17

7) No one may embezzle, alter, or sabotage the utilities in the data center.

8) An external person (such as a vendor or visitor) is required to sign a confidential

declaration prior to his/her first access to the data center and shall be subjected to a

security check by the administrators and security guards of the data center. Any

person violating the confidentiality rules shall be bound by the relevant rules of the

data center and governed according to the severity of the violation. In cases where

the violation constitutes a crime, the violation shall be reported to the Legal and

Security department of the company for investigating legal responsibilities.

9) An employee is only allowed to use the office computer allocated to him and is

not allowed to alter the operating system installed by the IT administrator.

10) The password policy is strictly mandatory for all employees, including non-

disclosure of account ID/password to others; the log-in password must be changed

every 90 days, and the allocated computer must be returned upon job rotation or

resignation.

11) Any work e-mailed to an external party must be copied to and approved by the

line manager. Any sensitive information (for example, account number, key, and

IP address) in emails and attachments must be appropriately shielded.

3.2 Physical security management

Physical security, referring to the security of the computer room as well as the

equipment and facilities of the data center, is the premise for safeguarding the information

system security of the data center. If the physical security of the computer room cannot be

safeguarded or there exist security hazards, then the security of the entire data center cannot

be realized.

The Ping An Data Center was constructed according to Class A of GB 50174 Code for

Design of Electronic Information System Room that provides a solid foundation for the

physical security of the data center. In addition, a control system for different levels of access

has been incorporated into the day-to-day operations of the data center, including access

control, material control, and fire safety.

3.2.1 Physical security configuration

Physical security of the Ping An Data Center is configured by the following five access

levels: site, building, compute room, zone, and rack.

Site: Security guards are employed to perform access control at site entrances by

Page 26: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

18

ensuring that employees and visitors display proper passes or identification before

entering. In addition, patrolling is also part of the security guards’ duties..

Building: At the entrance to the data center building, access control system,

material screening, and face recognition are installed. In addition to these facilities,

security guards are responsible for the administration of persons entering and

leaving the building.

Computer room: Access to the computer rooms of the data center is controlled by

face recognition, access card, and fingerprint verification.

Zone: A computer room is zoned for clients, where the zones are separated by wire

meshes and cold aisles. Access to the zones is separately controlled with the door

access control system to ensure that the zone is only accessible to pre-authorized

users.

Rack: The front and rear doors of a rack in the computer rooms are locked and can

only be unlocked by the pre-authorized users.

Surveillance cameras are installed in the data center building and computer rooms, with

the surveillance videos in the last three months stored for inquiry.

The record of accesses to control points in the computer room is maintained for a period

of one year.

3.2.2 Terminology and definition

Permanent access

This level of access is granted to those employees who require permanent access to the data

center and is controlled by the means of access card, fingerprint, and iris information. The

facility administrator of the data center maintains a list of the persons with permanent access to

the data center, which is updated when an access is granted, or a granted access is canceled, and

is reviewed by the leader of the Data Center Infrastructure Management department.

Temporary access

This level of access is granted to those employees of Ping An Technology who do not

require permanent access to the data center but rather temporary access for work or external

parties who request temporary access to the facility. A person who has been granted a

temporary access through a pre-established procedure is required to enter and leave the data

center in the company of a person with permanent access.

IT facility zone of the data center

Page 27: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

19

This is the zone for installing the IT facilities of the data center—racks for storage

devices, network devices, and servers, excluding other areas of the data center—for example,

rooms for infrastructure equipment, gas fire extinguishers, and uninterruptible power supplies

(UPS).

3.2.3 Procedure

Access control

Access control by zone: Access to the data center is further defined by zones according to

job responsibilities. That is, a person is only allowed to access those zones in the data center

that he requires to enter for job-related purposes.

Access application: Permanent access to the data center can be applied according to the

data center access application procedure. The Administrator of Data Center Infrastructure

Management add the approved access into the employee identity card of the applicant.

Changes to the granted access to the data center

When an employee with permanent access to the data center resigns or he is assigned to

a different job or different zone of the data center, his access needs to be deleted or updated

according to the data center access change approval procedure. The Administrator of Data

Center Infrastructure Management then updates the approved change into the employee

identity card.

Record of granted access to the data center

The unique permanent access to the data center granted to an employee (including his

employee identity card, fingerprint, and iris information) is recorded in the access control

system of the data center. The infrastructure engineer of the data center checks and updates

the system every month as per the granted accesses and retrieves from the system the list of

accesses and submits it to the data center manager for approval. A person with permanent

access to the data center shall fulfill his security commitment to the data center. He shall

retain his employee identity card in a proper manner and may not lend it to any other person.

In case the employee identify card is lost, he shall immediately report the loss.

Temporary access to the data center

If an employee of Ping An Technology requires temporary access to the data center for a

certain time period for job-related purposes, he shall submit an application for access

according to the temporary access approval procedure of the data center. The application for

temporary access should indicate who will enter the data center, at what time, for what

Page 28: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

20

authorized task, and on what object, as well as the coordination required to perform the

intended task (including risk assessment and risk mitigation plan). Upon the approval of the

temporary access, the applicant will be provided with a visitor identity card. Prior to entering

the working area of the data center, his relevant identity information will be recorded by the

data center administrator on duty, who will accompany him into the working area and collect

the visitor identity card upon his taking leave of the data center.

External visitor to the data center

For an external visitor to access the data center, an employee of Ping An Technology

shall submit an application for the visitor’s access two days in advance. The application shall

be made via the relevant electronic access application system according to the data center

visitor access application procedure and should indicate the reason and time of the visit as

well as the specific zones of the data centers to be visited. An approved visitor shall visit the

data center in the company of a member of the Data Center Operations Team.

3.2.4 Site access registration system

To enter the data center, persons with only temporary access shall register at the security

post. For employees with temporary access, the registration shall be carried out in sequence;

for the group of external visitors or material handling operators, the registration process may

be completed under the name of a representative. A courier man may be allowed to enter the

office area of the data center by showing his identity card without going through the

registration procedure; however, the courier man must be accompanied by thevisited

employee working in the computer room. A person granted with temporary access to the data

center shall sign a non-disclosure agreement prior to his first entry into the data center.

The security guard on duty at the security post shall request persons with temporary

access to enter the following information in the visitor registration system: name,

company/department, time of visit, purpose of visit, zones to be visited, materials brought

along, and number of companions. For employees of Ping An with temporary access,

registration shall be carried out using an employee identity card; for external visitors,

registration procedure shall be executed by showing a valid personal identity certificate

(identity card, passport, social security card, or driver's license). A visitor card and visitor

registration form will then be issued, which should be carried by the visitor during the visit.

Upon leaving the data center, a person with temporary access shall return the visitor

card and visitor registration form. The visitor registration form shall be signed by the person

Page 29: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

21

visited and the time of leaving should be indicated as well. The visitor shall record the time

of leaving the data center in the visitor registration system. The security guard on duty at the

security post shall check if the information provided in the visitor registration system is true

and complete.

In cases where the visitor registration system is technically unavailable, the registration

should be carried out via the data center access registration form instead and the record shall

be filed and maintained according to the relevant record control procedures.

Cleaning workers and supervisors may enter the pre-authorized zones using their special

access cards, but shall not be allowed into unauthorized zones without the company of the data

center’s technician on duty.

Page 30: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

22

Fig. 3.2-1 Flow chart depicting the access control of external visitors

Control of temporary access

1) For an external party to visit the data center or any other working area containing

Page 31: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

23

sensitive information, an application should be submitted to the administration office of the

data center. Upon approval of the application, a permit will be issued by the administration

office, and the data center technician on duty will then lead the visitor into the data center in

the pre-established visit window. For an external party to enter the computer room (excluding

external parties delivering goods into the warehouse), the security guard on duty shall use the

walk-through metal detector for performing a security check of the visitor and the material

brought along by him.

2) Once allowed into the data center, the visitor shall prepare for the intended

maintenance task or take rest in the designated area, no lingering is allowed in any other part

of the office area.

3) For a person with temporary access who will be involved in implementing changes to

the computer room, he is required to be prepared (for example, ensuring that the required IT

equipment and spare parts are identified and issued from the warehouse) before the daily

maintenance window (23:00 - 06:00).

4) Any person entering the data center shall check if the door to the data center has been

properly closed (the door should be normally closed). The data center technician on duty shall

check if the doors to computer rooms are properly closed and address any problem in a timely

manner.

5) No visitor shall enter any area that he has not been permitted to enter. Any violation of

this requirement will be reported to the customers concerned and the management of the data

center. The data center reserves the right to revoke the violator's access to the data center,

depending on the severity of the violence.

6) Photography or video recording using IT equipment is not allowed in the data center

without permission, except for job-related purposes by the employees. No person may take

any materials out of the data center. No person may take any software copies, technical files,

or any internal data classified as secrete information or marked with higher levels of

confidentiality out of the data center or disclose them to any third-party. A visitor to the data

center must sign a non-disclosure agreement as defined by the confidentiality management

system of the data center.

7) In cases where a person is allowed into the data center for hardware maintenance or

installation of any infrastructure or IT facility, or alteration to any optical fiber, network cable,

power socket, or power cable, the data center technicians on duty shall be notified of the

attempted maintenance or alteration, which shall then be performed under the supervision of

Page 32: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

24

the technician on duty.

8) No person may alter any cable or floor in the data center without permission. In cases

where the power cables or network cables or any other wirings in the data center is planned

to be extended, the data center planning administrator shall be notified of the planned

extension. The planning administrator will design the layout of the sockets and ports required

for the system extension, and the data center operations team will implement the planned

extension. No person may open the floor or alter the power or network cabling without

permission.

9) No external visitor may carry any baggage into the IT facility zone of the data center.

10) In cases where an external service person is involved in the maintenance of any IT

facility or equipment and has logged into a server for this purpose, the IT facility

administrator of the data center shall confirm that the external service person has logged out

of the server and has closed the log-in page before leaving the data center. Furthermore, the

service person shall go through the leaving procedure at the security post before leaving the

data center.

3.2.5 Control of goods

The security guard is in charge of checking goods delivered into and out of the data

center as per the relevant goods control requirements.

No foods, beverages, or any other non-work related materials (including personal bags)

are allowed into the data center.

No combustible, flammable, fragile, polluting, or any other dangerous materials as well

as materials with strong magnetic fields that may interfere with IT facilities are allowed into

the data center.

All materials to be carried into the IT facility zone of the data center must be placed in

the baskets provided at the security post and should be subjected to security checks when

carried into and out of the data center. Personal belongings must be placed in the designated

lockers.

No personal notebooks or cameras are allowed into the data center without permission.

Notebooks and cameras are available at the data center upon requesting the data center

operations administrator (such a request may be made by filling a registration form for

borrowing tools from the data center).

For any non-personal belongings to be taken out of the data center, a gate pass shall be

Page 33: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

25

prepared and approved. For any IT equipment with any magnetic media (a data security

concern) to be taken out of the data center, demagnetization treatment must be given to the

equipment by the operations administrator on duty and verified by the security guard on duty.

3.2.6 Fire safety management system

3.2.6.1 Regulations on fire and safety education and training

1) Regular training to employees on firefighting related laws, rules, and regulations.

2) Annual written examinations on firefighting knowledge and firefighting drills to

improve the firefighting and safety awareness and skills of the employees.

3.2.6.2 Regulations on fire hazard screening

1) Implement a responsibility system for fire prevention and safety (where the

responsibilities for fire prevention and control are defined for each job and included in the job

performance appraisal) and carry out regular fire and safety hazard screenings.

2) The firefighting facilities of the data center are maintained by a service provider, who

performs monthly fire hazard screenings and tracks the mitigation/elimination of identified

hazards.

3) Identified fire hazards shall be recorded by the inspector and signed by the parties

responsible for the mitigation/elimination.

3.2.6.3 Administrative regulations on emergency evacuation facilities

1) Escape routes and emergency exits shall be kelp clear, shall not be occupied for any

other purposes, and shall not be installed with fences or any other barriers that may obstruct

evacuation.

2) Emergency escape signs and emergency lighting shall be provided according to

relevant national regulatory requirements.

3) Firefighting facilities such as fire doors, emergency evacuation signs, emergency

lighting, mechanical smoke-discharging and ventilation, and emergency broadcasting shall

be regularly inspected, tested, maintained, and serviced for normal operation.

3.2.6.4 Regulations on fire safety

1) A hot work permit shall be obtained for any operation involving open flames.

2) Prior to any hot work, the scene (within a radius of 5 m) shall be free of flammable

and combustible materials and shall be properly segregated. Moreover, it shall be equipped

with appropriate types and quantities of fire-extinguishing materials (which are available

from the security department and shall be returned immediately at the end of the hot

operation, along with a record prepared for reporting any material that was used during the

Page 34: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

26

operation).

3) If hot work is attempted in a production area, the hot work permit shall be approved

by the line managers or above and the entire operation shall be supervised by the operations

team. For hot work being undertaken 2 m above the ground or higher, a person shall be

assigned specially for watching the operation and extinguishing any flames that may lead to a

fire.

3.3 Personnel safety management

Personal safety of the operations team must be taken as a priority while sustaining the

normal operation of the data center. Ping An Data Center pays high attention to the personal

safety of the operations team and has incorporated personnel safety management into every

process of the data center.

3.3.1 Personnel safety training

Personal safety is included in both the pre-job training to new employees and on-the-job

training to the essential operations team.

A new employee shall complete the pre-job safety training and pass an examination

(with a minimum score of 80) at the end of the training during the probation period according

to the working instructions of data center on employee training. An employee must pass the

safety training examination to qualify for his job.

The safety training specialist prepares an annual safety training plan each December and

submits it to the management for approval. To minimize personal safety risks to the

operations team, the plan is based on the current safety training curriculum, the actual

operations situation, and the latest lessons learned from safety incidents that occurred both

inside and outside the company. Every member of the operations team shall take annual

safety training and pass an examination (with a minimum score of 80) at the end of the

training according to the working instruction of the data center on employee training.

The examination result is included as an index for the annual performance appraisal.

The training covers

1)electrical safety specifications;

2)HVAC safety specifications;

3)regulations on the use of facilities and tools;

4)regulations on accessing computer rooms;

Page 35: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

27

5)reviews of safety incidents.

3.3.2 Day-to-day operational safety management

The operations team shall strictly follow safety specifications established by the data

center for the day-to-day operational activities (for example, electrical operations, operations

on HVAC systems, and use of facilities and tools).

3.3.2.1 Electrical safety specifications

1. An operator of electrical equipment must be physically fit (free from any disease

that may compromise personal safety during electrical operations as certified by a

doctor), equipped with appropriate electrical operation knowledge, certified for

electrical operation, and have skills for administering first aid in case of electrical

shocks as well as electrical fire prevention and extinguishing skills.

2. An electrical operation shall be performed by at least two persons, one for

operating and one for keeping vigilance. In cases where only one person is on

duty, he must be capable of working and handing incidents independently and is

only permitted for monitoring equipment operations, but not for operating any

electrical equipment without any person keeping a careful watch for possible

dangers.

3. The operator of electrical equipment must wear insulating boots, and should wear

insulating gloves when accessing the housing or structure of an equipment set..

4. The switch or knife-switch that directly controls the power supply to the

electrical equipment being operated shall be switched off and attached with a

label indicating that the switch should not be turned on.

5. A power distribution device, irrespective of whether its instruments indicate a

voltage or not, shall be taken as live unless it has been confirmed discharged.

6. When a power outage is planned following a major change approval procedure, the

power outage shall be restricted to the approved scope, and may not be extended

without further approval.

7. The operations team shall carry out patrol inspections earnestly and carefully,

correctly update operation logs, and properly prepare records and reports in a

timely manner.

8. No operations administrator may take duty under alcohol intoxication, may not

be involved in non-job-related affairs while on duty, and may not leave his

Page 36: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

28

position without permission.

3.3.2.2 HVAC safety specifications

1. An operator of the HVAC systems must be physically fit (free of any disease that

may compromise personal safety during operation as certified by a doctor),

equipped with appropriate HVAC knowledge, and certified for HVAC operation.

2. An operation on the HVAC systems shall be performed by at least two persons,

one for operating and one for supervising.

3. Switching between the water cooling units shall be performed in the monthly

routine maintenance window, no switching is allowed without permission. In

cases where the primary cooling unit is malfunctioning, switching to the

redundant unit can only be performed after consent obtained from the engineer on

duty.

4. The operations team shall carry out patrol inspections earnestly and carefully,

correctly update operation logs, and properly prepare records and reports in a

timely manner.

5. No operations administrator may enter the data center barefooted, stripped to the

waist, wearing short sleeves, shorts or slippers, or under the effect of alcohol

intoxication, fatigue, or a serious illness. Employees must behave formally and

appropriately in the data center.

3.3.2.3 Regulations on the use of facilities and tools

1. The operations team shall use tools according to the regulations on the use of

tools; carefully preparing the tool use registration form, using tools in a cautious

manner, and return tools in a timely manner.

2. When performing a welding and cutting operation, the operations team shall

ensure that appropriate fire prevention measures are in place, follow the working

instruction for welding operations, and wear safety goggles and other personal

protective equipment.

3. For any operation being performed 2 m above the ground or higher, the operator

must wear a safety belt. Safety belts shall be regularly checked, verified for

proper strength before use, and may not be extended without permission. A safety

belt shall be tied to a support located higher than the object to be operated on; it is

not permissible to tie the safety belt to a support located lower than the object to

be operated on.

Page 37: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

29

4. To operate on live parts, the operator must wear insulating gloves. Electric

instruments such as test pencils and multimeters shall be regularly checked for

electric performance.

5. When operating using a hand-held electrical device (for example, sander, cutter,

and screwdriver) the operator shall wear protective goggles and ensure that the

device is equipped with leakage protection. Any damaged device shall be

repaired by a specialist and can only be re-commissioned for use after its proper

functioning has been verified.

6. When a ladder is used for an operation above the ground, the ladder shall be

checked for robustness to prevent the operator from falling-off and getting

injured.

Page 38: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

30

Chapter4 Staffing and Staff Development

4.1 Organizational structure

As the data center is fundamental and central to the Company’s IT infrastructure, establishing

an appropriate organizational structure for and clearly defining the functional roles of the data

center is of great significance in driving and guiding its effective, efficient, and secure operations

and meeting the Company’s business goals.

An appropriate organizational structure design facilitates the streamlined workflow, close

cooperation between departments, clear definition of role and responsibilities, and employee

motivation, thereby sustaining efficient operations of the data center where all the employees are

assigned appropriate tasks and are aligned to make concerted efforts toward a common goal.

The Uptime Institute Tier Standard: Operational Sustainability sets forth different staffing

requirements for data centers of different classifications, with a greater number of staff and better

skills specified for higher classifications. As shown in Table 4.1-1, the Uptime Institute standards

categorize data centers into four classifications (Tier I to IV, from low to high), and specify

greater number of staff and higher staff presence requirements for higher classifications. A Tier

IV data center is expected to sustain a very high level of availability throughout the year, and

hence, it requires 24-hour presence of a technical specialist to oversee its operations, so that any

problem can be resolved immediately, or redundancy is readily available to sustain its operations.

Data center operational sustainability requirements by classification Table 4.1-1

As a key infrastructure of Ping An Group, the data center plays a fundamental role in the

Group’s core business and disaster recovery. Its operations are configured according to Tier IV,

Page 39: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

31

with 7 x 24 staff presence in three shifts (rotating among five teams). Each shift is staffed with an

experienced engineer (shift leader) who is capable of timely handling of emergencies, in addition

to a Monitoring Specialist and three Technicians (for electric equipment, HVAC, and electronic

systems, respectively). The Monitoring Specialist at the Guanlan data center site is responsible for

centralized monitoring of all the sites of Ping An Data Center. The operations team must be

staffed according to the width and depth of the operations. For any important position in the team,

a back-up person is assigned in case of unavailability of the primary person.

The operations team of Ping An Data Center is organized by three functional blocks, with the

function of each block further defined to establish a perfect operations system, as shown in the

table below.

Organizational structure of the data center operations team Table 4.1-2

Data center

operations team

Day-to-day operations

management

(IT management)

Network operations

Server operations

Software application operations

Data storage operations

Cloud platform operations

Infrastructure management

Electric systems operations

HVAC operations

Firefighting operations

System monitoring operations

Building security and

housekeeping

Security Department

Housekeeping Department

Logistics Department

4.2 Roles and responsibilities

With the continuous development of the Internet and information industries, data centers with

high-availability service and uptime become increasingly important. Consequently, it becomes

more critical to ensure the secure operations of data centers and the operations management of

data centers becomes increasingly more complicated and poses greater technical challenges. It is

important to define precisely the roles and responsibilities of the data center operations team,

which mainly consists of the following positions: Data Center Manager, Infrastructure Operations

Team Leader, Infrastructure Engineer, Infrastructure Monitoring Specialist, and Infrastructure

Technician.

Data Center Manager

The Data Center Manager assumes the overall responsibility for the data center and is

Page 40: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

32

specifically responsible for

1. overall planning of the data center (capacity, energy efficiency, availability, and business

sustainability) to satisfy business requirements;

2. translating business requirements into requirements for the data center;

3. all day-to-day operations management;

4. planning, implementation, and continuous improvement of the operations system of the

data center;

5. establishing and implementing operating plans for the data center;

6. effectively controlling the operating cost of the data center;

7. driving to improve the service capability of the data center;

8. managing the data center team;

9. reporting, tracking, and handling major incidents.

Infrastructure Operations Team Leader

The Infrastructure Operations Team Leader reports to the Data Center Manager and is

specifically responsible for

1. planning the infrastructure required in the data center to satisfy business requirements;

2. establishing, implementing, and improving the data center’s infrastructure service and

protection plans;

3. operation, maintenance, and service as well as regular and irregular patrol inspection of

facilities and equipment and ensuring that operation specifications and equipment repair

and maintenance procedures are followed;

4. assessment of service providers of the data center and acceptance inspection of

constructions at the data center;

5. review of major changes to facilities and taking timely action to improve the data

center’s equipment capacity;

6. supporting the Project Department to prepare and deploy operations sustainability plans;

7. reporting, tracking, and handling major incidents to the infrastructure;

8. improving the energy efficiency and overall equipment efficiency of the data center.

Infrastructure Engineer

The Infrastructure Engineer reports to the Infrastructure Operations Team Leader and is

specifically responsible for

1. maintaining secure operation of the infrastructure when on shift; carrying out a

comprehensive patrol inspection of the data center site during each shift to ensure normal

Page 41: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

33

operation of equipment and facilities;

2. people management, including

a) coordination and management of the Monitoring Specialist and Technicians; overseeing

their working discipline, quality of performed tasks, and progress in carrying out their job

responsibilities, supervising their work, and providing coordination when necessary;

b) reviewing and confirming by signature the change-of-shift reports and patrol inspection

records prepared by the Technicians on his shift;

c) service provider management, including managing the tasks performed by service

providers and reviewing and confirming by signature service provider reports generated on

his shift;

3. failure handling, including constantly watching the status shown on the monitoring

system as well as e-mail and text-message alerts; upon receiving an alert or failure notice,

locating and handling the failure in a time manner; for a Level II or more severe failure,

reporting it to the team leader and Management Representative immediately and updating

the latest progress in failure handling in a timely manner;

4. designing the sequence of the changes as necessary according to relevant plans or work

demands, initiating change requests accordingly, and implementing the planned changes;

5. keeping track of the operating conditions as well as technical data and files of major

equipment under his charge and ensuring that the equipment is in good operating

conditions; planning changes to remedy equipment defects and satisfy improvement

requirements according to pre-established procedures and initiating change requests

accordingly;

6. fulfilling major documentation tasks in time, including timely updating of documents

according to relevant specifications;

7. taking initiative to perform or participate in temporary tasks, e.g., organization and

coordination of training and drills, follow-up of changes, and preparation of operations

plans;

8. organizing relevant persons to support the construction, on-site management, and final

acceptance of projects;

9. hand over and take over shifts according to the change-of-shift procedure;

10. fulfilling other tasks assigned by line managers.

Infrastructure Monitoring Specialist

The Infrastructure Monitoring Specialist reports to the Infrastructure Engineer and is

Page 42: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

34

specifically responsible for

1. monitoring the operation of the data center infrastructure 7 x 24 through the monitoring

system;

2. checking the operating status of the data center through the monitoring system one round

each hour, including the water chilling, precision air-conditioning, and high- and low-

voltage power supply and distribution systems, UPS, STS, precision power switchgears,

power switchgears for air conditioning, and surveillance videos;

3. in case of an infrastructure failure, performing preliminary root cause analysis and

notifying the Infrastructure Engineer (who will coordinate with the Electric Technician and

Air Conditioning Technician to remedy the failure), or calling a conference call for failure

remedy, recording and updating the failure remedy progress, and issuing failure alerts;

4. in case of failure emergency response, reporting the failure as alerted by the monitoring

system through e-mail and interphone;

5. summarizing Level VI and more severe infrastructure alerts and remedial measures by

shift;

6. checking surveillance videos during each night shift, reporting any issues identified to

the operations team, and reporting and following up on incidents;

7. actively participating in drills, training, team meetings, and other team activities

organized by the Company to improve job skills and professional competence;

8. fulfilling other tasks assigned by line managers.

Infrastructure Technician

The Infrastructure Technician reports to the Infrastructure Engineer and is specifically

responsible for:

1. managing the data center infrastructure (including power supply and distribution, air

conditioning. and firefighting systems and environmental sanitation); performing patrol

inspections according to the pre-established specification and frequency, and reporting any

equipment defect identified to the Infrastructure Engineer in a timely manner;

2. repair of the data center’s building structures and decorations (stairways, passageways,

walls, floors, ceilings, and roofs); regular check of the walls and roofs of the data center

buildings for water leakage and seepage and peeling-off; regular check of the lighting and

emergency lighting of the data center to ensure their normal operation;

3. optimizing the operation of the central air-conditioning unit (for the new computer room)

Page 43: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

35

and precision air-conditioning equipment of the data center and improving their operating

efficiency and energy efficiency as instructed by the Infrastructure Engineer;

4. timely handling of failures in the power supply and distribution, firefighting, air-

conditioning, and water supply and drainage equipment to ensure their normal operation as

instructed by the Infrastructure Engineer;

5. supporting maintenance providers to maintain the power supply and distribution,

firefighting, air-conditioning, and water supply and drainage equipment and following up

on outstanding issues;

6. keeping the environment of the equipment under his charge clean, and safekeeping the

materials, keys, and tools issued to him for working his shift;

7. maintenance and repair of building decorations, office furniture, doors, windows, door

locks, floors, carpets, painting, lightings, and indicator lights of the data center;

8. making steel structures, floor-supporting structures, and floor holes; repairing floors in

the data center, and overseeing the operation performed by constructors on floors to ensure

that floor-supporting structures remain intact, underfloor spaces are free of foreign

materials such as cable ties and cable scraps, and floors are properly reinstalled after the

operation;

9. acquainting himself with the layout of the holes in the data center and overseeing that the

holes affected by construction operation are properly sealed and secured;

10. supporting the on-site management and final acceptance of new projects;

11. supporting the data center access control; No outsiders may enter the data center

without permission; Outsiders for construction and failure remedy may enter the data

center only after the employees in charge of infrastructure change control have arrived at

the site;

12. overseeing construction works when on duty to ensure that on-site construction

materials are arranged in an orderly manner; maintaining control of data center access to

ensure that equipment hardware is not affected by persons entering and leaving the data

center;

13. safekeeping materials—such as data, tools, and spare parts—in the data center;

checking the inventory of the materials at each change of shift and recording the final

inventory and changes to the inventory during the shift in the shift logbook;

14. actively participating in drills, training, team meetings, and other team activities

organized by the Company to improve job skills and professional competence;

Page 44: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

36

15. updating the shift logbook according to the change-of-shift procedure with detailed,

accurate, and complete description of events;

16. fulfilling other tasks assigned by line managers.

4.3 Staff training

The incumbent and new employees for the data center site infrastructure operations shall

complete comprehensive rigorous training to ensure that they are equipped with the knowledge

and skills necessary for performing their respective jobs, such that the data center operations team

is competent for its roles, the data center is operated securely in an orderly and standardized

manner, and operation risk caused by human factors is minimized. The training includes the

following five categories: general training, procurement training, professional skills training,

training on the data center’s systems and procedures, and occupational qualification certification

training.

4.3.1 New-employee training

A new employee shall complete a two-month-long pre-job training program starting from the

on-board date. The training covers the basic elements of the data center operations, such as

operational safety, rules and regulations, working procedures, equipment operation, equipment

maintenance, and equipment emergency. The pre-job training is clearly specified by job,

including the instructors for the training courses. A new employee must pass the assessment for

all the training courses, such that he is qualified for his job. Table 4.3-1 shows the training

schedule.

New-employee pre-job training schedule Table 4.3-1

A separate assessment shall be given for each training course, and the assessment is designed

considering the importance of each course; every new employee must pass the assessment for

every course to be qualified for his job.

Page 45: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

37

4.3.2 Training plan

4.3.2.1 Training plan for Engineer

The Engineer, a core technical and managerial position in the data center, assumes various

managerial and technical responsibilities in the data center. The training plan for this position,

which is based on the job description and performance targets pre-established for this position,

covers

all the management policies, processes, and systems of the data center;

the system configuration structure and operating plan of the data center;

the operation, maintenance, and emergency response of the power supply and

distribution equipment of the data center;

the operation, maintenance, and emergency response of the HVAC equipment of the data

center;

the operation, maintenance, and emergency response of the firefighting electronic

equipment of the data center.

4.3.2.2 Training plan for Technician

The Technician, a core position for on-site safeguarding of the data center, is responsible for

7 x 24 patrol, on-site control, and on-site emergency response in the data center. The training plan

for this position, which is based on the job description and performance targets pre-established for

this position, covers

all the management policies, processes, and systems of the data center;

the system configuration structure and operating plan of the data center;

the operation, maintenance, and emergency response of the power supply and

distribution equipment of the data center;

the operation, maintenance, and emergency response of the HVAC equipment of the data

center;

the operation, maintenance, and emergency response of the firefighting electronic

equipment of the data center.

4.3.2.3 Training plan for Monitoring Specialist

The Monitoring Specialist serves as the 7 x 24 alert service desk covering multiple sites of the

data center and is responsible for issuing alerts and notifications from the backstage. The training

plan for this position, which is based on the job description and performance targets pre-

established for this position, covers

the operation of the centralized power and environment monitoring system of the data

Page 46: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

38

center;

the operation of the security systems of the data center;

the system configuration structure and operating plan of the data center;

the operation of the Service Bot system of the data center;

the incident management procedure of the data center.

4.3.3 Training procedure

4.3.3.1 Sign-in for training

A sign-in record shall be available for every training course and shall indicate who is required

to attend and who has attended the training course. An employee must complete and pass all the

required training courses. Otherwise, he will be disqualified from his job.

A person shall be specially assigned to oversee if a sign-in record is properly completed for a

training course and to subsequently send to it to be filed together with other records of the training

process.

4.3.3.2 Training assessment

At the end of a training course, the person in charge of the training shall conduct an

assessment of the training attendants. All the training attendants shall complete and pass the

training assessment. Otherwise, they will be disqualified. An attendant who fails the first instance

of assessment is allowed a second chance. A person who fails the second instance of assessment

shall be considered disqualified for his current job. The disposition of a disqualified employee

includes reassignment.

Training assessments may be conducted in the form of written examination, interview, and

operating skills assessment. Records shall be available for all assessments and shall be maintained

together with other training records by a specially assigned person.

4.3.3.3 Training review

At the end of a training course, the person in charge of the training shall conduct a review of

the implemented training. The review shall cover the reasonableness of the training plan,

completeness of the training materials, training effect, and outcome of training attendant

assessment.

The person-in-charge shall modify and improve the training curriculum according to the

outcome of the training assessment and implement the changes in the future curriculum.

The required training courses of a new employee shall be monitored with a tracking sheet,

which shall be updated by the training instructor in a timely manner. When the new employee has

Page 47: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

39

completed and passed all the required training courses, the tracking sheet is used to document his

qualification for his job and is included in the centrally managed personnel file.

4.4 Staff development

As a data center site grows in scale and becomes more sophisticated in system structure, it is

more challenging to sustain its operations. A systematic training program that is comprehensive

and rich in content helps the operations team plan the data center’s operations and services more

effectively, reduce cost, improve operations processes, and render better support to business

processes, thereby improving the quality of the overall business operations.

4.4.1 Routine training

Employee routine training is planned periodically. In each December, the Engineer prepares

an annual training plan, which is subsequently approved by the leader-in-charge for

implementation. The training curriculum also covers management policies and includes courses

aimed to improve the professional competence of employees and facilitate their career

development.

Training and drilling courses:

(1) management policies of the data center;

(2) annual infrastructure security training;

(3) high- and low-voltage power distribution systems;

(4) air-conditioning systems;

(5) technical training on the firefighting systems;

(6) UPS systems;

(7) water supply and drainage systems of the data center;

(8) BA systems.

4.4.2 Special training

To power the sustainable development of and initiate changes necessary to the data center

operations, special training courses are offered irregularly to cover special events, processes, or

technologies. Such courses may be offered to employees or vendors, and the training instructors

may be provided by vendors or equipment manufacturers.

If any occupational qualifications are required for an employee on the operations team, he

will need to attend relevant third-party or national occupational qualification training courses and

pass occupational skill testing.

Page 48: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

40

4.5 Vendor management

Vendors play an important role in the data center operations, and hence, their service persons

shall acquaint themselves with the site work, management policies, and technical requirements of

the data center. The service support persons from vendors may enter the data center for service

delivery only after they complete and pass the required training courses. Among them, those who

have passed the training courses are included in the Master List of Qualified Service Persons.

Vendor training shall be conducted on an annual basis as a minimum. The training covers the

relevant management systems, working processes, and technologies of the data center. The vendor

service persons who fail the training may not enter the data center for service delivery.

4.5.1 Vendor training

Vendor training aims to acquaint vendor service persons with relevant management policies,

working procedures, and service requirements of the data center, so that they can provide services

to support the secure operations of the data center in a secure and effective manner. At the end of

the training, attendants are assessed for their understanding of the vendor service person

qualification requirements of the data center, the dos and don'ts when working on-site, systems for

controlling materials and persons entering and leaving the data center, and vendor service

requirements.

A minimum score of 80 is required to pass the assessment.

4.5.2 Service level agreement (SLA)

4.5.2.1 Power supply and distribution and UPS systems

Service response and commitment:

The vendor shall respond within 30 minutes of acknowledging the receipt of a failure

notification (by email, telegraph, telex, or telephone) from the data center and shall work

immediately to remedy the failure to safeguard the normal operation of the systems.

Level I failure: Any power distribution equipment failure that results in the failure of two or

more equipment sets (servers, storage devices, and switches), e.g., tripping of the main switch or

output switch of a power management module (PMM) cabinet, STS output failure, and air-

conditioning switchgear failure. The vendor shall arrive at the scene within one hour and remedy

the problem within two hours.

Level II failure: Any power distribution equipment failure that results in the failure of a single

equipment set in the data center, e.g., failure of a single circuit of a PMM cabinet and failure of a

single air-conditioning switch. The vendor shall arrive at the scene within two hours and remedy

Page 49: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

41

the problems within four hours.

Level III failure: Any power distribution equipment failure that has not resulted in any failure

in other equipment in the data center and has no impact on the availability of the data center, e.g.,

abnormal display on a PMM cabinet, and abnormal communication with a PMM cabinet or

electricity meter. The vendor shall arrive at the scene within six hours and remedy the problem

within 12 hours.

4.5.2.2 Air-conditioning systems

Service response and commitments:

The vendor shall respond within 30 minutes of acknowledging the receipt of a failure

notification (by email, telegraph, telex, or telephone) from the data center.

Level I failure: Any precision air-conditioning equipment failure or any precision chilled-

water pipe fracture that results in the failure or failed cooling of two or more precision air-

conditioning equipment sets, e.g., failure in the power supply to the precision air-conditioning

systems, fractured belt or malfunction ventilation fan of precision air-conditioning units, tripping

of air-conditioning switches, fractured chilled-water pipe, malfunctioning compressor of air-

cooled air-conditioners, and coolant leakage of air-cooled air-conditioners. The vendor shall arrive

at the scene within one hour and remedy the problem within two hours.

Level II failure: Any precision air-conditioning equipment failure or any precision chilled-

water pipe fracture in the data center that results in the failure or failed cooling of a single

precision air-conditioning equipment set in the data center, e.g., fractured belt or malfunctioning

ventilation fan of a single precision air-conditioning unit, malfunctioning compressor of an air-

cooled air-conditioner, and coolant leakage of an air-cooled air-conditioner. The vendor shall

arrive at the scene within two hours and remedy the problem within four hours.

Level III failure: A partial malfunctioning of the precision air-conditioning systems in the

data center that has not resulted in the failure of any other equipment or unavailability of cooling

in the data center and has no impact on the availability of the data center, e.g., abnormal display

on the air-conditioning systems and a malfunctioning humidifier. The vendor shall arrive at the

scene within six hours and remedy the problem within 12 hours.

4.5.3 Vendor qualification

Service persons from vendors must have obtained relevant occupational qualification

certifications issued by national authorities. A service person who does not have the above

qualifications may not enter a data center site for service delivery.

Page 50: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

42

Requirements for vendor communication interface:

A vendor shall designate a liaison person and a back-up person as its communication

interface with the data center.

The vendor's liaison person shall be readily available for communication and shall be able to

provide quick support in case of emergency. The support can be provided remotely through

telephone or, where necessary, on-site service in the time frame as set forth in the SLA.

The vendor shall maintain at least one qualified person for providing on-site emergency

support to the data center.

Working procedure:

A maintenance event or change to the infrastructure of Ping An Data Center is initiated in the

form of a work order. The work order for a maintenance event or change to be performed by a

vendor is initiated by an employee of the data center. A work order must be duly approved prior

to implementation.

In cases where a vendor intends to delay a maintenance event, a written application for the

delay shall be provided to the data center three days in advance. The vendor may not delay the

maintenance without prior permission from the data center. The maximum delay allowed is 10

days.

4.5.4 Vendor performance evaluation

An infrastructure maintenance provider shall submit a maintenance service summary report to

Ping An Data Center every six months. The report shall be well-formatted and true in its content.

The maintenance provider shall also review the maintenance service provided in the whole year

and submit an annual maintenance service report by the last working day before the termination

date of the contract. The service quality of a maintenance provider is evaluated against the

services defined in the contract, and payment for services will be made according to the outcome

of the evaluation.

Page 51: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

43

Chapter5 Best Practices of High-availability Operations

5.1 Routine check - Overview

Ping An Data Center is required to sustain a very high level of availability. To ensure the

stable and reliable operation of the IT facilities of the data center, the operations team must

monitor the infrastructure of the data center on a 24 × 7 basis. A small defect may lead to a major

failure. A data center infrastructure failure can always be traced down to some identifiable defect,

and hence, it is very important to conduct routine checks to detect and remedy operational defects

in a timely manner.

Two types of routine checks are implemented in the data center: on-site periodic check of the

infrastructure by infrastructure technicians and engineers; real-time monitoring of the power

supply and distribution, HVAC, firefighting, and security systems as well as the operating

environment of the data center by the infrastructure monitoring specialist through the monitoring

system of the data center. These two types of routine check complement each other to minimize

the major infrastructure failure occurrence rate and sustain the high availability of the IT facilities

of the data center.

5.1.1 Routine check - basic requirements

Smell: the odor of electrical discharge and burning odor of overheating insulators.

Listen: the sound of electric sparks and mechanical vibrations, abnormal sound caused by

abnormally high voltage or current, and mechanical vibrations caused by water pumps and

ventilation fans.

Feel: the temperature and vibration of non-live parts of equipment.

Look: electric sparkling, discoloring, deformation, dislocation, damages, oil seepage, water

seepage, relay actions, electricity meter readings, indication of instruments and signal lights,

and leakage, seepage, and dripping of pipes and valves.

5.1.2 Routine check - frequency and methods

Medium- and low-voltage switchgears, UPS, precision power distribution systems, diesel

generation systems, HVAC systems, and firefighting systems are checked every four hours

through manual on-site patrol inspection. Any anomaly identified should be immediately

reported to the infrastructure engineer and logged on the ServiceBot working platform to

Page 52: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

44

facilitate follow-up and remedial actions (inspection data can be recorded and transmitted

using a software application running on tablets).

Security systems: The video recording of designated cameras is checked every eight hours;

the real-time video capturing of all cameras is checked every 24 hours. In addition, the real-

time video capturing of cameras is monitored through the data center’s monitoring system or

online video surveillance system, and the storage condition of videos is monitored through

the online video surveillance system. Any anomaly identified should be immediately reported

to the engineer and logged on the ServiceBot working platform to facilitate follow-up.

The electronic monitoring system is checked every two hours. The operating conditions of the

data center (including environment systems, power distribution systems, and security systems)

are monitored using the Data Center Surveillance Application. Any anomaly identified should

be immediately reported to the engineer and logged on the ServiceBot working platform to

facilitate follow-up.

5.1.3 Routine check of medium- and low-voltage switchgears

1. Look: check medium- and low- voltage switchgear panels for abnormal display of

indicator lights and meters as well as warning lights; check the open/close status of medium- and

low- voltage switchgear circuit breakers against the required status for data center power

distribution.

2. Listen: check medium- and low- voltage switchgears for abnormal sound caused by partial

electrical discharge and abnormal vibration.

3. Smell: check medium- and low- voltage switchgears for odor of electrical discharge and

burning odor of overheating insulators.

4. Feel: check the live parts of medium- and low- voltage switchgears for abnormal

temperature and vibration.

5. Record the voltage and current values at medium-voltage incoming line switches, the

current values at feeder switches of medium-voltage transformers, and the voltage and current

values of incoming line main switches.

6. Input the above values into the mobile inspection app installed in the tablet. If an input

value is outside the preset limits, the app page will turn red, indicating an anomaly. Where

Page 53: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

45

necessary, take photographs of any anomaly identified during the inspection, and upload them

onto the mobile inspection app, which is synchronized with the ServiceBot platform of the data

center, where a work order will be generated and processed to address the anomaly. The work

order will be closed when the anomaly is remedied and the remedy is verified.

7. The mobile inspection app can record the route and time of inspections, such that the

frequency and quality of routine check can be monitored.

5.1.4 Routine check of uninterrupted power supplies (UPS)

1. Check if AC power input, bypass input, and power output switches are properly closed and

if indicator lights work normally; check circuit breaker protection units for warning indications.

2. Check UPS panels for warning messages and buzzer alarms;

3. Check for abnormal indication of the indicator lights on UPS panels, abnormal readings of

operating parameters, and new warning messages in the history record.

4. Check for abnormal operating sound or vibration; check electrical parts for burning odor.

5. Check the operating conditions of the fans installed on the housing; check if any filtering

screens are blocked.

6. Check the temperature and humidity of the UPS room and battery room for any out-of-

limit readings.

7. Check batteries for abnormal conditions (dirt, deformation, swelling, and liquid/acid

leakage); check the battery room for abnormal odor and sound.

8. Check battery packs for overheating connections and oxidized bolts;

9. Check the tools in the UPS room and battery room for missing/damaged items, integrity of

operating tips, and inappropriate marking and labeling.

5.1.5 Routine check of precision power distribution systems

1. Check the indicator lights on switchgear panels for flashing alarm lights; check

switchgears for abnormal sound and odor.

2. Check and record the readings of the electric parameters on switchgear panels; check if the

Page 54: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

46

dual mains supply is properly indicated;

3. Check precision switchgear panels for warning messages and buzzer alarms.

4. Check the isolating transformers inside switchgears for abnormal vibration, overheating,

and burning odor.

5. Check radiator fans inside switchgears for abnormal operating conditions.

6. Check for missing or inappropriate operating tips, marking, and labeling.

5.1.6 Routine check of diesel generation systems

Routine check of diesel generator units

1. Check diesel generator local control panels for alerts; check if control mode selection

switches are switched to the “Remote” position.

2. Check the operating condition of output switchboards, compound switchboards, grounding

resistance cabinets, and dehumidifier-heaters.

3. Check component surfaces and piping connections for traces of oil and water leakage;

check the floor for water and oil stains; check for bite marks and other traces indicating the

presence of rats or other varmints.

4. Check the water level of cooling-water tanks; check the operating condition of the

cooling-water heaters.

5. Check engine oil level; check oil–water separators for water content at the bottom and

discharge the water from the bottom if necessary.

6. Check the charging voltage and current of charging panels; check batteries and start relays

for terminal oxidization and corrosion; check the charging of the emergency battery packs.

7. Check the oil level of daily oil tanks; check for oil seepage.

Routine inspection of the diesel generation low-voltage power distribution room

1. Check the display of compound switchboards; check if selection switches are switched to

the “automatic” position; check for warning indications and buzzer alarms.

Page 55: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

47

2. Check the indicator lights on oil supply switchboard panels; check if selection switches are

switched to the pre-defined position (“Manual” for standby of the generator units and “Automatic”

for loaded operation); check the oil level of tanks (lower limit: 500 mm; upper limit: 900 mm).

3. Check direct current cabinets for abnormal parametric readings and alerts; check central

signal cabinets for alerts; check the heat radiation of power module cabinets.

4. Check the operating condition of power and lighting switchboards; check the lighting in

computer rooms.

Routine check of diesel generation high-voltage power distribution room

1. Check if switches are switched to the appropriate positions for diesel generators to

maintain their hot standby status.

2. Check the indicator lights of instrument REF615 on switchboards for alerts.

3. Check the operating condition of switchboard electrical heaters.

4. Check if dummy load controllers display any warning signals.

5. Check if the protection equipment and tools for high-voltage operation are stored in the

right place.

Routine check of diesel supply systems

1. Check and record the reading of the magnetic level meter on the outdoor oil tank

(specification: 200–1800 mm).

2. Check the oil tank valve well for ponding, settlement, and deformation.

3. Screen the oil tank area for fire hazards; check if proper lightning protection and

grounding measures are in place.

4. Check if emergency oil pumps and pipes are properly stored.

5.1.7 Routine check of heating, ventilation, and air conditioning (HVAC) systems

Routine check of precision air-conditioning units

1. Check the parametric readings and warning messages displayed on control panels.

Page 56: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

48

2. Check if generators produce abnormal vibration and sound during operation.

Routine check of centrifugal chilled-water units

1. Check the parametric readings, alerts, and alarms displayed on the main unit control

panels.

2. Listen carefully to the operating sound of the main units.

3. Check the units for water and oil leakage.

4. Check the oil level of the main units (the reading of the level gage should be 1/3 at the

minimum when the main unit is shut down).

5. Check the refrigerant piping through sight glasses (the normal color observed through a

sight glass is green).

6. Check the differences between inlet and outlet water pressures of the chilled water and

cooling water piping of the main units (500 KPa at the minimum).

7. Check and record the percentage of operating current of the main units.

Routine check of circulating pumps and control cabinets for chilled-water units

1. Check and record the operating current of starter boxes.

2. Check the heat radiation of starter boxes for overheating/burning odor.

Routine check of cooling tower

1. Check the water level of the cold-water tray; check for scaling and deposit in the tray.

2. Check the operating condition of cooling tower fans.

3. Check the circulating water quality.

5.1.8 Routine check of firefighting systems

1. Check for fire alarm, fault, shielding, and monitoring messages displayed on the fire

alarm/gas extinguisher system control panel as well as buzzer alarms.

2. Check if “system normal” is displayed on the panel of the integrated firefighting/alarm

Page 57: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

49

control cabinet; check the operating condition of the indicator lights on the control panel, fire

emergency telephone, fire emergency broadcasting, audio input, amplifier, and printer.

3. Check the manual control panel of the gas extinguisher system for warning lights and

buzzer alarms (the “manual” indicator light is on under normal conditions). Check the indicator

lights of manual/automatic gas extinguishing switches (the “Manual” indicator light is on under

normal conditions).

4. Check the indicator lights on the air sampler control panel (the power indicator light

should be normally on); check if any of the fault indicator lights is on.

5. Check if the pressure readings of IG541 gas cylinders (in the gas cylinder room) fall in the

green area; check the air cylinder head valves and zone selection valves in the air cylinder room;

check the magnetic valve control box in the air cylinder room.

6. Check if the pressure readings of the Heptafluoropropane fire extinguishing cylinders in

the diesel power distribution room fall in the green area.

7. Check if the pressure readings of fire extinguishers in the various areas fall in the green

area.

8. Check if the selection switches of the power control boxes for smoke exhaust fans, fire

pumps, sprinkler pumps, and jockey pumps are switched to the “Automatic” position.

5.1.9 Routine check of security systems

1. A checklist for all the cameras by their physical wiring is prepared for check by shift

(three shifts rotated in a cycle of one week). In each shift, the latest three-day video recordings of

certain number of cameras are checked through the online video surveillance system, such that all

the memory devices, video coders, and cameras can be covered to ensure that video footages are

properly stored and any anomaly can be quickly identified.

2. In the night shift, the real-time camera videos are also checked (including camera

identification description, system time, angle, and image resolution) through the data center’s

monitoring system or online video surveillance system.

5.1.10 Routine check of electronic monitoring systems

Table 5.1-1 Data center infrastructure monitoring checklist

Page 58: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

50

System/equipment Check items

Air-conditioning systems Ambient temperature and humidity, inlet air temperature, return air temperature, and warnings

Power supply and

distribution systems

Voltage, current, power factor, active power, and reactive power

Generators Startup and shutdown conditions, current, voltage, load factor, and power supply to control

systems

UPS systems Input Voltage and Current, output Voltage and Current, frequency, power factor, load factor,

temperature, and warnings

Firefighting systems Alarms

Security and electronic

monitoring systems

Operating conditions of door access systems, alarms, surveillance videos, and visitor record

1. Check if any devices are shielded on the “Security Period” page of the Data Center

Surveillance Application.

2. Check if any devices are disconnected for communication on the “Devices” page of the

Data Center Surveillance Application.

3. Check if there are any red alarms displayed on the Data Center Surveillance Application.

The operating status of devices is indicated on the monitoring system of the data center by color,

with blue or green color indicating normal operation, red indicating abnormal operation or alarm,

and grey indicating disconnected communication.

5.2 Preventive maintenance - overview

Preventive maintenance is planned for extending the service life and reducing the failure rate

of equipment. It aims to identify defects of equipment before they develop into major failures

through regular check and service.

Ping An Data Center has established annual, quarterly, and monthly preventive plans based

on equipment operating conditions and the recommendations of equipment suppliers. The

maintenance personnel are required to follow the maintenance process and carry out maintenance

activities in a timely manner according to the systematic characteristics of equipment. The records

and reports generated from maintenance activities should be objective, practical, and properly

filed. The operations team should perform regular statistics and quantitative trend analysis of

equipment operating condition. For any abnormal trend identified, they will issue a warning and

Page 59: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

51

propose and implement reactive as well as corrective actions to minimize the possibility of major

equipment failure.

5.2.1 Preventive maintenance - general requirements

Ping An Data Center has established detailed maintenance operation procedures (MOPs) for

all infrastructure maintenance activities, including step-by-step description as well as the

person-in-charge and schedule for each maintenance activity. Equipment standard operation

procedures (SOPs) should be followed during maintenance. This is to ensure the smooth

completion of maintenance activities and avoid wrong equipment operation that may result in

major equipment failure or personal injury. For example, the switching of medium-voltage

switches, manual startup of generator units, and switching of a UPS to its bypass circuit must

follow the respective SOPs.

The annual preventive maintenance plan of the data center must be followed, and the target

completion rate for annual preventive maintenance is set at 95%.

5.2.2 Checklists for preventive inspection, maintenance, and operation (including but not

limited to the systems and equipment listed below)

Table 5.2-1 Data center infrastructure preventive inspection checklist

System/

equipment

Functional check Vulnerability check

Power supply

and

distribution

systems

Power frequency voltage withstand test of circuit

breakers, main circuit insulation resistance test of

circuit breakers, transmission test and interlock test of

switchgears, checking the primary and secondary

circuits of switchgears, cleaning dust inside

switchgears, checking if holes are properly plugged

and sealed, and insulation, voltage withstand, and

grounding tests of mains cables and transformers

Power rating test of circuit breakers, partial

discharge test of switchgears, test of capacitors,

checking lightning protection devices, checking

cables and components for overheating

Generators Checking operating parameters, checking the

generator units for vibration and overheating

Checking startup batteries, oil level, cooling liquid

level, and air suction and smoke exhaust channels

UPS systems Checking components for overheating, checking

batteries (appearance, liquid level, and wiring

terminals)

Checking components and cables for overheating,

checking the discharging time of batteries

Air-

conditioning

High- and low-pressure pressures (air cooling

system), chilled-water pressure and temperature,

cooling-water pressure and temperature (water

Hot spots in computer rooms, checking indoor units

for water leakage, checking the operating

conditions of outdoor fans, checking filtering

Page 60: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

52

systems cooling system), operating conditions of fans, dusts screens

Firefighting

systems

Pressures and expiration dates of air cylinders,

checking sensors for contamination

Pilot cylinders, pipe switches, and air pressures

Security

systems

Sensitivity of components, image sharpness (at

different levels of illumination)

Sensitivity of components, monitoring blind angle

Table 5.2-2 Data center infrastructure preventive maintenance and operation checklist

System/

equipment

Basic maintenance Testing Data operation

Power supply

and

distribution

systems

Switching operations Spare power automatic switching

test, spare power automatic

interlocking test

Backup of the logs of circuit breaker

protection units

Generators Replacing filtering devices,

cleaning generator body

No-load test, loaded test, and

switchover test

Backup of operating log,

backup/deletion of alarm record

UPS systems Cleaning the bypass circuit

and the inside of the housing

Bypass test, battery discharge

test

Backup of operating log,

backup/deletion of alarm record

Air-

conditioning

systems

Startup and shutdown,

cleaning/replacing filtering

screen, cleaning/replacing

humidifier system,

cleaning condensers

Water leakage alarm test Backup of operating log,

backup/deletion of alarm record

Firefighting

systems

Cleaning sensors Startup test, testing sensors Backup/deletion of alarm record

Security

systems Door access authorization Sensitivity of components,

image resolution (at different

levels of illumination)

Export and backup of door access

record, backup/deletion of surveillance

videos, backup/deletion of alarm record

5.2.3 Preventive maintenance - detailed schedules for key systems

Preventive maintenance of medium- and low-voltage switchgears - general requirements

1. The preventive maintenance of medium- and low-voltage switchgears includes general

live-line check (semi-annual), spare power automatic switching logic test (annual), and switchgear

test and maintenance (every three years).

2. The above maintenance activities are carried out by engineers of switchgear manufacturers

according to the MOP and SOP of the data center.

Page 61: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

53

3. As the most important means for maintaining the power supply and distribution systems of

the data center, the preventive maintenance is intended to identify and remove safety hazards with

the operation of switchgears in a timely manner, extend equipment service life, and improve

system availability.

Preventive maintenance checklist for medium-voltage switchgears

Table 5.2-3 General live-line check

Category Maintenance item

Operating

environment of power

distribution room

Check and record the temperature and humidity of the power distribution room; check if the room is

properly ventilated, cable ducts are properly sealed, and appropriate measures have been taken

against varmints; check protection and operation tools

Load of switchgears Record voltage and current values

Temperature of

switchgears

Record the temperatures of the low-voltage chamber, rear panel, and front panel of switchgear

Condition of

switchgear

Check the condition of the display panel of the protection unit, indicator lights (electrical heating,

closing/opening of switches, energy storage, grounding switch, and high-voltage presence), relay

plate, and low-voltage chamber lighting

Table 5.2-4 Spare power automatic switching logic test

Category Test description

Automatic switching

between mains lines

Disconnect one mains line and test the logic of automatic switching between the two medium-

voltage mains lines (connected to one busbar)

Automatic switching

between mains power

and diesel generation

power

Disconnect both mains lines and test the logic of automatic switching between mains power and

diesel generation power

Table 5.2-5 Switchgear testing checklist

Category Subcategory Test items

Earthing of the housing Test the integrity and resistance of the main earthing circuit

Switchgear main circuit Test the resistance and voltage withstand (destructive test, not

recommended unless necessary) of the main circuit

Lightning protection devices Check and test lightning protection and monitoring devices

Page 62: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

54

Housing Current transformer Calibrate polarity, transformation ratio, and excitation characteristic

curve

Potential transformer Test transformation ratio and non-load current

Protection relay Rating test, protection and signaling function test

The five error-proof functions of

the interlock mechanisms

Calibrate the mechanical and electrical interlock mechanisms

Low-voltage chamber secondary

circuit insulation

Sensitivity of components, image sharpness (at different levels of

illumination)

Circuit breaker

Main circuit Test the main circuit resistance

Opening/closing coils Test DC resistance and low-voltage operations

Maintenance of operating

mechanisms

Adjustment, repair, lubrication, and other in-depth maintenance items

(special-purpose solvent and lubrication grease); replacement of quick-

wear parts

Insulation of the control

component of circuit breaker

Test the insulation resistance of secondary components

(opening/closing coil, auxiliary contact, relay, and energy-storage

motor)

Integrity of vacuum interrupter Voltage-withstand test (destructive test, not recommended unless

necessary)

Special-purpose

diagnostics and

tests

Partial discharge test Switchgear partial discharge test

Operating behaviors of fuses Preventive failure diagnostics and testing of fuses

Mechanical behaviors of circuit

breakers

Test mechanical behaviors of circuit breakers

Table 5.2-6 Switchgear maintenance checklist

Category Sub-category Maintenance

Busbar chamber

Cleaning Cleaning main circuit and insulation parts with anhydrous alcohol

Bolt tightening torque

calibration

Tighten busbar bolts with a torque of 70 N.m (the bolts should not move)

Maintenance of insulation

parts

Check insulation plate, moving and fixed contacts box between main line,

busbar, and housing for damage, electrical discharge, and flashover, and

Page 63: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

55

clean them with anhydrous alcohol

Cable chamber

Cleaning Clean main circuit, insulation parts, cable heads, and transformers with

anhydrous alcohol

Bolt tightening torque

calibration

Tighten cable bolts with the specified torque (the nuts should not be

removed)

Maintenance of insulation

parts

Check wall bushings, insulation plates, cable heads, and transformers for

damage, electrical discharge, and flashover, and clean them with anhydrous

alcohol

Maintenance of earthing

switches

Check if earthing knife-switches operate normally; check the operation and

position indication of interlock couplers; check if auxiliary contact switches

operate normally; clean and lubricate contacts

Maintenance of sealings Check the ingress protection of the cable chamber against varmints and

water vapor; improve the sealings where necessary

Trolley chamber Cleaning Clean contact boxes and curtain doors with anhydrous alcohol

Bolt tightening torque

calibration

Check if fixed contacts are properly tightened; check the integrity of curtain

door mechanism bolts and jump rings

Maintenance of insulation

parts

Check contact boxes for damage, electrical discharge, and flashover, and

clean them with anhydrous alcohol

Lubrication Clean curtain door mechanisms and earthing trolley rails of fixed contact

boxes with anhydrous alcohol

Low-voltage

chamber

Functionality of secondary

components

Secondary components should be functionally reliable and free of loose

connection, electrical discharge, and ablation.

Security of terminal

wiring

Tighten terminal wiring; check terminals for ablation and loose connection

Circuit breaker

Maintenance of operating

mechanisms

Check the inside of operating mechanisms for missing or damaged parts;

clean and lubricate them where necessary

Secondary circuit Check opening/closing coils, energy-storage motors, relays, and sensitive

switches

Trolley chamber

Signal plates Adjust or replace signal plates

Mechanic interlock

mechanisms

Lubricate and check mechanic interlock mechanisms; check if they function

reliably

Contacts and contact arms Clean contact arms; clean, lubricate, and tighten moving contacts

Page 64: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

56

Preventive maintenance checklist for low-voltage switchgears

1. The preventive maintenance of low-voltage switchgears includes general live-line check

(semi-annual), spare power automatic switching logic test (annual), and switchgear test and

maintenance (every three years).

2. The above maintenance activities are carried out by engineers of switchgear manufacturers

according to the MOP and SOP of the data center.

Table 5.2-7 General live-line check

Category Maintenance item

Operating

environment of power

distribution room

Check and record the temperature and humidity of the power distribution room; check if the room is

properly ventilated, cable ducts are properly sealed, and appropriate measures have been taken

against varmints; check protection and operation tools

Load of switchgears Record voltage and current values

Temperature of

switchgears

Record the temperatures of the low-voltage chamber, rear panel, and front panel of switchgear

Condition of

switchgear

Check the condition of the display panel of the protection unit, indicator lights (closing/opening of

switches and energy storage)

Table 5.2-8 Spare power automatic switching logic test

Category Test description

Spare power

automatic switching

Disconnect one feeder line of the transformer and test the logic of automatic switching between the

two low-voltage mains lines (connected to one busbar)

Table 5.2-9 Switchgear testing checklist

Category Sub-category Test description

Housing

General check No paint peeling-off, no housing deformation, legible labeling on

instrument dials, no abnormal condition inside the housing

Insulation resistance of main

busbar and control circuit

Test with 500 VDC or 1000 VDC insulation resistance tester;

minimum 1000 MΩ insulation resistance. Test to be conducted via the

grounding method and secondary control function to be considered.

Break grounding connections for the test

Page 65: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

57

Grounding connections Check the reliability of the system, cabinet, and board grounding

connections against the specific grounding system requirements;

grounding connection of output cables; the equal-potential grounding

of cabinet doors

Busbar and cable connections Check cable and busbar connections for overheating (using an infrared

thermometer or imager); check major connections using a torque

wrench against preset torque value

Mechanical function of drawer

circuit

Check the indication of the drawer circuit; check if it can be pushed in

and pulled out normally

Circuit breaker

General check

Check appearance (no overheating-caused contact oxidization, no

traces of flashover outside the arc-extinguishing chamber, integrity of

front panel, framework deformation, integrity of secondary terminals,

legibility of secondary line labeling)

Phase-phase insulation and

insulation between upper and

lower ports

Test with 500 VDC insulation resistance tester (minimum 1000 MΩ

insulation resistance required)

Contact wear (air circuit breaker) Open the arc-extinguishing chamber cover and check the wear of

phase contacts

Trip force (air circuit breaker) Test the trip force of air circuit breaker actuators using a special-

purpose tester

Mechanic operation Test the following operations: rocking in and out, manual energy

storage, and manual closing/opening; check the snap-in force of

framework clamps

Interlock function Check mechanical and electrical interlock function

Mechanical behaviors (air circuit

breaker)

Test the current curve, energy storage speed, three-phase

synchronization, contact resistance, bouncing, and over travel using a

Prodia mechanical characteristics tester

Operating characteristics of

protection units

Test the functionality of protection units and conduct selective

analysis using a Proselect protection unit tester

General check Check capacitors for swelling and deformation; check connection

cables for discoloring; check the appearance of contactors and series

reactors; check if ventilation holes are plugged; check for dust

deposited on dust screens

Main incoming line harmonic Test total harmonic distortion rate and specific harmonic content using

Page 66: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

58

Compensation

capacitor

(loaded) a power quality analyzer

Controller configuration and

alarm record

Check its measurement display, parameter setting, and alarm record

Phase current of capacitor (live-

line)

Test with clip-on ammeter while switching it on manually

Operating condition of contactors

during stepped switching

Observe contactor’s vibration and noise while it is being switched on

and off

Panel display during stepped

switching

Observing the varying display of power factor, current, and step

number during manual switching

Startup of fans Check functionality of fans while they are being manually switched

on and off

Temperature alarm devices Test their operating condition manually

Capacitance of capacitors (power

off)

Test phase-phase capacitance of capacitors using a capacitance meter

(the measurement should be higher than 90% of the theoretical value)

Contactor circuit resistance Test the contact resistance of each contactor by phase (power off,

manual switching)

Table 5.2-10 Switchgear maintenance checklist

Category Maintenance item Maintenance description

Housing

Cleaning dust inside the

cabinet

Clean dust with a vacuum cleaner; scrub insulators and cable connections

with a dry cloth and anhydrous alcohol

Lubrication of clips for

plug-in type functional units

Apply a small amount of conductive paste to connections (clips, silver-plated

bars of the moving part, copper bar at the incoming side of the drawer)

Cleaning and lubricating

mechanical parts

Clean the positioning mechanism, bearings, and sliding guide of drawer;

lubricate the positioning mechanism only

Circuit breaker

Cleaning and lubricating

exterior mechanisms

Clean and lubricate rock-in and -out mechanisms and interlock mechanisms

Disassembling air circuit

breakers for maintenance

Disassemble energy-storage springs, opening/closing coils, energy-storage

motors, secondary auxiliary contacts, and tripping units for comprehensive

check, cleaning, and service; replace consumable parts

Cleaning and lubricating

main contacts

Clean and lubricate contacts and clips on the main body and chassis

Page 67: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

59

Tightening chassis bolts Tighten the bolts for connecting the chassis to the cabinet

Replacing control unit

batteries

Replace the batteries in the control unit

Compensation

capacitor

Cleaning the inside of

capacitance compensation

cabinet

Clean dust with a vacuum cleaner; scrub insulators and cable connections

with a dry cloth and anhydrous alcohol

Tightening internal cables Tighten primary and secondary connection cables

Replacing ventilation hole

dust screens

Replace dust screens and sealing rubber strips

Cleaning and lubricating

fuse seats

Cleaning contacts and clips on fuse seats, and apply a small amount of

conductive paste

Replacing failed parts and

aged capacitors

Replace capacitors, fuses, and contacts that have failed testing

Preventive maintenance of diesel generation systems

1. No-load test: conducted monthly by the operations team of the data center to verify the

automatic startup and parallel operation functions of the generator units.

2. Single-unit dummy load test: conducted monthly jointly by the service provider and

operations team to verify the effective load capacity of the generator units.

3. Loaded test under parallel operation: conducted annually jointly by the service provider

and operations team to verify the automatic startup and parallel operation functions and effective

load capacity of the generator units.

4. Monthly preventive maintenance:

A. Check engine appearance: Check the fastenings of the engine’s coolant, fuel, and

smoke exhaust systems and tighten or replace them where necessary.

B. Check engine oil level: Pull out the engine oil level gauge after the generator units are

shut down for five minutes and check if the oil level is between the “L” (low) and “H” (high)

marks. Replenish engine oil if the oil level is lower than the “L” mark.

C. Check coolant level: Open the pressure cover of the cooling system and check the

Page 68: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

60

coolant level. Replenish coolant (to below the coolant filling neck on the radiator) if the coolant

level is too low. Be sure not to replenish coolant until the coolant temperature decreases to below

50 ℃. Re-install the cooling system pressure cover after the replenishment.

D. Visual check of cooling fans: Visually check cooling fans for cracking, loose screws,

bent blades, and other anomalies. Liaise with the vendor to remedy any damage or anomaly.

E. Check the operating condition of engine coolant heaters. If the working power supply of

a heater is normal but the temperature is too low, then the heater may possibly have stopped

working. Any malfunctioning heater should be remedied in a timely manner to resume its normal

operation.

F. Check engine’s air intake filter: The air filter indication meter is located on the air filter

assembly or between the assembly and turbocharger. As the dust deposit on the filtering element

increases, the accumulative dust meter increases accordingly on the indication display. Clean or

replace the filtering element when the accumulative dust displayed on the indication meter

exceeds the threshold.

G. Check air intake pipes for looseness: Check air intake pipes for cracking, piercing, or

loose clamping. Tighten or replace the loose parts where necessary to ensure no leakage in the air

intake system. Check the hoses under clamps for corrosion. Replace the hoses where necessary, to

prevent foreign materials from entering the engine.

H. If the diesel fuel system is equipped with an oil–water separator, drain the water inside

it as follows: Turn the water drain valve anticlockwise two rounds. Wait until only clean fuel is

discharged from the oil–water separator. Close the water drain valve by turning it clockwise two

rounds. Do not tighten the valve with too much force, to avoid damage to the screw.

I. Where necessary, discharge sludge in fuel tanks as follows: loosen the screwed oil drain

plug with a spanner. Drain the tank until only clean fuel is discharged from it. Close the blow-

down valve and restore the screwed plug.

J. Check storage batteries and DC startup systems: Check if storage battery terminals are

clean and securely wired. Clean and re-wire them where necessary. Check if wire harnesses of DC

systems are properly connected, and replace damaged harnesses. Check the connections between

storage batteries and AC chargers. Check charger belts visually for cracking and other anomalies.

Page 69: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

61

5. Annual preventive maintenance:

A. Refer to the monthly preventive maintenance items above.

B. Replace engine oil and engine oil filters.

C. Clean daily fuel tanks, and replace fuel filters.

D. Replace coolant filters and air filters.

6. Preventive maintenance of diesel generators:

A. Check the underground fuel tank; check the water level in the inspection hole and drain

the water (biweekly).

B. Check if there is water in the underground fuel tank by drawing a fuel sample from its

bottom through the oil drain port (monthly).

C. Replace startup batteries and startup relays (biannually).

D. Replace the spare batteries in the integrated control cabinet (biannually).

E. Replace coolant (every three years).

F. Replace fuel in the underground fuel tank and clean and test the tank according to fuel

quality test results (every five years).

G. Perform in-depth maintenance and test generator units (every ten years). Scrap or

replace the units if their reliability is compromised or their main performance indexes cannot

satisfy the preset specification.

Preventive maintenance of UPS

Preventive maintenance of UPS is carried out quarterly by service engineers of the original

manufacturer according to the MOP and SOP of the data center. Where the condition permits, the

preventive maintenance also includes more in-depth functional checks of the UPS systems

performed quarterly or at longer intervals. These checks may involve switching operations of UPS

and cannot be performed without putting adequate protection measures in place.

1. Check input power quality (input voltage and frequency) and output power quality (output

Page 70: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

62

voltage, frequency, and output waveform distortion factor).

2. Check if the power switchover time is smaller than the specified value.

3. Check if the transient output voltage drop during power change-over is smaller than the

specified value.

4. Check if the output harmonic distortion factor is smaller than the specified value.

5. Check if the floating charge voltage and charging current fall within the respective design

specifications.

6. Check the voltages of battery pack and single batteries.

7. Check the battery pack backup time as follows: Turn off the main circuit input switch,

discharge the batteries for 30 minutes, turn on the switch, and record the backup time.

8. Check if the battery pack outputs large transient current while starting up.

9. Check the internal resistance of battery packs. If the internal resistance exceeds the

specified value, perform equalizing charge of the battery packs and thereafter discharge or treat

them with activation.

10. Check the manual opening and closing of prime and post switchgear circuit breakers.

11. Check the homogeneous current under parallel operation and parallel operation change-

over logic.

12. Shut down the UPS, check the tightness of its internal connections, and clean the dust on

key electrical parts.

13. Check the operating condition of radiator fans. Replace defective fans.

14. Simulate failures of the UPS systems to identify potential issues with the systems. This

helps prevent failures of the UPS systems when they are required to support operations. Ensure

that protection measures are put in place for the simulation.

A. Simulate mains power outage and observe if the UPS units switch to different working

modes normally.

Page 71: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

63

B. Simulate mains power outage and record the discharging voltage curves of the battery

packs.

C Simulate one of the parallel connected UPS units being down, and observe if the other

units work normally.

15. As recommended by the manufacturer, replace the AC and DC capacitors of UPS units

preventively after five years of service.

Preventive maintenance of air-conditioning systems

1. The data center conducts preventive maintenance of the air-conditioning systems to ensure

their operating safety and stability and sustain their energy-saving performance.

2. Monthly preventive maintenance of chilled-water units:

A. Check, record, and analyze the operating conditions of the units.

B. Check the level and color of lubrication oil.

C. Check the lubricant supply and return circuits of the lubrication system, lubricant

temperature, and the operating condition of lubricant coolers.

D. Check the time differences of startup/shutdown between lubricant pumps and main

units.

E. Check for abnormal vibration and noise.

F. Check the temperature of output chilled water against the specified value.

G. Check the evaporating temperature and condensing temperature against inlet and

outlet chilled water and cooling water temperature differences.

H. Check for any leakage in the units.

I. Check motor current against actual electricity consumption.

J. Check the operating condition of guide vane actuators.

K. Check the control configuration of the units.

Page 72: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

64

L. Analyze the operating condition of the units.

3. Annual preventive maintenance of chilled-water units

A. Check the evaporator and condenser pressures displayed on the control panel

against measurements.

B. Transfer refrigerant to the condenser, and discharge refrigerant oil from the

refrigerant oil filling valve.

C. Check oil system circuits and oil cooling systems, replace oil filters, and replenish

refrigerant oil.

D. Check refrigerant system circuits, and replace refrigerant filters.

E. Dehumidify and vacuum evaporators.

F. Balance refrigerant system pressure and check the housing of the units for pressure

leakage.

G. Test the insulation of compressor and pump motors.

H. Check the operating condition of guide vane actuators.

I. Check and clean startup cabinets.

J. Check parameters and automatic control of the units: condensing and evaporating

pressures, bearing temperature, motor coil temperature, oil bath temperature, inlet and outlet

chilled water temperatures, pressures, oil pressures, and oil pressure differences. Start up and

shut down guide vanes and start up oil pump to check oil pressure and output digital signals of

oil heating relays.

K. Start up the units for test run, and provide a worker order for annual preventive

maintenance of the units based on the operating conditions.

4. Monthly preventive maintenance of computer room air conditioners

A. Check and record operating parameters of precision conditioners; check controllers

for warning messages.

Page 73: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

65

B. Check the tightness and wear of belts. Adjust or replace them where necessary.

C. Clean or replace air filtering screens.

D. Check the working condition of proportional control valves.

E. Check the discharge of chilled water and the outlet air of the units.

5. Monthly preventive maintenance of cooling tower

A. Check and record the operating current of the cooling tower.

B. Check the operating condition of the cooling tower. The air blade rotation should be

balanced, without significant vibration or scraping against the cooling tower wall. The water tray

should be filled with an appropriate level of water.

C. Replenish the lubrication oil for fan reducers. Check belts and belt pulleys, and

adjust them where necessary.

D. Check water distribution devices and cooling tower water replenishment devices.

E. Check the condition of fillers for clogging or damage.

F. Check the cooling tower piping, framework, and ladder for corrosion.

6. Other preventive maintenance items for the cooling tower

A. Clean cooling tower tray and filler (quarterly).

B. Check motor insulation (annually).

C. Replace cooling tower filler (every five years or depending on the working condition

of the cooling tower filler).

7. Monthly maintenance of water pipe network and water quality

A. Check pipes and valves for water dripping and leakage. Check piping heat insulation

materials for traces of water dripping and leakage.

B. Check pipes for displacement, settlement, bending, and deformation, and report any

anomaly identified immediately.

Page 74: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

66

C. Check valve surface for seepage and corrosion, and remedy any leakage identified.

Perform regular test operation of valves to ensure that they can be easily switched on and off.

D. Check pipe flanges for corrosion, looseness, and water dripping and leakage.

E. Check water system piping. Check pipes and accessories (flexible joints, check

valves, and water treaters) for aesthetic defect and cracking. Check the joints for water seepage.

Take immediate actions for any defect identified.

F. Remove rust on water pipes and valves and repaint them to maintain integrity of

painting (no peeling-off). Repair any insulation layer damage immediately.

G. Check pipe brackets for insecure installation, dislocation, or deformation. Check

wooden pipe carriers for corrosion and deformation.

H. Check if the cooling water is clean. Replace it where necessary. Analyze water

quality regularly, and add germicide, algicide, anti-sludging agent, and/or corrosion inhibitor to

the water where necessary.

I. Check the quality of softened water for the chilled-water system. Check the softened

water system.

J. Check the accuracy of pressure gages and thermometers. Instrument dials should be

clear. Replace any damaged dials immediately.

K. Check the operating condition of float valves for cooling water replenishment and

chilled-water pressure-stabilization and replenishment devices.

L. Clean water piping filters (the difference between the pressures at the two ends of a

filter is greater than 0.05 MPa).

M. Ensure that appropriate anti-freezing measures are in place for outdoor piping in

winter.

N. Check the accuracy of pressure gages and thermometers for water distributors and

collectors.

8. Preventive maintenance of circulating water pumps

Page 75: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

67

A. Replenish lubrication oil (quarterly).

B. Check water pump sealing (quarterly). Repair any water leakage identified.

C. Test and calibrate the concentricity of couplings, and check coupling bolts and

rubber rings (annually). Replace damaged parts.

D. Tighten pump seat screws and perform antirust treatment to pumps (annually).

E. Service water pumps (annually), including the check of major parts—such as vane

wheel, sealing ring, and bearing. Clean vane wheel and remove scaling in vane wheel water

channels.

9. Monthly maintenance of motors and power distribution and control systems

A. Motors should operate normally, with bearings well lubricated and insulation

resistance greater than 2 MΩ. All wiring connections should be secure, and the load current and

temperature increase should satisfy the respective specifications.

B. Check the operating conditions of frequency converters and soft-start starters (the

temperature increase should not exceed the specified value).

C. Electrical and control components should be clean in surface, integrated in structure,

accurate in operation, and integrated in display and alarm functions.

5.3 Predictive maintenance - overview

To sustain the secure and stable operation of the data center, the operations management team

regularly monitors the infrastructure of the data center (power supply and distribution, UPS, diesel

generator, chilled water, and lightning protection and grounding systems) using various

instruments and professional third-party testing services. As one of the major types of proactive

maintenance activities to sustain the secure operation of the data center, predictive maintenance

involves comprehensive trend analysis of data about infrared temperature increase, vibration, and

chemical composition of fuel and lubrication oil, with the aim of diagnosing the operating health

of the component systems and facilitating the early identification and timely, effective mitigation

of potential risks with the systems by the operations management personnel.

Page 76: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

68

5.3.1 Predictive maintenance - general requirements

Establish and implement detailed annual predictive maintenance plans.

Measurement tools used for predictive maintenance should be regularly calibrated according

to the quality inspection department’s calibration procedure to maintain their measurement

accuracy.

Employ third-party testing professionals to test the systems and equipment of the data center

and produce relevant testing reports.

Predictive maintenance should be performed according to MOP and SOP to ensure equipment

and personnel safety during the maintenance.

Reports should be generated for completed predictive maintenance activities and include

trend analysis based on comparison with historic data.

5.3.2 Predictive maintenance - high-level plan

Table 5.3-1 Data center infrastructure predictive maintenance checklist

Component systems Check item

Power supply and

distribution systems

Test transformers, busbars, circuit breakers, and capacitors using infrared thermography; test the

discharging of DC cabinet storage batteries

Generators Test the chemical composition of fuel and lubrication oil; test electrical systems using infrared

thermography; check mechanical vibration

UPS systems Test them using infrared thermography

Air-conditioning systems Test the chemical composition of refrigerant oil; test pipes for defect; check the mechanical

vibration of refrigerators and water pumps

Computer room

environment

Employ third-part professionals to test the dust load, electromagnetic radiation, noise, and

lightning protection and grounding in the computer room

Lightning protection and

grounding

Test the lightning protection and grounding of the building regularly according to the lightning

protection test specification

5.4 Emergency plan overview

The operations team of the data center has established detailed, comprehensive

failure/incident emergency response procedures according to actual operating conditions. The

Page 77: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

69

procedures are regularly drilled to improve the capacity of the team to deal with emergent failures

and incidents. This contributes toward building a foundation for sustaining the high availability of

the data center.

5.4.1 Emergency drill plan

Comprehensive emergency response procedures must be established proactively for potential

failures or anomalies. The operations team must become acquainted with the procedures.

Establish and implement annual emergency drilling plans.

Sand table exercise: The operations personnel gather around a sand table and report verbally

their respective responsibilities and actions to be taken during emergencies.

Movement exercise: The personnel for emergency response run to the failure simulation

scene and simulate the failure response procedure. They should be able to report verbally the

failure response plan step by step.

5.4.2 Emergency drill items

Table 5.4-1 Emergency drilling for system/equipment failures

Drilling item Drilling description

Low-voltage power distribution

systems

Simulate the tripping of a transformer incoming line switch, and manually close the

interconnection switch that has been interlocked for spare power automatic

switching.

Medium-voltage power distribution

and diesel generators

1. Disconnect one line of the double-circuit mains power supply, and manually close

the medium-voltage bus tie switch that has been interlocked for spare power

automatic switching.

2. Simulate mains power outage and failed automatic startup of the diesel

generators, and manually start the diesel generators for parallel operation.

Switching between primary and

redundant power supply and air-

conditioning systems

Switch between the primary and redundant power supply and air-conditioning

systems to verify the high availability of the power supply systems of the data

center.

Chilled-water systems (main unit

failure)

Simulate the failure of the primary chilled-water unit, and switch quickly to the

redundant unit.

Page 78: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

70

UPS systems and precision

switchgears failure

1. Simulate the failure of a UPS system, and switch to the bypass circuit for power

supply.

2. Simulate the failure of precision switchgear, and switch to the UPS systems to

resume power supply.

Monitoring system Simulate the failure of the primary monitoring server, and switch to the redundant

monitoring server.

Air-conditioning system (water

system anomaly)

Simulate a leakage in the chilled-water piping, close the chilled-water valves, switch

precision air conditioners to air cooling mode, and check the heat radiation capacity

of the outdoor air conditioner units and temperature variation in the computer room.

Elevator emergency Simulate the failure of an elevator, and rescue people in the elevator carriage.

Water supply and drainage systems Simulate flooding in an underground space, and quickly drain the flooded space.

Firefighting system 1. Simulate a fire in the data center, and test the automatic and manual gas fire-

extinguishing procedure and the integrated fire alarm control.

2. Personnel emergency evacuation

5.5 System availability check

The operations team of the data center works toward further improving the availability of the

data center by regularly checking the operating environment and condition of the data center (for

example, parameter configurations of systems and equipment, control/alarm limits for critical

equipment, equipment information list, rack power distribution units (PDUs), and logic

relationship between switches) and employing third-party professionals to regularly inspect

computer rooms.

5.5.1 Monthly check of data center facilities

In addition to the routine check, a comprehensive monthly inspection of the data center

infrastructure is conducted to identify defects and opportunities for improvement, which are

subsequently logged in the ServiceBot system for remedy and tracking by engineers. A

defect/opportunity for improvement will be closed in the system when remedied or improved,

with the details of the remedies and corrective actions taken recorded in the system. This

contributes toward further improvement of the system and equipment availability.

5.5.2 Data center room environment check

A comprehensive monthly inspection of the working environment of the data center

Page 79: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

71

infrastructure is conducted to identify opportunities for improvement, which are subsequently

logged in the ServiceBot system for remedy and tracking by the person-in-charge. Details of the

improvement actions taken are also recorded in the system.

5.5.3 Data center facilities operational information check

To facilitate delicacy management of the data center infrastructure, regular checks and

updates are carried out for equipment operation settings, opening/closing status of switches, rack

PDU and the corresponding operating status labeling for switches and equipment

(operating/standby), detailed equipment list, equipment operation tips, monitoring/alarm limits,

and monitoring and alarm filters.

5.6 Life cycle management

The life cycle of a data center refers to the entire process from the demand of data center

construction to the end of its economic life. The life cycle can be divided into decision-making,

implementation, and operations maintenance stages, and each of the stages can be further divided

into several sub-stages. The decision-making stage includes needs collection, planning, site

selection, and feasibility analysis. The implementation stage includes project design, construction,

acceptance, and hand-over. The operations stage covers the entire process from the completion of

basic construction and commissioning of the data center to the end of its economic life.

This chapter focuses on the equipment life cycle management at the operations stage of the

data center. Good equipment life cycle management is achieved by identifying equipment

operating risks and establishing risk mitigation plans. This not only reduces equipment failure rate

and improves the availability of the data center, but also extends the service life of the data center

and maximizes its benefit.

In terms of life cycle management of data center infrastructure, Ping An Data Center focuses

on medium- and low- voltage power distribution equipment, transformers, UPS, diesel generators,

and chilled-water units. The major activities in this regard include regular equipment check,

replacement of quick-wear critical parts, and equipment obsolescence and replacement.

5.6.1 Life cycle management - medium-voltage switchgears

The critical parts of medium-voltage switchgear (including circuit breaker, busbar, and

cabinet housing) are subject to routine maintenance every six months and in-depth maintenance

Page 80: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

72

every three years. The planned service life of circuit breakers is 15 years (or 10,000 operations).

In the 14th year of its service life, a circuit breaker shall be evaluated for its operating condition

and, where necessary, a proposal shall be initiated and a budget shall be set up to replace it in the

following year. The planned service life of busbars and cabinet housing is 20 years. In the 19th

year of the service life, a proposal shall be initiated and a budget shall be set up to have them

obsoleted and replaced in the following year. Life cycle management and maintenance plans shall

be established for new replacement switchgear.

5.6.2 Life cycle management - low-voltage switchgears

The critical parts of low-voltage switchgear (including circuit breaker, busbar, cabinet

housing, and capacitance compensator) are subject to routine maintenance every six months and

in-depth maintenance every three years. The planned service life of circuit breakers is 15 years (or

30,000 operations). In the 14th year of the service life, a circuit breaker shall be evaluated for its

operating condition and, where necessary, a proposal shall be initiated and a budget shall be set up

to replace it in the following year. The planned service life of busbars and cabinet housing is 20

years. In the 19th year of the service life, a proposal shall be initiated and a budget shall be set up

to have them obsoleted and replaced in the following year. Life cycle management and

maintenance plans shall be established for new replacement switchgears. The planned service life

of capacitance compensators is 5–8 years, shorter than that of other parts. Capacitance

compensators are replaced as recommended by the manufacturer or according to their operating

conditions. It is recommended to have a capacitance compensator replaced twice during the life

cycle of the switchgear.

5.6.3 Life cycle management - transformers

Transformers are subject to annual de-energized maintenance and preventive maintenance

every six years. The planned service life of transformers is 20 years. In the 19th year of the

service life, a transformer shall be evaluated for its operating condition and a proposal shall be

initiated and a budget shall be set up to have it obsoleted and replaced in the following year. Life

cycle management and maintenance plans shall be established for new replacement transformers.

5.6.4 Life cycle management - diesel generators

The engine oil, diesel, and air filtering elements of a diesel generator unit’s lubrication, fuel,

and air filtering systems shall be replaced every year.

Page 81: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

73

The coolant and cooling water filters of the cooling system shall be replaced every three years.

Startup batteries shall be replaced every two years.

The planned service life of diesel generator units is 15 years. In the 10th year of the service

life, its operating condition shall be evaluated to decide whether to continue its service. If it is

decided to continue its service, in the 14th year of the service life, a budget shall be set up to have

it obsoleted and replaced in the following year.

5.6.5 Life cycle management - uninterrupted power supplies (UPS)

The AC and DC capacitors in a UPS are quick-wear parts and have a service life of five to six

years. They need to be replaced as recommended by the manufacturer and the general principle is

two replacements in the life cycle of UPS.

UPS storage batteries shall be replaced according to their operating condition and the general

principle is at least one replacement in the life cycle of UPS.

The planned life cycle of UPS is 20 years. In the 19th year of the service life, a proposal shall

be initiated and a budget shall be set up to have it obsoleted in the following year. Life cycle

management and maintenance plans shall be established for a new replacement UPS.

5.6.6 Life cycle management – chilled-water units

The oil filters, refrigerant drying and filtering devices, and refrigerant oil in the chilled-water

units need to be replaced every year.

The planned service life of chilled-water units is 15 years. In the 10th year of the service life,

a chilled-water unit shall be evaluated for its operating condition to decide whether to continue its

service. If it is decided to continue its service, in the 14th year of the service life, a budget shall be

set up to have it obsoleted and replaced in the following year.

5.7 Risk management

The operations team of the data center effectively manages the operating risks of the data

center. This facilitates the operations team to make correct decisions, protect the security and

integrity of company assets, and achieve its performance goals. This is significant for the

operations of the data center.

Page 82: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

74

5.7.1 Acronyms and definitions

The risk management of the data center refers to the management process to identify risks in

an environment and minimize the potential impact of the identified risks.

5.7.2 Risk identification and analysis

As the first important step of the risk management process, the risk identification of the data

center involves identification of risks in the computer room in the form of a comprehensive risk

analysis list. The identified risks are subsequently proactively analyzed for their potential impact

and best measures to mitigate the impact.

Risk identification is conducted in the form of a risk analysis list. The identified list is

thereafter analyzed and categorized into the following three categories: high, medium, and low

risks. A high risk is an unbearable operating risk whose occurrence will result in the inability of

the computer room to quickly resume its operation and cause serious loss to the company.

Medium and low risks are tolerable and controllable operating risks that threaten operational

security but only in the local scale.

Note: The risk identification and evaluation form is a live document and needs to be updated

regularly, as an operating risk may change and need to be reclassified and new risks may arise as

relevant factors in the computer room evolve.

Table 5.7-1 Computer room operating risk analysis list

Risk

classification

High Medium Low

Computer room

security

Fire impacting the entire

computer room

Fire impacting some of the

computer room equipment

Leakage water pooling in a

large area of the computer

rooms

Water pooling in the main

computer room

Leakage water pooling locally in the

computer rooms

Collapse of the computer

room building

Local damage of the computer

room building

The structural integrity of the computer

room threatened

Firefighting systems out of control Firefighting system faults

Air-conditioning system failure or Abnormal temperature or humidity

Page 83: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

75

out of control

Door access system out of control Door access system fault

Computer room lighting system

failure

Lighting system fault

Computer room monitoring system

failure

Computer room monitoring system

warning

Operational

security

Core equipment failure Major equipment failure

Large-scale power outage

in the computer room

Power supply fault

Air-conditioning water

system piping blow-up

Air-conditioning system failure in

a single computer room

Entire diesel generation

system failure

Diesel generation unit failure

Core network cable broken Primary/redundant network cable

broken

Local failure of network cable

Management

and personnel

safety

Sabotage Severe operating error General operating error

Incomplete definition of

management structure or

responsibilities

Incomplete rules and regulations Poor implementation of rules and

regulations

Personnel casualties Personal injury

Property

management

Damaged major equipment Local damage of equipment Equipment failure

Major equipment (data)

missing

Equipment missing Equipment components missing

Others

Power outage or network

communication failure

caused by lightning

Lightning Lighting protection device failure

Page 84: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

76

Cable damaged by varmints Presence of varmints

Severe electromagnetic

interference

General electromagnetic

interference

5.7.3 Risk mitigation plan

The operating risks identified in the data center are tracked and controlled in the form of a

risk control list, where risk mitigation plans as well as the status of the planned actions are

recorded (a mitigated risk may be controlled as a generic issue). The risk control list includes the

following information:

Date of risk identification: The date on which a risk is identified.

Risk description: A description of the identified risk to facilitate the data center operations

team to understand the risk.

Risk occurrence probability: Three levels of risk occurrence probability are defined: high,

medium, and low.

Risk impact: Three levels of risk impact are defined: high, medium, and low.

Risk severity: Three levels of risk severity are defined: high, medium, and low.

Risk owner: A person is specially designated for controlling and tracking a risk.

Risk control strategy: Risks are controlled through any of the following three strategies:

avoidance, mitigation, and acceptance. The specific control strategy for a risk is decided by

the risk owner according to the outcome of risk evaluation.

Risk mitigation plan: With the identified risks analyzed qualitatively and quantitatively and

prioritized, the owner of a specific risk develops a risk mitigation action plan according to the

operating condition of the data center.

Risk emergency plan: A plan is established for a quick response to the occurrence of each

specific risk and resuming normal operation. For an emergency plan to be comprehensive,

scientific, and effective, the following information for risk emergency response shall be

included: emergency reporting system and emergency response organization responsible for

mobilization, on-site coordination, and staffing (including technical professionals for risk

response).

Risk control status: The status of risk control can be closed or open. An open risk needs to be

tracked and regularly updated. A closed risk may be referenced for similar risks in the future.

Risk change record: Record of major actions taken and major progresses made in risk control.

Risk update date: The date on which a risk in the risk control list is deleted or modified or the

Page 85: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

77

date on which a new item is added to the list.

Approaches to close a risk: A risk may be closed if it is mitigated, changed to a generic issue,

or taken as it is.

Date of closing a risk: The date on which a risk is avoided, mitigated, or taken as it is and is

thereafter closed for control after actions are taken to cope with the risk.

Risk transfer: Risks with low probabilities can be transferred to insurance companies and

service provides by purchasing insurances and outsourcing equipment maintenance. For

example, purchasing property insurances can transfer some computer room risks (for example,

risks with the computer room building and risk of fire) to insurance companies; outsourcing

computer room equipment maintenance can transfer the risk of equipment failures (for

example, UPS and precision air conditioners) to equipment maintenance service providers.

5.8 Asset management

5.8.1 Challenges of asset management

Ping An Group is one of China’s personal financial service groups with the most

comprehensive range of financial business licenses, the most extensive business scope, and the

most compact shareholding structure. Owing to the interaction between its diversified businesses,

its IT systems are tightly coupling and have complicated infrastructure. To cope with its rapid

business development and frequent business changes, its IT facilities are faced with the challenge

of accommodating approximately 100 changes a day. Owing to its nature of financial service,

Ping An Data Center is required to quickly resume operation after the occurrence of failures.

Therefore, it is essential to locate the hardware failure and affected applications of large-scale IT

infrastructure (100,000+ units of equipment) in a timely manner. This in turn dictates highly

efficient asset management in the data center, which requires a customized tool that supports

systematic management.

5.8.2 Systematic asset management

5.8.2.1 Scope of asset management

Ping An Group involves many business units and has a multitude of IT-related assets that are

widely distributed. Considering the complicated asset management situation, the data center and

the Group Asset Management Office have defined the scope of data center asset management as

the physical area of Ping An owned data center, which has been officially published.

Page 86: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

78

5.8.2.2 Asset issuance procedure

The assets of the data center include both operating and non-operating equipment units.

Commissioned equipment units are installed with application systems and can be monitored

automatically at the following three levels according to the company’s IT system management

specification: application level, operating system (OS) level, and hardware status level. An

unauthorized change to a commissioned equipment unit will trigger an alarm, which is monitored

by the asset management officer. However, there is no effective means for automatic monitoring

of noncommissioned equipment units, which are controlled through the asset issuance procedure.

The procedure is linked with the company’s financial system. If the procedure is not followed, the

expenses for acquiring the equipment cannot be processed for reimbursement and payment.

5.8.2.3 Asset management responsibility system

The position of the asset management officer is specially established in the data center for

asset management. The asset management officer is required to become acquainted with the

equipment classification system of the data center and work carefully, earnestly, and patiently to

manage assets according to the asset management system.

5.8.2.4 Asset obsolescence and disposition procedure

The data center obsoletes and disposes equipment that has not been in use and has exceeded

the financial depreciation life. This is carried out twice a year according to the asset obsolescence

and disposition procedure established by the corporate asset management office. The timely

disposition of obsolete assets contributes to refreshed asset data.

5.8.3 Developing a unique asset management system for the data center

5.8.3.1 The necessity to develop a unique asset management system

The number of assets of the data center increases in the magnitude of more than 10,000 units

a year. This rapid increase dictates a unique asset management system that fits well with the

situation of the data center, such that asset changes can be recorded and data collected and

analyzed using big-data technology. Ping An Data Center has now developed Goods Receipt

System, Integrated Data Center (IDC) Visual Management System, and OPCM Management

system that satisfy its management requirements. The systems have a PC version and a mobile

phone APP version to enable system access in the office environment and mobile access while

working on-site in the computer room.

Page 87: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

79

5.8.3.2 Top priorities in asset management system development

The top priority is the design of configuration management database (CMDB) and

configuration items (CI). Two considerations are given in this regard: 1) it is not advisable to

cover every configuration item during design phase, as data on CIs and the relations between them

are constantly changing. This would require much more efforts only for maintenance; 2) it is not

advisable to seek an all-round system that provides solutions at many levels (data center, server,

storage, network, and application), as this may result in no good solution for any single failure.

A key challenge is the integration of off-the-shelf products into the asset management system.

The biggest issue with off-the-shelf products is that they are for general purposes and provide no

solution for the practical problems of the data center. Another potential issue is that the System

Developer does not understand the requirements of data center operations. This may result in a

long development cycle and a system that is too far from satisfying the operational needs.

5.8.3.3 The asset management system of the data center

To address the above issues, Ping An Data Center developed OPCM—an asset management

system that fits well with its particular situation in 2016. The system was developed based on the

following two principles: 1) streamlining the CMDB and CI to realize an asset management

system that has fewer but better functions. The target is to design a system that can manage 95%

of the day-to-day work, with the remaining 5% to be managed by on-site check or logging onto

the OS to check configurations (for example, the number of network cards for an equipment unit

and the MAC address of each network card). This is intended to avoid too big a CMDB. 2) To

address the issue that a system developed by personnel without operations knowledge is prone to

be unsuitable for operations, the data center operations team provided operations training to the

system development personnel. The OPCM system has now been commissioned and proven to be

capable of facilitating the asset management of the data center as expected.

5.8.4 Asset management system illustrated

5.8.4.1 Total life cycle management of the assets of the data center

The figure below shows an example of total life management of assets in the OMCP

system—a process starting from asset acceptance.

Page 88: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

80

Fig. 5.8-1

5.8.4.2 Equipment hardware configuration information management

Fig. 5.8-2

5.8.4.3 Equipment-application correlation management

Fig. 5.8-3

Page 89: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

81

5.8.5 On-site asset control

5.8.5.1 Characteristics of on-site asset management

The on-site asset management of the data center covers two types of assets: 1) those that have

been commissioned in the data center and 2) those that have not been commissioned and are

stored in the warehouse. Quicker failure recovery is required of financial data centers, which

dictates quick acquisition of information about equipment configurations, applications running on

equipment, and persons in charge of the applications as well as configurations of spare equipment

stored in the warehouse when needed to replace failed equipment. This constitutes a special

challenge for equipment management in data centers. To cope with this challenge, the data center

applies QR code labels on commissioned equipment and has developed an app for mobile data

center management that runs on tablets and mobile phones.

5.8.5.2 Introduction of the QR code technology used in the data center

As the mobile technology is advancing, the application of QR code—a technology that makes

life and work much easier and more convenient—has become increasingly popular. QR code is

employed in the asset management of the data center. Two types of QR codes are used for asset

management: 1) those for assets identified with serial numbers and 2) those for assets identified

with asset descriptions. In the first case, the serial number of an asset is coded into a QR code,

which is thereafter printed out and stuck somewhere in the vicinity of the asset, whereas in the

second case, an asset description is generated according to the company’s pre-established

specification and thereafter input into a QR code generator to create a QR code label. The figures

below show examples of QR codes.

Equipment QR code label

Fig. 5.8-4

Rack QR code label

Page 90: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

82

Fig. 5.8-5

5.8.5.3 Application of QR code illustrated

Scan equipment QR code to acquire equipment information

Fig. 5.8-6

Scan rack QR code to acquire information about all the equipment units in the rack

Fig. 5.8-7

Page 91: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

83

5.8.5.4 Asset obsolescence and disposition procedure

An asset that has reached the end of service life or cannot continue service (an item in the list

of obsolete assets) shall be disposed in a timely manner. This is to improve the power and space

efficiencies of the data center, reduce operations cost, and improve asset data cleanness. The asset

management officer of the data center is responsible for asset obsolescence and disposition. He

shall arrange at least two rounds of asset disposition a year, which is defined as one of his KPIs.

The asset obsolescence and disposition process is as follows. The asset management officer

prepares a list of obsolete assets and emails it to the asset users for confirmation. If an asset is

confirmed to be obsolete, the asset management officer prepares an asset obsolescence request

and sends it to the end user, data center manager, departmental managers of the data center,

corporate asset management office, and finance department for approval. The asset management

officer thereafter arranges it for auction. The asset management officer thereafter updates the asset

financial data in corporate material system, and the data center updates the record in the OPCM.

The asset management officer thereafter prepares an asset disposition end-of-availability (EOA)

request and sends it to the user, data center manager, departmental managers of the data center,

corporate asset management office, and finance department for approval. The auction winner is

permitted to take the asset away. This completes the asset disposition process.

5.8.5.5 Asset inventory check

There are “dirty asset data” owing to human error even with the OPCM system implemented

in the data center for asset management. Asset inventory check is the only effective way to

identify and correct dirty data. There are two types of asset inventory checks implemented in the

data center: 1) quarterly self-check by the data center and 2) annual corporate asset inventory

check, which is conducted by the corporate asset management office for company-wide assets.

With these two types of asset inventory checks put in place, the asset data accuracy of the data

center is now higher than 99.8%.

5.9 Day-to-day operations management

5.9.1 Challenges of day-to-day operations

Ping An Data Center supports not only Ping An Group’s traditional financial services such as

insurance, banking, and investment but also Internet financial services such as Lufax, OneConnect,

and eWallet. The traditional financial services are mature but complicated in structure. This

dictates the support of a data center that is stable and quick in failure recovery. In addition, a data

Page 92: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

84

center failure that is not recovered in the regulatory time frame must be reported to the regulator.

Therefore, for the data center to be able to support the traditional financial services, the top

priority is stability i.e., the fewer the changes, the better. In contrast, the new Internet financial

services require short time to grab market share and frequent remedies as problems may pop up

after a new service goes live. Therefore, the new Internet financial services require the data center

to be capable of short time to market and frequent changes. In addition, the traditional financial

services are incorporating more and more Internet service elements. This results in a complicated

business structure of the data center: the coexistence of old structures based on traditional “OEM”

products, new Internet structures based on Ping An’s financial clouds but correlated with

traditional OEM products, and new structures completely based on the Internet framework and

philosophy. This poses continuous new challenges to the data center. To satisfy the requirements

of both the traditional financial businesses and new Internet financial businesses, the data center is

required to break down the requirements of the financial services it is required to support and

carry out delicacy management of its operations.

5.9.2 Systematic day-to-day operations management

5.9.2.1 Zoned management

Ping An Data Center supports Ping An Group’s insurance, banking, and securities businesses,

and its support service shall satisfy the regulatory requirement of China Insurance Regulatory

Commission, China Banking Regulatory Commission, and China Securities Regulatory

Commission, respectively; its support service to Internet financial services (for example, Lufax

and credit inquiry) shall satisfy the regulatory requirements of the People’s Bank of China. The

data center is also subject to annual inspections by the above regulators. Considering this

challenge, the data center has established a zoned service management system. Some zones are

physically segregated into segregated modules or by physical barriers, if physical segregation is

required by the regulator. If physical segregation is not required by the regulator, service zoning is

realized by concentrating a service in a separate rack and locking the rack. Zoned delicacy

management of different services is realized by establishing different management systems

according to their different characteristics.

However, as technology advances and new business forms emerge, regulators may update

their regulatory requirements according to the latest situation. Thus, the data center needs to

closely follow changes in regulatory requirements for data centers and update its zoned

Page 93: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

85

management system accordingly.

5.9.2.2 Service window and maintenance window

With the zoned management system, the data center is able to guarantee no impact of

equipment maintenance in one zone on the operation in any other zones. However, as the data

center has structures for both traditional financial businesses and Internet financial businesses,

very often, its component systems are interconnected and a minor change in one part of the data

center may affect the entire data center. To ensure no impact of a change on the major businesses

serviced, the data center has set up service and maintenance windows, which have been agreed to

by relevant parties.

In the service window, no maintenance events or changes are allowed in order to ensure the

stable operation of business systems. Maintenance activities and changes can only be

implemented in the maintenance window. If a maintenance event or change for a business does

not impact any other businesses, it can be implemented in the pre-established maintenance

window; in cases of an event or change impacting several interrelated businesses, it can only be

implemented in a maintenance window that is acceptable to all the businesses. Thus, delicacy

management of routine maintenance for different businesses can be realized. In cases of a service

outage or severe vulnerability that may lead to a service outage in the service window,

maintenance is allowed in the service window but only after undergoing a rigorous approval

procedure. This is to provide flexibility in the time of emergency while preventing the abuse of

this emergency channel.

Table 5.9-1 Examples of service and maintenance windows

Business systems Service window Maintenance window

Insurance **:** - **:** **:** - **:**

Banking **:** - **:** **:** - **:**

Securities **:** - **:** **:** - **:**

Internet financing **:** - **:** **:** - **:**

5.9.2.3 Business contingency plan

To sustain service availability, the data center has established a business contingency plan,

which provides differentiated contingency protection based on the criticality of businesses

Page 94: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

86

serviced. For example, Class I systems (or traditional structures) are protected with both remote

and local backups. Furthermore, resource investment is differentiated based on the pre-established

recovery point objective (RPO) and recovery time objective (RTO), such that Class I systems are

capable of sustaining business continuity. For Internet-financial-service-oriented structures and

applications, multiple remote backups and double local backups are planned to ensure that Class I

systems are capable of sustaining business continuity. In addition, the corporate contingency

planning department carries out contingency drilling every year, to ensure that the data center can

sustain business system continuity.

5.9.2.4 Change management procedure

According to industry data, 70% data center failures are caused by human errors. As a critical

component of the group’s IT system, the data center may impact the entire group’s business

systems if any of its parts fails. Therefore, the data center has implemented a rigorous control of

changes. Changes are categorized into the following categories according to their characteristics:

routine, normal, and major changes. Routine changes are initiated by the engineer on duty and

subject to approval of the reporting line manager. Normal changes are subject to review by the

engineer on duty and reporting line manager and approval of the departmental manager. For a

major change, the engineer on duty shall prepare an implementation plan, which is subject to

review by the reporting line manager and department manager and elaboration and approval of the

Change Approval Board (CAB) of the data center. Thus, delicacy management of changes can be

realized.

Change management is one of the four core tasks in data center day-to-day management, the

other three being incident management, problem management, and configuration management,

which have already been covered in the previous chapters.

5.9.2.5 Equipment/system access authority classification system

Ping An Data Center runs the business systems of Ping An Group’s professional companies.

To ensure data security, the data center has implemented a system access authority classification

system. Specifically, the data center management personnel are only permitted to change the data

center’s equipment operating environment, physical wiring, and equipment location; hardware

management personnel have the authority to manage hardware only; operating system

management personnel have the authority to manage operating systems only; application

operations personnel have the authority at the application level only; development personnel do

Page 95: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

87

not have access to production systems. Thus, an employee has the authority to manage the data

center components related to his work only, having no access to the entire system. In addition, the

system is managed by different functional units in different operating environments (development,

staging, production, and contingency). In cases where production data are required in a testing

environment, the data must be desensitized. With the access authority classification system,

personnel and environment authorities are minimized, such that intentional disclosure, tampering,

and embezzlement of user data can be minimized.

5.9.2.6 Information security management system

An outstanding data center needs to ensure not only operations stability but also information

security. Information security is particularly important for financial data centers. Ping An Data

Center has established two zero-tolerance objectives for information security: zero tolerance of

major regulatory compliance issue and zero tolerance of major information security issue. To

achieve this, the data center has established a document (file) classification system. Documents,

whether in hardcopy or electronic, are classified into the following categories: secret, classified,

and highly confidential. The position of document control officer is specially set for controlling

the documents of the data center. Defective hard disks that need to be taken out of the data center

shall be demagnetized or physically damaged to prevent data disclosure. For solid-state drives that

cannot be demagnetized to prevent data disclosure, the manufacturer has agreed contractually to

have them serviced without the need to return them to the manufacturer and the manufacturer’s

engineers cannot take them away from the data center—all defective drives are reclaimed and

physically destroyed by the data center in a centralized manner. Magnetic tapes for data backup

purposes must be written using encryption technologies. For such a magnetic tape to be

transferred for storage in a different site, it must be placed in a special-purpose magnetic tape

storage box, the box must be locked, and the handover form must be signed and locked in the box.

The box must be escorted during transportation by a qualified security company that has signed a

nondisclosure agreement with the company.

5.9.2.7 Audits of day-to-day operations

To assess its day-to-day operations, Ping An Data Center conducts an internal audit every

quarter. It also employs the corporate information security department and well-known

organizations such as BSI and Ernst & Young to audit its information security and day-to-day

operations systems. Issues identified in such audits must be remedied as part of the continual

improvement process, such that the effectiveness of the management system of the data center can

Page 96: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

88

be sustained. The data center has now been certified to the ISO 9001, ISO 20000, ISO 27001, and

Uptime M&O standards.

5.9.3 Integrated data center management system

5.9.3.1 The necessity to develop a unique integrated data center (IDC) management system

Ping An Data Center operates more than 100,000 units of equipment and approximately 1,000

business systems. Manual labor only cannot sustain stable operation of the data center. The

challenges can be summarized in the following five aspects:

1) With the multitude of equipment units and many systems running in the data center,

manual labor cannot solely satisfy the requirements of business system operations;

2) Different persons have different skill levels and, therefore, may yield different outcomes

for the same task;

3) The same person may yield different outcomes for the same task in different conditions,

psychologies, or times;

4) There is no effective way to pass human experiences from one person onto another;

5) It is difficult to realize standardized operation.

Therefore, it is necessary to develop an effective data center management system, such that

standardized management can be realized. With the data center operations training provided by

Ping An Data Center, the development personnel have developed an IDC visual management

system and computer room management app, which contribute toward improved computer room

management efficiency and standardized delicacy management of the data center.

5.9.3.2 Delicacy management of Ping An Data Center

5.9.3.2.1 Integrated delicacy management

With the IDC visual management system, the data center can understand the real-time status

of used power and space resources and layout of business systems. In addition, big-data

technology is employed to analyze the historic data and development trends of business systems.

Thus, future requirements for rack resource by each business system can be predicted. This

facilitates proactive capacity expansion, flexible allocation of data center resources for business

systems in a holistic manner, and integrated delicacy management of the data center. The figure

Page 97: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

89

below shows the operating condition of a module of the data center.

Fig. 5.9-1

5.9.3.2.2 Delicacy management by module

With the IDC visual management system, the data center can understand the used capacity

and power consumption of each rack in real time as well as the current condition and future trend

by data center site. Thus, delicacy management of the various modules of the data center can be

realized.

Fig. 5.9-2

5.9.3.2.3 Delicacy management by rack

Page 98: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

90

With the IDC visual management system, the data center can understand the used capacity,

power consumption, and equipment operating condition in each rack in real time. This is

subsequently combined with characteristics analysis of each rack’s functional areas and each

business as well as big-data analysis. Considering that each rack has a maximum power of 6 kW

and height of 46 U, 18 servers can be placed in every rack in the VXLAN framework or in the

TOR DB framework; 15 servers can be placed in every rack in the GBD framework; 16 servers in

every rack can deploy a financial cloud platform. Based on these data and the characteristics of

applications, each unit of each rack can be utilized to its full capacity, thereby facilitating delicacy

management at the rack level.

5.9.3.3 Equipment location automatic distribution system

Ping An Data Center has developed an equipment location automatic distribution system,

according to the service characteristics of its server, which include small but frequent batches, and

the principle of full utilization of space and old wiring.

The design principle of the rack location automatic distribution system is as follows: the

feasibility of installing a server into a rack is based on the equipment specification (the rack space

and power capacity of the same equipment type), as well as the analysis of zoning, power

consumption, and available rack space.

Fig. 5.9-3

Page 99: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

91

Chapter6 Operations Quality Assurance System

This chapter introduces approaches to test the operations quality of the data center, including

an internal audit by the security department of the group, an internal audit conducted in the form

of a crosscheck between different teams of the data center, and external audits for M&O, ISO

9001, ISO 27001, and ISO 20000 certification.

6.1 Internal audit

Internal audits, sometimes called first-party audits, are conducted by, or on behalf of, the

organization itself for management review and other internal purposes, and can form the basis for

an organization’s declaration of conformity. In many circumstances and in small organizations in

particular, internal audits can be conducted by personnel not responsible for the activity being

audited, in order to demonstrate their independence.

There are two types of internal audits in the data center: those at the data center level and

those at the corporate level.

6.1.1 Internal audit at the data center level

Internal audits of the data center are conducted quarterly in the form of a crosscheck between

different data center sites and between different teams of the same data center site. Internal audits

are conducted strictly according to the pre-established standard procedure, in order to review and

assess the conformity and effectiveness and ensure continuous effective operation of the quality

management system and provide input for quality system improvement.

Responsibilities

(1) Accountable Role in the data center: taking corrective actions against nonconformities

identified in internal audits.

(2) Internal Auditor: conducting internal audits against the Data Center Internal Audit

Checklist.

(3) Lead Internal Auditor: planning for internal audits, leading the internal audit team to audit

the quality management system, chairing opening and closing meetings for internal audits,

preparing internal audit reports, and following up on corrective actions.

(4) Management Representative: reviewing annual internal audit plans and audit reports,

submitting them to the Data Center Manager for approval.

Page 100: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

92

(5) Data Center Manager: approving annual internal audit plans and internal audit reports.

Audit procedure

Audit plan

The Lead Internal Auditor shall prepare an annual audit plan and submit it for discussion at

the management review. The plan shall ensure that

(1) a minimum of four internal audits are conducted each year;

(2) all the requirements of ISO 9001 are covered in a period of one year;

(3) audits are focused on areas with frequent occurrence of nonconformities;

(4) audits are conducted independently or auditors are not responsible for the activity audited;

(5) audits are conducted in a timely manner for the occurrence of major quality defects or

major changes to the quality management system, including changes to documentation,

organization structure, operations procedures, and products (services);

(6) the schedule, frequency, and scope of audits are defined.

(7) The plan is subject to approval of the Data Center Manager. The Management

Representative shall communicate the plan to all personnel in the data center.

Audit preparation

(1) The Management Representative shall establish an internal audit team and designate a

Lead Auditor one month in advance of a planned audit.

(2) The Lead Auditor is responsible for assignment among the auditor team. An internal

auditor should have no direct responsibility for the object (department or position) being

audited.

(3) The audit plan (prepared by the Lead Auditor and approved by the Management

Representative) should be communicated to the departments and persons to be audited at

least one week in advance. The audit plan should include the auditee, scope, date, and

criteria of the audit as well as the assignment among the auditors.

(4) If an auditee does not agree with the audit plan, he can request the audit team to change

the plan within two days of the receipt of the plan. Changes to the plan should be based

on mutual consultation.

(5) The Lead Auditor should ensure that the auditors use the latest version of the Data Center

Internal Audit Checklist for the audit.

Implementing an internal audit

Participants of an internal audit opening meeting include all the auditors, auditee

Page 101: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

93

representatives, main auditee contacts, the Management Representative, and top managers (where

necessary). An opening meeting may not be necessary for a crosscheck between local teams but is

mandatory for a crosscheck between different data center sites. The opening meeting is chaired by

the Lead Internal Auditor and should cover:

(1) introduction of the auditors and the assignment among them (undertaken by the Lead

Auditor);

(2) restatement of the scope, criteria, and purpose of the audit;

(3) a brief introduction of the audit methodology;

(4) request for assistance required from the auditees;

(5) clarification on the audit plan.

On-site audit

(1) Internal auditors conduct the audit against the Data Center Internal Audit Checklist. They

may conduct the audit through sampling check of records, on-site observation, interview,

and check of documents.

(2) If any issue is identified during the audit, the auditor should confirm the issue with the

person-in-charge or operator and thereafter record it in the Data Center Internal Audit

Checklist. This is intended to facilitate the understanding and remedy of nonconformities.

(3) At the end of on-site audit (prior to the closing meeting), the Lead Auditor should conduct

an audit team meeting to summarize the audit findings and confirm the nonconformities

identified during the audit.

Closing meeting.

Participants of a closing meeting include all the auditors, auditee representatives, main

persons involved in the audit, the Management Representative, and top managers (where

necessary). A closing meeting may not be necessary for a crosscheck between local teams but is

mandatory for a crosscheck between different data center sites. The closing meeting is chaired by

the Lead Internal Auditor. It is intended to provide a summary of the audit. A closing meeting

should cover the following aspects:

(1) restatement of the scope, criteria, and purpose of the audit;

(2) clarification on audit findings to the auditees;

(3) nonconformities identified during the audit and their supporting evidence;

(4) conclusions and proposals by the audit team;

(5) clarification on the corrective action process for nonconformities (undertaken by the Lead

Page 102: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

94

Auditor).

Audit report

(1) The Lead Auditor should prepare an internal audit report for the audit. It is intended to

summarize the audit, statistically analyze the nonconformities, identify areas of concern

and opportunities for improvement, and propose areas to be focused on during the

subsequent audit.

(2) The Lead Auditor submits the report to the Management Representative and sends a copy

to the Data Center Manager.

(3) The Lead Auditor communicates the audit findings to the auditees.

(4) The Lead Auditor follows up on corrective actions.

(5) The auditees should provide corrective action plans for nonconformities and opportunities

for improvement identified during the audit. A corrective action plan should:

* be preventive in nature to avoid the occurrence of similar nonconformities;

* provide clear and practical actions, whose effectiveness is measurable;

* provide a timetable for each action to be taken.

(6) The Lead Internal Auditor uses the Data Center Internal Audit Checklist to track the

planned corrective actions. A corrective action will be closed when it is verified to be

effective. If a corrective action is not effective, the Lead Auditor should request the

person-in-charge to take another action. This process is defined in the Analysis and

Improvement Procedure.

(7) The Lead Auditor should update the Management Representative and Data Center

Manager on the status of the correction actions and pay attention to the existence of

similar issues during the subsequent audit.

(8) The Lead Internal Auditor should hand over all the internal audit records to the Document

Controller, as defined in the Quality Record Control Procedure.

(9) The results of the internal audit should be included in management review, as defined in

the Management Review Procedure.

Reference documents:

<Data Center Internal Audit Checklist>

<Analysis and Improvement Procedure>

<Quality Record Control Procedure>

<Management Review Procedure>

Page 103: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

95

6.1.2 Corporate internal audit

Corporate internal audits mainly cover information security, as shown in the table below.

Table 6.1-1 Checklist of data center data for internal audit

No. Data type Data description Period covered Remarks

1

Data center

environment

Data center construction planning and site selection as well as

profile

2 Layout of the data center

3 Layout of lightning protection devices

4 Layout of smoke detectors and temperature sensors

5 Layout of water piping and leakage sensors

6 Layout of firefighting devices

7 Layout of surveillance cameras

8 Physical environment security evaluation reports

9 Layout of air-conditioning chilled-water pipes

10

Data center

management

Job descriptions of data center management positions

11 Service provider selection, management, and evaluation

records

12 Equipment procurement contracts

13 Contracts with telecommunication operators

14 Checklist of data center equipment/assets

15 Applications, gate passes, and receipts related to equipment

moving in and out of the data center

16 Equipment acceptance records

17 Equipment disposition records

18 Records of tapes received in and delivered out of the media

room

19 Media checklist and inventory check records

20 Tape demagnetization records

21

Visitor records

Granted data center accesses checklist

22 Applications for data center access

23 Records of deleted data center access

24 Data center access review records

25 Application for temporary data center access

26 Registration of data center visitors

27 Signed letters of confidentiality for data center access

28 Data center access system log/record

29

Data center

operations

Data center routine check records

30 Data center equipment patrol inspection records

31 Equipment maintenance and service records

32 Emergency exits opening/alarming records

Page 104: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

96

33 List of issues with the data center

34 Problem/failure handling processes

35

Operations system

Checklist of data center systems

36 Master list of granted accounts and accesses to operations

systems

37

Drilling reports

Firefighting drilling reports

38 Power outage drilling reports

40 Diesel generator drilling reports

41

Management

systems

ISO quality management documents and operation manuals

42 Service provider selection/management/evaluation

standards/systems

43 Visitor registration procedure

44 Inspection standards for portable fire extinguishers

6.2 External audits

External audits include those generally called second- and third-party audits. Second-party

audits are conducted by parties having an interest in the organization, such as customers, or by

other persons on their behalf. Third-party audits are conducted by external, independent auditing

organizations such as those providing certification/registration of conformity with ISO 19001 or

ISO 24001.

The external audits of Ping An Data Center include those for M&O, ISO 9001, ISO 27001,

and ISO 20000 certification and certification renewal.

6.2.1 Audit for M&O certification renewal

The M&O standard provides an overall standardized management configuration for data

center operations and management from multiple dimensions, frameworks, perspectives, and

levels. The standard also includes detailed standardization requirements at the training, drilling,

planning, adjustment, and practical operation levels of the operations and management system, in

order to improve the management competency of operations personnel and sustain high service

levels of data centers.

The M&O certification is valid for two years. To maintain the certification, the data center is

subject to a certification renewal audit of its processes and systems every two years. The audit is

based on a scoring system, and a minimum score of 80 is required to pass the audit.

The audit covers 20 sub-categories in five categories as shown in the table below.

Page 105: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

97

Table 6.2-1 M&O audit checklist

No. Category Sub-category Required information

1

Staffing and

organization

Staffing Staffing plan (number and responsibilities)

Escalation and call procedure (between internal parties and between the

data center and vendors)

2 Qualification Training certificates and records

Assignment of responsibilities (responsible area, training, and security)

3 Organization Organizational chart, including the following information:

- Detailed organizational chart at the infrastructure level

- Detailed organizational chart at the data center level (infrastructure, IT, and

security departments)

- Job descriptions for infrastructure-related positions

4

Maintenance

Preventive

maintenance

Checklist and timetable for preventive maintenance

Preventive maintenance methods

Work orders for preventive maintenance

Calibration of testing tools

Checklist of critical spare parts and points of order

Process for switching between redundant components

5 Housekeeping

policy

Housekeeping policy for the main computer room

6 Maintenance

management

system

Completion rate of preventive maintenance

Open-loop and closed-loop working processes

7 Vendor support Approved vendor list and SLA

8

Deferred

maintenance plan

Deferred maintenance checklist

Deferred maintenance procedure

9 Predictive

maintenance

10 Life cycle

management

11 Failure analysis

procedure

History record of power outages and corrective actions

12

Training

Data center

employee training

Tabulated training needs by job position

Training participation record

Training course syllabuses

13

Vendor training

Tabulated vendor training needs

Participation record

Training course syllabuses

Page 106: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

98

14

Planning,

Coordination,

and control

Data center policy

Data center policy

- Standard operation procedures

- Emergency response procedures

- Configuration control procedures

15 Financial

management

Development planning and budgeting procedure

16 Reference library Library access control

Data update procedure

17

Main computer

room management

Computer room planning and growth requirements

Power and cooling control procedure

IT facility commission and decommission control procedure

18

Operating

conditions

Load management Load management policy

19 Operation

configuration

point

Operation configuration policy

20 Equipment

rotation

6.2.2 ISO 9001 audit

What is meant by an ISO 9001 certificate?

ISO 9001 specifies requirements for a quality management system when an organization

needs to demonstrate its ability to consistently provide products and services that meet customer

and applicable statutory and regulatory requirements, and aims to enhance customer satisfaction

through the effective application of the system, including processes for improvement of the

system and the assurance of conformity to customer and applicable statutory and regulatory

requirements.

What is not meant by an ISO 9001 certificate?

(1) Note that the requirements specified in ISO 9001 are for the quality management system

of an organization, not products or services of an organization. ISO 9001 certification

should enhance an organization’s confidence in consistently providing products and

services that satisfy customer and applicable statutory and regulatory requirements.

However, the certification does not guarantee that an organization has realized 100%

product compliance, although this is the permanent goal of an organization.

(2) ISO 9001 certification does not indicate an organization’s ability to provide high-quality

products or services or the certification of its products or services to the ISO standard or

any other standards or specifications.

Page 107: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

99

Purpose, scope, and criteria of audit

The audit aims to ensure that the management system of an organization can effectively and

consistently satisfy the requirements of the management system standard, enhance the

organization’s confidence, demonstrate the organization’s ability to comply with legal, regulatory,

and contractual requirements as well as the organization’s pre-set targets, and confirm the

continuous effectiveness and suitability of proactive plans, through proactive, evidence-based

monitoring. It is applicable to the scope of the management standard. If an audit is part of a

multiple-site audit, the final recommendation for certification is based on the findings at all the

sites.

The scope of the audit includes an organization’s documented management system as

required in ISO 9001 as well as the locations and areas covered by the management system (to be

indicated in the audit plan).

Definition of audit findings:

(1) Nonconformity:

Non-fulfillment of a requirement

(2) Major non-conformities:

These include nonconformities that compromise the ability of the management system to

realize an expected outcome. A nonconformity can be categorized as a major nonconformity

in any of the following conditions:

A nonconformity results in serious doubt about the effectiveness of process control or the

compliance of products or services with requirements;

Several minor nonconformities are related to the same requirement or issue and indicate

the existence of a systematic failure.

(3) Minor nonconformity:

This indicates a nonconformity that does not compromise the ability of the management

system to realize an expected outcome.

(4) Opportunity for improvement

This is an auditor’s evidence-based statement-of-fact about a weakness or latent defect of the

management system that, if not improved, may develop into a nonconformity in the future.

Our certification organization can provide independent general information for process and

system improvement, including interpretation of the meaning and intent of the standard,

explanation about relevant theories, methods, techniques, and tools, and sharing of non-

Page 108: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

100

confidential best practices of the industry. However, it does not provide specific solutions for

particular problems.

(5) Observation:

This is only applicable to certification programs where the certification organization is not

allowed to include opportunities for improvement in audit findings. An observation is an

auditor’s statement-of-fact about a weakness or latent defect of the management system that,

if not improved, may develop into a nonconformity in the future.

6.2.3 ISO 27001 audit

The ISO 27001 standard for information security management systems is now the most

widely implemented information security management standard. It is developed from the British

standard BS 7799. Its latest version is ISO 27001:2013.

Ping An Data center obtained ISO 27001 certification in 2008 and has maintained the

certification since then. It has undergone annual surveillance audits and certification renewal

audits (the certificate is valid for three years) by professional certification organizations.

The ISO 27001 certification has the following values:

(1) sustaining business capacity through the definition, evaluation, and control of risks;

(2) minimizing liabilities that may result from the breach of contracts and violation of legal

and regulatory requirements;

(3) improving business competitiveness and image by demonstrating compliance with the

international standard;

(4) clearly defining internal and external information access control to prevent information

misuse and loss;

(5) establishing a policy for the use of security tools;

(6) preventing the loss of technical know-hows;

(7) enhancing information security awareness inside the organization;

(8) serving as evidence for public accounting audit.

The fact that the data center has maintained the certification demonstrates its commitment to

information security and its successful efforts in information security protection. More

importantly, the certification program contributes to better information security management

in the company.

Page 109: White Book of the High-Availability Operations of Ping An ...ftps.zhiding.cn/files/3/26086.pdf · international standards such as ITIL, ISO 9001, ISO 20000, ISO 27001, and M&O. Owing

101

6.2.4 ISO 20000 audit

ISO 20000 is developed by the International Standardization Organization based on ITIL best

practices and BSI 15000. Released on December 15, 2005, it is the first international standard for

IT service management systems. Aiming to sustain IT service quality through management and

standardization of service processes, it is available for certification by organizations to

demonstrate their IT service capability and quality.

However, the value of ISO 2000 certification is not limited to satisfying IT service

requirements and enhancing service quality. The certification also has positive implications in

quantifying services, appraising employee performance, and evaluating the return of IT

investments.

The operations service system of Ping An Data Center has been certified to ISO 2000. This

indicates that the data center’s operations service management capacity has been recognized by

leading international authorities. The certification also contributes to better service management

of the data center and better service management awareness of its personnel. The better service

and operations management of the data center in turn facilitates the long-term business

development of Ping An Data Center.

The data center will maintain the ISO 20000 certification by undergoing annual audits and

certification renewal audits (every three years).