Hardware failures

16
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/ CF Hardware failures Wayne Salter on behalf of Olof Bärring

description

Hardware failures. Wayne Salter on behalf of Olof B ärring. Outline. Failures What fails? How often? When? Repairs How? By whom? How quickly? Conclusions. What fails? and how do we know?. The only things we know for sure about hardware are: It will fail - PowerPoint PPT Presentation

Transcript of Hardware failures

Page 1: Hardware failures

Computing Facilities

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF

Hardware failures

Wayne Salter

on behalf of Olof Bärring

Page 2: Hardware failures

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Outline

• Failures– What fails?– How often?– When?

• Repairs– How?– By whom?– How quickly?

• Conclusions

CERN IT facility

Page 3: Hardware failures

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF What fails? and how do we know?

• The only things we know for sure about hardware are:1. It will fail

2. Some of it fails more often than other…• disk drives for instance

• Monitoring failures– Disks: assume fail-stop but reality more complex– At CERN we base our decision on SMART counters

and failed media scans

• Monitoring ‘repairs’ rather than ‘failures’:– Vendor tickets (~4k 2010-11)– Changes in serial numbers inventory (~10k 2010-11)

CERN IT facility

Page 4: Hardware failures

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Failure space

• CERN IT by numbers (14/9/2011)

CERN IT facility

Number of systems 8,792

Number of processors 14,972

Memory modules 55,729

Number of HDD's 62,023

Number of RAID controllers 3,607

Number of Fibre channel ports 742

Number of 1G ports 16,773

Number of 10G ports 622

Page 5: Hardware failures

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF How often?

• Monitoring changes in serial numbers gives an idea

CERN IT facility

01-A

pr-1

0

01-J

un-1

0

01-A

ug-1

0

01-O

ct-10

01-D

ec-1

0

01-F

eb-1

1

01-A

pr-1

1

01-J

un-1

1

01-A

ug-1

1

01-O

ct-11

01-D

ec-1

11

10

100

1000

100001425

3886

Month

Bulk campaigns

Page 6: Hardware failures

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF How often?

• Monitoring changes in serial numbers gives an idea– Excluding campaigns ~170 disks /month (5 /day)

CERN IT facility

01-A

pr-1

0

01-J

un-1

0

01-A

ug-1

0

01-O

ct-1

0

01-D

ec-1

0

01-F

eb-1

1

01-A

pr-1

1

01-J

un-1

1

01-A

ug-1

1

01-O

ct-1

1

01-D

ec-1

10

50100150200250300

HDD failures/day:5 Hours/day: 24

~1 fail per 5hrs

64,000 drives in the centre MTTF = 320,000 hrs

(Spec: 1.2Mhrs)

Page 7: Hardware failures

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF When?

Failure rates of hardware products typically follow a “bathtub curve” with high failure rates at the beginning (infant mortality) and the end (wear-out) of the lifecycle1.

CERN IT facility

1 http://www.usenix.org/events/fast07/tech/schroeder/schroeder.pdf

Page 8: Hardware failures

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF When?

Process and categorize 2010-11 vendor calls according to ‘Warranty age’ when call was opened

CERN IT facility

0 200 400 600 800 1000 12000%

5%

10%

15%

20%

25%

30%

35%

40%

Quarterly failure rateAll failures - Disk servers

Disk failures - Disk servers

All failures - CPU servers

Disk failures - CPU servers

Warranty age (days)

Qua

rter

ly ra

te

10x disks to CPU servers

Page 9: Hardware failures

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF When?

Quarterly disk failure rate normalized to number of disks

CERN IT facility

0 200 400 600 800 1000 12000.0%

0.2%

0.4%

0.6%

0.8%

1.0%

1.2%

1.4%

1.6%

Normalised disk failuresNormalized disk failures - CPU serversNormalized disk failures - Disk servers

Warranty age (days)

Qua

rter

ly ra

te

Early failures(infant mortality)

Page 10: Hardware failures

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF When?

Other failure types• Swappable: RAM, PSU, BBU, BMC, …• Complex repairs: cabling, backplane, main

board, … no clue…

CERN IT facility

0 200 400 600 800 1000 12000.0%

1.0%

2.0%

3.0%

4.0%

5.0%

6.0%

Swappable (RAM, PSU, ...)CPU serversDisk servers

Warranty age (days)

Qua

rter

ly ra

te

0 200 400 600 800 1000 12000.0%0.5%1.0%1.5%2.0%2.5%3.0%3.5%4.0%

Complex repairsCPU serversDisk servers

Warranty age (days)

Qua

rter

ly ra

te

Page 11: Hardware failures

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Repairs

CERN IT facility

Alarm

Vend

or c

all

New sn: WD3342ABC

Page 12: Hardware failures

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF By who,?

CERN IT facility

Vendor

Page 13: Hardware failures

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF How quickly?

• Two contract types

• ‘Normal’ only used for CPU servers

CERN IT facility

Type Time to intervene Repair time

Normal 24 working hours 40 working hours

Fast 4 working hours 12 working hours

0 13 26 39 52 65 78 91104

117130

143156

169182

195208

0

50

100

150

200

250

300 Repair target: 12 working hours

Calendar hours

Inte

rven

tions

~30%

Page 14: Hardware failures

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CFCF

CERN IT facility

Ongoing Improvements

• Tracking changes to servers– Keep current tools that report HW info

Controller 0: Vendor="Intel Corporation" Model="82801JI (ICH10 Family) SATA AHCI Controller" Location="/sys/devices/pci0000:00/0000:00:1f.2" BBU="None" Cache="None" Serial="None" Version="None" Driver="ahci" Type="sata” Controller 0 Port 0: Vendor="WDC" Model="WD1002FBYS-02A6B0" Size="953869" Serial="WD-WMATV4729249" Version="03.00C06" Device="sda” Controller 0 Port 1: Vendor="WDC" Model="WD1002FBYS-02A6B0" Size="953869" Serial="WD-WMATV8136033" Version="03.00C06" Device="sdb” Controller 0 Port 2: Vendor="WDC" Model="WD1002FBYS-02A6B0" Size="953869" Serial="WD-WMATV4713233" Version="03.00C06" Device="sdc” BIOS: Vendor="American Megatrends Inc." Version="080015 (07/20/2009)" smt="enabled” BMC: Vendor="Winbond" Model="IPMI 2.0" IPMI Version="2.0" MAC="00:00:00:00:00:0A" Serial="" Version="1.12” CPU 0: Vendor="GenuineIntel" Model="Intel(R) Xeon(R) CPU L5520 @ 2.27GHz" Cores="4" Speed="2270”CPU 1: Vendor="GenuineIntel" Model="Intel(R) Xeon(R) CPU L5520 @ 2.27GHz" Cores="4" Speed="2270”NIC 0: Vendor="Intel Corporation" Model="82574L Gigabit Network Connection" Location="/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0" MAC="00:00:00:00:00:00" Speed="1024000" Bus="pci" Media="ethernet" Version="1.9-0”NIC 1: Vendor="Intel Corporation" Model="82574L Gigabit Network Connection" Location="/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0" MAC="00:00:00:00:00:0F" Speed="1024000" Bus="pci" Media="ethernet" Version="1.9-0”RAM 0: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P2-DIMM1A" Type="Other" Serial=”00000001”RAM 1: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P2-DIMM1B" Type="Other" Serial="00000002” RAM 2: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P2-DIMM2A" Type="Other" Serial="00000003” RAM 3: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P2-DIMM2B" Type="Other" Serial="00000004” RAM 4: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P2-DIMM3A" Type="Other" Serial="00000005” RAM 5: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P2-DIMM3B" Type="Other" Serial="00000006” RAM 6: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P1-DIMM1A" Type="Other" Serial="00000007” RAM 7: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P1-DIMM1B" Type="Other" Serial="00000008” RAM 8: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P1-DIMM2A" Type="Other" Serial="00000009” RAM 9: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P1-DIMM2B" Type="Other" Serial="00000010” RAM 10: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P1-DIMM3A" Type="Other" Serial="00000011” RAM 11: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P1-DIMM3B" Type="Other” Serial="00000012” Serial: ”SDFGSDFG34DFGDFG345DFGDFG345"

– Will store each server’s HW info as a document (HW inventory)

– Key is unique id stored in the BMC when hardware is purchased

– Change log, e.g. replaced parts, for each server– Goals:

– Better accessibility and usability of data – Provide base for a more comprehensive HW

inventory tool– Systematic tracking of parts replacement due to

failure– Trending and potential action (e.g. #disk

replacements in last month > X

Page 15: Hardware failures

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Conclusions

• Hardware fails– As expected– More often than expected

• MTTF ~320khours rather than 1.2Mhours

– When expected:• Effect of early failures (infant mortality) in first year• No sign of wear-out at the end of the 3 years warranty

• Repairs are currently carried out by vendor– Missed repair targets in ~30% of cases– Looking at a different model…

CERN IT facility

Page 16: Hardware failures

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF

Questions?

CERN IT facility