NERSC Reliability Data

14
NERSC Reliability Data NERSC - PDSI Bill Kramer Jason Hick Akbar Mokhtarani PDSI BOF, FAST08 Feb. 27, 2008

description

NERSC Reliability Data. NERSC - PDSI Bill Kramer Jason Hick Akbar Mokhtarani PDSI BOF, FAST08 Feb. 27, 2008. Production Systems Studied at NERSC. HPSS : 2 High Performance Storage Systems Seaborg : IBM SP RS/6000, AIX, 416 nodes (380 compute) Bassi : IBM p575 POWER 5, AIX, 122 nodes - PowerPoint PPT Presentation

Transcript of NERSC Reliability Data

Page 1: NERSC Reliability Data

NERSC Reliability Data

NERSC - PDSI

Bill Kramer

Jason Hick

Akbar Mokhtarani

PDSI BOF, FAST08

Feb. 27, 2008

Page 2: NERSC Reliability Data

Production Systems Studied at NERSC

• HPSS: 2 High Performance Storage Systems• Seaborg: IBM SP RS/6000, AIX, 416 nodes (380 compute)• Bassi: IBM p575 POWER 5, AIX, 122 nodes• DaVinci: SGI Altrix 350 (SGI PropPack 4 64-bit Linux)• Jacquard: Opteron Cluster, Linux, 356 nodes• PDSF: Networked distributed computing, Linux• NERSC Global File-system: Shared file-system based on

IBM’s GPFS

Page 3: NERSC Reliability Data

Datasets

• Data were extracted from problem tracking database and paper records kept by the operations staff, and Vendor’s repair records

• Coverage is from 2001 - 2006, in some cases a subset of that period • Preliminary results on systems availability and component failure were

presented at HEC FSIO Workshop last Aug.• Have done a more detailed analysis trying to classify the underlying causes

of outage and failure

• Produced statistics for the NERSC Global File-system (NGF) and uploaded to the CMU website. This is different from fsstats; used fsstas to cross check results on some smaller directory tree on NGF

• Made workload characterization of selected NERSC applications available. They were produced by IPM, a performance monitoring tool.

• Made trace data for selected applications• Results from a number of I/O related studies done by other groups at

NERSC were posted to the website.

Page 4: NERSC Reliability Data

Results

• Overall systems availability is 96% - 99%• Seaborg and HPSS have comprehensive data for the

6 year period and show availability of 97% - 98.5% (scheduled and unscheduled outage)

• Disk drives failure rate for Seaborg show rates consistent with “aging” and “infant mortality”, average of 1.2%

• Tape drives failure for HPSS show the same pattern, average rate (~19%) - higher than manufacturer stated 3% MTBF

Page 5: NERSC Reliability Data

Average Annual Outage

Data since: Seaborg(2001), Bassi(Dec. 2005), Jacquard(July 2005), DaVanci(Sept. 2005), HPSS(2003), NGF(Oct. 2005), PDSF(2001)

Average Annual System-wide Outage

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Seaborg Bassi Jacquard Pdsf Davinci HPSS NGF

Percent

Scheduled Un-Scheduled

Page 6: NERSC Reliability Data

Seaborg and HPSS Annual OutageHPSS Annual Outage (Archive)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

2001 2002 2003 2004 2005 2006

Year

Minutes

SW, SCH SW, UNSCH HW, SCH HW, UNSCH

Seaborg Annual Outage

0

0.5

1

1.5

2

2.5

3

3.5

4

2001 2002 2003 2004 2005 2006

Year

Percent

Scheduled Unscheduled

HPSS Annual Outage

0

0.5

1

1.5

2

2.5

2001 2002 2003 2004 2005 2006

Year

Percent

Scheduled Unscheduled

Seaborg Annual Outage

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

2001 2002 2003 2004 2005 2006

Year

Minutes

HW, SCH HW, UNSCH SW, SCH SW, UNSCH

Page 7: NERSC Reliability Data

Disk and Tape Drives Failure

HPSS Tape Drives Replaced

0

2

4

6

8

10

12

14

16

18

20

2001 2002 2003 2004 2005 2006

Year

Number

0

5

10

15

20

25

30

35

Percent

Actual Percent

Seaborg Disks Replaced

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2001 2002 2003 2004 2005 2006

Year

Percent

0

10

20

30

40

50

60

70

80

Number

Percentage Actual

Page 8: NERSC Reliability Data

Extra Slides

Page 9: NERSC Reliability Data

Outage ClassificationsSeaborg Category Outage

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

Accounting

Benchmark

Control Work Station

Dedicated test

Disk

File system

HW

HW Upgrade

LoadLeveler

OSF Power

Security

SW

SW Upgrade

Switch

Minutes

HPSS Outage by Category

0 5000 10000 15000 20000 25000

SW, SCH

SW, UNSCH

HW, SCH

HW, UNSCH

HPSS UPGRADE

NETWORK

OSF POWER

Minutes

Page 10: NERSC Reliability Data

Seaborg Data

• IBM SP RS/6000, AIX 5.2• 416 nodes; 380 compute

nodes• 4280 disk drives (4160

SSA, 120 Fibre Channel)• Large disks failure in 2003

can be attributed to “aging” of older drives and “infant mortality” of newer disks

Number of

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Jan-01Mar-01May-01Jul-01Sep-01Nov-01Jan-02Mar-02May-02

Jul-02Sep-02Nov-02Jan-03Mar-03May-03Jul-03Sep-03Nov-03Jan-04Mar-04May-04

Jul-04Sep-04Nov-04Jan-05Mar-05May-05Jul-05Sep-05Nov-05Jan-06Mar-06May-06

Jul-06Sep-06Nov-06

Date Installed

Number

Total SSA FC

Seaborg Disks Replaced

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2001 2002 2003 2004 2005 2006

Year

Percent

0

10

20

30

40

50

60

70

80

Number

Percentage Actual

Seaborg Disk Failure

0

2

4

6

8

10

12

Jan-01Mar-01May-01Jul-01Sep-01Nov-01Jan-02Mar-02May-02

Jul-02Sep-02Nov-02Jan-03Mar-03May-03Jul-03Sep-03Nov-03Jan-04Mar-04May-04

Jul-04Sep-04Nov-04Jan-05Mar-05May-05Jul-05Sep-05Nov-05Jan-06Mar-06May-06

Jul-06Sep-06Nov-06

Date

Number

SSA FASTT

Page 11: NERSC Reliability Data

HPSS Data

• Two HPSS systems available at NERSC

• Eight tape silos with 100 tape drives attached

• Tape drives seem to show the same failure pattern as seaborg’s disk drives, “aging” and “infant mortality”.

HPSS Tape Drives

0

20

40

60

80

100

120

Jan-01Mar-01May-01Jul-01Sep-01Nov-01Jan-02Mar-02May-02

Jul-02Sep-02Nov-02Jan-03Mar-03May-03Jul-03Sep-03Nov-03Jan-04Mar-04May-04

Jul-04Sep-04Nov-04Jan-05Mar-05May-05Jul-05Sep-05Nov-05Jan-06Mar-06May-06

Jul-06Sep-06Nov-06

Date Installed

Number

Total T9840A T9940A T9940B T10KA

HPSS Tape Drive Failure

0

1

2

3

Jan-01Mar-01May-01Jul-01Sep-01Nov-01Jan-02Mar-02May-02

Jul-02Sep-02Nov-02Jan-03Mar-03May-03Jul-03Sep-03Nov-03Jan-04Mar-04May-04

Jul-04Sep-04Nov-04Jan-05Mar-05May-05Jul-05Sep-05Nov-05Jan-06Mar-06May-06

Jul-06Sep-06Nov-06

Date

Number

T9840A T9940A T9940B

HPSS Tape Drives Replaced

0

2

4

6

8

10

12

14

16

18

20

2001 2002 2003 2004 2005 2006

Year

Number

0

5

10

15

20

25

30

35

Percent

Actual Percent

Page 12: NERSC Reliability Data

NGF statsNGF File size (11/28/1007)

1

10

100

1000

10000

100000

1000000

10000000

1 3 715 31 63

127 255 5111023 2047 4095 8191

16383 32767 65535131071262143524287

1048575209715141943038388607167772153355443167108863

1342177272684354555368709111073741823

0 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 655361310722621445242881048576209715241943048388608167772163355443267108864134217728268435456536870912

KB

NGF Directory Entries (11/28/2007)

1

10

100

1000

10000

100000

1000000

1 3 7 15 31 63 127 255 511 1023 2047 4095 8191 16383 32767 65535 131071 262143

0 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072

Page 13: NERSC Reliability Data

Seaborg Outages

Page 14: NERSC Reliability Data

HPSS Outages