CERN Data Services Update
description
Transcript of CERN Data Services Update
![Page 1: CERN Data Services Update](https://reader033.fdocuments.us/reader033/viewer/2022051620/568143f4550346895db07e4a/html5/thumbnails/1.jpg)
CERN Data ServicesUpdate
HEPiX 2004 / NeSC Edinburgh
Data Services team:Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes,
Gordon Lee, Tony Osborne, Tim Smith
![Page 2: CERN Data Services Update](https://reader033.fdocuments.us/reader033/viewer/2022051620/568143f4550346895db07e4a/html5/thumbnails/2.jpg)
2004/05/26 CERN Data Services: [email protected] 2 of 19
Outline
Data Services Drivers Disk Service
Migration to Quattor / LEMON Future directions
Tape Service Media migration Future directions
Grid Data Services
![Page 3: CERN Data Services Update](https://reader033.fdocuments.us/reader033/viewer/2022051620/568143f4550346895db07e4a/html5/thumbnails/3.jpg)
2004/05/26 CERN Data Services: [email protected] 3 of 19
Data Flows Tier-0 / Tier-1 for the LHC
Data Challenges: CMSDC04 (finished) ; PCP05 (Autumn) +80;
+170 ALICE ongoing +137 TB LHCb ramping up +40 TB ATLAS ramping up +60 TB
Fixed Target Programme: NA48 at 80 MB/s +200 TB COMPASS at 70 MB/s (peak 120) +625 TB nToF at 45 MB/s +180 TB NA60 at 15 MB/s +60 TB Testbeams at 1~5 MB/s (x 5)
Analysis…
![Page 4: CERN Data Services Update](https://reader033.fdocuments.us/reader033/viewer/2022051620/568143f4550346895db07e4a/html5/thumbnails/4.jpg)
2004/05/26 CERN Data Services: [email protected] 4 of 19
Disk Server FunctionsAFS1%
Oracle8%
CASTOR: Experiment dedicated
64%
CASTOR: Infrastructure
9%
CASTOR: Public Services
4%
LCG14%
![Page 5: CERN Data Services Update](https://reader033.fdocuments.us/reader033/viewer/2022051620/568143f4550346895db07e4a/html5/thumbnails/5.jpg)
2004/05/26 CERN Data Services: [email protected] 5 of 19
Generations
0th Jumbos
1st & 2nd
4U
3rd & 4th
8U
![Page 6: CERN Data Services Update](https://reader033.fdocuments.us/reader033/viewer/2022051620/568143f4550346895db07e4a/html5/thumbnails/6.jpg)
2004/05/26 CERN Data Services: [email protected] 6 of 19
Warrantees
0
50
100
150
200
250
300
350
400
Jan-00 Jan-01 Jan-02 Jan-03 Jan-04 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09
Nu
mb
er o
f D
isk
Ser
vers
ELONEX - 2.4GHz
ELONEX - 2.4GHz
ELONEX - 2.0GHz
ELONEX - 1.1GHz
JTT - 1.1GHz
ELONEX - 1GHz
ELONEX - 1GHz
JTT - 1GHz
ELONEX - 900
ELONEX - 900
ELONEX - 900
TECH - 800
ELONEX - 700
ELONEX - 650
COGESTRA - 500
ELONEX - 500
ELONEX - 500
COGESTRA - 450
Out of Warantee
4th Generation
3rd Generation
2nd Generation
1st Generation
0th Generation
![Page 7: CERN Data Services Update](https://reader033.fdocuments.us/reader033/viewer/2022051620/568143f4550346895db07e4a/html5/thumbnails/7.jpg)
2004/05/26 CERN Data Services: [email protected] 7 of 19
Disk Servers: Jan 2004 370 EIDE Disk Servers
Commodity Storage in a box 544 TB of disk capacity 6700 spinning disks
Storage Configuration HW Raid-1 mirrored for “maximum
reliability” ext2 file systems
Operating systems RH6.1, 6.2, 7.2, 7.3, RHES 13 different kernels
Application uniformity; CASTOR SW
![Page 8: CERN Data Services Update](https://reader033.fdocuments.us/reader033/viewer/2022051620/568143f4550346895db07e4a/html5/thumbnails/8.jpg)
2004/05/26 CERN Data Services: [email protected] 8 of 19
Quattor-ising Motivation: Scale
Uniformity; Manageability; Automation Configuration Description (into CDB)
HW and SW; nodes and services Reinstallation
Production machines – min service interruption!
Eliminate peculiarities from CASTOR nodes MySQL, web servers Refocus root control
Quiescing a disk server ≠ draining a batch node!
Gigabit cards gymnastics (ext2 -> ext3)
Complete (except 10 RH6 boxes for Objectivity)
![Page 9: CERN Data Services Update](https://reader033.fdocuments.us/reader033/viewer/2022051620/568143f4550346895db07e4a/html5/thumbnails/9.jpg)
2004/05/26 CERN Data Services: [email protected] 9 of 19
LEMON-ising MSA everywhere
Linux box monitoring and alarms
Automatic HW static checks
Adding CASTOR server specific Service monitoring
HW Monitoring lm_sensors (see tape section) smartmontools
smartd deployment Kernel issues; firmware bugs; through 3ware controller smart_ctl auto checks; predictive monitoring
IPMI investigations; especially remote access Remote reset/power-on/power-off
![Page 10: CERN Data Services Update](https://reader033.fdocuments.us/reader033/viewer/2022051620/568143f4550346895db07e4a/html5/thumbnails/10.jpg)
2004/05/26 CERN Data Services: [email protected] 10 of 19
Disk Replacement Failure rate
unacceptably high 10 months to be
believed 4 weeks to execute
1224 disks exchanged (out of 6700)
And the cages
Western Digital; type DUA Head instabilities
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0%
3.5%
4.0%
4.5%
Dec-03 Jan-04 Feb-04 Mar-04 Apr-04 May-04
% B
rok
en
Mir
rors
![Page 11: CERN Data Services Update](https://reader033.fdocuments.us/reader033/viewer/2022051620/568143f4550346895db07e4a/html5/thumbnails/11.jpg)
2004/05/26 CERN Data Services: [email protected] 11 of 19
Disk Storage Futures
EIDE Commodity storage in a box Production systems
HW Raid-1 / ext3 Pilots (15 production systems)
HW Raid-5 + SW Raid-0 / XFS (See Jan Iven’s talk next)
New tenders out… 30TB SATA in a box 30TB external SATA disk arrays
New CASTOR stager (see Olof’s talk)
![Page 12: CERN Data Services Update](https://reader033.fdocuments.us/reader033/viewer/2022051620/568143f4550346895db07e4a/html5/thumbnails/12.jpg)
2004/05/26 CERN Data Services: [email protected] 12 of 19
9940B14157
9940A8889
98408149
35908639
Tape Service
70 tape servers (Linux) (mostly) Single FibreChannel attached
drives 2 symmetric robotic installations
5 x STK 9310 Silos in each
9940B50
9940A4
984020
359014
LTO6Drives
Media
![Page 13: CERN Data Services Update](https://reader033.fdocuments.us/reader033/viewer/2022051620/568143f4550346895db07e4a/html5/thumbnails/13.jpg)
2004/05/26 CERN Data Services: [email protected] 13 of 19
Tape Server Temperatures
lm_sensors package General SMBus access
and hardware monitoring.
Used to access LM87 chip
Fan speeds Voltages Int/Ext temperatures
ADM1023 chip Int/Ext temperatures
![Page 15: CERN Data Services Update](https://reader033.fdocuments.us/reader033/viewer/2022051620/568143f4550346895db07e4a/html5/thumbnails/15.jpg)
2004/05/26 CERN Data Services: [email protected] 15 of 19
Media Migration
To 9940B (mainly from 9940A) 200GB – extra capacity avoids
unnecessary acquisitions Better performance – though hard to
benefit in normal chaotic mode Reduced errors; fewer interventions
1-2% of A tapes can not be read (extremely slow) on B drives Have not been able to return all A-
drives
![Page 16: CERN Data Services Update](https://reader033.fdocuments.us/reader033/viewer/2022051620/568143f4550346895db07e4a/html5/thumbnails/16.jpg)
2004/05/26 CERN Data Services: [email protected] 16 of 19
Tape Service Developments
Removing tails… Tracking of all tape errors (18 months)
Retiring of problematic media Proactive retiring of heavily used media
(>5000 mounts) repack on new media
Checksums Populated writing to tape Verified loading back to disk 22% already after few weeks
![Page 17: CERN Data Services Update](https://reader033.fdocuments.us/reader033/viewer/2022051620/568143f4550346895db07e4a/html5/thumbnails/17.jpg)
2004/05/26 CERN Data Services: [email protected] 17 of 19
Water Cooled Tapes!
Plumbing error!
5000 tapes disabled for a few days 550 superficially wet 152 seriously wet – visually
inspected
![Page 18: CERN Data Services Update](https://reader033.fdocuments.us/reader033/viewer/2022051620/568143f4550346895db07e4a/html5/thumbnails/18.jpg)
2004/05/26 CERN Data Services: [email protected] 18 of 19
Tape Storage Futures
Commodity drive studies LTO-2 (Collaboratively CASPUR/Valencia)
Test and evaluate High-end drives IBM 3592 STK NGD
Other STK offerings SL8500 robotics and silos Indigo; managed storage, tape
virtualisation
![Page 19: CERN Data Services Update](https://reader033.fdocuments.us/reader033/viewer/2022051620/568143f4550346895db07e4a/html5/thumbnails/19.jpg)
2004/05/26 CERN Data Services: [email protected] 19 of 19
GRID Data Management
GridFTP + SRM servers (Former) Standalone / experiment dedicated Hard to intervene; not scalable
New load-balanced 6 node Service castorgrid.cern.ch SRM modifications to support operate
behind load balancer GridFTP standalone client
Retire ftp and bbftp access to CASTOR