Post on 23-Jan-2015
description
Petabye Scale Data Challenge- Worldwide LHC Computing Grid
ASGC/Jason ShihComputex, Jun 2nd, 2010
Outline
Objectives & MilestonesWLCG experiment and ASGC Tier-1 CenterPetabyte Scale ChallengeStorage Management SystemSystem Architecture, Configuration and
Performance
Objectives
Building sustainable research and collaboration infrastructureSupport research by e-Science, on data intensive
sciences and applications require cross disciplinary distributed collaboration
ASGC Milestone
Operational from the deployment of LCG0 since 2002 ASGC CA establish on 2005 (IGTF in same year)Tier-1 Center responsibility start from 2005Federated Taiwan Tier-2 center (Taiwan Analysis Facility, TAF)
is also collocated in ASGCRep. of EGEE e-Science Asia Federation while joining EGEE
from 2004Providing Asia Pacific Regional Operation Center (APROC)
services to regional-wide WLCG/EGEE production infrastructure from 2005
Initiate Avian Flu Drug Discovery Project and collaborate with EGEE in 2006
Start of EUAsiaGrid Project from April 2008
LHC First Beam – Computing at the Petascale
General Purpose, pp, heavy ions
ATLAS: General Purpose, pp, heavy ions
ALICE: Heavy ions, ppLHCb: B-physics, CP Violation
CMS: General Purpose, pp, heavy ions
Size of LHC Detector
Bld. 40ATLAS
CMS
UNESCO Information Preservation debate, April 2007 -
Jamie Shiers@cern ch
7http://www.damtp.cam.ac.uk/user/gr/public/bb_history.html
Standard Cosmology
Good model from 0.01 secafter Big Bang
Supported by considerable observational evidence
Elementary Particle Physics
From the Standard Model into theunknown: towards energies of1 TeV and beyond: the Terascale
Towards Quantum Gravity
From the unknown into the unknown...
Tim
e
Energy, Density, Tem
perature
WLCG Timeline
First Beam on LHC, Sep. 10, 2008Severe Incident after 3w
operation (3.5TeV)
21
ASGC - Introduction
Large Hadron Collider (LHC)
Avian Flu Drug DiscoveryGrid Application Platform
A Worldwide Grid Infrastructure
Asia Pacific Regional Operation Center
>250 sites, 48 countries>68,000 CPUs, >25 PetaBytes>10,000 users, >200 VOs>150,000 jobs/day
Best Demo Award of EGEE’07
Lightweight Problem Solving Framework
1. Most Reliable T1: 98.83%2. Very Highly Performing and most Stable Site in CCRC08
Max CERN/T1-ASGC Point2Point Inbound : 9.3 Gbps
Collaborating e-Infrastructures
“Production” = Reliable, sustainable, with commitments to quality of service
TWGRID
EUAsiaGrid
Potential for linking ~80 countries
WLCG Computing Model- The Tier Structure
Tier-0 (CERN)Data recordingInitial data reconstructionData distribution
Tier-1 (11 countries)Permanent storage Re-processingAnalysis
Tier-2 (~130 countries)Simulation End-user analysis
4EGEE07, Budapest, 1-5 October 2007
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688 4
ArcheologyAstronomyAstrophysicsCivil ProtectionComp. ChemistryEarth SciencesFinanceFusionGeophysicsHigh Energy PhysicsLife SciencesMultimediaMaterial Sciences…
Why Petabyte? Challenges
Why Petabyte?Experiment Computing ModelComparing with conventional data management
Challenges Performance: LAN and WAN activities
Sufficient B/W between CPU FarmEliminate Uplink Bottleneck (Switch Tires)
Fast responding of Critical EventsFabric Infrastructure & Service Level Agreement
Scalability and ManageabilityRobust DB engine (Oracle RAC)KB and Adequate Administration (Training)
Tier Model and Data Management Components
WLCG Experiment Computing Model
ATLAS T1 Data Flow
Tier-0
CPUFarm
T1T1Other
Tier-1s
DiskBuffer
RAW
1.6 GB/file0.02 Hz1.7K f/day32 MB/s2.7 TB/day
ESD2
0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day
AOD2
10 MB/file0.2 Hz17K f/day2 MB/s0.16 TB/day
AODm2
500 MB/file0.004 Hz0.34K f/day2 MB/s0.16 TB/day
RAW
ESD2
AODm2
0.044 Hz3.74K f/day44 MB/s3.66 TB/day
RAW
ESD (2x)
AODm (10x)
1 Hz85K f/day720 MB/s
T1T1Other
Tier-1s
T1T1EachTier-2
Tape
RAW
1.6 GB/file0.02 Hz1.7K f/day32 MB/s2.7 TB/day
DiskStorage
AODm2
500 MB/file0.004 Hz0.34K f/day2 MB/s0.16 TB/day
ESD2
0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day
AOD2
10 MB/file0.2 Hz17K f/day2 MB/s0.16 TB/day
ESD2
0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day
AODm2
500 MB/file0.036 Hz3.1K f/day18 MB/s1.44 TB/day
ESD2
0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day
AODm2
500 MB/file0.036 Hz3.1K f/day18 MB/s1.44 TB/day
ESD1
0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day
AODm1
500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day
AODm1
500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day
AODm2
500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day
Plus simulation and Plus simulation and analysis data flowanalysis data flow
WLCG Tier-1- Defined Minimum Levels of Services.
Define response time refer to max delay before taking action.Mean time repairing the service is also crucial but cover
indirectly through required availability target.
WLCG MoU & ASGC Resource Level- Pledged Resources and Projection
0
10002000
3000
40005000
6000
2005 2006 2007 2008 2009 2010
(Uni
t KSI
2k)
0
10002000
3000
40005000
6000
Tera
Byt
e
CPU MoUCPUDiskTapeDISK MoUTape MoU
Year CPU (HEP2k6) Disk (PB) Tape (PB)End 2009 29.5K 2.6 2.4Mou 2009 20K 3.0 3.0Mou 2010 28K 3.5 3.5
Data Management SystemCASTOR V1
CERN Advanced STORageSatisfactorily serving 10s of 1K
Req/day/TB of Disk CacheLimitation: 1M files in cacheTape movement API not flexible
CASTOR V2Centric DB Arch.Scheduling FeatureGSI and KerberosResource MgmtResource Handling
CASTOR Configurations- Current Infrastructure
Shared cores servicesServing: Atlas and CMSServices:
Stager, NS, DLF, Repack, and LSFDB cluster
Two DB Clusters (SRM and NS)5 Services (DB) split into two clusters 5 Oracle Instances
Total capacity: 0.63PB and 0.7PB for CMS and Atlas resp.Current usage: 63% and 44% for CMS and Atlas
CASTOR Configurations (cont’)- Disk Cache
Disk pools & serversPerformance (IOPS)
With 0.5kB IO size: 76.4k and 54k for read & write resp.Slightly decrease around 9% for both read and write
when inc. IO size to 4kB.80 disk servers (+6 will be online end of 3rdw Oct)
Total capacity: 1.67PB (0.3PB allocate dynamically)Current usage: 0.79PB (~58% usage)
14 disk pools (8 for atlas and 3 for CMS, another three for bio, SAM, and dynamic)
050
100150200250300350400450
atlas
GROUPDISK
biomedD1T
0
atlas
HotDisk
cmsW
ANOUT
atlas
PrdD0T
1atl
asStag
e
dteamD0T
0
atlas
MCTAPE
atlas
Scratch
Disk
atlas
PrdD1T
0
cmsL
TD0T1
atlas
MCDISK
cmsP
rdD1T
0Stan
dby
Tota
l Cap
acity
(TB
)
0246810121416Install Capacity Free Capacity
Num of Disk Servers
Disk Pool Configuration- T1 MSS (CASTOR)
Distribution of Free Capacity- Per Disk Servers vs. per Pool
0 50 100 150 200 250
atlasGROUPDISK
atlasHotDisk
atlasMCDISK
atlasMCTAPE
atlasPrdD0T1
atlasPrdD1T0
atlasScratchDisk
atlasStage
biomedD1T0
cmsLTD0T1
cmsPrdD1T0
cmsWANOUT
dteamD0T0
Standby
Dis
k Po
ol
Free Capacity (TB)
Storage Server Generation- Drive vs. Total Capacity
18235.5
3774123
683
6238
0100200300400500600700800
0 10 20 30 40
Numer of Raid Subsystem
Tota
l Cap
acity
of S
tora
geG
ener
atio
n (T
B)
TB
TBTB
TB
CASTOR Configurations (cont’)- Core Service Overview
Services Type
OS Level Release Remark
Core SLC 4.7/x86-64 2.1.7-19 Stager/NS/DLFSRM SLC 4.7/x86-64 2.7-18 3 Head Nodes
Disk Svr. SLC 4.7/x86-64 2.1.7-19 80 Q3 2k9 (20+ in Q4)Tape Svr. SLC 4.7/32 + 64 2.1.8-8 X86-64 OS deployed
CASTOR Configurations (cont’)- CMS Disk Cache: Current Resource Level
Space Token
Disk PoolCapacity/Job Limit
DiskServers
TapePool/Capacity
cmsLTD0T1 278TB/488 9 *cmsPrdD1T0 284TB/1560 13cmsWanOut 72TB/220 4
* Dep. on tape family.
CASTOR Configurations (cont’)- Atlas Disk Cache: Current Resource Level
Space Token Cap/JobLimit DiskServers TapePool/Cap.atlasMCDISK 163TB/790 8 -atlasMCTAPE 38TB/80 2 atlasMCtp/39TBatlasPrdD1T0 278TB/810 15 -
atlasPrdD0T1 61TB/210 3 atlasPrdtp/105TB
atlasScratchDisk 28TB/80 1 -atlasHotDisk 2/40TB 2 -
atlasGROUPDISK 19T/40 1 -
Total 950TB/1835 46 -
IDC CollocationFacility install complete at Mar 27th
Tape system delay after Apr 9th
RealignmentRMA for faulty parts
Storage Farm~ 110 raid subsystem deployed since 2003.Supporting both Tier1 and 2 storage fabricDAS connection to frontend blade server
Flexible switching front end server upon performance requirement4-8G fiber channel connectivity
CASTOR Configurations (cont’)- Tape Pool
Tape PoolCapacity
(TB)/UsageDrive
DedicationLTO3/4 Mixed
atlasMCtp 8.98/40% N YatlasPrdtp 101/65% N Y
cmsCSA08cruzet 15.6/46% N NcmsCSA08reco 5/0% N N
cmsCSAtp 639/99% N YcmsLTtp 34.4/44% N N
dteamTest 3.5/1% N N
MSS Monitoring ServicesStd. Nagios Probes
NRPE + customized pluginsSMS to OSE/SM for all types of critical
alarmsAvailability metricsTape metrics (SLS)Throughput, capacity & scheduler per
VO and Diskpool
MSS Tape System- Expansion/Upgrade Planning
Before incident:LTO3 * 8 + LTO4 * 4720TB with LTO3530TB with LTO4
May 2009:Two LOT3 drivesMES: 6 LTO4 drives end of MayCapacity: 1.3PB (old, LTO3,4 mixed) + 0.8PB (LTO4)
New S54 model introduce mid of 20092K slots with tier modelRequired:
Upgrade ALMSEnhanced gripper
MES Q3 200918 LTO4 drivesHA implementation resume in Q4
Expansion Planning
20080.5PB expansion of Tape system in Q2Meet MOU target mid of Nov.1.3MSI2k per rack base on recent E5450 processor.
2009 Q1150 SMP/QC blade serversRaid subsystem consider 2TB per drive42TB net capacity per chassis and 0.75PB in total
2009 Q3-418 LTO4 drives – mid of Oct.330 Xeon QC (SMP, Intel 5450) blades servers2nd phase TAPE MES - 5 LTO4 drives + HA3rd phase TAPE MES – 6 LTO4 drivesETA 0.8PB expansion delivery: mid of Nov
Computing/Storage System Infrastructure
1
2
3
4
7
8
9
10
6B
5A
Dia
gS
tat
G48T10/100/1000BASE-T
1 25 2 26 3 27 4 28 5 29 6 30 7 31 8 32 9 33 10 34 11 35 12 36 13 37 14 38 15 39 16 40 17 41 18 42 19 43 20 44 21 45 22 46 23 47 24 48
1
25
41511
Diag
Stat
10GBASE-X1 32 4
10G4X 41611Di
agSt
at
G 4 8X a 4 1 54 2
2 51 2 62 2 73 2 84 2 95 3 06 317 3 28 2 39 3 410 3 51 1 3 61 2 3 71 3 3 81 4 3 91 5 401 6 4 11 7 4 21 8 4 319 4 42 0 4 52 1 4 62 2 4 72 3 4 82 4
Diag
Stat
10GBASE-X1 32 4
10G4X 41611
Dia
gS
tat
10GBASE-X1 32 4
10G4X 41611
Diag
Stat
10GBASE-X1 32 4
10G4X 41611
CASTOR2 Disk servers
Core Services – CE, RB, DPM, PX, BDII etc.
CASTOR2 Tape Servers
4 * GE (SX) to ASGC Distribution Switch in Rack#49
(links to Tier-1 Servers)
BladeCenter
1 2 3 4 5 6 7 8 9 10 11 12 13 14
64 x IBM HS20 Blade system -
WNBladeCenter
1 2 3 4 5 6 7 8 9 10 11 12 13 14
142 x IBM HS21 Blade system -
WN
20 x Quanta Blades -WN
Battery#1 + #2
Battery#3 + #4
DC SMR 48V / 100A
2 * GE (LX) to 4F M160(links to HK, JP Tier-2s)
2 * GE (LX) to 4F TaipeiGigaPoP-7609(links to TW Tier-2s)
Data Center – C3 Archive Room
ASGC CASTOR2 Disk Farm
Throughput of WLCG ExperimentsThroughput defined as Job Eff. x # Jobs runningCharacteristic of 4 LHC Exp. depicting in-efficiency
is due to poor coding.
Reliability From Different View Perspective
Summary
Deploy highly-scalable DM system and performance driven storage infrastructure
Eliminate possible complexity of SRM abstraction layerResource utilization, provisioning and optimization
From POC to Production, the challenges remains:Data Challenge, Service Challenge, CCRC08, STEP09, etc.Motivation appear clear for: Medical, Climate, Cosmological Operation wide:
Robust Database setupKB for fabric infrastructure operationFast enough event processing and documentation
Consider beyond the data management use cases in WLCG: commonality in many other disciplines in EGEE infrastructureactively participate in e-Science collaboration within the region