The ALICE GridThe beat of a different drum
L.Betev, P.Buncic, A.Peters, P.Saiz, S.Bagnasco, P.Mendez-Lorenzo,
C.Cistoiu, C.GrigorasPresented by F.Carminati
April 23, 2007ACAT - Amsterdam
223/04/07 fca @ ACAT07
level 0 - special hardware8 kHz (160 GB/sec)
level 1 - embedded processors
level 2 - PCs
200 Hz (4 GB/sec)
30 Hz (2.5 GB/sec)
30 Hz
(1.25 GB/sec)
data recording &
offline analysis
Total weight 10,000tOverall diameter 16.00mOverall length 25mMagnetic Field 0.4Tesla
ALICE Collaboration ~ 1/2 ATLAS, CMS, ~ 2x LHCb ~1000 people, 30 countries, ~
80 Institutes
323/04/07 fca @ ACAT07
QuickTime™ and a decompressor
are needed to see this picture.
423/04/07 fca @ ACAT07
The ALICE Grid (AliEn)
Functionality+
Simulation
Interoperability+
Reconstruction
Performance, Scalability, Standards+
Analysis
First production (distributed simulation)
Physics Performance Report (mixing & reconstruction)10% Data Challenge
2001 2002 2003 2004 2005 2006 2007
Start
There are millions lines of code in OS dealing with GRID issuesWhy not using them to build the minimal GRID that does the job?
Fast development of a prototype, can restart from scratch etc etc Hundreds of users and developers Immediate adoption of emerging standards
AliEn by ALICE (5% of code developed, 95% imported)
WLCG integration20% Data Challenge
523/04/07 fca @ ACAT07
gLiteMiddleware
Services
Middleware Services in AliEn
GAPI
WM DM
TQ
PM
FTQ
ACE FC
CEJW(JA) SE
CR(LSF,..)
LJC SRM
LRC
API
GAS Grid Access ServiceWM Workload MgmtDM Data MgmtRB Resource BrokerTQ Task QueueFPS File Placement ServiceFQ File Transfer QueuePM Package ManagerACE AliEn CE (pull)FC File CatalogueJW Job WrapperJA Job AgentLRC Local Replica Catalogue? Local Job CatalogueSE Storage ElementCE Computing ElementSRM Storage Resource MgrCR Computing Resource
(LSF, PBS,…)
EDGAliEn
Exp specific services
LCGAliEn arch + LCG code
EGEE
Exp specific services(AliEn for ALICE)
EGEE, ARC, OSG…
623/04/07 fca @ ACAT07
Design criteria
• Minimize intrusiveness– Limit the impact on the host computer centres
• Use delegation– Where possible acquire “capability” to perform
operation, no need to verify operation mode at each step
• Centralise information– Minimise the need to “synchronise” information sources
• Decentralise decisions– Minimise interactions and avoid bottlenecks
• Virtualise resources• Automatise operations• Provide extensive monitoring
723/04/07 fca @ ACAT07
Site
ALICE central services
Job submission in LCGJob 1 lfn1, lfn2, lfn3,
lfn4
Job 2 lfn1, lfn2, lfn3, lfn4
Job 3 lfn1, lfn2, lfn3
Job 1.1 lfn1
Job 1.2 lfn2
Job 1.3 lfn3, lfn4
Job 2.1 lfn1, lfn3
Job 2.1 lfn2, lfn4
Job 3.1 lfn1, lfn3
Job 3.2 lfn2
Optimizer
ComputingAgent
RB
CE WN
Env OK?
Die with grac
e
Execs agent
Sends job agent to site
Yes No
Close SE’s & SoftwareMatchmaking
Receives work-load
Asks work-load
Retrieves workload
Sends job result
Updates TQ
Submits job UserALICE Job Catalogue
Submitsjob agent
VO-Box
LCG
User Job
ALICE catalogues
Registers output
lfn guid
{se’s}
lfn guid
{se’s}
lfn guid
{se’s}
lfn guid
{se’s}
lfn guid
{se’s}
ALICE File Catalogue
packman
823/04/07 fca @ ACAT07
VO-Box monitoring
• Status of the VOBOX, ALICE and WLCG services are monitored through ML
• Sites are encouraged to check the status through these pages
• Alarm system established
• Standard SAM tests to check LCG services availability are incorporated in the VO-box
• Available to Grid Support and ALICE (via ML)
923/04/07 fca @ ACAT07
Job submission
• Minimize intrusiveness– Job submission is realised using existing Grid MW if
possible or directly to CE otherwise
• Centralise information– Jobs are held in a single central queue handling
priorities, and quotas
• Decentralise decisions– Sites decides which jobs to “pull”
• Virtualise resources– Job agents are run to providing a standard
environment (job wrapper) across different systems
• Automatise operations• Provide extensive monitoring
1023/04/07 fca @ ACAT07
The AliEn FC
• Hierarchical structure (like a UNIX File system)• Designed in 2001
– Provides mapping from LFN to PFN– Built on top of several distributed databases
• Possible to add another database
– Possible to move directories to another table• Transparent for the end user
– Metadata catalogue on the LFN– Triggers– GUID to PFN mapping in the central catalogue
• No “local catalogue”
– Possibility of automatic PFN construction• Store only the GUID and Storage Index and the SE builds the PFN from
the GUID
– Two independent catalogues: LFN->GUID and GUID->PFN• Possible to add databases to one or the other• We could drop LFN->GUID mapping if not used anymore
1123/04/07 fca @ ACAT07
Benchmarks
Reading AliEn v2-12 AliEn v2-13
No cache
cache No cache
cache
List LFN 23 2.8 20 2
List LFN (10 ) 23 1.5 20 1
LFN ->GUID 24 3 20 2.5
LFN->PFN 106* 30* 70 5.5
GUID->PFN 143* 51* 52 2
• Tests done on:– Dual Pentium CPU 3.4
GHz– 3.2 GB RAM
• DB, writers, reader and soap servers running on the same machine
• Users: – Register their files in their home
directories• PackMan
– Definition of the packages (VO & user)
• Production user– Register data
• AliEn TaskQueue:– Register the output of the jobs
Insertion
1223/04/07 fca @ ACAT07
Other features• Size
– LFN tables: 130 bytes/entry– GUID: 300 (Innodb), 210 (MyISAM), 120 (no PFN)– Binary log files: 1000 bytes/entry!
• Needed for database replication• Automatically cleaned by mysql
– The current database could contain 7.5 billion entries!• Two QoS for SE
– Custodial: File has low probability of disappearing– Replica: File has high probability of disappearing– User specifies QoS when registering a file
• Still to do: quotas• Entries in the LFN catalogue can have expiration time
– The entry will disappear regardless of QoS of SE and is removed from storage
– A GUID not referenced by any LFN will also disappear
1323/04/07 fca @ ACAT07
File Catalogue v2-13
/
/alice
/alice/user/p/psaiz
/alice/simulation/2006
…
Index
LFN->GUID
1-JAN-1970
1-JAN-2006
14-FEB-2007
23-AUG-2008
…
Index
GUID->PFN
LFN Catalogue GUID Catalogue
1423/04/07 fca @ ACAT07
Storage strategy
WN
VOBOX::SA
xrootd (manager)
MSS
xrootd (worker)
Disk
SR
M
xrootd (worker)
DPM
xrootd (worker)
Castor
SR
M
SR
M
MSS
xrootd emulation (worker)
dCache
SR
MDPM, CASTOR, dCache are LCG-developed SEs
AvailableBeing deployed
Prototype being validated
Being deployed
1523/04/07 fca @ ACAT07
Xrootd architecture
Client
Redirector(Head Node)
Data Servers
open file X
A
B
C
go to C
open file X
Who has file X?
I have
Cluster
Client sees all servers as xrootd data servers
2nd open X
go to C
RedirectorsCache filelocation
1623/04/07 fca @ ACAT07
xrootd serving several VO’s
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
priv key
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
proxyQuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
sec env
client
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
proxy QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
sec env
pub key
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
ALICE catalogue
xrootd server
GSI auth
Catalogue auth
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
1723/04/07 fca @ ACAT07
Tag architecture
ev#guid
Tag1, tag2, tag3…
ev#guid
Tag1, tag2, tag3…
ev#guid
Tag1, tag2, tag3…
ev#guid
Tag1, tag2, tag3…
Reconstruction
Analysis job
GRID/PROOF#1
GRID/PROOF#2
GRID/PROOF#3
…
GRID/PROOF#N
guid#{ev1…evn}
guid#{ev1…evn}
guid#{ev1…evn}
guid#{ev1…evn}
SelectionList of ev#guid’s
Bitmap Index
Index builder
Selection List of ev#guid’s
When available
InteractiveBatch Job#1
Job#2Job#3
…Job#N
1823/04/07 fca @ ACAT07
How to select data
• A dataset list is created via queries to the metadata– Key/value pairs– Run, file, and tag MD
• Run Meta Data– Stored as (Directory) Meta
Data in the File Catalogue– Contains parameters
describing conditions during the run
• File Meta Data– No physics information– Sanity, permission &
location of Files
1923/04/07 fca @ ACAT07
File Catalogue query
CE and SE
processing
User job (many events)
Data set (ESD’s, AOD’s)
Job Optimizer
Sub-job 1 Sub-job 2 Sub-job n
CE and SEprocessin
g
CE and SE
processing
Job Broker
Grouped by SE files location
Submit to CE with closest SE
Output file 1
Output file 2
Output file n
File merging job
Job output
Distributed analysis
processin
g
processin
g
2023/04/07 fca @ ACAT07
Grid data challenge - PDC’06• The longest running Data Challenge in ALICE
– A comprehensive test of the ALICE Computing model– Running already for 9 months non-stop: approaching data taking regime of
operation– Participating: 55 computing centres on 4 continents: 6 Tier 1s, 49 T2s– 7MSI2k • hours 1500 CPUs running continuously • 685K Grid jobs total
• 530K production• 53K DAQ • 102K user !!!
• 40M evts, 0.5PB generated, reconstructed and stored
• User analysis ongoing
43% T1s57% T2s
T1 sites:
CNAF, CCIN2P3, GridKa, RAL, SARA
• FTS tests T0->T1 Sep-Dec• Design goal 300MB/s reached
but not maintained• 0.7PB DAQ data registered
2123/04/07 fca @ ACAT07
Long HistoryDB
Monitoring, monitoring, monitoring…
http://pcalimonitor.cern.ch:8889/LCG Tools
MonALISA @Site
ApMon
AliEn Job Agent
ApMon
AliEn Job Agent
ApMon
AliEn Job Agent
MonALISA @CERN
MonALISA
LCG Site
ApMon
AliEn CE
ApMon
AliEn SE
ApMon
ClusterMonitor
ApMon
AliEn TQ
ApMon
AliEn Job Agent
ApMon
AliEn Job Agent
ApMon
AliEn Job Agent
ApMon
AliEn CE
ApMon
AliEn SE
ApMon
ClusterMonitor
ApMon
AliEn IS
ApMon
AliEn Optimizers
ApMon
AliEn Brokers
ApMon
MySQLServers
ApMon
CastorGridScripts
ApMon
APIServices
MonaLisaMonaLisaRepositoryRepository
Aggregated Data
rss
vsz
cpu
time
run
tim
e
job
slots
free
spac
e
nr.
of
file
s
op
en
files
Queued
JobAgents
cpu
ksi2k
jobstatus
disk
used
pro
cesses
loadn
etIn
/ou
t
jobsstatussockets
migratedmbytes
active
sessions
MyP
roxy
status
2223/04/07 fca @ ACAT07
Back to the future…
• But now…– Memory and disk space is cheap– Virtual Machines running on commodity hardware on Open Source
OS are promising to deliver what we lost some time ago• Why?
– The infrastructure can evolve independently from the application– Now we can Start, Stop, Pause, Migrate VM– Software running inside a VM can not affect the environment– Perfect process and file sandboxing – (re)use a lot of code which was previously is system/kernel domain
IBM-VM 360 mainframe, 1988• Once upon a time…– statically linking, running in a VM– prefect isolation!
• Then, things changed..– Unix, PC, commodity computing,
shared libraries, dynamical linking, plugins
– Fuzzy application boundary!
2323/04/07 fca @ ACAT07
Virtual Appliances• Virtual Software Appliance =
Application + Virtual Machine + Simple UI that combines – Minimal operating
environment– Specialized application
functionality• Designed to run under
various virtualization technologies– VMware , Xen, Parallels,
Microsoft Virtual PC, QEMU, User mode Linux, CoLinux, Virtual Iron…
• Allieviate the deployment in a traditional server environment– Complex configuration– Maintenance
ExamplerPath: Software Appliance Company
2423/04/07 fca @ ACAT07
Practical exercise: AliEn Appliance
External Dependencies
AliEn
busybox(system tools)
ggbox
System devices
Kernel
Grid Appliance
++
=
2523/04/07 fca @ ACAT07
AliEnX • AliEn Linux – minimal guest OS capable of running AliEn
services and hosting Grid applications– http://alien.cern.ch/twiki/bin/view/AliEnX– http://alien.rpath.org
• Built using rPath tools (rBuilder and Conary package manager)
• AliEn Appliance Version 0.4– x86 Mountable Filesystem (Xen Virtual Appliance) – x86_64 Mountable Filesystem (Xen Virtual Appliance) – x86 VMware (R) ESX Server Virtual Appliance – x86 Installable CD/DVD – x86_64 Parallels, QEMU (Raw Hard Disk) – x86 Parallels, QEMU (Raw Hard Disk)
• Already usable as User Interface– Generic, can be customized for other purposes– To do: Run Grid Jobs in, VM
Xen 3.0.3 Native
Simu 193 s 191.5 s
Reco 52 s 51 s
3 GHz Pentium D, 1GB RAM, AliRoot
2623/04/07 fca @ ACAT07
Use cases for Virtual Machines ?• Grid
– Sandbox environment for job execution on WN– Enhanced site security
• VO box– Enhanced Scalability
• User Interfaces– Separation of Grid and system environment– Reducing Grid initiation threshold
• Specialized environments– PROOF/CAF
• process migration• kernel modules to enable fancy user space file systems• P2P like object sharing and caching
• Training setups– Make sure that everyone has the same environment when they walk in
training room • Testing environments
– Easy to setup, saving time and money
2723/04/07 fca @ ACAT07
A cloud over the Grid?
http://www.rpath.com/corp/amazon.html
2823/04/07 fca @ ACAT07
Conclusions• AliEn has allowed ALICE to exploit its distributed computing
resources achieving different objectives, potentially contradictory– Make maximum usage of the existing Grid MW– A stable and uniform environment for processing and
analysing ALICE data– A lean environment for development and test of new
technologies• The AliEn MW has been tested in production and we are
confident it provides a solid framework for ALICE computing• A promising area that we are exploring now with AliEn is
VM– Coming back as viable technology– Potential benefits for users and resource providers– Technology and business model are catching up fast– They may not solve all our problems, but they can make
solutions faster and easier
2923/04/07 fca @ ACAT07
Top Related