Report on HEPiX Meeting Spring ‘10 · 2010. 6. 3. · Report on HEPiX Meeting Spring ‘10 A...
Transcript of Report on HEPiX Meeting Spring ‘10 · 2010. 6. 3. · Report on HEPiX Meeting Spring ‘10 A...
Report on HEPiX Meeting Spring ‘10A short personal view
Thomas Finnern(DESY/IT)
HEPiX Spring 2010
Lisbon , Portugal
Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 2
„Weniger Asche für ihre Cloud“
> Most Americans in Lisbon
> Most Europeans on EVO
� ~60-65 connections in average
Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 3
What is HEPiX ?
> A global organisation since 1991
> Unites IT system support staff, including system administrators, system engineers, and managers from the High Energy Physics (HEP) and Nuclear Physics laboratories and institutes
> BNL, CERN, DESY, FNAL, IN2P3, INFN, JLAB, NIKHEF, RAL, SLAC, TRIUMF and many others
> Semi-annual meetings are an excellent source of information for IT specialists in scientific computing
> http://www.hepix.org
Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 4
HEPiX Meeting Event Outline
> Names and Numbers� LIP: Laboratório de Instrumentação e Física Experimental de Partículas
� 105 Participants (6 DESY)
90 from Europan Countries15 from other Countries
� Monday to Friday
> Sky Crisis due to Island Volcano Eyjafjallajökull [‘ɛɪja,fjatl̥a,jœkʏtl̥]
> Daily Key Notes� Conveners: Borges, Goncalo
� Portugal Infrastructure: Ibergrid , Grid Europe – South America,
� CPU-GPU-Clusters
� Management Information Systems
> Site Reports� Convener Michel Jouvin (LAL/IN2P3/GRIF)
> Technical Topics (Tracks)� Virtualisation: Convener Tony Cass (CERN)
� Operating Systems & Applications: Convener Sandy Philpott (JLAB)
� Monitoring & Infrastructure tools: Conveners Helge Meinhard(CERN)
� Storage and Filesystems: Convener: Andrei Maslennikov (CASPUR)
� Grid and WLCG: Conveners: Borges, Goncalo (LIP)
� Security & Networking: Convener: Dr. Kelsey, David (RAL)
� Benchmarking
> Virtualization Working Group F2F� See: Extra Report
> DESY Talks (in order of appearance)� Evaluation of NFS v4.1 (pNFS) with dCache (FUHRMANN, Patrick)
� Building up a high performance data centre with commodity hardware (HAUPT, Andreas)
� DESY site report (FRIEBEL, Wolfgang)
� Virtual Network and Web Services (An Update) (FINNERN, Thomas)
Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 5
Site Reports (Michel Jouvin)
> RAL Site Report: BLY, Martin� Admin & Science Support, SCARF (HPC Cluster), UPS, Chillers &
Pumps, 2,25 MW, Batch 3GB/Core
> BNL RHIC/ATLAS Computing Facility Site Report: HOLLOWELL, Christopher
� Near NY, 100 Thumper/Thor Sol10, 250 SL Infrastructure Server
> CERN site report: Dr. MEINHARD, Helge LHC Status� CERN IT-Reorganisation
� ITIL Implementation ongoing, 100kW missing, big procurement, Windows 7,
� Solaris And Sparc phased out
> LAL/GRIF Site Report: JOUVIN, Michel
> Jefferson Lab Site Report: PHILPOTT, Sandy� ipv6
� CEBAF Upgrade Project
� HPC Infiniband Clusters
� HPC GPU Cluster: 63 Nodes, 200 GPUs
� 200 TB Lustre 1.8.2
� Fedora 32-bit -> CentOS 5.3 64-bit
> Site report from PDSF: SRINIVASAN, Jay(?)
> KIT Site Report: ALEF, Manfred(?)
> CSC Site Report: HAKALA, Tero
> PIC Site Report: MARTINEZ RAMIREZ, Francisco
> SLAC Site Report: MELEN, Randy� Change HEP to Photon Science: Process, + Enterprise architect, …
� 8200 Batch Cores (LSF), subcluster
� infiniband, GPU (CUDA,OpenCL)
> DESY site report: FRIEBEL, Wolfgang� 1a: New directors, buildings
> PSI Site Report: Dr. FEICHTINGER, Derek� Bell
> GSI site report: Mr. HUHN, Christopher
> Petersburg Nuclear Physics Institute (PNPI) status report: SHEVEL, Andrey
� Small site -> small cluster -> small support ? -> cloud gateway at t3 level ?
> Fermilab Site Report: Dr. KEITH, Chadwick� FermiCloud System in GCCi. IAAS, Procurement now
� Nehalem w hyperthreading
� ITIL Transition e.g. Changemanegement
� Poweroutage Fynman Center Downtime 2-4 hours
� + 12 d 2nd Break by 1400 A 3 phase power breaker
� -> running 75 % … +++
� Lessons Learned: trust, communication, HA -> UA, networkrescue
> The ATLAS Great Lakes Tier-2: Status and Plans: Dr. MC KEE, Shawn
> The Portuguese WLCG Tier2: status and issues: DAVID, Mario
> INFN Tier1 site report: SAPUNENKO, Vladimir� 1400 -> 2000 Virtual Machines
� Castor -> GEMSS (StoRM, GPFS, TSM, GridFTP)
> Prague Tier-2 Site Report: SVEC, Jan
Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 6
Virtualisation (Tony Cass)
> Update on HEPiX Working Group on Virtualisation: CASS, Tony
� See: Extra Report
> Virtualization at CERN: a status report: SCHWICKERATH, Ulrich
� PES
� Batch: 3 different VM Layouts
� ISF and/or OpenNebula
� „Golden Nodes“ as VM reference
� Quattor/Lemon
� 1GB Net scp rtorrent compresst images
� Still under test:Large scale
� VM: 24h Lifetime only
� No direct batch control of created image
> virtual machines over PBS: RODRIGUEZ ESPADAMALA, Marc
� PIC
� Startup VM within prolog
� KVM Snapshots
� Need/Want some addons in PBS for better VM support
� DIRAC pilot jobs
> An Adaptive Batch Environment for Clouds: GABLE, Ian
� HEP Legacy Data Project (BaBar), CANFAR (Astrophysics) lot of individual environments
� NIMBUS Context Broker
� SGE ro Amazon EC2
� (Eucalyptus)
� -> Cloud Scheduler: Combine without copying the above
� Early Experiences
� Cloudscheduler.org
> CERN Virtual Infrastructure: VAN ELDIK, Jan
> Virtual Network and Web Services (An Update): FINNERN, Thomas
� F5 Loadbalancer
� DESY web site
� Infoscreen
> Virtualisation for Oracle databases and application servers: GARCIA FERNANDEZ, Carlos
� Number of Instances Increases
� Performance (- few %)
� Live migration
� JRockit direct Application on Hypervisor
� Migration from physical to virtual
� Quattor Based
� Nfs based /OVS -> /var/mount/ovs/<uuid>
� „on the fly“ and „golden images“
Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 7
Operating Systems & Applications (Sandy Philpott)
> Scientific Linux Status Report and Plenary Discussion: Mr. DAWSON, Troy
> Update on Scientific Linux: Troy Dawson, FERMI
� Linux is like a wigwam: no gates, no windows
� and Apache inside.
� "We all know Linux is great...it does infinite loops in 5 seconds." --Linus
� SL5 increasing
� Sl5.4 November 2009 incl. LifeCD
� WIPsl4.9 ? 2010Sl5.5 in beta -> June 2010newer, faster, better distro serversSl6: koji build automationSl6 pre alpha after sl5.5Sl6.0 Februuar 2011 ?Sl3.0.9 end october 2010Sl4 goes in legacy mode
� Spacewalk -> delta e-mails ?
� Discussion:if CentOs then parallel or notExtra rpms without regular security updates inbetween sl release cycleNo extra kernel maintainable by Troy
� Discussion: Startup difficult: Couple of month <- Docu, personal
> Current Windows 7 Status @ CERN: Michal Budzowski
� From vista, but 6000 managed PC‘s (mostly XP)
� Already 430 Installations
� Default 32 bit
� Microsoft / Cern Recommendation 32(64) bitMemory 1(2)G 2(4)GDisk 16(20) GB 60(60) GBCPU 1 Ghz 2Ghz
� Legacy: Old Hardware gone 2012
� Addons/Changes:Password protected screensaverRecently changed to search folderInternational: Language English(UK) Location Switzerland, Keyboard US, Euro
� 2010Q2: Windows 7 default
� 2010Q3: Phase out Vista
� 2010Q4: Roadmap XP
> TWiki at CERN: Past Present and Future Mr. JONES, Pete
� A wiki is a web page with an edit button
� The simplest online database possible
� Confluence,MediaWiki,PhPWikiTikiWiki,Twiki
� Twiki:Since 1998 Open sourcePerl based: Linux + Apache
� Since 2003 TWiki has been upgraded several times
� March 2010 backend migrated from AFS->NFS
� 7500 registered User (CMS, Atlas, …)
� 190 collaboration webs
� 60.000 Topics
� 3.000.000 accesses / month
� 50.000 Updates / month
� No anonymous write
� Access Control: By Username or Group and Egroup
� ENV(HTTP_ADFS_GROUP)
� Issues:Performance, SSO code change, web managementGoogle (search) does not see protected pagesCern search soon also for protected dataNow Twiki.net instead of Open Source ….
� Complements other IT services
Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 8
Monitoring & Infrastructure tools (Helge Meinhard)
> Spacewalk and Koji: Troy Dawson
� Spacewalk: Install / Maintain „channels“
� @Fermi: Test: Group Machines / Set Channels / Seperate Users / …
� Opensource RedHat Satelite System
� Koji for build (redhat/fedora) distro (and sl6 ?)
� Koji -> mock -> … -> rpms
� Mash + Bhodi
> RAL Tier1 Quattor experience and Quattor outlook: COLLIER, Ian Peter
� Quattor Toolkit Introduction
� Profile/Version Control ….
� Stated with sl5 680 Servers + 130 Castor Servers
� The bigger the sites the more quattor usage
� Discussion: Heteroneous hardware difficult -> Separate hw and payload defs
> Lavoisier : a way to integrate heteregeneous monitoring systems: L'ORPHELIN, Cyril
> Scientific Computing: first quantitative methodologies for a production environment: Mr. CIAMPA, Alberto
Cost Evaluation:TCO ? ROI ?
> Lessons Learned from a Site-Wide Power Outage: BARTELT, John
� SLAC: 19.1. Start – 20.1.
� Payroll/printing, light, coffee, communication, priorities, …
� Documentations of dependecies
Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 9
Storage and Filesystems (Andrei Maslennikov)
> CERN Lustre evaluation and storage outlook: BELL, Tim
� 1.7 Beta, HSM, Analysis Space Project Space User homedierectories
� Mandatory: Strong Authentication almost ok , No live datamigration, Backup OK, HA/redundancy ok, Small files almost ok, HSM under development, no replication, no privilege delegation, no strong admin control
� Additional: too strong coupling client/server: versions, kernel, etc
� No Lustre at CERN 4 T0, Analysis or Afs replacement, Interest in fulfilling roadmap
� Big Storage: On Disk and Tape Backup (Write Once, Read never)
> High Performance Storage for LHC: Dr. DUELLMANN, Dirk
> LCLS Data Analysis Facility: WACHSMANN, Alf� Linac Coherent Light Source
� Also with CFEL(DESY)
� Lustre online + offline
> GEMSS: Grid Enabled Mass Storage System for LHC experiments: SAPUNENKO, Vladimir
� CNAF (Italy)
� StoRM/GPFS/TSM
> OpenAFS Performance Improvements: Linux Cache Manager and Rx RPC Library: ALTMAN, Jeffrey WILKINSON, Simon
� Disk cache benefits over memory cache
� Page cache improvements
� Minimize data copies
� OpenAFS Roadmap (ALTMAN)
1.6 Summer 2010, 1.8, 2.0 Summer 2011, 2.21.5 Windows Production1.6: Source Code Quality, MAC OS X.1.6: …., NFS -> AFS Translator1.6: Solaris 111.8: … krb5,gss,x.509,SCRAM,…
> CC IN2P3: A way to combine heterogeneousmonitoring systems
Lavoisier: A data source composition service> Progress Report 2010 for HEPiX Storage Working
Group: MASLENNIKOV, Andrei� Test Facility @ KIT: High End Server + Last Versions: CMS and
Atlas Tests, Questionaire: 87 PtB: 33 % Castor, 33 % dCache, N-Client/N-Server = 10 for 1 Gb-Server, AFS/Lustre, GPFS, dCache
> Evaluation of NFS v4.1 (pNFS) with dCache: FUHRMANN, Patrick
� Nfs4.1 preproduction quality, support by golden release, set_aclby user soon,
� Kernel 2.6.32 first supports nfs4.1 (sl6+), local results stable and fast
> Building up a high performance data centre with commodity hardware: HAUPT, Andreas
� Lustre, Multi Batch Cluster (Batch, Parallel, NAF), DELL, 30 cent/GB
> Lustre-HSM binding: LEIBOVICI, Thomas� CEA/France
� HSM Backend generic (not only HPSS, POSIX, …)
� Oracle/SUN/CFS/CEA/Other
� V1 Feature: MigrateData, free space, recover, policies, import, diesaster recovery
� RobinHood as PolicyEngine
� V2 Features in progress> Fine tuning …
Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 10
Grid and WLCG (Goncalo Borges)
> The new Generations of CPU-GPU Clusters for e-Science: PROENçA, Alberto� Paradigm Change: Computer not becoming faster
� Complex smp + mpp systems
� Single prgs: messaging, loadbalance, …
� New chips with message passing between Cores
� Data parallelism: simd (single instruction multiple data) to gpu (MIMD)
� CUDA: Compute unified device architecture CPU(Ser.) + GPU(par.) Code
� Nvidia gpus: g80 gt200, fermi (512 cores)
� Big Installs CISRO(AUS), NCSA(USA)
� - Conveners: Borges, Goncalo
> WLCG - evolving for the future: Dr. BIRD, Ian� T2/T3 discussion
> CESGA Experience with the Grid Engine batch system: Mr. FREIRE GARCIA, Esteban� Oracle Grid engine
� With ARCo
� Large=400
� Interconnect builtin + ssh
� AllowUser=Admins
> CERN Grid Data Management Middleware plan for 2010: Oliver Keeble� Manageability, Performance, Standards (and therefore interoperabilty) ssl, nfs4.1, http/https
� Full CERN Support: FTS,DPM/LFC,gfal/lcg_util
� EGEE Site Deployment: The UMinho-CP case study
inside Grid and WLCGView details|Material|Export
� Presenter(s): Sá, Tiago (Uminho)EGEE is a dynamic organism with requirements that constantly evolve over time.The deployment of UMinho-CP - an EGEE site supporting Civil Protection related activities -, revealed new challenges, so...
� Deployment: Rocks toolkit not Quattor
Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 11
Security & Networking (Dr. David Kelsey)
> update on computer security: Dr. SCHWICKERATH, Ulrich
> IPv6 in HEP - a discussion: Dr. KELSEY, David
Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 12
Benchmarking
> Preliminary Measurements of Hep-Spec06 on the new multicoreprocessor Mr: MICHELOTTO, Michele
� Intel Nehalem -> Westmere 4(8) to 8 cores(12 logical cpu)
� AMD Instambul -> Magny-Cours 6 to 12 cores
� Compiler ready ?
> Hyperthreading influence on CPU performance: MARTTINS, Joao
� +20% without I/O
� +30 % with light I/O
� Advantage application specific
� Default OS cpu affinity not optimal for HT
� Recommendation: Now no HT
Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 13
Summary
> State of the Art Virtualisation
> More Data
> More „Green Computing“
> More Consolidation
> ITIL(IT Infrastructure Library) is still coming (slow)
> In fact the most challenging HEPiX meeting to organize!
� Everything reorganized over the WE
> Good side: EVO experience was a success!
� Most of the registered people connected during the 5 days
~60-65 connections in average� Smooth meeting despite being remote
� Need to add coffee breaks and dinner to EVO!!!
> HEPiX continues to attract new sites and new people
Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 14
Next Fall meeting (2010): Cornell University
> Ithaca, NY (south of lake Ontario)
> http://maps.google.fr/maps?client=opera&rls=fr&q=ithaca+new+york&sourceid=opera&oe=utf-8&um=1&ie=UTF-8&hq=&hnear=Ithaca,+NY,+USA&gl=fr&ei=6PrqSvypBoPclAfJqfj_BA&sa=X&oi=geocode_result&ct=image&resnum=1&ved=0CAsQ8gEwAA
> 1st week of November (Nov. 1-5)
> Web site available by end of May