HEPiX Spring Meeting 2015 University of Oxford, UK 2 Arne Wiebalck Julien Leduc Adam Krajewski...
-
Upload
martin-murphy -
Category
Documents
-
view
219 -
download
1
Transcript of HEPiX Spring Meeting 2015 University of Oxford, UK 2 Arne Wiebalck Julien Leduc Adam Krajewski...
2
HEPiX Spring Meeting 2015University of Oxford, UK
http://indico.cern.ch/event/346931/
Arne Wiebalck
Julien Leduc
Adam Krajewski
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
3
HEPiX
• Global organization of service managers and support staff providing computing facilities for HEP community
• Participating sites include BNL, CERN, DESY,
FNAL, IN2P3, INFN, NIKHEF, RAL, TRIUMF …
• Meetings are held twice per year- Spring: Europe, Autumn: U.S./Asia
• Reports on status and recent work, work in progress & future plans
- Usually no showing-off, honest exchange of experiences
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
4
Outline • 2015 Spring Meeting & General HEPiX News• Site Reports (17)• Grids, Clouds, and Virtualization (8)
• Storage and File systems (8)• Computing and Batch (17)• IT Facilities (2)
• End User Services & Operating Systems (10)• Networking and Security (10)• Basic IT Services (7)
• Closing remarks
Arne
Julien
Adam
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
5
HEPiX Spring 2015• Mar 23 – 27, 2015 at the Physics
Department Oxford University, UK
• 134 registered participants (record!)- Many first timers again
- 75% from Europe, ~20 from 8 companies
- 45 different affiliations
• 83 contributions (+30%)- slots cut down to 25mins
- Ceph BoF, IPv6 tutorial
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
6Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
7
HEPiX Working Groups
• Benchmarking- Awaiting SPEC CPUv6
- Suggestion of a “fast” benchmark (minutes)- First test of a candidate provided by LHCb
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
8
Site Reports (1)• 17 site reports: about half from T0/T1
• HTCondor continues to be very visible - Many sites consider to move (e.g. DESY or KISTI)
- Mostly due to scalability issues with current solutions
- Feedback from sites running it is very positive
- INFN renewed LSF contract “for the last time”
• Config’ mgmt: Puppet still gaining popularity- Quattor flag held up by some (few) sites
- Ansible mentioned as well (NERSC) … (reminds me of Umeå 2009)
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
9
Site Reports (2)• Storage: Ceph clearly dominating reports …
- Some sites well advanced (e.g. BNL, RAL, CERN)
- Many sites exploring what to do with Ceph
• … but Lustre (re)gains some popularity
- Beyond GSI & JLAB, sites are considering deployment (e.g. NIKHEF)
- Apparently sites see a need for a distributed file system (specialised ones? DESY considers moving out of AFS)
• SL vs. CentOS: not a hot topic
- No rivalry, sites do not worry
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
10
Site Reports (3)
• Monitoring: being redone at several sites - With usual suspects: Flume, ES, Kibana, Grafana, …
• Cgroups started to be used more widely- Issues on various batch installations (kernel panics)
• IHEP: per user but managed VMs
- no root access, no console access
- an option for lxplus++ ?
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
11Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
GSI’s Cube: “3-d” CC- 6 floors (128 racks, 36k U)- PUE < 1.1- used for heating- cable length
More details next time!
12
Virtualization (1)
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
• 8 talks, 2 from CERN- Bruno: Cloud report & Heat
• OpenStack community within HEP growing- Different approaches (e.g. IHEP only 3 images, RAL only 1 flavor)
- Mostly used for dev machines, some for services, few for compute
(normal virtualization phase-in)
• “ATLAS on Amazon” (BNL/AWS)- Practical feasibility of commercial clouds for ATLAS production – at full scale!
- Joint work with the AWS Scientific Computing Group
- Areas: compute (capacity?), networking (direct links?), storage (from “keep” to “delete”), … std vs scientific computing
- First test w/ 20k slots was economical, next test: 100k cores
http://indico.cern.ch/event/346931/session/9/contribution/20/material/slides/0.pdf
https://indico.cern.ch/event/346931/session/9/contribution/54/material/slides/1.pdf
13
Virtualization (2)
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
14
Outline • 2015 Spring Meeting & General HEPiX News• Site Reports (17)• Grids, Clouds, and Virtualization (8)
• Storage and File systems (8)• Computing and Batch (17)• IT Facilities (2)
• End User Services & Operating Systems (10)• Networking and Security (10)• Basic IT Services (7)
• Closing remarks
Arne
Julien
Adam
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
15
Storage and File systems (1)• 8 talks
• 4 about CEPH• Panel and BoF: Ask the CEPH experts
• CEPH as a building block for many services• RACF at BNL
• CephFS went in production since 2014Q3• Awaiting RDMA support to ditch IBoE
• RAL• Ceph as a large scale object store to replace their Castor disk only
storage• Using Xrootd and GridFTP plugins• Testified about the experience of loosing monitors: using now 3
physical monitors physically distributed• Going to erasure coding: 3 replicas are too expensive, looking for
+30% HA overhead for Ceph storage
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
16
Storage and File systems (2)• Distributed File systems:
• GPFS: DESY Petra III data taking and analysis infrastructure is moving to GPFS after detector upgrades (DESY <=> IBM partnership)
• BeeGFS experience:DESY wants to use this as a replacement for GPFS and Lustre
• Former FhGFS from Fraunhofer, renamed in 2014
• Project will become opensource with commercial support available
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
17
Storage and File systems (3)• DESY experimenting with HGST
open Ethernet drive to build a dCache cluster
• Each disk:• Runs Linux (2GB RAM, disk is
sda, network is eth0), 60/4U enclosure
• They recompiled dCache pool code and run it directly on disks
• Future tests: reuse this HW to test Ceph deployment on disks
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
18
Computing and Batch (1)• 17 talks
• 8 benchmarking + 9 batch systems
• Commissioning cloud resources• Several simple metrics: wallclock, CPU
usage, data stage-in time, cvmfs software setup time allowing quick commissioning of cloud resources
• Stable cloud is easy to integrate in production
• Lot of efforts to optimize performance
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
19
Computing and Batch (2)• BNL remote evaluation of HW
• Need to speed up acquisition processes (partnership with vendors)
• Long acquisition processes <=> money lost
• Beyond HS06/fast benchmark• Candidates (SPEC CPUv6, Multithreaded
Geant4), mandatory compiler flags (-o2?)...• Fast benchmark LHCb fast benchmark,
HS06/LHCb ratio between 1.2 and 1.6 (but can go >2)
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
20
Computing and Batch (3)• Alternate CPUs:
• Intel Atom Avoton, Tegra K1 (ARM 32bit) extensively tested
• ARM 64bit software support is improving• Working on integration in CERN
environment (PXE boot, puppet, koji...)
• Test platforms available through CERN techlab
https://twiki.cern.ch/twik/bin/viewauth/IT/TechLab
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
21
Computing and Batch (4)• Univa GE is popular
• Only one to support DRMAA2 standard now
• HTCondor is more popular• Very large reactive community• Lot of additional tools developed by communities (HEP,
HCCondor)
• Monitoring CPU and memory usage with cgroups• Batch schedulers can isolate jobs in cgroups• Allow to understand resource utilization per type of jobs
(analysis, reconstruction,...) => refine scheduling policies
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
22
IT Facilities (1)• 2 talks from CERN
• Recent operational issues at CERN• 14/10/16 power incident (+ Murphy's
law)
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
23
IT Facilities (2)• Another operational
incident: Dust on tape incident
• Thanks to vendor impact was limited
• Development of a homemade dust sensor to monitor dust inside tape libraries at CERN
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
24
Outline • 2015 Spring Meeting & General HEPiX News• Site Reports (17)• Grids, Clouds, and Virtualization (8)
• Storage and File systems (8)• Computing and Batch (17)• IT Facilities (2)
• End User Services & Operating Systems (10)• Networking and Security (10)• Basic IT Services (7)
• Closing remarks
Arne
Julien
Adam
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
End User Services & OS (1)• 10 talks total, 5 from CERN
• Andreas: CERN Search and Social for the Enterprise Web Experience• Thomas: Evolutions in the CERN Conferencing Services Landscape• Arne: CERN CentOS 7 Update• Nils:
• Update on software collaboration services at CERN• Status of volunteer computing at CERN
• HEP Software Foundation• Collaboration started for HEP software/computing efforts (kickoff meeting April
2014, first workshop January 2015)• Objectives: sharing expertise, catalyzing common SW projects, promoting
collaboration in new developments• Website: http://hepsoftwarefoundation.org
• Scientific Linux Current Status• Development continues, SL 7.1 released on April 10th 2015• Researching containerization possibilities:
• Docker image• Scientific Linux Project Atomic distro
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary 25
26
End User Services & OS (2)• SciDB at NERSC
• Testbed evaluation• Cluster of ~20 nodes, normally 100 GB – 1 TB data, even 20+ TB
• Happy with the results, decided to go with a production-level cluster
• Lustre at the Sanger Institute• 11 Lustre Volumes, 6 PB storage• Problems analyzing storage usage• Solved by implementing an efficient, parallel file tree walker using MPI
• Zimbra at DESY• Replacement of UNIX mail and Microsoft Exchange
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
27
Networking and Security (1)• 10 talks in total, 2 from CERN:
• Adam: Effects of packet loss and delay on TCP performance• Romain: Computer Security Update + phishing demonstration
• IPv6 Working Group• Lots of sites still not IPv6-ready (especially T2)• Testing and deploying dual-stack services if performance is sufficient• Dual-stack perfSONAR should be provided in 2015
• perfSONAR• Network and Transfer Metric Working Group started in May 2014• OSG datastore – community data store for all perfSONAR metrics --to
enter production in Q3 2015• Integrating perfSONAR with FTS and experiments to optimize transfers
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
28
Networking and Security (2)• WLCG Cloud Traceability Working Group
• Looking into incident traceability in emerging cloud computing environments• Best practices for gathering additional logging informations in cloud
frameworks, configuring VMs etc.
• Operational Security in the EGI and WLCG• Security policies: reporting vulnerabilities is essential • Only 8 incidents last year, quite successful prevention• Now re-working policies to face cloud computing technology threats
• OSSEC at Scotgrid Glasgow• Visualizing with Elasticsearch / Logstash / Kibana
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
29
Basic IT Services (1)• 7 talks, 3 from CERN:
• Alberto: Configuration management at CERN: Status and directions• Francisco: Towards a modernisation of CERN’s telephony infrastructure• Andrei: Updates from Database Services at CERN
• Config Management at RACF• Deployed Puppet Server in production
• Catalog compilation avg 1.97 sec -> 1.00 sec
• Looking into Jenkins CI for testing pending production changes• MCollective in testing, plans to put it in production
• MCollective at DESY• Succesfully deployed in production for following use cases:
• Steering Puppet agent runs• Querying the infrastructure• Small parallel-ssh tasks (e.g. package updates)
• Performance problems caused by SSH key plugin, now fixed
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
30
Basic IT Services (2)• Subtlenoise by Lancaster University
• Small framework to leverage acoustics during monitoring shifts• „produces low-impact but information-rich soundscapes in realtime”• https://github.com/ptrlv/subtlenoise
• Update on Quattor• Still in development, Quattor 15.2.0 released March 23rd 2015• ~15 institutes participating, over 2500 commits on GitHub in 2014• Active community
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
31
Outline • 2015 Spring Meeting & General HEPiX News• Site Reports (17)• Grids, Clouds, and Virtualization (8)
• Storage and File systems (8)• Computing and Batch (17)• IT Facilities (2)
• End User Services & Operating Systems (10)• Networking and Security (10)• Basic IT Services (7)
• Closing remarks
Arne
Julien
Adam
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
32
HEPiX Board News• Next meetings
- Autumn 2015: BNL (US) Oct 12 – 16 (to be held jointly with the WLCG GDB)
- Spring 2016: DESY Zeuthen (DE) April 18-22
- Autumn 2016: U.S. West Coast candidates, but also other proposals
• Discussions about swapping the European/US location cycle
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary
33
Questions?
Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary