Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18...
Transcript of Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18...
Data Preservation in HEP:
The Next 3 (2?) Years
CERN IT-MM, June 2018
These slides and associated material:
https://indico.cern.ch/event/731584/
International Collaboration for Data Preservation and
Long Term Analysis in High Energy Physics
Slide 3
Overview
1. DPHEP “2020 vision” – brief reminder
2. Status of CERN Certification and Outlook
3. “EIROforum” TWG on Long-Term Data Preservation
4. PV2020@CERN [ Preservation & Value adding ]
1. European Strategy update – ESPP2020
Not covered: ARCHIVER, ESCAPE etc. (although Certification relevant here too)
Slide 4
DPHEP 2020 Vision – Reminder
• DPHEP Blueprint published in May 2012 –to some extent a “cry for help”
Urgent action is needed for LTDP in HEP
The preservation of the full capacity to do analysis is recommended such that new scientific output is made possible using the archived data
• Current ESPP (May 2013):
• …data preservation and distributed data-intensive computing should be maintained and further developed.
“Open Data” was not part of Blueprint. CMS policy: May 2012@CHEP
Slide 5
What does DPHEP do? DPHEP is a Collaboration with signatures from the main HEP
laboratories and some funding agencies worldwide.
• It has established a "2020 vision", whereby: + All archived data – e.g. that described in DPHEP Blueprint, including
LHC data – should be easily findable and fully usable by the designated communities with clear (Open) access policies and possibilities to annotate further ( = P + V );
Best practices, tools and services should be well run-in, fully documented and sustainable; built in common with other disciplines, based on standards;
+ There should be a DPHEP portal, through which data / tools accessed;
Clear targets & metrics to measure the above should be agreed between Funding Agencies, Service Providers and the Experiments.
Vision presented to ICFA meeting in Feb 2013 who issued a “statement”
Slide 6
Path to “Vision” – some MS• “Full Costs of Curation” workshop – inspired by 4C – January 2014
• Bit preservation “cost model”, good understanding of costs (P+M) vs “value” – adopted HEP-wide (and beyond), e.g. Data Rescue talk at PV2018
• First “Collaboration Workshop” (after signatures of CA) – June 2015 –DPHEP Status Report
• Led to LEP data on EOS – 3 copies just at CERN!
• Common reporting format for all HEP expts (DMP+SWOT)
• ISO 16363 training – June 2015 – CERN + T1s
iPRES 2016 “CERN Services for LTDP” paper
• Also CERNLIB documentation update, GPHIGS license etc.
2018: ICFA report; Request to host BaBar data; OPERA ingest; ESRIN visit; PV2020 agreement; ISO 16363 self-assessment; HSF-CWP
Meets and (greatly) exceeds ESPP; good progress towards “vision”
“Standards” include:
• FAIR DMPs; TDRs
+ OAIS & related
Slide 7
Built on 3 "pillars": 1. The data itself (“bits” – state of the art bit preservation,
e.g. in a “Trustworthy Digital Repository” (TDR));
2. Documentation (services like Zenodo, B2SHARE);
together with the necessary
3. Software + environment (CernVM / CVMFS)
• Services for all 3 areas exist and are mature but change on fully independent timescales
We need flexible (not static) bridges between them
LTDP in HEP (iPRES 2016 paper)
See https://cds.cern.ch/record/2195937/files/iPRES2016-CERN_July3.pdf
Slide 8
Certification – a “sine qua non” of LTDP
• In May 2017 I wrote a note to the IT-MM on a strategy for certification as a TDR (attached)
• From then until end 2017, worked on draft responses to the 109 metrics in ISO 16363
• These have now been submitted to “stage 1” offsite audit and a contract signed with PTAB
• Feedback is expected shortly – work on OAIS update has delayed this but iteration is to be expected, if not major revisions
More details on the ISO audit process can be found in the PV2018 paper (attached)
“Sterling work” according to WLCG GDB chair
Slide 9
ISO 16363 (is right for CERN)• Was developed and is maintained by the same
people as OAIS (ISO 14721) – the “space community” Much closer to us than e.g. humanities
• CoreTrustSeal, which came from DSA+WDS, follows the same breakdown: but it is not as thorough
• Satisfying ISO 16363 should “automatically” mean satisfying CoreTrustSeal (e.g. BSc vs Oxbridge)
• European Framework for Audit and Certification of Digital Repositories presents the main methodologies as a “hierarchy”, along with an MoU
• Others pursuing ISO 16363 include EU publications office, US library of congress & some “secret” ones
Open Archival Information System – an archive (systems + people)
Slide 10
Who does it benefit?• Funding agencies, who can better judge if the money
they are providing will be used according to their requirements • e.g. FAIR DMPs which call for preservation & re-use
• Data users to be able to determine the “trustworthiness” of the data (user surveys)
• Producers (e.g. LHC experiments) to understand how and what a repository does to preserve their data
The data of most CERN experiments already lost!• By number of experiments, not by volume
• CERN Greybook: 776 completed experiments, ~20 active
• “Preserved”: LEP(3/4), LHC(4)
O(10) vs O(1000)
Slide 11
Certification Areas
3. Organisational Infrastructure• IMHO we are quite strong here, but some of the
descriptions may well not be clear to people outside HEP (such as the auditors)
4. Digital Object Management• Here we are (very) weak WRT standard. We don’t
in general have AIPs etc but we have proven that we can “ingest” data (e.g. OPERA, maybe BaBar) TOGETHER WITH the experiment
IMHO (cf EOSC Pilot) a generic TDR could NOT
5. Infrastructure and Security Risk Management• Probably OK although some elements, e.g. Business
Continuity, still work-in-progress
e-group DPHEP-CERN-Certification
Slide 12
Example Feedback (APARSEN)
3. Organizational Infrastructure
• Currently, <SITE> does not formally document all
changes to its operations, procedures, software
and hardware.
4. Digital Object Management
• The process for converting SIPs to AIPs and the
corollary mapping history between them was unclear.
5. Infrastructure and Security Risk Management
• <SITE> has no technology watch.
• <SITE> has no risk register.
Slide 13
What Happens Next?• It is hard to make a concrete plan without the first written feedback
Still target an on-site audit in 2019 with Certification by 2020 –earlier if possible
• This will need the presence of a number of experts• In IT, most likely from DI, ST, CDA & WLCG
• “Surveillance audits” would typically follow in 2021 & 2022• Ideally, I should be involved in 1, if not 2, of these
Even 1 may no longer be possible if there are further delays (for whatever reason)!
• In parallel, the “motivation” for certification can be expected to increase (cf Science Europe w/s, FAIR action plan)
Slide 14
Slide 15
EIROforum WG on LTDP• As mentioned above, the “space community” defined
and maintains the LTDP standards• Very active mailing list: hope to update OAIS in 2020!
• Initiated the “PV” conference series, all but one (@DCC) have been at “space institutes”
• Triggered by a visit from ESRIN (they now want to learn from CERN / HEP!) technical & topical meetings will be held with “EIROforum” institutes and similar, e.g. DLR, [ ARCHIVER procurers etc, ]
Complementary to PV but much more hands-on
• Topics could include archive i/f, tape strategies, portals, s/w preservation, certification etc etc First meeting at CERN(?) after the summer
Yet another “success story” IMHO…
Slide 16
PV2020@CERN
Just to be clear, I think that this is a great opportunity!
• Typically a 2.5 day meeting, probably late April / early May, plenaries + 2 parallel tracks + posters
• Can have some co-located events: some good suggestions from closing talk in RAL• E.g. show-casing open data from different
disciplines to school kids etc
• (We are not leaders in this area!)
• Target 150 – 200 attendees from all continents and many scientific disciplines
And another! We are definitely “on the map” WRT LTDP
Slide 17
Goals (as presented at RAL)
• Attract more scientific communities
• Broaden information exchange, sharing of
experiences, tools and even services
• Keep in step with (or ahead of) funding agencies /
policy makers in their push for LTDP & OD
• (Discussion at end) Suggestions came here
Slide 18
PV2020 Organisation• Would prefer to avoid need for a sponsor (and sales talks)
• PV2018 registration was GBP 160 without conference dinner
• Should be able to cover coffee / lunch breaks, welcome “apero”, plus conference bag for this (or less…)
• Session chairs come from programme committee (usual suspects plus some new ones)
• Abstract submission / reviews done using EasyChair which works quite well (assume Indico for agenda & badges)
• Proceedings (4 pp per talk) published before meeting
• On-site visits? Would take organisation & guides but likely to be very popular
Quite a few invited talks were poor, 1 or 2 excellent!
Panel session went well – something to repeat?
• Local organisers at RAL seemed to be very stressed. Why?
Would be good to have DG / Directorate level support
Slide 19
ESPP 2020
• Clear from HSF CWP that “bit preservation” (with acceptably low error rate) considered “solved”
• Services around the key “pillars” of LTDP in HEP exist, are mature and well supported:• Bit preservation; CVMFS/CernVM for s/w + environment;
Invenio-based solutions for documentation) [EOSC service]
• Focus now on “new” areas. Those discussed in Naples include:• Re-use & reproducibility (always a goal but unclear how)
• Handling changes in access protocols
New and changing requirements from FAs will need to be considered
Input in 2012 was not well coordinated and sometimes contradictory
Slide 20
Beyond 2020
• The elaboration of a post-2020 DPHEP Vision, its implementation and that of new directives in the 2020 ESPP need to be addressed by someone else (non-CERN needs?)
• Ideally, they would start well before this, getting increasingly involved in ISO 16363 (re-)certification, PV2020 preparation, EIROforumLTDP WG, H2020 projects(?) and any other activities deemed important
OAIS (ISO 14721) updated in 2020, ISO 16363 and other updates will follow
Slide 21
Questions for Run3 management
1. Does CERN wish to continue with DPHEP Project Management? (See letter from former DRC)
• In principle would need to be approved by DPHEP Collaboration Board and ICFA in 2020 or B4
2. Does CERN / HEP wish to continue to collaborate with other disciplines / policy makers / funding agencies? (At the same level as now? More?)
• We benefit – they benefit – we all benefit, e.g. costs & benefits, technical solutions, “knowledge is more than documentation” etc
3. Does CERN wish to maintain Certification?
4. Should this activity – if retained – be in IT?
5. Did you get the DPHEP CA from the former DRC?
Slide 22
Comments…
• By 2020, we should be able to…
Implement the DPHEP 2020 vision
Whilst taking account of the evolving landscape during this period (FAIR+DMPs+TDRs+EOSC etc)
Obtain ISO 16363 certification for CERN as a TDR for all LTDP activities
Bring the leading scientific LTDP conference to CERN
Run a WG with other major scientific organisationson LTDP – including those that “wrote the book”
• Change the way people think of LTDP?
Slide 23
ARCHIVER
• It is unlikely that this can succeed (in
attracting suppliers) without a good
understanding of OAIS & TDRs
• This includes agreement on SIPs & DIPs
(the conversion to AIPs is up to the supplier)
• We are also likely to insist that suppliers are
certified to some agreed standard
• Collaboration with ESRIN will be important
to help specify interface
Slide 24
Summary
• 2020 is a key date for many aspects of LTDP
• And the horizon for some non-CERN projects
• There is a lot to do – even without additional H2020 projects or new ideas from FAs
• Some preparation for 2020+ (Run3 and beyond) will be is now required
On-going certification should help ensure LTDP remains a reality at CERN for decades(?) to come (LHC, HL-LHC, HE-LHC)
Slide 26
Slide 27
F.A.I.R. Data Management
• Increasing emphasis on FAIR DMPs, including
preservation, sharing, reproducibility etc.
FAIR now includes also s/w but not yet build
systems, verification procedures & environment
• IMHO not yet fully understood (some claim
otherwise) - we see (ir)regular changes on how we
find data and what protocol(s) we use to access it
• This can be a problem over periods < 1 decade
Only solution we know of: find the effort to
migrate (problem for legacy projects / data)
FAIR = Findable, Accessible, Inter-operable, Re-usable
Slide 28
Expert Group on FAIR
Sandra Collins, National Library of Ireland
Françoise Genova, Observatoire Astronomique de Strasbourg
Natalie Harrower, Digital Repository of Ireland
Simon Hodson, CODATA, Chair of the Group
Sarah Jones, Digital Curation Centre, Rapporteur
Leif Laaksonen, CSC-IT Center for Science
Daniel Mietchen, Data Science Institute, Univ. of Virginia
Ruta Petrauskaité, Vytautas Magnus University
Peter Wittenburg, Max Planck Computing & Data Facility
Slide 29
Collaboration1. Through technology:
• Large Tape Users’ Group
• Invenio Zenodo B2SHARE (INSPIREHEP)
• CVMFS / CernVM
2. Through projects:• e*, E* and H*
3. Through services:• Obi-wan Zenodo
• CVMFS repository for “lost” experiments
• Possible hosting of 2PB of BaBar@SLAC data
• 70 TB of OPERA data (CERN “recognised” expt)
4. Through workshops & conferences:• e.g. EIROForum technical WG on LTDP
• More on Thursday… (?)
BABAR needs Help! BABAR in Numbers• BABAR data actively being analyzed and high
impact papers published (see slide 2). Expect this to continue to at least through 2021.
• SLAC management plans to stop hosting BABARcomputing in February 2020 at which time the tapes with data will be ejected.
• DOE support ended in 2017, now running on international common funds (OCF).
• Looking for possibility of support and long term data preservation at
– CERN,
– GridKa (BABAR site for analysis and XRootDfederated dataset main redirector),
– University of Victoria (BABAR site for analysis, documentation, and tools support).
• BABAR lightweight VMs come with the latest software release and xrootd client included, running under the most common virtual machine players. Just add the data via the GridKa main XRootD redirector.
• 2PB of data on T10k-D tapes– raw, processed, Monte Carlo– Unique dataset at the Y(3S) resonance
(no plan (yet?) to run at the Y(3S) @ Belle II)
• Full environment enclosed in VMs (SL5,SL6)
• ~1TB of documentation, repositories, and dataset information (DBs, cvs, wiki, html)
– Internal documents archived on INSPIRE
• 574 papers, ~10 papers/year past 3 years • 231 members (semi-frozen author list)
– Including PhD students in Canada, Germany, Israel, Italy, Russia, US
– Associated theorists mine data to test new ideas
• ~20 analyses on track, ~10 more in the pipeline
– Continue to have new analyses every year including joint BABAR -Belle analyses
• Students analyze BABAR data while working on Belle II and other experiments in construction/commissioning phase
Slide 31
ISO 16363 certification of CERN• ISO 16363 follows OAIS breakdown:
3. Organisational Infrastructure;
4. Digital Object Management;
5. Infrastructure and Security Risk Management.
• Many of the elements in 3) and 5) covered by existing (and documented) CERN practices• Some “weak” areas – being addressed – include disaster
preparedness / recovery (together with EIROForum)
On-going “stage 1” external audit to high-light those areas requiring attention• May just be a question of documentation,
e.g. CERN is not going to change its financial practices (MTP etc) as a result of ISO 16363!
Slide 32
Who does it benefit?• Funding agencies, who can better judge if the money
they are providing will be used according to their requirements • e.g. FAIR DMPs which call for preservation & re-use
• Data users to be able to determine the “trustworthiness” of the data (user surveys)
• Producers (e.g. LHC experiments) to understand how and what a repository does to preserve their data
The data of most CERN experiments already lost!• By number of experiments, not by volume
• CERN Greybook: 776 completed experiments, ~20 active
• “Preserved”: LEP(4), LHC(4)
O(10) vs O(1000)
Slide 33
HEP Community White Paper
• Focuses on the challenges of the next decade or so (LHC Run3, HL-LHC Run4) Massive increase in data rates and computational
needs – way beyond technology predictions
• “bit preservation with an acceptably low error rate can now be considered a solved problem”
• Main areas of work now:• Analysis capture (incl. workflows) and reproducibility
• “Open Data” at multi-PB scale and beyond
• Trying to do this in collaboration with others (e.g. RDA)
Does “Open Data” mean zero or low latency?• People assume so – enormous implications!
Slide 34
Services are (just) services
• No matter how fantastic our { TDRs, PID services,
Digital Library, Software repository } etc is, they
are there to support the users
Who have to do the really hard work!
E.g. write the software, documentation, acquire and
analyse the data, write the scientific papers
• However, getting the degree of public recognition
as at the Higgs discovery day was a target e-
KPI!
Computing was thanked in the same way as the LHC & experiments
Slide 35
What is the future?
• Some hope that it may be possible to separate
long-term preservation of data at the bit level
from domain-specific aspects
• The former could benefit from economies of
scale and specialised knowledge in running
multi-PB / EB archives
The latter will continue to need expert
knowledge to revalidate on a regular basis
• Drive to reduce overhead through "domain
protocols" for DMPs
Bottom line: be collaborative to drive down costs
Slide 36
Input to next ESPP
• Certification as a “Very Trustworthy Digital
Repository” – exabytes & decades & changes
• Open Data – clarification(s); resources
• Reproducibility & Re-use
Resilience to and handling of change(s)
Slide 37
29 years of LEP – what does it tell us?
► Major migrations are unavoidable but hard to foresee!
► Data is not just “bits”, but also documentation, software + environment + “knowledge”► “Collective knowledge” particularly hard to capture
► Documentation “refreshed” after 20 years (1995) – now in Digital Library in PDF & PDF/A formats (was Postscript)
► Today’s “Big Data” may become tomorrow’s “peanuts”
► 100TB per LEP experiment: immensely challenging at the time; now “trivial” for both CPU and storage
► With time, hardware costs tend to zero ► O(CHF 1000) per experiment per year for archive storage
► Personnel costs tend to O(1FTE) >> CHF 1000!► Perhaps as little now as 0.1 – 0.2 FTE per LEP experiment to keep
data + s/w alive – (new analyses “cost extra”)
See DPHEP Workshop on “Full Costs of Curation”, January 2014:
https://indico.cern.ch/event/276820/
Slide 38
Conclusions
• We are well on the way to implementing our 2020 vision using “standard” services• VTDR & PIDs, Digital Libraries & DOIs, s/w preservation
• Services that are – or should be – offered in the EOSC*
But they are not “holistic” – “mind the gap(s)”
• And they will change over time – whatever people (especially in IT) pretend!
• Beware of "grey-backed gorillas"
==> Constant effort is needed – like with a bike
Slide 40
What Makes HEP Different?
• We throw away most of our data before it is even recorded – “triggers”
• Our detectors are relatively stable over long periods of time (years) – not “doubling every 6 or 18 months”
• We make “measurements” – not “observations”
• Our projects typically last for decades – we need to keep data usable during at least this length of time (but not necessarily “forever”)
• We have shared “data behind publications” for more than 30 years… (HEPData)
13th January 2014A. Valassi – Objectivity Migration 41
ODBMS migration – overview (300TB)
A triple migration! Data format and software conversion from Objectivity/DB to Oracle Physical media migration from StorageTek 9940A to 9940B tapes
Took ~1 year to prepare; ~1 year to execute
Could never have been achieved without extensive system, database and application support!
Two experiments – many software packages and data sets COMPASS raw event data (300 TB)
Data taking continued after the migration, using the new Oracle software
HARP raw event data (30 TB), event collections and conditions data Data taking stopped in 2002, no need to port event writing infrastructure
In both cases, the migration was during the “lifetime” of the experiment System integration tests validating read-back from the new storage
BABAR Highlights and Press Releases
November 2017
Dataset:
Y(4S): 433/fb
Y(3S): 30/fb
Y(2S): 14/fb
Off resonance: 10%
Y(1S) accessed via
Y(2S,3S) → Y(1S) π+π–
June 2017