Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster...

31
Science Demonstrator Panel Session 1 on Life Sciences

Transcript of Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster...

Page 1: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

Science Demonstrator Panel Session 1 on Life Sciences

Page 2: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

PanCancer Science Demonstrator - Sergei Yakneen, EMBL

2www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by

the European Commission, DG Research & Innovation under contract no.

739563

Page 3: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

The Science Challenge

3www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by

the European Commission, DG Research & Innovation under contract no.

739563

- Collect Next Generation Sequencing Data from several cohorts of cancer patients generated at multiple sequencing centres and across multiple cancer types.

- Reanalyze the data using a uniform and consistent data processing pipeline utilizing established best practices from the International Cancer Genomics Consortium.

- Analyze the integrated data set to identify patterns of germline and somatic mutation that act across cancer types in a PanCancer fashion.

Page 4: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

The Science Demonstrator

4www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by

the European Commission, DG Research & Innovation under contract no.

739563

- Utilize Butler, a cloud-based large-scale scientific workflow framework developed in the context of ICGC’s Pancancer Analysis of Whole Genomes project to perform a coordinated data analysis across multiple clouds.- Code - https://github.com/llevar/butler- Paper - https://doi.org/10.1101/185736

- Perform automated repeatable deployments and configuration of the entire processing infrastructure at three academic cloud computing environments.- EMBL-EBI Embassy Cloud- ComputeCanada West Cloud- Cyfronet

- Deliver a large dataset (>50 TB) to each cloud computing centre.- Use Butler to run PanCancer pipelines and monitor progress.

Page 5: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

Successes

5www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by

the European Commission, DG Research & Innovation under contract no.

739563

EMBL/EBI Embassy Compute Canada Cyfronet

vCPU 1000 1000 700

RAM 4 TB 4 TB 2.6 TB

Disk 1 PB 150 TB 200 TB

Data 448 samples from 224

prostate cancer donors

422 samples from 211 pediatric

brain tumour donors

2081 samples from 1000

Genomes Project

71 TB raw data 62 TB raw data 50 TB raw data

Status Alignment and variant

calling completed

Alignment and variant calling

completed

Alignment completed

- Developed configurations for each cloud - https://github.com/llevar/eosc_pilot

- Developed extensive documentation and examples - https://butler.readthedocs.io/en/latest/

- Developed Butler self-healing capabilities.

- Performed data staging via Cyfronet Onedata.

Page 6: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

Issues

6www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by

the European Commission, DG Research & Innovation under contract no.

739563

- Biggest issue encountered by the SD was the initial shortage of resources for operating at “cloud scale”.- Used 20% of data set that was utilized for PCAWG- < 0.5% of data set for 100k Genomes Project.

- Repeatable provisioning of large clusters of VMs.- >10% of provisioning jobs experience failures

- Data movement and staging.- 50 TB data set takes up to two weeks to move locations- Genomics data requires encryption and network security

measures- Shared access to network-accessible storage creates

processing bottlenecks.

Page 7: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

Lessons Learned

7www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by

the European Commission, DG Research & Innovation under contract no.

739563

- Effectively supporting life sciences use cases like cancer genomics will require A LOT of resources.

- Diverse data-sets have diverse data handling requirements, thus it is better to provide a variety of tools to make solutions with rather than a single “solution”.

- Automated detection and resolution of issues with infrastructure (a la Butler self-healing) are imperative for effective operation at cloud-scale.

Page 8: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

EGA – FAIR Genomic DatasetsTony Wildish on behalf of Nino Spataro andthe EGA-CRG team

8www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the

European Commission, DG Research & Innovation under contract no. 739563

Page 9: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

The Science Challenge

9www.eoscpilot.e

u

The principal objectives of our SD are:

i. Test the feasibility of data reproducibility in genomics

ii. Prove the possibility to remaster genomic datasets

iii. Render genomic datasets more FAIR

The European Open Science Cloud for Research pilot project is funded by the

European Commission, DG Research & Innovation under contract no. 739563

Page 10: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

The Science Demonstrator

10www.eoscpilot.e

u

How we made it:

◆ Implementing portable containerized genomic pipelines

◆ Using a language enabling scalable and reproducible scientific work-flows(Nextflow available at: https://www.nextflow.io/)

◆ Storing the pipelines in a public repository together with metadatadescribing each pipeline step and the used tools and versions

The European Open Science Cloud for Research pilot project is funded by the

European Commission, DG Research & Innovation under contract no. 739563

Page 11: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

Successes

11www.eoscpilot.e

u

✓ Genomic pipelines portabilility

Pipelines were successfully implemented and executed in a third-party infrastructure.

✓ Genomic pipelines FAIRificationPipelines were deposited jointly with metadata describing the relevant variables relevant

for pipeline description and re-use.Pipelines available at:https://dockstore.org/workflows/github.com/CRG-CNAG/EOSC-Pilot

✓ Feasibility of reproducibility and remastering in genomics

Overall, 97.38% of the obtained variants are shared and 99.66% of the called genotypesperfectly agreed.

The European Open Science Cloud for Research pilot project is funded by the

European Commission, DG Research & Innovation under contract no. 739563

Page 12: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

Issues

12www.eoscpilot.e

u

✓ Unavailable original version of some softwares

Solved using of the closest available version

✓ Size of the selected dataset to replicate

Solved limiting the replicability to a subset of the original data

Time-consuming understanding of original pipelines

The absence of consolidated standards to store and describe the original pipelinesslowed down the pipeline implementation process

The European Open Science Cloud for Research pilot project is funded by the

European Commission, DG Research & Innovation under contract no. 739563

Page 13: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

Lessons Learned

13www.eoscpilot.e

u

The European Open Science Cloud for Research pilot project is funded by the

European Commission, DG Research & Innovation under contract no. 739563

➢ Reproducibility is a time consuming task on both the implementation andcomputational side.

➢ Universal methods to describe pipelines are required along with long termrepositories to keep the whole experiment reproducible.

➢ A FAIR-compliant semantic repository on which to represent objects and theirrelationships is missing in the EOSC ecosystem.

➢ Open science is still not perceived as scientific obligation by scientificstakeholders. Continuous training and education is required to form a newgeneration of scientists.

Page 14: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

CryoEMCarlos Oscar Sorzano (CSIC)

14www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by

the European Commission, DG Research & Innovation under contract no.

739563

Page 15: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

The Science Challenge

15www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by

the European Commission, DG Research & Innovation under contract no.

739563

CryoEM aims to improve reproducibility of their work using image processing workflows through the production of a Scipion workflow file that describes their image processing steps. This allows full reproduction of the same results when the data is reprocessed outside the microscope facility. This description can also be uploaded to public databases, so that other users can understand the process followed to achieve a given structure.

Page 16: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

The Science Demonstrator

16www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by

the European Commission, DG Research & Innovation under contract no.

739563

• Adapt Scipion (an image processing workflow engine) to be able to thoroughly report in a Json file all the inputs, outputs, and used parameters so that the same processing can be reproduced.

• Adapt Scipion to be able to reproduce an already existing workflow producing exactly the same results as in the first run.

• Connecting Scipion to a public database (Electron Microscopy Data Bank) in order to allow the user to automatically submit his/her results.

• Allow other users to visualize the workflow performed by other scientists.

Page 17: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

Successes

17www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by

the European Commission, DG Research & Innovation under contract no.

739563

Page 18: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

Issues

18www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by

the European Commission, DG Research & Innovation under contract no.

739563

1. Create a public repository of acquisition metadata and image processing workflows for new acquisitions, as a temporary repository until the data is finally analyzed and deposited in the standard public databases (EMDB and EMPIAR).

2. Create an authentication policy such that biologists coming out from an EM facility could continue the image processing in some of the EOSC cloud machines.

Page 19: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

Lessons Learned

19www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by

the European Commission, DG Research & Innovation under contract no.

739563

• There is a big gap between technological advances and their adoption in EU facilities and scientists. Much of it due to funding:

• Local resources for stream processing

• Existence of temporary repositories

• Access to high-end computer clusters

• There is a gap between open science promotion and the obligation of facilities to keep and disclose publicly funded data.

Page 20: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

The European Open Science Cloud for Research pilot project is funded by the

European Commission, DG Research & Innovation under contract no. 739563

BioimagingBeatriz Serrano-Solano

Jean-Karim Hériché

2

0www.eoscpilot.eu

Page 21: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

The European Open Science Cloud for Research pilot project is funded by the

European Commission, DG Research & Innovation under contract no. 739563

The Science Challenge

2

1www.eoscpilot.eu

▸ Biological images contain more information than described in their original publications.

▸ Re-analyzing the images with machine learning algorithms can extract new knowledge from these unexploited resources.

Page 22: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

The European Open Science Cloud for Research pilot project is funded by the

European Commission, DG Research & Innovation under contract no. 739563

The Science Demonstrator

2

2www.eoscpilot.eu

Page 23: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

The European Open Science Cloud for Research pilot project is funded by the

European Commission, DG Research & Innovation under contract no. 739563

Successes

2

3www.eoscpilot.eu

Page 24: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

The European Open Science Cloud for Research pilot project is funded by the

European Commission, DG Research & Innovation under contract no. 739563

Issues

2

4www.eoscpilot.eu

Page 25: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

The European Open Science Cloud for Research pilot project is funded by the

European Commission, DG Research & Innovation under contract no. 739563

Lessons Learned

2

5www.eoscpilot.eu

▸ EOSC Ecosystem

▸ Technical

▸ Lack of high-performance file system

▸ Lack of big memory machines (1 TB of RAM)

▸ Services

▸ User-unfriendly deployment and set-up (e.g. ElastiCluster)

▸ Inadequate training

It would have been more efficient to use the local HPC

Page 26: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

Photon and Neutron Michael Schuh, DESY

26www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by

the European Commission, DG Research & Innovation under contract no.

739563

Page 27: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

The Science Challenge

27www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by

the European Commission, DG Research & Innovation under contract no.

739563

Data

● Volume of hundreds of PBs

● Fast data ingest, tens of GB/s per detector

● File creation at kHz rates

Computing

● Fast resources for immediate online

analysis, monitoring running experiments

● Highly specialized offline analysis

frameworks used in physics, chemistry,

materials science, biology, nanotechnology

Policy

● Data Management Plans

● Sharing of FAIR data, methods, results

between users, sites and communities

● Control access during data embargos

● Persistence, long term archival

Images: desy.de/~twhite/crystfel, cid.cfel.de/research/femtosecond_crystallography

Page 28: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

The Science Demonstrator

28www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by

the European Commission, DG Research & Innovation under contract no.

739563

Motivation:Data sets too large to take home

○ Execute codes on cloud

resources close to the data,

avoid downloading large

amounts of data to user systems

Solution:IaaS and PaaS

○ No stack implementation

by the user

○ Efficient resource management

○ Prepare federation of DESY

OpenStack as EOSC resource

CaaS

○ Libraries for containerized

software, tools and functions

○ Run user defined software stacks

○ Container orchestration

FaaS

○ Containers as cloud functions

Service oriented architecture with cloud computing technologies

Page 29: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

Successes

29www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by

the European Commission, DG Research & Innovation under contract no.

739563

Automated data processing

● Data comes in, FaaS

automatically triggered

○ Create derived data

○ Extract metadata

Interactive data analysis

● Share and re-use complete workflows

● Jupyter Notebooks as graphical frontend,

run anywhere from EOSC to small remote

system

● Notebooks and functions published and

continuously integrated via GitLab/Docker

Page 30: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

Issues

30www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by

the European Commission, DG Research & Innovation under contract no.

739563

● Fully integrated template solutions (Magnum/Heat, TOSCA) for scaling COE

clusters (Docker Swarm, kubernetes, Mesos) still cumbersome.

○ EOSC can do a great job in facilitating this with good cluster on demand

service as open science solution

● Cloud Functions (FaaS) have proven to be a good solution for short running

functions, micro-services. Integration with present HPC and HTC systems still

undefined, request routing based on job profile needs research.

○ Submitting into present HPC clusters

○ Virtualizing HPC clusters in the EOSC on demand

● Many licenses are not aware of new container distribution channels and

deployments as cloud functions, as a service.

● Integrated AAI solution needed technical and policy-wise

● Will EOSC provide cloud application building blocks?

○ Container registries

○ Message hubs

○ GitLab

○ JupyterHub

Page 31: Science Demonstrator Panel Session 1 on Life Sciences · ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud

Lessons Learned

31www.eoscpilot.euThe European Open Science Cloud for Research pilot project is funded by

the European Commission, DG Research & Innovation under contract no.

739563

● Scaling highly specialized scientific applications means effort,

splitting into micro-services, containerizing, cloud deployments.

○ Strengthen co-development between cloud, infrastructure, platform

DevOps and software developers as well as data analysts.

● User interaction feels different with graphical applications, Window-

Forwarding from cloud resources often low-performing.

○ Clearly define where batch, headless, API ready and GUI applications

are in focus.

● Fully templated virtualized HPC cluster solutions still to emerge,

same for native deployments and for container clusters

○ EOSC to provide collaborative templates as know-how

as well as cluster on demand solutions.

○ EOSC to provide sufficient resources

for large-scale deployments suitable for big data.