ELIXIR: Data Challenges in the Life...

25
European Life Sciences Infrastructure for Biological Information www.elixir-europe.org ELIXIR: Data Challenges in the Life Sciences e-IRG workshop, Athens, 9-10 June 2014 Andrew Smith ELIXIR Hub

Transcript of ELIXIR: Data Challenges in the Life...

Page 1: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

European Life Sciences Infrastructure for Biological Information www.elixir-europe.org

ELIXIR: Data Challenges in the Life Sciences

e-IRG workshop, Athens, 9-10 June 2014

Andrew Smith ELIXIR Hub

Page 2: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

2

medicine

environment

bioindustries

society

To build a sustainable European infrastructure for biological information, supporting life science research and its translation to:

ELIXIR’s mission

Page 3: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

European Life Sciences Infrastructure for Biological Information www.elixir-europe.org

The potential

Page 4: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

4

Genome-wide analysis of crop plants

• Population growth and climate change are major challenges to food security.

• Traditional routes to crop improvement are too slow to keep up with this increase in demand.

• Understanding plant genomes helps us identify which species will be most tolerant to drought, salt and pests while still providing optimum nutrition.

Page 5: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

5

Matching the treatment to the cancer

• One in 10 women in the EU-27 will develop breast cancer before the age of 80.

• If we can identify patterns of genes that are active in different tumours, we can diagnose and treat cancers earlier.

Page 6: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

European Life Sciences Infrastructure for Biological Information www.elixir-europe.org

The challenges

Page 7: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

Growing data

12 month doubling

18 month doubling 4 month doubling

3 month doubling

100000000

1E+09

1E+10

1E+11

1E+12

1E+13

1E+14

1E+15

1E+16

2002 2004 2006 2008 2010 2012 2014 2016

byte

s

date

EGA

ENA

PRIDE

MetaboLights

ArrayExpress

Page 8: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

The data challenge: geography

8

• Data production increasing sites across Europe

• European Illumina sales up 20% 2o13

Source: http://omicsmaps.com

Page 9: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

Data resources in life science

• Diverse • Many

• Disperse

Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2012. MY Galperin, GR Cochrane – Nucleic Acids Research, 2011

Genomics Databases (non-vertebrate) (17.9%)Protein sequence databases (12.9%)Human Genes and Diseases (9.8%)Structure Databases (9.7%)Metabolic and Signaling Pathways (9.3%)Nucleotide Sequence Databases (8.8%)Human and other Vertebrate Genomes (7.1%)Plant databases (7.1%)RNA sequence databases (4.9%)Microarray and other Gene Expression Databases (4.5%)Other Molecular Biology Databases (3.3%)Immunological databases (1.8%)Organelle databases (1.6%)Proteomics Resources (1.2%)Cell biology (0.2%)

~1800 molecular biology

data resources

Page 10: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

Users are global

10

Source: EMBL-EBI Live Data Map

Page 11: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

The policy drivers: Open Access to data

• Open access to life science data is essential for advances in many areas of research

• It provides a valuable path to discovery, one that in many other areas of research is limited by commercial confidentiality

• National funders increasingly require researchers to make data open

• EC’s H2020 pilot on Open Research Data and Data Management Plans

Page 12: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

European Life Sciences Infrastructure for Biological Information www.elixir-europe.org

The response

Page 13: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

Infrastructure for Life Sciences

13

Compute

Data

Standards

Tools Access Search Analysis

Formats Ontologies Guidelines

Integration Optimization Privacy

Storage Network Computing

Training

Page 14: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

ELIXIR’s structure

14

• Tools • Standards • Data • Compute • Training • Industry

Page 15: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

ELIXIR Nodes

15

Page 16: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

16

Page 17: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

Training

17

For Big Data to become huge, however, there are still hurdles to leap. For one thing, the tools to analyse data are not yet good enough. And people with the skills to analyse data are scarce and will become scarcer. By 2018 there will be a “talent gap” of between 140,000 and 190,000 people, …

Page 18: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

ELIXIR pilots addressed key challenges in biomedical research

1. Cloud computing “Embassy cloud”: Access reference data in a virtual environment – work as though you are at EMBL-EBI or SIB, Switzerland

2. Authentication & Authorisation Improved methods and processes for access to clinical data

Page 19: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

Identifying new drug targets ELIXIR pilot: Interoperability of high-resolution protein data at EMBL-EBI and HPA, Sweden

The Human Protein Atlas portal is a publicly available database with millions of high-resolution images showing the spatial distribution of proteins in 46 different normal human tissues and 20 different cancer types, as well as 47 different human cell lines.

Page 20: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

European ELIXIR Data - ”LightPath” (EBI / CSC)

• To explore the replication of large scale (Petabyte scale) archives to remote sites

• To create a separate source of data files for challenging DataIO projects

• Selection of pilot data transfer technology between EBI and CSC

• Established a dedicated light path between datacenters in London and Kajaani

• Development of model for future IO needs in the life sciences in Europe

Page 21: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

Cross-site VM Operation - pilot

21

• Perform analysis via cloud infrastructures and VMs

• Transfer VMs between computing centers to allow researchers to perform analyses that they could not otherwise do locally

• Supported by 5 NRENs and in collaboration with

Page 22: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

Cross-site VM Operation

22

CSC

EMBL-EBI

University of Groningen

Data Analysis tools

Computation

Data

Analysis tools

VM

VM

VM

Chipster 200GB

NBIC Galaxy 50GB

GoNL 60TB

ENA 3.2PB 1GB lightpath

1GB lightpath 1GB lightpath

Funet

Janet

SURFnet

Page 23: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

European Research Infrastructures

23

LS

e-infrastructures life sciences

ICT

Page 24: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

Knowledge exchange workshop Discussion of big data challenges in life sciences

Focus on few representative domains

Looking 5 years ahead

Jointly identify potential solutions to our problems

Data

ICT e-infrastructures

LS life sciences Physical facilities

Scientific information

Transfer Computation Storage

Page 25: ELIXIR: Data Challenges in the Life Sciencese-irg.eu/documents/10920/260645/elixir_e-irg_andy_final.pdf · 2014-12-02 · skills to analyse data are scarce and will become scarcer.

European Life Sciences Infrastructure for Biological Information www.elixir-europe.org

Thank you