E biothon workshop 2014 04 15 v1

e-Biothon

V. Breton ([email protected])LPC Clermont-Ferrand, IdGC

CNRS-IN2P3http://france-grilles.fr

Credit: N. Bard, A. Franc, JF Gibrat

Extreme Performance Computational Science workshopTokyo, April 15th 2014

mailto:[email protected]

http://france-grilles.fr/

Table of content

2

• What are the computing challenges of life sciences?

• France Grilles: a multidisciplinary distributed e-infrastructure for science

• E-Biothon: an HPC platform for research in life sciences

Generalities on sequencing

• Genome = DNA sequence (4 nucleotids: A, C, G, T)– Smallest non viral genome: Carsonella

ruddii (0,16Mbp)– Largest genome: Polychaos dubium

(670Gbp)

Sanger technology 500 bp sequences

454 technology 105 reads of 450 to 600bp seq.

Illumina Technology 106 reads of 100 bp seq.

Current projects(Tara) 107 reads of 100 to 400 bp seq.

Explosion of data set size

Data analysis ?Algorithms?Heuristics?

Tara @ http://oceans.taraexpeditions.org/

Evolution of sequencing techniques

Data production is distributed2558 High Throughput « Next Generation » sequencing facilities in the world, located in 920 centers (only 10 with more than 15 machines)

Source: omicspmaps.com

Data production grows faster than Moore’s law

Sequencing scenarii• Interest for a new genome requires assembly

– process of taking a large number of short DNA sequences and putting them back together to create a representation of the original

– Algorithms based on read overlapping benefit from large RAM (1 TO) -> HPC

• Working with a reference genome requires comparative analysis– Alignment algorithms (BLAST) find regions of local similarity between

sequences– Phylogeny algorithms (PhyML) build evolutionary relationships between

genomes – Comparative analyses are easily parallelized at data level -> HTC

Summary

• Life Sciences have specific computational challenges– Data production grows faster than Moore law– Permanent need of comparing new data to existing ones

• Life sciences needs can be relevantly addressed on multidisciplinary IT infrastructures (e-infrastructures)– HPC resources best fitted for genome assembly– Grid/cloud HTC resources well fitted for comparative

analysis• Life sciences are among the main users of the French

national grid/cloud production infrastructure

France Grilles

• Is a Scientific Interest Group…– Created in 2010 by 8 partners: CEA, CNRS,CPU, INRA, INRIA,

INSERM, MESR, RENATER…– To steer up and coordinate the national strategy in the fields of

grids and clouds

• Vision: – Build and operate a national distributed computing

infrastructure open to all sciences and to developing countries

France Grilles 9

France Grilles model

• France Grilles does not own the resources– Resources owned by user communities

• France Grilles provides a framework– To share resources, expertise and know how– To promote innovation and initiatives– To foster collaboration at national and international

levels– To reach out to the long tail of users

10

France Grilles resources

France-Grilles backbone: LCG-France

France-Grilles spine:CC-IN2P3

12

EGI de 2010 à 2013

2010-2013: from 14 regional to 34 operations centres in 53 countries,from 188,000 jobs/day with 80,000 cores on 250 Resource Centresto 1,200,000 jobs/day with 430,000 cores on 337 Resource Centres

Technologies• Grids• Clouds• Desktops

Exposé S. Newhouse Madrid, Sept. 2013

France Grilles, a partner of EGI

Provide a common framework to all user communities

Provide an open environment for fruitful disciplinary and multidisciplinary research

14

1

10

100

1000

5 1 1

21854

9 1 5 9 11 15 13 11

75599 50

9 23

Over 1500 scientific publicationsjune 2010 – April 2014

Web portal

Users

479 registered users in Nov 2013 (175 in France)Most used robot certificate in EGI (http://go.egi.eu/wiki.robot.users)

Neuro-image analysisCancer therapy simulation

Prostate radiotherapy plan simulated with GATE(L. Grevillot and D. Sarrut)

Image simulation

Echocardiography simulated with FIELD-II (O. Bernard et al)

Modeling and optimization ofdistributed computing systems

Acceleration yielded by non-clairvoyanttask replication (R. Ferreira da Silva et al)

Brain tissue segmentationwith Freesurfer

Scientific applications

Infrastructure

Supported by EGI InfrastructureUses biomed VO (most used EGI VO for life sciences in 2013)VIP accounts for ~25% of biomed's activityVIP consumes ~50 CPU years every month

DIRAC

France-Grilles

Application as a serviceFile transfer to/from grid

Virtual Imaging Platform:http://www.creatis.insa-lyon.fr/vip

http://go.egi.eu/wiki.robot.users

http://www.creatis.insa-lyon.fr/vip

Collaborations with dedicated life sciences infrastructures

• Institut Français de Bioinformatique (computing and storage resources at IDRIS)

• France Genomique ( computing and storage resources at TGCC)

• France Life Imaging (infrastructure for biomedical imaging)

• E-Biothon

16

17

• Telethon: every year, fund raising by french media for French Muscular Distrophy Association (AFM)

• From Telethon to Decrypthon– Computing infrastructure (IBM)– Research projects (CNRS)– Human resources (AFM)

• From Decrypthon to E-Biothon

E-Biothon: history

e-Biothon: an HPC platform for research in life sciences

18

User SupportBlue Gene / p

machinesTechnical support User Support

Blue Gene / P operationWeb access

portal

E-Biothon: infrastructure

19

• 2 Blue Gene/P IBM racks with 200 TO storage – 2x1024 4-core nodes– up to 28 TFlops peak

performance• SysFera-DS web access

to computing resources• 2 modes:

– Standard (MPI)– HTC (1024 independent

tasks in parallel)

E-Biothon vision is to offer a service to the user communities in life sciences

• 2013-2014: first 3 projects– Jean-François Gibrat et al, (MIGALE

platform, INRA Jouy-en-Josas)– Olivier Gascuel, Stéphane Guindon

et Vincent Lefort (CNRS Montpellier)

– Yec’han Laizet, Philippe Chaumeil, Jean-Marc Frigerio, Stéphanie Mariette, Sophie Gerber, Alain Franc (INRA BioGeCo – Bordeaux)

• > 2014: open call for projects (IFB)

Studying the synteny over a wide range of microbial genomes

21

• Definition: similar blocks of genes in the same relative positions in the genome

• Interest: Study of synteny can show how the genome is cut and pasted in the course of evolution

• MIGALE team at INRA designed a pipeline analysis to compute synteny between 2 genomes and store it in a database

• E-Biothon impact: change in scale - capacity to compute synteny between 2000 complete bacterial genomes (7 millions comparisons)

PhyML

Philogenetics is the study of evolutionary relationships among groups of organisms

PhyML is a software that estimates maximum likelihood phylogenies from alignments of nucleotide or amino acid sequences

PhyML original publication in 2007 is the most cited in environment and ecology (> 6000 citations).

E-Biothon impact: change in scale in the resources made available to PhyML users

Characterizing biodiversity

According to botanic theory, biodiversity is organized in species, genders, families, orders:is it confirmed in the distancebetween sequences?

Study of biodiversity in Guyane16000 different tree species in amazonian forest (≈ 300 in Europe)

More biodiversity in 10000 m2 of forest in French Guyana than in Europe

Decrypthon added value

Change in scale (from local Mesocenter in Bordeaux)

Millions of reads

Exact distance computation without heuristics (alignement scores)

TOctets of data produced every week

Conclusion

• Both HPC and HTC resources are increasingly needed to address life sciences data and computing challenges:– As sequencing technologies keep evolving, data production grows

faster than Moore law and is increasingly distributed– Biological data need to be constantly compared to each other

(phylogenetics, genomics comparative analysis)• France is developing complementary HPC and HTC

infrastructures for life sciences– Institut Français de Bioinformatique, France Génomique– E-Biothon: an HPC platform for research in life sciences– France Grilles: a multidisciplinary grid/cloud production

infrastructure

2558 Next Generation Sequencers in the world

Are life sciences specific w.r.t computing?

What is specific to life sciences: - As sequencing technologies keep evolving, data production grows faster than

Moore law- Biological data need to be constantly compared to each other (phylogenetics,

Genomics comparative analysis)What is not specific?

- Data production is distributed- Multiscale modeling

E biothon workshop 2014 04 15 v1

Technology

Transcript of E biothon workshop 2014 04 15 v1