E biothon workshop 2014 04 15 v1
-
Upload
vincent-breton -
Category
Technology
-
view
118 -
download
1
description
Transcript of E biothon workshop 2014 04 15 v1
e-Biothon
V. Breton ([email protected])LPC Clermont-Ferrand, IdGC
CNRS-IN2P3http://france-grilles.fr
Credit: N. Bard, A. Franc, JF Gibrat
Extreme Performance Computational Science workshopTokyo, April 15th 2014
Table of content
2
• What are the computing challenges of life sciences?
• France Grilles: a multidisciplinary distributed e-infrastructure for science
• E-Biothon: an HPC platform for research in life sciences
Generalities on sequencing
• Genome = DNA sequence (4 nucleotids: A, C, G, T)– Smallest non viral genome: Carsonella
ruddii (0,16Mbp)– Largest genome: Polychaos dubium
(670Gbp)
Sanger technology 500 bp sequences
454 technology 105 reads of 450 to 600bp seq.
Illumina Technology 106 reads of 100 bp seq.
Current projects(Tara) 107 reads of 100 to 400 bp seq.
Explosion of data set size
Data analysis ?Algorithms?Heuristics?
Tara @ http://oceans.taraexpeditions.org/
Evolution of sequencing techniques
Data production is distributed2558 High Throughput « Next Generation » sequencing facilities in the world, located in 920 centers (only 10 with more than 15 machines)
Source: omicspmaps.com
Data production grows faster than Moore’s law
Sequencing scenarii• Interest for a new genome requires assembly
– process of taking a large number of short DNA sequences and putting them back together to create a representation of the original
– Algorithms based on read overlapping benefit from large RAM (1 TO) -> HPC
• Working with a reference genome requires comparative analysis– Alignment algorithms (BLAST) find regions of local similarity between
sequences– Phylogeny algorithms (PhyML) build evolutionary relationships between
genomes – Comparative analyses are easily parallelized at data level -> HTC
Summary
• Life Sciences have specific computational challenges– Data production grows faster than Moore law– Permanent need of comparing new data to existing ones
• Life sciences needs can be relevantly addressed on multidisciplinary IT infrastructures (e-infrastructures)– HPC resources best fitted for genome assembly– Grid/cloud HTC resources well fitted for comparative
analysis• Life sciences are among the main users of the French
national grid/cloud production infrastructure
France Grilles
• Is a Scientific Interest Group…– Created in 2010 by 8 partners: CEA, CNRS,CPU, INRA, INRIA,
INSERM, MESR, RENATER…– To steer up and coordinate the national strategy in the fields of
grids and clouds
• Vision: – Build and operate a national distributed computing
infrastructure open to all sciences and to developing countries
France Grilles 9
France Grilles model
• France Grilles does not own the resources– Resources owned by user communities
• France Grilles provides a framework– To share resources, expertise and know how– To promote innovation and initiatives– To foster collaboration at national and international
levels– To reach out to the long tail of users
10
France Grilles resources
France-Grilles backbone: LCG-France
France-Grilles spine:CC-IN2P3
12
EGI de 2010 à 2013
2010-2013: from 14 regional to 34 operations centres in 53 countries,from 188,000 jobs/day with 80,000 cores on 250 Resource Centresto 1,200,000 jobs/day with 430,000 cores on 337 Resource Centres
Technologies• Grids• Clouds• Desktops
Exposé S. Newhouse Madrid, Sept. 2013
France Grilles, a partner of EGI
Provide a common framework to all user communities
Provide an open environment for fruitful disciplinary and multidisciplinary research
14
1
10
100
1000
5 1 1
21854
9 1 5 9 11 15 13 11
75599 50
9 23
Over 1500 scientific publicationsjune 2010 – April 2014
Web portal
Users
479 registered users in Nov 2013 (175 in France)Most used robot certificate in EGI (http://go.egi.eu/wiki.robot.users)
Neuro-image analysisCancer therapy simulation
Prostate radiotherapy plan simulated with GATE(L. Grevillot and D. Sarrut)
Image simulation
Echocardiography simulated with FIELD-II (O. Bernard et al)
Modeling and optimization ofdistributed computing systems
Acceleration yielded by non-clairvoyanttask replication (R. Ferreira da Silva et al)
Brain tissue segmentationwith Freesurfer
Scientific applications
Infrastructure
Supported by EGI InfrastructureUses biomed VO (most used EGI VO for life sciences in 2013)VIP accounts for ~25% of biomed's activityVIP consumes ~50 CPU years every month
DIRAC
France-Grilles
Application as a serviceFile transfer to/from grid
Virtual Imaging Platform:http://www.creatis.insa-lyon.fr/vip
Collaborations with dedicated life sciences infrastructures
• Institut Français de Bioinformatique (computing and storage resources at IDRIS)
• France Genomique ( computing and storage resources at TGCC)
• France Life Imaging (infrastructure for biomedical imaging)
• E-Biothon
16
17
• Telethon: every year, fund raising by french media for French Muscular Distrophy Association (AFM)
• From Telethon to Decrypthon– Computing infrastructure (IBM)– Research projects (CNRS)– Human resources (AFM)
• From Decrypthon to E-Biothon
E-Biothon: history
e-Biothon: an HPC platform for research in life sciences
18
User SupportBlue Gene / p
machinesTechnical support User Support
Blue Gene / P operationWeb access
portal
E-Biothon: infrastructure
19
• 2 Blue Gene/P IBM racks with 200 TO storage – 2x1024 4-core nodes– up to 28 TFlops peak
performance• SysFera-DS web access
to computing resources• 2 modes:
– Standard (MPI)– HTC (1024 independent
tasks in parallel)
E-Biothon vision is to offer a service to the user communities in life sciences
• 2013-2014: first 3 projects– Jean-François Gibrat et al, (MIGALE
platform, INRA Jouy-en-Josas)– Olivier Gascuel, Stéphane Guindon
et Vincent Lefort (CNRS Montpellier)
– Yec’han Laizet, Philippe Chaumeil, Jean-Marc Frigerio, Stéphanie Mariette, Sophie Gerber, Alain Franc (INRA BioGeCo – Bordeaux)
• > 2014: open call for projects (IFB)
Studying the synteny over a wide range of microbial genomes
21
• Definition: similar blocks of genes in the same relative positions in the genome
• Interest: Study of synteny can show how the genome is cut and pasted in the course of evolution
• MIGALE team at INRA designed a pipeline analysis to compute synteny between 2 genomes and store it in a database
• E-Biothon impact: change in scale - capacity to compute synteny between 2000 complete bacterial genomes (7 millions comparisons)
PhyML
Philogenetics is the study of evolutionary relationships among groups of organisms
PhyML is a software that estimates maximum likelihood phylogenies from alignments of nucleotide or amino acid sequences
PhyML original publication in 2007 is the most cited in environment and ecology (> 6000 citations).
E-Biothon impact: change in scale in the resources made available to PhyML users
Characterizing biodiversity
According to botanic theory, biodiversity is organized in species, genders, families, orders:is it confirmed in the distancebetween sequences?
Study of biodiversity in Guyane16000 different tree species in amazonian forest (≈ 300 in Europe)
More biodiversity in 10000 m2 of forest in French Guyana than in Europe
Decrypthon added value
Change in scale (from local Mesocenter in Bordeaux)
Millions of reads
Exact distance computation without heuristics (alignement scores)
TOctets of data produced every week
Conclusion
• Both HPC and HTC resources are increasingly needed to address life sciences data and computing challenges:– As sequencing technologies keep evolving, data production grows
faster than Moore law and is increasingly distributed– Biological data need to be constantly compared to each other
(phylogenetics, genomics comparative analysis)• France is developing complementary HPC and HTC
infrastructures for life sciences– Institut Français de Bioinformatique, France Génomique– E-Biothon: an HPC platform for research in life sciences– France Grilles: a multidisciplinary grid/cloud production
infrastructure
2558 Next Generation Sequencers in the world
Are life sciences specific w.r.t computing?
What is specific to life sciences: - As sequencing technologies keep evolving, data production grows faster than
Moore law- Biological data need to be constantly compared to each other (phylogenetics,
Genomics comparative analysis)What is not specific?
- Data production is distributed- Multiscale modeling