Metagenomic data analysis From reads to...
Transcript of Metagenomic data analysis From reads to...
Metagenomic data analysis From reads to biomarkers
Meteor, MetaOMineR and future…
Nicolas Pons
Emmanuelle Le Chatelier Magali Berland
2015-09-25 – Bioinformatique du Centre de Jouy-en-Josas
Human intestinal icrobiota: a forgotten organ
from “sterile” humans at birth to 2kg of microorganisms,
out of the 100 trillions cells in the human body, only 1 in 10 is human.
“education” of innate immune defenses
Immune system colonization resistance
terminal differentiation of mucosa
epithelial “homeostasis”
Gut interface
food degradation
vitamin production
Metabolism
energy extraction
90 % microbes
10 % human cells
A major role in health and disease
Bach JF, N Eng J Med 2002
Scientific objectives
Compare multiple samples from the same study / project stratification of the individuals & personalized medicine health vs. disease biomarkers
Mine multiple projects for novel discoveries and association
sick healthy
Analyse a single sample to identify organisms / genes / functions present
Most of microorganisms are unknown and uncultivable…
a largely unknown ecosystem
< 30% cultivables
huge data
500 -1000 dominant
bacterial species
Metagenomics genes inventory
and quantification
sample collection
sequencing reference construction
gene profiling
sequence mapping onto gene catalogue
stool sample
total DNA library preparation
SoLiD/Proton/Illumina sequencing
short sequences 30-50 million
reference gene catalogue
gene counts
gen
es
bioinformatics /statistics analyses
preprocessing / normalization &
dimension reduction
Quantitative metagenomics pipeline
catalogue structuration
individuals
1
2
3
4
1 2 3 4
relation with clinical data
Identify clinically relevant groups &
ecosystems
build and test prediction models
reference catalogue
Adapted from Gevers, et al, 2012, Plos One
MetaHit
1.5Tb
MetaHit HMP
and others
5.1M genes
9.9M genes
3.9M genes
3.3M genes
Multiplication of the catalogues
Qin et al Nature, 2010
New assembly needed for specific study
LiverC 5.4M
Baby 1M
sample collection
sequencing reference construction
gene profiling
sequence mapping onto gene catalogue
stool sample
total DNA
short sequences 30-50 million
reference gene catalogue
gene counts
gen
es
bioinformatics /statistics analyses
preprocessing / normalization &
dimension reduction
relation with clinical data
Identify clinically relevant groups &
ecosystems
Quantitative metagenomics pipeline
catalogue structuration
individuals
1
2
3
4
1 2 3 4 SOMA
MGS canopy PAMA build and test
prediction models
MGS CANOPY : http://git.dworzynski.eu/mgs-canopy-algorithm METEOR APP: IDDN.FR.001.420008.000.R.P.2013.000.30000
MetaOMiner APP: IDDN.FR.001.220005.000.R.P.2014.000.10000
library preparation SoLiD/Proton/Illumina
sequencing
Meteor specifications • Input data: SOLiD, Illumina, Proton • Quality and adaptor/barcode/contaminant filtering • Mapping against very large reference (>10M genes) • Several kind of gene abundance measures • Gene abundance normalization
• Output data (report and abundance table): AdvantageDB, text-file,
excel
• Experiment branch system with workflow • Traceability
• Developed in Delphi (windows) • Command-line program (cluster) and GUI (desktop) • Distributed with user-friendly interfaces for managing samples,
reference catalog, profiles, workflows…
APP : IDDN.FR.001.420008.000.R.P.2013.000.30000
Quick overview of Meteor
Counting Indexing Quality controls
Mapping
Metagenomic
sample
Millions of
reads are
produced by
Illumina,
SOLiD or Ion
Proton
sequencers
Cleaning and
filtering are
performed with
alienTrimmer and
embedded filters.
The mapping of
the reads is
performed with
bowtie 1 or 2 on
reference
catalogues
composed of
several milions of
genes
The gene
abundance is
estimated and the
counting profiles
of hundreds of
individuals are
aggregated.
Data are indexed
on an ISAM
system (Indexed
Sequential Access
Method) managed
by an embedded
NoSQL database
(AdvantageDB,
Sybase).
Gene Profiling with
reads filtering (Quality, Barcodes, Contaminants)
clean reads
Total = 7 7 3 0 4
mapping
How to measure gene abundance ?
Unique = 7 6 1 0 3
Shared = 7 6 + 1/3 1 + 1/3 0 3 + 1/3
i-Shared = 7 6 + 0.6 1 + 0.1 0 3 + 0.3
mismatch
Sample data acquisition
Sample data indexing and QC
Mapping Mapping indexing
Counting Profiling
700 GByte 1.8 TByte 3.5 TByte 4.0 TByte 4.2 TByte 4.4 TByte 4.5 TByte
Meteor benchmark
Time processing and storage provision for a set of 200 samples (50M reads) – ref 3.3M genes (cluster HPC Windows 2008RC2 16 nodes/192 cores with ProActive scheduler)
1 day 2 days
demonstration
Create a new project
Sample management
Add a new sample
Result is a text file with metadata used for traceability through a complete meteor run
The new sample
Counting configuration
Workflow editing
Launch counting
Build abundance table by agregating sample counting result
Reference preparation directly from FASTA file or with iMOMi Genome Studio
Genes Ind 1 Ind 2 Ind 3 Ind 4 Ind 5 Ind 6 Ind 7 1 0 36 2 0 43 106 1250 2 0 27 193 0 44 103 8 3 0 31 0 0 0 0 0 4 152 59 282 1 0 0 0 5 115 0 0 1 0 29 2 6 90 783 26 0 2 0 0 7 104 1616 0 0 0 0 5 8 0 82 0 0 0 0 0 9 2 0 0 0 0 0 0
10 23 239 1302 10 0 190 0 11 30 183 900 13 0 172 0 12 27 228 1120 6 0 324 0 13 103 0 0 0 0 0 0 14 0 30 269 0 0 0 0 15 0 0 0 0 0 95 0 16 1250 6002 468 607 492 141 8023 17 0 0 0 0 0 0 0 18 0 9 108 0 0 55 0 19 0 0 0 3 0 0 0
3300000 0 36 2 0 43 106 1250
Ref
eren
ce c
atal
ogu
e
Individuals
raw matrix
Abundance table is now ready for MetaOMineR…
MetaOmics mining in R
~ 150 functions
Edi Prifti & Emmanuelle Le Chatelier
Prepro
cessin
g
Ecosystem
sPhylogen
ec
annota
on
Analysis
Funconal
annotaon
Data
integr
aon
R-packages suite for
Metagenomic analysis of human gut
+ additional data
packages related to the catalogs
83 31
98 25
Diagnostic tool in liver cirrhosis
Qin N. et al. Nature 2014 Zhejiang University, Hangzhou, China & MGP
diagnosis • by biopsy in 40% • by clinical symptoms or imaging in 60%
Genes Ind1 Ind2 Ind3 Ind4 Ind5 Ind6 Ind71 0 36 2 0 43 106 12502 0 27 193 0 44 103 83 0 31 0 0 0 0 04 152 59 282 1 0 0 05 115 0 0 1 0 29 26 90 783 26 0 2 0 07 104 1616 0 0 0 0 58 0 82 0 0 0 0 09 2 0 0 0 0 0 010 23 239 1302 10 0 190 011 30 183 900 13 0 172 012 27 228 1120 6 0 324 013 103 0 0 0 0 0 014 0 30 269 0 0 0 015 0 0 0 0 0 95 016 1250 6002 468 607 492 141 802317 0 0 0 0 0 0 018 0 9 108 0 0 55 019 0 0 0 3 0 0 0
5400000 0 36 2 0 43 106 1250
Referencecatalogue
Individuals
5.4M genes catalogue
H50
L70
L51
L71
L26
LV8
H56
L86
H33
HV
8
H3
HV
3
L55
L73
H6
L11
H16
H22
H71
H72
H50
L70
L51
L71
L26
LV8
H56
L86
H33
HV8
H3
HV3
L55
L73
H6
L11
H16
H22
H71
H72
0.2 0.4 0.6 0.8 1
Value
02
06
010
0
Color Key
and Histogram
Count
raw matrix
hierarchical clustering of the samples
contaminations – mix ?
sample coherence
hierClust() (momr) 124K genes
a. Data cleaning b. Downsizing c. Normalization d.MGS elaboration
construction quality testing and cleaning visualization
e. Data reduction vectorisation
Preprocessing
Eco
sy
ste
ms
Ph
ylo
ge
ne
tic
an
no
tati
on
Analysis
Functional
annotation
Data
integration
L49
L63
L5
L8
L87
L18
LV16
LV20
L56
LV17
L23
L96
L51
L71
LV18
L92
LV15
L61
L62
LV10
LV19
LV6
L57
L33
L2
L45
LV12
L36
L69
L93
L88
L90
H50
L70
L53
L66
L40
L77
L91
L21
L78
L79
L84
L38
L39
L76
L68
L75
L80
L85
L27
L48
L24
L64
L47
L30
L42
L43
L20
L44
L54
L1
L17
L32
L26
LV8
L59
L60
L50
L98
LV13
L19
L72
HV
23
HV
30
H67
HV
25
H57
H78
H66
H54
H55
H35
H5
H8
H2
H36
H59
H37
H27
H63
H71
H72
H53
H51
H52
H81
L97
H33
HV
8H
18
L15
H32
HV
26
H15
H9
LV11
H60
HV
20
H42
H62
H40
H41
L81
L83
L10
L9
LV1
H39
LV23
H17
L4
L16
L3
L46
LV3
LV7
H1
H68
H75
H69
H74
H21
H61
H10
H24
H31
H58
L34
L35
L28
L29
H77
HV
2L
37
L12
H65
H7
LV24
H44
H80
H13
H70
H14
H45
H3
HV
3H
V5
H76
H82
HV
27
LV22
L89
H25
H64
L31
H6
L11
H11
H34
HV
7H
V18
H47
HV
12
HV
10
H73
HV
13
HV
16
HV
28
HV
29
H38
HV
22
H26
H83
H23
H28
H4
HV
4H
12
HV
31
HV
6H
16
H22
L55
L73
LV14
L6
L82
L22
L41
L14
L25
HV
1H
V19
H19
H20
H49
LV2
LV25
HV
21
LV9
HV
14
H48
LV4
LV5
H29
HV
11
HV
24
L58
L67
HV
15
HV
9L
13
L95
HV
17
L7
LV21
L65
H56
L86
L52
H43
L74
H30
L94
H46
H79
L49L63L5L8L87L18LV16LV20L56LV17L23L96L51L71LV18L92LV15L61L62LV10LV19LV6L57L33L2L45LV12L36L69L93L88L90H50L70L53L66L40L77L91L21L78L79L84L38L39L76L68L75L80L85L27L48L24L64L47L30L42L43L20L44L54L1L17L32L26LV8L59L60L50L98LV13L19L72HV23HV30H67HV25H57H78H66H54H55H35H5H8H2H36H59H37H27H63H71H72H53H51H52H81L97H33HV8H18L15H32HV26H15H9LV11H60HV20H42H62H40H41L81L83L10L9LV1H39LV23H17L4L16L3L46LV3LV7H1H68H75H69H74H21H61H10H24H31H58L34L35L28L29H77HV2L37L12H65H7LV24H44H80H13H70H14H45H3HV3HV5H76H82HV27LV22L89H25H64L31H6L11H11H34HV7HV18H47HV12HV10H73HV13HV16HV28HV29H38HV22H26H83H23H28H4HV4H12HV31HV6H16H22L55L73LV14L6L82L22L41L14L25HV1HV19H19H20H49LV2LV25HV21LV9HV14H48LV4LV5H29HV11HV24L58L67HV15HV9L13L95HV17L7LV21L65H56L86L52H43L74H30L94H46H79
0.2 0.4 0.6 0.8 1
Value
050
00
150
00
Color Key
and Histogram
Count
filt.hierClust() (momr)
0'33"
83 (+31) 98 (+25)
Technical bias : sequencing depth
Bias due to different sequencing depth need for downsizing !
Min. Median Max. 9167000 31860000 123100000
20M 9M
rho = 0.48
raw matrix
reads 9M 12M 41M 25M 13M 15M 12M
downsizing 11M 15M
genes : 1 2 3 1 2 3
a. Data cleaning b. Downsizing c. Normalization d.MGS elaboration
construction quality testing and cleaning visualization
e. Data reduction vectorisation
// downsizeMatrix(9M) 200' // downsizeGC(9M) 13' (momr) // downsizeGC.all (9-20M) 210' computeUpsizedGC()
Preprocessing
Eco
sy
ste
ms
Ph
ylo
ge
ne
tic
an
no
tati
on
Analysis
Functional
annotation
Data
integration
Genes Ind1 Ind2 Ind3 Ind4 Ind5 Ind6 Ind71 0 36 2 0 43 106 12502 0 27 193 0 44 103 83 0 31 0 0 0 0 04 152 59 282 1 0 0 05 115 0 0 1 0 29 26 90 783 26 0 2 0 07 104 1616 0 0 0 0 58 0 82 0 0 0 0 09 2 0 0 0 0 0 010 23 239 1302 10 0 190 011 30 183 900 13 0 172 012 27 228 1120 6 0 324 013 103 0 0 0 0 0 014 0 30 269 0 0 0 015 0 0 0 0 0 95 016 1250 6002 468 607 492 141 802317 0 0 0 0 0 0 018 0 9 108 0 0 55 019 0 0 0 3 0 0 0
5400000 0 36 2 0 43 106 1250
Referencecatalogue
Individuals
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
0e
+0
02
e+
05
4e
+0
56
e+
05
8e
+0
51
e+
06
all : 884 ind ; gene counts correlation
mean9
me
an
11
GC_9M , R?= 0.9996GC_7M ,R?= 0.9977GC_5M , R?= 0.9926GC_3M , R?= 0.9786
down3 down4 down5 down7 down9
GC ...
GC
11
M
upsizing
raw matrix
reads 9M 12M 41M 25M 13M 15M 12M
downsizing 11M 15M
genes : 1 2 3 1 2 3
a. Data cleaning b. Downsizing c. Normalization d.MGS elaboration
construction quality testing and cleaning visualization
e. Data reduction vectorisation
// downsizeGC(9M) 13' (momr) // downsizeGC.all (9-20M) 210' computeUpsizedGC() // downsizeMatrix(9M) 200'
Preprocessing
Eco
sy
ste
ms
Ph
ylo
ge
ne
tic
an
no
tati
on
Analysis
Functional
annotation
Data
integration
Genes Ind1 Ind2 Ind3 Ind4 Ind5 Ind6 Ind71 0 36 2 0 43 106 12502 0 27 193 0 44 103 83 0 31 0 0 0 0 04 152 59 282 1 0 0 05 115 0 0 1 0 29 26 90 783 26 0 2 0 07 104 1616 0 0 0 0 58 0 82 0 0 0 0 09 2 0 0 0 0 0 010 23 239 1302 10 0 190 011 30 183 900 13 0 172 012 27 228 1120 6 0 324 013 103 0 0 0 0 0 014 0 30 269 0 0 0 015 0 0 0 0 0 95 016 1250 6002 468 607 492 141 802317 0 0 0 0 0 0 018 0 9 108 0 0 55 019 0 0 0 3 0 0 0
5400000 0 36 2 0 43 106 1250
Referencecatalogue
Individuals
H LC
20
00
00
60
00
00
10
00
00
01
40
00
00
GC unique ~ status, pval = 0.00012
H LC
50
00
00
100
00
00
15
00
00
02
00
00
00
GC shared ~ status, pval = 1.6e−05
H LC
40
000
060
00
00
80
00
00
120
00
00
GC shared down 9M ~ status, pval = 6.1e−12
pval 1.6e-5 pval 6.1e-12
raw downsized
?
downsized raw matrix
Normalization RPKM : ( Nreads / sizekb )
Σ( Nreads / sizekb )
a. Data cleaning b. Downsizing c. Normalization d.MGS elaboration
construction quality testing and cleaning visualization
e. Data reduction vectorisation
Preprocessing
Eco
sy
ste
ms
Ph
ylo
ge
ne
tic
an
no
tati
on
Analysis
Functional
annotation
Data
integration
≠ frequency matrix
normRPKM() 2'20" (momr)
Genes Ind1 Ind2 Ind3 Ind4 Ind5 Ind6 Ind71 0 36 2 0 43 106 12502 0 27 193 0 44 103 83 0 31 0 0 0 0 04 152 59 282 1 0 0 05 115 0 0 1 0 29 26 90 783 26 0 2 0 07 104 1616 0 0 0 0 58 0 82 0 0 0 0 09 2 0 0 0 0 0 010 23 239 1302 10 0 190 011 30 183 900 13 0 172 012 27 228 1120 6 0 324 013 103 0 0 0 0 0 014 0 30 269 0 0 0 015 0 0 0 0 0 95 016 1250 6002 468 607 492 141 802317 0 0 0 0 0 0 018 0 9 108 0 0 55 019 0 0 0 3 0 0 0
5400000 0 36 2 0 43 106 1250
Referencecatalogue
Individuals
Genes Ind1 Ind2 Ind3 Ind4 Ind5 Ind6 Ind71 0.0E+00 3.6E-08 2.0E-09 0.0E+00 4.3E-08 1.1E-07 1.3E-062 0.0E+00 2.7E-08 1.9E-07 0.0E+00 4.4E-08 1.0E-07 8.0E-093 0.0E+00 3.1E-08 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+004 1.5E-07 5.9E-08 2.8E-07 9.1E-10 0.0E+00 0.0E+00 0.0E+005 1.2E-07 0.0E+00 0.0E+00 1.4E-09 0.0E+00 2.9E-08 2.0E-096 9.0E-08 7.8E-07 2.6E-08 0.0E+00 2.4E-09 0.0E+00 0.0E+007 1.0E-07 1.6E-06 0.0E+00 0.0E+00 0.0E+00 0.0E+00 5.0E-098 0.0E+00 8.2E-08 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+009 1.6E-09 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+0010 2.3E-08 2.4E-07 1.3E-06 1.0E-08 0.0E+00 1.9E-07 0.0E+0011 3.0E-08 1.8E-07 9.0E-07 1.3E-08 0.0E+00 1.7E-07 0.0E+0012 2.7E-08 2.3E-07 1.1E-06 5.6E-09 0.0E+00 3.2E-07 0.0E+0013 1.0E-07 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+0014 0.0E+00 3.0E-08 2.7E-07 0.0E+00 0.0E+00 0.0E+00 0.0E+0015 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 9.5E-08 0.0E+0016 1.3E-06 6.0E-06 4.7E-07 6.1E-07 4.9E-07 1.4E-07 8.0E-0617 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+0018 0.0E+00 9.4E-09 1.1E-07 0.0E+00 0.0E+00 5.5E-08 0.0E+0019 0.0E+00 0.0E+00 0.0E+00 3.4E-09 0.0E+00 0.0E+00 0.0E+00
5400000 0.0E+00 3.6E-08 2.0E-09 0.0E+00 4.3E-08 1.1E-07 1.3E-06
Referencecatalogue
Individuals
Clustering genes into MetaGenomic Species (MGS)
intestinal gut bacteria
abundance profiles of gene catalogue
organize the catalogue by co-variance
1 1 0 1 1 1 1 0 1
3 3 0 3 3 1 3 0 1
0 2 0 0 0 1 2 0 1
2 0 0 2 2 1 0 0 1
0 0 1 0 0 0 0 1 0
0 0 2 0 0 2 0 2 2
high throuput sequencing
DNA extraction
0 0 1 0 0 0 0 1 0
MGS canopy
Nielsen*, Almeida* et al., Nature Biotech. 2014
a. Data cleaning b. Downsizing c. Normalization d.MGS elaboration
construction quality testing and cleaning visualization
e. Data reduction vectorisation
Preprocessing
Eco
sy
ste
ms
Ph
ylo
ge
ne
tic
an
no
tati
on
Analysis
Functional
annotation
Data
integration
frequency matrix
plotBarcode() (momr)
Genes Ind1 Ind2 Ind3 Ind4 Ind5 Ind6 Ind71 0.0E+00 3.6E-08 2.0E-09 0.0E+00 4.3E-08 1.1E-07 1.3E-062 0.0E+00 2.7E-08 1.9E-07 0.0E+00 4.4E-08 1.0E-07 8.0E-093 0.0E+00 3.1E-08 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+004 1.5E-07 5.9E-08 2.8E-07 9.1E-10 0.0E+00 0.0E+00 0.0E+005 1.2E-07 0.0E+00 0.0E+00 1.4E-09 0.0E+00 2.9E-08 2.0E-096 9.0E-08 7.8E-07 2.6E-08 0.0E+00 2.4E-09 0.0E+00 0.0E+007 1.0E-07 1.6E-06 0.0E+00 0.0E+00 0.0E+00 0.0E+00 5.0E-098 0.0E+00 8.2E-08 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+009 1.6E-09 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+0010 2.3E-08 2.4E-07 1.3E-06 1.0E-08 0.0E+00 1.9E-07 0.0E+0011 3.0E-08 1.8E-07 9.0E-07 1.3E-08 0.0E+00 1.7E-07 0.0E+0012 2.7E-08 2.3E-07 1.1E-06 5.6E-09 0.0E+00 3.2E-07 0.0E+0013 1.0E-07 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+0014 0.0E+00 3.0E-08 2.7E-07 0.0E+00 0.0E+00 0.0E+00 0.0E+0015 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 9.5E-08 0.0E+0016 1.3E-06 6.0E-06 4.7E-07 6.1E-07 4.9E-07 1.4E-07 8.0E-0617 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+0018 0.0E+00 9.4E-09 1.1E-07 0.0E+00 0.0E+00 5.5E-08 0.0E+0019 0.0E+00 0.0E+00 0.0E+00 3.4E-09 0.0E+00 0.0E+00 0.0E+00
5400000 0.0E+00 3.6E-08 2.0E-09 0.0E+00 4.3E-08 1.1E-07 1.3E-06
Referencecatalogue
Individuals
300 as column
50 genes / MGS as row
MGS visualization : the MGS barcode
MGS cleaning (moclust)
computeFilteredVectors() (momr)
//
a. Catalogs annotation Blast queries and treatment
b. Clusters annotation gene list MGS visualization
superk. phylum class order family genus species
great variability between catalogs: from 3% (baby) to 30% NA from 6% (mouse) to 90% (baby) species
tool to better assign / correct assignation
(mophyl) //
Preprocessing
Eco
sy
ste
ms
Ph
ylo
ge
ne
tic
an
no
tati
on
Analysis
Functional
annotation
Data
integration
taxonomy of the MGS / MGU according to the 20 first Hits
Phylogenetic annotation
a. Sample stratification LGC/HGC, upsizing enterotyping
b. Feature selection correlation association bootstrapping
c. MGS/MGU selection d. Modelisation
model building model testing
ranksum q<10-4
75 245 genes
83 H 98 LC
66 MGS
38 Healthy MGS
28 LC MGS
// testRelations() 30' (momr)
Preprocessing
Eco
sy
ste
ms
Ph
ylo
ge
ne
tic
an
no
tati
on
Analysis
Functional
annotation
Data
integration
projectOntoMGS() (momr)
taxonomy of the MGS / MGU
ranksum q<10-4
75 245 genes
83 H 98 LC
66 MGS
38 Healthy MGS
28 LC MGS
Preprocessing
Eco
sy
ste
ms
Ph
ylo
ge
ne
tic
an
no
tati
on
Analysis
Functional
annotation
Data
integration
oral species
28 LC MGS
38 Healthy MGS
MGS taxonomy
Co-occurence microbial network
MELD CTP
p<1e-5 p<3e-4
LCspeciesabundance
Low High Low High
severityscores
a. Sample stratification LGC/HGC, upsizing enterotyping
b. Feature selection correlation association bootstrapping
c. MGS/MGU selection d. Modelisation
model building model testing
ranksum q<10-4
75 245 genes
83 H 98 LC
66 MGS
7 m
od
el MG
S
+ - - - - - -
7 MGS score
Model building (1-N MGS)
AUC = 0.952
discovery (83+98) validation (31+25)
AUC = 0.937
model testing
(mopred)
Preprocessing
Eco
sy
ste
ms
Ph
ylo
ge
ne
tic
an
no
tati
on
Analysis
Functional
annotation
Data
integration
3.3M 3.9M
9.9M
0.7 0.3
1.3
2.2
1.9 2.01.1
6.20.2
0.4
3.7
Specificandsharedgenes
124samples 396samples
1267samples
a. Omics links phenotypes, clinicals, metagenomics, metabolomics, metalipidomics, ......
b. Catalogs bridging genes, mgs
Omics links
Catalogs bridging
to mine multiple projects done on different
catalogs
(bridge)
phenoPairwiseRelations() (momr) Prepro
cessing
Eco
sy
ste
ms
Ph
ylo
ge
ne
tic
an
no
tati
on
Analysis
Functional
annotation
Data
integration
Data integration
MetaOmics mining in R
~ 150 functions
Edi Prifti & Emmanuelle Le Chatelier
Prepro
cessin
g
Ecosystem
sPhylogen
ec
annota
on
Analysis
Funconal
annotaon
Data
integr
aon
R-packages suite for
Metagenomic analysis of human gut
+ additional data
packages related to the catalogs
momr 1.1 deposited at CRAN mopred, mecos, mophyl, moclust, bridge
future developments with other MGP analysts
many time consuming steps and size limits ! >> Parstream solution
.040 23/ 09/ 2015 Magali Berland / Demi-journée de bioinformatique à Jouy-en-Josas / MetaGenoPolis
http://mgps.eu/index.php?id=ibs-tools
.041 23/ 09/ 2015 Magali Berland / Demi-journée de bioinformatique à Jouy-en-Josas / MetaGenoPolis
Goulot d’étranglement du traitement des données
• 2.5 To / semaine de données primaires à traiter
• Cluster de calcul non stop depuis juin
• Prévisionnel : pleine charge pour les 10 prochains mois
• Example du projet MetaCardis : 4000 échantillons de 5 Go
.042 23/ 09/ 2015 Magali Berland / Demi-journée de bioinformatique à Jouy-en-Josas / MetaGenoPolis
Perspectives : traitement des données massives
Base de
donnée
distribuée
Parralélisation
massive des
requètes
.043 23/ 09/ 2015 Magali Berland / Demi-journée de bioinformatique à Jouy-en-Josas / MetaGenoPolis
Perspectives : traitement des données massives
Rassemble le
traitements primaire et
l’analyse des données
Permet le croisement
des données entre les
projets
Exploration des
données sous divers
angles – flexibilité des
analyses
.044 23/ 09/ 2015 Magali Berland / Demi-journée de bioinformatique à Jouy-en-Josas / MetaGenoPolis
Acknowledgements
Anne-Sophie Alvarez
Jean-Michel Batto
Magali Berland
S. Dusko Ehrlich
Franck Gauthier
Ndeye A. Gaye
Amine Ghozlane
Marie Jeammet
Emmanuelle Le Chatelier
Pierre Léonard
Nicolas Maziers
Florian Plaza-Onate
Nicolas Pons
Etienne Ruppé
Florence Thirion
Kevin Weiszer
Acknowledgements
Vincent Ducrot Dany Tello Sébastien Monot Tarik Saidani Victor Arslan
Denis Caromel Vladimir Bodnartchouk Fabien Viale
Eric Mahé
Mihai Pop Mathieu Almeida
Equipe BAC Equipe IFE
Thanks for your attention