Metagenomic data analysis From reads to...

46
Metagenomic data analysis From reads to biomarkers Meteor, MetaOMineR and future… Nicolas Pons Emmanuelle Le Chatelier Magali Berland 2015-09-25 – Bioinformatique du Centre de Jouy-en-Josas

Transcript of Metagenomic data analysis From reads to...

Page 1: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Metagenomic data analysis From reads to biomarkers

Meteor, MetaOMineR and future…

Nicolas Pons

Emmanuelle Le Chatelier Magali Berland

2015-09-25 – Bioinformatique du Centre de Jouy-en-Josas

Page 2: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Human intestinal icrobiota: a forgotten organ

from “sterile” humans at birth to 2kg of microorganisms,

out of the 100 trillions cells in the human body, only 1 in 10 is human.

“education” of innate immune defenses

Immune system colonization resistance

terminal differentiation of mucosa

epithelial “homeostasis”

Gut interface

food degradation

vitamin production

Metabolism

energy extraction

90 % microbes

10 % human cells

Page 3: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

A major role in health and disease

Bach JF, N Eng J Med 2002

Page 4: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Scientific objectives

Compare multiple samples from the same study / project stratification of the individuals & personalized medicine health vs. disease biomarkers

Mine multiple projects for novel discoveries and association

sick healthy

Analyse a single sample to identify organisms / genes / functions present

Page 5: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Most of microorganisms are unknown and uncultivable…

a largely unknown ecosystem

< 30% cultivables

huge data

500 -1000 dominant

bacterial species

Metagenomics genes inventory

and quantification

Page 6: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

sample collection

sequencing reference construction

gene profiling

sequence mapping onto gene catalogue

stool sample

total DNA library preparation

SoLiD/Proton/Illumina sequencing

short sequences 30-50 million

reference gene catalogue

gene counts

gen

es

bioinformatics /statistics analyses

preprocessing / normalization &

dimension reduction

Quantitative metagenomics pipeline

catalogue structuration

individuals

1

2

3

4

1 2 3 4

relation with clinical data

Identify clinically relevant groups &

ecosystems

build and test prediction models

Page 7: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

reference catalogue

Adapted from Gevers, et al, 2012, Plos One

MetaHit

1.5Tb

MetaHit HMP

and others

5.1M genes

9.9M genes

3.9M genes

3.3M genes

Multiplication of the catalogues

Qin et al Nature, 2010

New assembly needed for specific study

LiverC 5.4M

Baby 1M

Page 8: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

sample collection

sequencing reference construction

gene profiling

sequence mapping onto gene catalogue

stool sample

total DNA

short sequences 30-50 million

reference gene catalogue

gene counts

gen

es

bioinformatics /statistics analyses

preprocessing / normalization &

dimension reduction

relation with clinical data

Identify clinically relevant groups &

ecosystems

Quantitative metagenomics pipeline

catalogue structuration

individuals

1

2

3

4

1 2 3 4 SOMA

MGS canopy PAMA build and test

prediction models

MGS CANOPY : http://git.dworzynski.eu/mgs-canopy-algorithm METEOR APP: IDDN.FR.001.420008.000.R.P.2013.000.30000

MetaOMiner APP: IDDN.FR.001.220005.000.R.P.2014.000.10000

library preparation SoLiD/Proton/Illumina

sequencing

Page 9: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Meteor specifications • Input data: SOLiD, Illumina, Proton • Quality and adaptor/barcode/contaminant filtering • Mapping against very large reference (>10M genes) • Several kind of gene abundance measures • Gene abundance normalization

• Output data (report and abundance table): AdvantageDB, text-file,

excel

• Experiment branch system with workflow • Traceability

• Developed in Delphi (windows) • Command-line program (cluster) and GUI (desktop) • Distributed with user-friendly interfaces for managing samples,

reference catalog, profiles, workflows…

APP : IDDN.FR.001.420008.000.R.P.2013.000.30000

Page 10: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Quick overview of Meteor

Counting Indexing Quality controls

Mapping

Metagenomic

sample

Millions of

reads are

produced by

Illumina,

SOLiD or Ion

Proton

sequencers

Cleaning and

filtering are

performed with

alienTrimmer and

embedded filters.

The mapping of

the reads is

performed with

bowtie 1 or 2 on

reference

catalogues

composed of

several milions of

genes

The gene

abundance is

estimated and the

counting profiles

of hundreds of

individuals are

aggregated.

Data are indexed

on an ISAM

system (Indexed

Sequential Access

Method) managed

by an embedded

NoSQL database

(AdvantageDB,

Sybase).

Page 11: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Gene Profiling with

reads filtering (Quality, Barcodes, Contaminants)

clean reads

Total = 7 7 3 0 4

mapping

How to measure gene abundance ?

Unique = 7 6 1 0 3

Shared = 7 6 + 1/3 1 + 1/3 0 3 + 1/3

i-Shared = 7 6 + 0.6 1 + 0.1 0 3 + 0.3

mismatch

Page 12: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Sample data acquisition

Sample data indexing and QC

Mapping Mapping indexing

Counting Profiling

700 GByte 1.8 TByte 3.5 TByte 4.0 TByte 4.2 TByte 4.4 TByte 4.5 TByte

Meteor benchmark

Time processing and storage provision for a set of 200 samples (50M reads) – ref 3.3M genes (cluster HPC Windows 2008RC2 16 nodes/192 cores with ProActive scheduler)

1 day 2 days

Page 13: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

demonstration

Page 14: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Create a new project

Page 15: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Sample management

Page 16: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Add a new sample

Result is a text file with metadata used for traceability through a complete meteor run

Page 17: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

The new sample

Page 18: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Counting configuration

Page 19: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Workflow editing

Page 20: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Launch counting

Page 21: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Build abundance table by agregating sample counting result

Page 22: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting
Page 23: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Reference preparation directly from FASTA file or with iMOMi Genome Studio

Page 24: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Genes Ind 1 Ind 2 Ind 3 Ind 4 Ind 5 Ind 6 Ind 7 1 0 36 2 0 43 106 1250 2 0 27 193 0 44 103 8 3 0 31 0 0 0 0 0 4 152 59 282 1 0 0 0 5 115 0 0 1 0 29 2 6 90 783 26 0 2 0 0 7 104 1616 0 0 0 0 5 8 0 82 0 0 0 0 0 9 2 0 0 0 0 0 0

10 23 239 1302 10 0 190 0 11 30 183 900 13 0 172 0 12 27 228 1120 6 0 324 0 13 103 0 0 0 0 0 0 14 0 30 269 0 0 0 0 15 0 0 0 0 0 95 0 16 1250 6002 468 607 492 141 8023 17 0 0 0 0 0 0 0 18 0 9 108 0 0 55 0 19 0 0 0 3 0 0 0

3300000 0 36 2 0 43 106 1250

Ref

eren

ce c

atal

ogu

e

Individuals

raw matrix

Abundance table is now ready for MetaOMineR…

Page 25: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

MetaOmics mining in R

~ 150 functions

Edi Prifti & Emmanuelle Le Chatelier

Prepro

cessin

g

Ecosystem

sPhylogen

ec

annota

on

Analysis

Funconal

annotaon

Data

integr

aon

R-packages suite for

Metagenomic analysis of human gut

+ additional data

packages related to the catalogs

Page 26: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

83 31

98 25

Diagnostic tool in liver cirrhosis

Qin N. et al. Nature 2014 Zhejiang University, Hangzhou, China & MGP

diagnosis • by biopsy in 40% • by clinical symptoms or imaging in 60%

Genes Ind1 Ind2 Ind3 Ind4 Ind5 Ind6 Ind71 0 36 2 0 43 106 12502 0 27 193 0 44 103 83 0 31 0 0 0 0 04 152 59 282 1 0 0 05 115 0 0 1 0 29 26 90 783 26 0 2 0 07 104 1616 0 0 0 0 58 0 82 0 0 0 0 09 2 0 0 0 0 0 010 23 239 1302 10 0 190 011 30 183 900 13 0 172 012 27 228 1120 6 0 324 013 103 0 0 0 0 0 014 0 30 269 0 0 0 015 0 0 0 0 0 95 016 1250 6002 468 607 492 141 802317 0 0 0 0 0 0 018 0 9 108 0 0 55 019 0 0 0 3 0 0 0

5400000 0 36 2 0 43 106 1250

Referencecatalogue

Individuals

5.4M genes catalogue

Page 27: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

H50

L70

L51

L71

L26

LV8

H56

L86

H33

HV

8

H3

HV

3

L55

L73

H6

L11

H16

H22

H71

H72

H50

L70

L51

L71

L26

LV8

H56

L86

H33

HV8

H3

HV3

L55

L73

H6

L11

H16

H22

H71

H72

0.2 0.4 0.6 0.8 1

Value

02

06

010

0

Color Key

and Histogram

Count

raw matrix

hierarchical clustering of the samples

contaminations – mix ?

sample coherence

hierClust() (momr) 124K genes

a. Data cleaning b. Downsizing c. Normalization d.MGS elaboration

construction quality testing and cleaning visualization

e. Data reduction vectorisation

Preprocessing

Eco

sy

ste

ms

Ph

ylo

ge

ne

tic

an

no

tati

on

Analysis

Functional

annotation

Data

integration

L49

L63

L5

L8

L87

L18

LV16

LV20

L56

LV17

L23

L96

L51

L71

LV18

L92

LV15

L61

L62

LV10

LV19

LV6

L57

L33

L2

L45

LV12

L36

L69

L93

L88

L90

H50

L70

L53

L66

L40

L77

L91

L21

L78

L79

L84

L38

L39

L76

L68

L75

L80

L85

L27

L48

L24

L64

L47

L30

L42

L43

L20

L44

L54

L1

L17

L32

L26

LV8

L59

L60

L50

L98

LV13

L19

L72

HV

23

HV

30

H67

HV

25

H57

H78

H66

H54

H55

H35

H5

H8

H2

H36

H59

H37

H27

H63

H71

H72

H53

H51

H52

H81

L97

H33

HV

8H

18

L15

H32

HV

26

H15

H9

LV11

H60

HV

20

H42

H62

H40

H41

L81

L83

L10

L9

LV1

H39

LV23

H17

L4

L16

L3

L46

LV3

LV7

H1

H68

H75

H69

H74

H21

H61

H10

H24

H31

H58

L34

L35

L28

L29

H77

HV

2L

37

L12

H65

H7

LV24

H44

H80

H13

H70

H14

H45

H3

HV

3H

V5

H76

H82

HV

27

LV22

L89

H25

H64

L31

H6

L11

H11

H34

HV

7H

V18

H47

HV

12

HV

10

H73

HV

13

HV

16

HV

28

HV

29

H38

HV

22

H26

H83

H23

H28

H4

HV

4H

12

HV

31

HV

6H

16

H22

L55

L73

LV14

L6

L82

L22

L41

L14

L25

HV

1H

V19

H19

H20

H49

LV2

LV25

HV

21

LV9

HV

14

H48

LV4

LV5

H29

HV

11

HV

24

L58

L67

HV

15

HV

9L

13

L95

HV

17

L7

LV21

L65

H56

L86

L52

H43

L74

H30

L94

H46

H79

L49L63L5L8L87L18LV16LV20L56LV17L23L96L51L71LV18L92LV15L61L62LV10LV19LV6L57L33L2L45LV12L36L69L93L88L90H50L70L53L66L40L77L91L21L78L79L84L38L39L76L68L75L80L85L27L48L24L64L47L30L42L43L20L44L54L1L17L32L26LV8L59L60L50L98LV13L19L72HV23HV30H67HV25H57H78H66H54H55H35H5H8H2H36H59H37H27H63H71H72H53H51H52H81L97H33HV8H18L15H32HV26H15H9LV11H60HV20H42H62H40H41L81L83L10L9LV1H39LV23H17L4L16L3L46LV3LV7H1H68H75H69H74H21H61H10H24H31H58L34L35L28L29H77HV2L37L12H65H7LV24H44H80H13H70H14H45H3HV3HV5H76H82HV27LV22L89H25H64L31H6L11H11H34HV7HV18H47HV12HV10H73HV13HV16HV28HV29H38HV22H26H83H23H28H4HV4H12HV31HV6H16H22L55L73LV14L6L82L22L41L14L25HV1HV19H19H20H49LV2LV25HV21LV9HV14H48LV4LV5H29HV11HV24L58L67HV15HV9L13L95HV17L7LV21L65H56L86L52H43L74H30L94H46H79

0.2 0.4 0.6 0.8 1

Value

050

00

150

00

Color Key

and Histogram

Count

filt.hierClust() (momr)

0'33"

Page 28: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

83 (+31) 98 (+25)

Technical bias : sequencing depth

Bias due to different sequencing depth need for downsizing !

Min. Median Max. 9167000 31860000 123100000

20M 9M

rho = 0.48

Page 29: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

raw matrix

reads 9M 12M 41M 25M 13M 15M 12M

downsizing 11M 15M

genes : 1 2 3 1 2 3

a. Data cleaning b. Downsizing c. Normalization d.MGS elaboration

construction quality testing and cleaning visualization

e. Data reduction vectorisation

// downsizeMatrix(9M) 200' // downsizeGC(9M) 13' (momr) // downsizeGC.all (9-20M) 210' computeUpsizedGC()

Preprocessing

Eco

sy

ste

ms

Ph

ylo

ge

ne

tic

an

no

tati

on

Analysis

Functional

annotation

Data

integration

Genes Ind1 Ind2 Ind3 Ind4 Ind5 Ind6 Ind71 0 36 2 0 43 106 12502 0 27 193 0 44 103 83 0 31 0 0 0 0 04 152 59 282 1 0 0 05 115 0 0 1 0 29 26 90 783 26 0 2 0 07 104 1616 0 0 0 0 58 0 82 0 0 0 0 09 2 0 0 0 0 0 010 23 239 1302 10 0 190 011 30 183 900 13 0 172 012 27 228 1120 6 0 324 013 103 0 0 0 0 0 014 0 30 269 0 0 0 015 0 0 0 0 0 95 016 1250 6002 468 607 492 141 802317 0 0 0 0 0 0 018 0 9 108 0 0 55 019 0 0 0 3 0 0 0

5400000 0 36 2 0 43 106 1250

Referencecatalogue

Individuals

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

0e

+0

02

e+

05

4e

+0

56

e+

05

8e

+0

51

e+

06

all : 884 ind ; gene counts correlation

mean9

me

an

11

GC_9M , R?= 0.9996GC_7M ,R?= 0.9977GC_5M , R?= 0.9926GC_3M , R?= 0.9786

down3 down4 down5 down7 down9

GC ...

GC

11

M

upsizing

Page 30: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

raw matrix

reads 9M 12M 41M 25M 13M 15M 12M

downsizing 11M 15M

genes : 1 2 3 1 2 3

a. Data cleaning b. Downsizing c. Normalization d.MGS elaboration

construction quality testing and cleaning visualization

e. Data reduction vectorisation

// downsizeGC(9M) 13' (momr) // downsizeGC.all (9-20M) 210' computeUpsizedGC() // downsizeMatrix(9M) 200'

Preprocessing

Eco

sy

ste

ms

Ph

ylo

ge

ne

tic

an

no

tati

on

Analysis

Functional

annotation

Data

integration

Genes Ind1 Ind2 Ind3 Ind4 Ind5 Ind6 Ind71 0 36 2 0 43 106 12502 0 27 193 0 44 103 83 0 31 0 0 0 0 04 152 59 282 1 0 0 05 115 0 0 1 0 29 26 90 783 26 0 2 0 07 104 1616 0 0 0 0 58 0 82 0 0 0 0 09 2 0 0 0 0 0 010 23 239 1302 10 0 190 011 30 183 900 13 0 172 012 27 228 1120 6 0 324 013 103 0 0 0 0 0 014 0 30 269 0 0 0 015 0 0 0 0 0 95 016 1250 6002 468 607 492 141 802317 0 0 0 0 0 0 018 0 9 108 0 0 55 019 0 0 0 3 0 0 0

5400000 0 36 2 0 43 106 1250

Referencecatalogue

Individuals

H LC

20

00

00

60

00

00

10

00

00

01

40

00

00

GC unique ~ status, pval = 0.00012

H LC

50

00

00

100

00

00

15

00

00

02

00

00

00

GC shared ~ status, pval = 1.6e−05

H LC

40

000

060

00

00

80

00

00

120

00

00

GC shared down 9M ~ status, pval = 6.1e−12

pval 1.6e-5 pval 6.1e-12

raw downsized

?

Page 31: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

downsized raw matrix

Normalization RPKM : ( Nreads / sizekb )

Σ( Nreads / sizekb )

a. Data cleaning b. Downsizing c. Normalization d.MGS elaboration

construction quality testing and cleaning visualization

e. Data reduction vectorisation

Preprocessing

Eco

sy

ste

ms

Ph

ylo

ge

ne

tic

an

no

tati

on

Analysis

Functional

annotation

Data

integration

≠ frequency matrix

normRPKM() 2'20" (momr)

Genes Ind1 Ind2 Ind3 Ind4 Ind5 Ind6 Ind71 0 36 2 0 43 106 12502 0 27 193 0 44 103 83 0 31 0 0 0 0 04 152 59 282 1 0 0 05 115 0 0 1 0 29 26 90 783 26 0 2 0 07 104 1616 0 0 0 0 58 0 82 0 0 0 0 09 2 0 0 0 0 0 010 23 239 1302 10 0 190 011 30 183 900 13 0 172 012 27 228 1120 6 0 324 013 103 0 0 0 0 0 014 0 30 269 0 0 0 015 0 0 0 0 0 95 016 1250 6002 468 607 492 141 802317 0 0 0 0 0 0 018 0 9 108 0 0 55 019 0 0 0 3 0 0 0

5400000 0 36 2 0 43 106 1250

Referencecatalogue

Individuals

Genes Ind1 Ind2 Ind3 Ind4 Ind5 Ind6 Ind71 0.0E+00 3.6E-08 2.0E-09 0.0E+00 4.3E-08 1.1E-07 1.3E-062 0.0E+00 2.7E-08 1.9E-07 0.0E+00 4.4E-08 1.0E-07 8.0E-093 0.0E+00 3.1E-08 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+004 1.5E-07 5.9E-08 2.8E-07 9.1E-10 0.0E+00 0.0E+00 0.0E+005 1.2E-07 0.0E+00 0.0E+00 1.4E-09 0.0E+00 2.9E-08 2.0E-096 9.0E-08 7.8E-07 2.6E-08 0.0E+00 2.4E-09 0.0E+00 0.0E+007 1.0E-07 1.6E-06 0.0E+00 0.0E+00 0.0E+00 0.0E+00 5.0E-098 0.0E+00 8.2E-08 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+009 1.6E-09 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+0010 2.3E-08 2.4E-07 1.3E-06 1.0E-08 0.0E+00 1.9E-07 0.0E+0011 3.0E-08 1.8E-07 9.0E-07 1.3E-08 0.0E+00 1.7E-07 0.0E+0012 2.7E-08 2.3E-07 1.1E-06 5.6E-09 0.0E+00 3.2E-07 0.0E+0013 1.0E-07 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+0014 0.0E+00 3.0E-08 2.7E-07 0.0E+00 0.0E+00 0.0E+00 0.0E+0015 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 9.5E-08 0.0E+0016 1.3E-06 6.0E-06 4.7E-07 6.1E-07 4.9E-07 1.4E-07 8.0E-0617 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+0018 0.0E+00 9.4E-09 1.1E-07 0.0E+00 0.0E+00 5.5E-08 0.0E+0019 0.0E+00 0.0E+00 0.0E+00 3.4E-09 0.0E+00 0.0E+00 0.0E+00

5400000 0.0E+00 3.6E-08 2.0E-09 0.0E+00 4.3E-08 1.1E-07 1.3E-06

Referencecatalogue

Individuals

Page 32: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Clustering genes into MetaGenomic Species (MGS)

intestinal gut bacteria

abundance profiles of gene catalogue

organize the catalogue by co-variance

1 1 0 1 1 1 1 0 1

3 3 0 3 3 1 3 0 1

0 2 0 0 0 1 2 0 1

2 0 0 2 2 1 0 0 1

0 0 1 0 0 0 0 1 0

0 0 2 0 0 2 0 2 2

high throuput sequencing

DNA extraction

0 0 1 0 0 0 0 1 0

MGS canopy

Nielsen*, Almeida* et al., Nature Biotech. 2014

Page 33: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

a. Data cleaning b. Downsizing c. Normalization d.MGS elaboration

construction quality testing and cleaning visualization

e. Data reduction vectorisation

Preprocessing

Eco

sy

ste

ms

Ph

ylo

ge

ne

tic

an

no

tati

on

Analysis

Functional

annotation

Data

integration

frequency matrix

plotBarcode() (momr)

Genes Ind1 Ind2 Ind3 Ind4 Ind5 Ind6 Ind71 0.0E+00 3.6E-08 2.0E-09 0.0E+00 4.3E-08 1.1E-07 1.3E-062 0.0E+00 2.7E-08 1.9E-07 0.0E+00 4.4E-08 1.0E-07 8.0E-093 0.0E+00 3.1E-08 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+004 1.5E-07 5.9E-08 2.8E-07 9.1E-10 0.0E+00 0.0E+00 0.0E+005 1.2E-07 0.0E+00 0.0E+00 1.4E-09 0.0E+00 2.9E-08 2.0E-096 9.0E-08 7.8E-07 2.6E-08 0.0E+00 2.4E-09 0.0E+00 0.0E+007 1.0E-07 1.6E-06 0.0E+00 0.0E+00 0.0E+00 0.0E+00 5.0E-098 0.0E+00 8.2E-08 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+009 1.6E-09 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+0010 2.3E-08 2.4E-07 1.3E-06 1.0E-08 0.0E+00 1.9E-07 0.0E+0011 3.0E-08 1.8E-07 9.0E-07 1.3E-08 0.0E+00 1.7E-07 0.0E+0012 2.7E-08 2.3E-07 1.1E-06 5.6E-09 0.0E+00 3.2E-07 0.0E+0013 1.0E-07 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+0014 0.0E+00 3.0E-08 2.7E-07 0.0E+00 0.0E+00 0.0E+00 0.0E+0015 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 9.5E-08 0.0E+0016 1.3E-06 6.0E-06 4.7E-07 6.1E-07 4.9E-07 1.4E-07 8.0E-0617 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+00 0.0E+0018 0.0E+00 9.4E-09 1.1E-07 0.0E+00 0.0E+00 5.5E-08 0.0E+0019 0.0E+00 0.0E+00 0.0E+00 3.4E-09 0.0E+00 0.0E+00 0.0E+00

5400000 0.0E+00 3.6E-08 2.0E-09 0.0E+00 4.3E-08 1.1E-07 1.3E-06

Referencecatalogue

Individuals

300 as column

50 genes / MGS as row

MGS visualization : the MGS barcode

MGS cleaning (moclust)

computeFilteredVectors() (momr)

//

Page 34: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

a. Catalogs annotation Blast queries and treatment

b. Clusters annotation gene list MGS visualization

superk. phylum class order family genus species

great variability between catalogs: from 3% (baby) to 30% NA from 6% (mouse) to 90% (baby) species

tool to better assign / correct assignation

(mophyl) //

Preprocessing

Eco

sy

ste

ms

Ph

ylo

ge

ne

tic

an

no

tati

on

Analysis

Functional

annotation

Data

integration

taxonomy of the MGS / MGU according to the 20 first Hits

Phylogenetic annotation

Page 35: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

a. Sample stratification LGC/HGC, upsizing enterotyping

b. Feature selection correlation association bootstrapping

c. MGS/MGU selection d. Modelisation

model building model testing

ranksum q<10-4

75 245 genes

83 H 98 LC

66 MGS

38 Healthy MGS

28 LC MGS

// testRelations() 30' (momr)

Preprocessing

Eco

sy

ste

ms

Ph

ylo

ge

ne

tic

an

no

tati

on

Analysis

Functional

annotation

Data

integration

projectOntoMGS() (momr)

taxonomy of the MGS / MGU

Page 36: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

ranksum q<10-4

75 245 genes

83 H 98 LC

66 MGS

38 Healthy MGS

28 LC MGS

Preprocessing

Eco

sy

ste

ms

Ph

ylo

ge

ne

tic

an

no

tati

on

Analysis

Functional

annotation

Data

integration

oral species

28 LC MGS

38 Healthy MGS

MGS taxonomy

Co-occurence microbial network

MELD CTP

p<1e-5 p<3e-4

LCspeciesabundance

Low High Low High

severityscores

Page 37: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

a. Sample stratification LGC/HGC, upsizing enterotyping

b. Feature selection correlation association bootstrapping

c. MGS/MGU selection d. Modelisation

model building model testing

ranksum q<10-4

75 245 genes

83 H 98 LC

66 MGS

7 m

od

el MG

S

+ - - - - - -

7 MGS score

Model building (1-N MGS)

AUC = 0.952

discovery (83+98) validation (31+25)

AUC = 0.937

model testing

(mopred)

Preprocessing

Eco

sy

ste

ms

Ph

ylo

ge

ne

tic

an

no

tati

on

Analysis

Functional

annotation

Data

integration

Page 38: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

3.3M 3.9M

9.9M

0.7 0.3

1.3

2.2

1.9 2.01.1

6.20.2

0.4

3.7

Specificandsharedgenes

124samples 396samples

1267samples

a. Omics links phenotypes, clinicals, metagenomics, metabolomics, metalipidomics, ......

b. Catalogs bridging genes, mgs

Omics links

Catalogs bridging

to mine multiple projects done on different

catalogs

(bridge)

phenoPairwiseRelations() (momr) Prepro

cessing

Eco

sy

ste

ms

Ph

ylo

ge

ne

tic

an

no

tati

on

Analysis

Functional

annotation

Data

integration

Data integration

Page 39: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

MetaOmics mining in R

~ 150 functions

Edi Prifti & Emmanuelle Le Chatelier

Prepro

cessin

g

Ecosystem

sPhylogen

ec

annota

on

Analysis

Funconal

annotaon

Data

integr

aon

R-packages suite for

Metagenomic analysis of human gut

+ additional data

packages related to the catalogs

momr 1.1 deposited at CRAN mopred, mecos, mophyl, moclust, bridge

future developments with other MGP analysts

many time consuming steps and size limits ! >> Parstream solution

Page 40: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

.040 23/ 09/ 2015 Magali Berland / Demi-journée de bioinformatique à Jouy-en-Josas / MetaGenoPolis

http://mgps.eu/index.php?id=ibs-tools

Page 41: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

.041 23/ 09/ 2015 Magali Berland / Demi-journée de bioinformatique à Jouy-en-Josas / MetaGenoPolis

Goulot d’étranglement du traitement des données

• 2.5 To / semaine de données primaires à traiter

• Cluster de calcul non stop depuis juin

• Prévisionnel : pleine charge pour les 10 prochains mois

• Example du projet MetaCardis : 4000 échantillons de 5 Go

Page 42: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

.042 23/ 09/ 2015 Magali Berland / Demi-journée de bioinformatique à Jouy-en-Josas / MetaGenoPolis

Perspectives : traitement des données massives

Base de

donnée

distribuée

Parralélisation

massive des

requètes

Page 43: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

.043 23/ 09/ 2015 Magali Berland / Demi-journée de bioinformatique à Jouy-en-Josas / MetaGenoPolis

Perspectives : traitement des données massives

Rassemble le

traitements primaire et

l’analyse des données

Permet le croisement

des données entre les

projets

Exploration des

données sous divers

angles – flexibilité des

analyses

Page 44: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

.044 23/ 09/ 2015 Magali Berland / Demi-journée de bioinformatique à Jouy-en-Josas / MetaGenoPolis

Acknowledgements

Anne-Sophie Alvarez

Jean-Michel Batto

Magali Berland

S. Dusko Ehrlich

Franck Gauthier

Ndeye A. Gaye

Amine Ghozlane

Marie Jeammet

Emmanuelle Le Chatelier

Pierre Léonard

Nicolas Maziers

Florian Plaza-Onate

Nicolas Pons

Etienne Ruppé

Florence Thirion

Kevin Weiszer

Page 45: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Acknowledgements

Vincent Ducrot Dany Tello Sébastien Monot Tarik Saidani Victor Arslan

Denis Caromel Vladimir Bodnartchouk Fabien Viale

Eric Mahé

Mihai Pop Mathieu Almeida

Equipe BAC Equipe IFE

Page 46: Metagenomic data analysis From reads to biomarkersmigale.jouy.inra.fr/sites/...crj_sept2015_npons_elechat_mberland.pdf · Quick overview of Meteor Indexing Quality controls Counting

Thanks for your attention