Additional File 1 MetaMeta: Integrating metagenome ...10.1186/s40168-017-0318... · MetaMeta:...

14
Additional File 1 MetaMeta: Integrating metagenome analysis tools to improve taxonomic profiling Vitor C. Piro, Marcel Matschkowski, Bernhard Y. Renard 1 Implementation Figure 1: Score and bin matrices: Left: Matrix with an example of calculated scores for 6 tools. Right: matrix showing the division of the scores into 4 bins 1.1 File formats MetaMeta accepts BioBoxes format directly (https://github.com/bioboxes/rfc/tree/master/data- format) or a .tsv file in the following format: Profiling: rank, taxon name or taxid, abundance Example: 1

Transcript of Additional File 1 MetaMeta: Integrating metagenome ...10.1186/s40168-017-0318... · MetaMeta:...

Page 1: Additional File 1 MetaMeta: Integrating metagenome ...10.1186/s40168-017-0318... · MetaMeta: Integrating metagenome analysis tools to improve taxonomic pro ling ... 1.2 Mode functions

Additional File 1

MetaMeta: Integrating metagenome analysis

tools to improve taxonomic profiling

Vitor C. Piro, Marcel Matschkowski, Bernhard Y. Renard

1 Implementation

Figure 1: Score and bin matrices: Left: Matrix with an example of calculatedscores for 6 tools. Right: matrix showing the division of the scores into 4 bins

1.1 File formats

MetaMeta accepts BioBoxes format directly (https://github.com/bioboxes/rfc/tree/master/data-format) or a .tsv file in the following format:Profiling: rank, taxon name or taxid, abundanceExample:

1

Page 2: Additional File 1 MetaMeta: Integrating metagenome ...10.1186/s40168-017-0318... · MetaMeta: Integrating metagenome analysis tools to improve taxonomic pro ling ... 1.2 Mode functions

genus Methanospirillum 0.0029genus Thermus 0.0029genus 568394 0.0029species Arthrobacter sp. FB24 0.0835species 195 0.0582species Mycoplasma gallisepticum 0.0536

Binning: readid, taxon name or taxid, lenght of sequence assignedExample:

M2—S1—R140 354 201M2—S1—R142 195 201M2—S1—R145 457425 201M2—S1—R146 562 201M2—S1—R147 1245471 201M2—S1—R150 354 201

1.2 Mode functions

The mode parameter can be selected among 5 different functions, that wouldgenerate more precise or sensitive results (Figure 2). Each bin will have a cut-offvalue C defined as:

Very-sensitive: Cbin = log(bin + 3)/log(maxbins + 3)Sensitive: Cbin = log(bin + 1)/log(maxbins + 1)Linear: Cbin = bin/maxbinsPrecise: Cbin = 2bin/2maxbins

Very-precise: Cbin = 4bin/4maxbins

where maxbins is the total number of bins.

2 Results

2.1 Databases

Table 1: MetaMeta pre-configured databasesTool Archaea + Bacteria (v1) Custom

CLARK Yes (https://doi.org/10.5281/zenodo.819305) YesDUDes Yes (https://doi.org/10.5281/zenodo.819343) Yes

GOTTCHA Yes (https://doi.org/10.5281/zenodo.819341) Nokaiju Yes (https://doi.org/10.5281/zenodo.819425) Yes

kraken Yes (https://doi.org/10.5281/zenodo.819363) YesmOTUs Yes (https://doi.org/10.5281/zenodo.819365) No

2

Page 3: Additional File 1 MetaMeta: Integrating metagenome ...10.1186/s40168-017-0318... · MetaMeta: Integrating metagenome analysis tools to improve taxonomic pro ling ... 1.2 Mode functions

1 2 3 4Bins

0.0

0.2

0.4

0.6

0.8

1.0Cu

t-off

(% o

f tax

ons

kept

)

very-sensitivesensitivelinearprecisevery-precise

Figure 2: Example of cut-off values for 4 bins in each mode

2.2 Computer specifications

The main evaluations were performed with MetaMeta v1.1 on a x86 clusterconsisting of of a total of 1000 cores and roughly 3.5 TB RAM. The sub-sampling evaluations on CAMI data were performed with MetaMeta v1.0 on:60 CPUs x Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz, 1056 GB RAM,Debian GNU/Linux 8.4, 2.8 TB SSD.

2.3 Datasets and Parameters

MetaMeta pipeline was executed with all 6 pre-configured tools using the ar-chaea and bacteria database (Table 1).All CAMI toy sets (low, medium and high complexity) were obtained fromhttps://data.cami-challenge.org/148 stool samples from HMP were obtained at: http://hmpdacc.org/

3

Page 4: Additional File 1 MetaMeta: Integrating metagenome ...10.1186/s40168-017-0318... · MetaMeta: Integrating metagenome analysis tools to improve taxonomic pro ling ... 1.2 Mode functions

List of analyzed samples: SRS011061, SRS011134, SRS011239, SRS011271,SRS011302, SRS011405, SRS011452, SRS011529, SRS011586, SRS012273, SRS012902,SRS013158, SRS013215, SRS013476, SRS013521, SRS013687, SRS013800, SRS013951,SRS014235, SRS014287, SRS014313, SRS014459, SRS014613, SRS014683, SRS014923,SRS014979, SRS015065, SRS015133, SRS015190, SRS015217, SRS015264, SRS015369,SRS015578, SRS015663, SRS015782, SRS015794, SRS015854, SRS015960, SRS016018,SRS016056, SRS016095, SRS016203, SRS016267, SRS016335, SRS016495, SRS016517,SRS016585, SRS016753, SRS016954, SRS016989, SRS017103, SRS017191, SRS017247,SRS017307, SRS017433, SRS017521, SRS017701, SRS017821, SRS018133, SRS018313,SRS018351, SRS018427, SRS018575, SRS018656, SRS018817, SRS019030, SRS019161,SRS019267, SRS019397, SRS019582, SRS019601, SRS019685, SRS019787, SRS019910,SRS019968, SRS020233, SRS020328, SRS020869, SRS021484, SRS021948, SRS022071,SRS022137, SRS022524, SRS022609, SRS022713, SRS023346, SRS023526, SRS023583,SRS023829, SRS023914, SRS023971, SRS024009, SRS024075, SRS024132, SRS024265,SRS024331, SRS024388, SRS024435, SRS024549, SRS024625, SRS042284, SRS042628,SRS043001, SRS043411, SRS043701, SRS045004, SRS045645, SRS045713, SRS047014,SRS047044, SRS048164, SRS048870, SRS049164, SRS049712, SRS049900, SRS049959,SRS049995, SRS050299, SRS050422, SRS050752, SRS050925, SRS051031, SRS051882,SRS052027, SRS052697, SRS053214, SRS053335, SRS053398, SRS054590, SRS054956,SRS055982, SRS056259, SRS056519, SRS057478, SRS057717, SRS058723, SRS058770,SRS062427, SRS063040, SRS063985, SRS064276, SRS064557, SRS064645, SRS065504,SRS075398, SRS077730, SRS078176The sample SRS023176 couldn’t be properly analyzed due to inconsistent readpairs.

Table 2: MetaMeta (v1.1) parameters used for the CAMI and HMP data. De-fault parameters were used when not stated below.

CAMIDefault low/med./high HMP

trimming 0 - -desiredminlen 70 - -

subsample 0 - -mode linear - sensitivecutoff 0.0001 - 0.00001bins 4 - -

ranks species - -

2.4 Results

4

Page 5: Additional File 1 MetaMeta: Integrating metagenome ...10.1186/s40168-017-0318... · MetaMeta: Integrating metagenome analysis tools to improve taxonomic pro ling ... 1.2 Mode functions

Table 3: MetaMeta (v1.0) parameters used for the sub-sampled CAMI data.Default parameters were used when not stated below. N/A: not applicable

CAMI CAMI CAMI CAMI CAMI CAMI CAMIDefault 1% 5% 10% 16.6% 25% 50% 100%

trimming 0 1 1 1 1 1 1 1desiredminlen 70 - - - - - - -

strictness 0.8 - - - - - - -errorcorr 0 - - - - - - -

subsample 0 1 1 1 1 1 1 -samplesize 1 0.01 0.05 0.1 - 0.25 0.5 N/A

replacement 0 - - - - 1 1 N/Amode linear precise precise precise precise precise precise precisecutoff 0.0001 0.00001 0.00001 0.00001 0.00001 0.00001 0.00001 0.00001bins 4 3 3 3 3 3 3 3

ranks species - - - - - - -

5

Page 6: Additional File 1 MetaMeta: Integrating metagenome ...10.1186/s40168-017-0318... · MetaMeta: Integrating metagenome analysis tools to improve taxonomic pro ling ... 1.2 Mode functions

clark

dude

s

gottc

haka

ijukra

ken

motus

metameta

merge

55

60

65

70

75

True

Pos

itive

s

0

500

1000

1500

2000

2500

Fals

e Po

sitiv

es

Figure 3: True and False Positives - CAMI medium complexity set Inblue (left y axis): True Positives. In red (right y axis): False Positives. Resultsat species level. Each marker represents one out of four samples from the CAMImedium complexity set.

6

Page 7: Additional File 1 MetaMeta: Integrating metagenome ...10.1186/s40168-017-0318... · MetaMeta: Integrating metagenome analysis tools to improve taxonomic pro ling ... 1.2 Mode functions

0.0 0.2 0.4 0.6 0.8 1.0Sensitivity

0.0

0.2

0.4

0.6

0.8

1.0

Prec

isio

n

clarkdudesgottchakaijukrakenmotusmetametamerge

Figure 4: Precision and Sensitivity - CAMI medium complexity setResults at species level. Each marker represents one out of four samples fromthe CAMI medium complexity set.

7

Page 8: Additional File 1 MetaMeta: Integrating metagenome ...10.1186/s40168-017-0318... · MetaMeta: Integrating metagenome analysis tools to improve taxonomic pro ling ... 1.2 Mode functions

supe

rking

dom

phylu

mcla

ssord

erfam

ilyge

nus

spec

ies0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

L1 n

orm

clarkdudesgottchakaijukrakenmotusmetametamerge

Figure 5: L1 norm error Mean of the L1 norm measure at each taxonomiclevel for four samples from the medium complexity CAMI set.

8

Page 9: Additional File 1 MetaMeta: Integrating metagenome ...10.1186/s40168-017-0318... · MetaMeta: Integrating metagenome analysis tools to improve taxonomic pro ling ... 1.2 Mode functions

clark

dude

s

gottc

haka

ijukra

ken

motus

metameta

merge

5

6

7

8

9

10

True

Pos

itive

s

0

500

1000

1500

2000

2500

Fals

e Po

sitiv

es

Figure 6: True and False Positives - CAMI low complexity set In blue(left y axis): True Positives. In red (right y axis): False Positives. Results atspecies level.

9

Page 10: Additional File 1 MetaMeta: Integrating metagenome ...10.1186/s40168-017-0318... · MetaMeta: Integrating metagenome analysis tools to improve taxonomic pro ling ... 1.2 Mode functions

0.0 0.2 0.4 0.6 0.8 1.0Sensitivity

0.0

0.2

0.4

0.6

0.8

1.0

Prec

isio

n

clarkdudesgottchakaijukrakenmotusmetametamerge

Figure 7: Precision and Sensitivity - CAMI low complexity set Resultsat species level.

10

Page 11: Additional File 1 MetaMeta: Integrating metagenome ...10.1186/s40168-017-0318... · MetaMeta: Integrating metagenome analysis tools to improve taxonomic pro ling ... 1.2 Mode functions

supe

rking

dom

phylu

mcla

ssord

erfam

ilyge

nus

spec

ies0.0

0.5

1.0

1.5

L1 n

orm

clarkdudesgottchakaijukrakenmotusmetametamerge

Figure 8: L1 norm error L1 norm measure at each taxonomic level for onesample from the low complexity CAMI set.

11

Page 12: Additional File 1 MetaMeta: Integrating metagenome ...10.1186/s40168-017-0318... · MetaMeta: Integrating metagenome analysis tools to improve taxonomic pro ling ... 1.2 Mode functions

Figure 9: Sub-sampling Precision at species level for one randomly selectedCAMI high complexity sample. Each sub-sample was executed five times. Linesrepresent the mean and the area around it the maximum and minimum achievedvalues. The evaluated sample sizes are: 100%, 50%, 25%, 16.6%, 10%, 5%,1%. 16.6% is the exact division among 6 tools, using the the whole sample.Sub-samples above that value were taken with replacement and below withoutreplacement.

12

Page 13: Additional File 1 MetaMeta: Integrating metagenome ...10.1186/s40168-017-0318... · MetaMeta: Integrating metagenome analysis tools to improve taxonomic pro ling ... 1.2 Mode functions

motus_rpt

clean_files

metametamerge

kaiju_rpt

clark_db_custom_1

clark_db_custom_3

clark_db_custom_2 kaiju_db_custom_profile

database_profile errorcorr_reads

subsample_reads

kaiju_db_custom_1

kaiju_db_custom_2

dudes_rpt

db_archaea_bacteria_check

kraken_run_1 motus_run_1 gottcha_run_1clark_run_1 dudes_run_1 kaiju_run_1

kraken_rpt

dudes_db_custom_checkclark_db_custom_check

kaiju_db_custom_4

kaiju_db_custom_check

db_archaea_bacteria_download

db_archaea_bacteria_unpack kaiju_db_custom_3

gottcha_rpt

kraken_db_custom_2

kraken_db_custom_check

kraken_db_custom_3

clark_rpt

dudes_db_custom_3

dudes_db_custom_4dudes_db_custom_profile

dudes_db_custom_2

dudes_run_2

dudes_db_custom_1

metametamerge_get_taxdump

trim_reads

all

clark_db_custom_profile

krona clean_reads

kraken_db_custom_1

kraken_db_custom_profile

Figure 10: Rulegraph Overview of the rules and their dependencies onMetaMeta pipeline.

errorcorr_reads

subsample_reads

clean_filestool: dudes

metametamerge

clean_reads

all

db_archaea_bacteria_downloadtool: kaiju

db_archaea_bacteria_unpackdb_archaea_bacteria_check

dudes_run_1database: archaea_bacteria

clean_filestool: gottcha

kraken_rpt

clean_filestool: kraken

clean_filestool: motus

db_archaea_bacteria_downloadtool: motus

db_archaea_bacteria_unpack

db_archaea_bacteria_downloadtool: gottcha

db_archaea_bacteria_unpack

metametamerge_get_taxdump

db_archaea_bacteria_unpack

db_archaea_bacteria_check db_archaea_bacteria_check

motus_run_1database: archaea_bacteria

kraken_run_1database: archaea_bacteria

kaiju_run_1database: archaea_bacteria

kaiju_rpt

db_archaea_bacteria_downloadtool: dudes

db_archaea_bacteria_unpack

clean_filestool: kaiju

motus_rpt

db_archaea_bacteria_downloadtool: kraken

krona

db_archaea_bacteria_check

gottcha_run_1database: archaea_bacteria

clark_run_1database: archaea_bacteria

db_archaea_bacteria_check

gottcha_rpt

db_archaea_bacteria_unpack

trim_readssample: sample_data_custom_viral

dudes_rpt

dudes_run_2

clark_rpt

clean_filestool: clark

db_archaea_bacteria_downloadtool: clark

db_archaea_bacteria_check

Figure 11: DAG - pre-configured database Directed acyclic graph of theMetaMeta pipeline for one sample, one database (pre-configured) and six tools.

13

Page 14: Additional File 1 MetaMeta: Integrating metagenome ...10.1186/s40168-017-0318... · MetaMeta: Integrating metagenome analysis tools to improve taxonomic pro ling ... 1.2 Mode functions

dudes_db_custom_1database: custom_viral_db

dudes_db_custom_check

metametamerge

clean_reads krona

clean_filestool: dudes

dudes_db_custom_4

all

kaiju_db_custom_1database: custom_viral_db

kaiju_db_custom_2 database_profiletool: dudes

trim_readssample: sample_data_custom_viral

errorcorr_reads

clean_filestool: kraken

clark_db_custom_check

clark_run_1

database_profiletool: clarkkaiju_db_custom_3

metametamerge_get_taxdumpclean_filestool: clark

kraken_db_custom_check

kraken_run_1

dudes_db_custom_profile

clark_db_custom_profiledatabase: custom_viral_dbkraken_db_custom_profile

database_profiletool: kraken

kaiju_run_1

kaiju_rpt kraken_rpt

dudes_run_1

dudes_run_2

clark_rpt

clean_filestool: kaiju

kaiju_db_custom_check

subsample_reads

kraken_db_custom_2database: custom_viral_db

kraken_db_custom_3

clark_db_custom_1database: custom_viral_db

clark_db_custom_2

clark_db_custom_3

dudes_db_custom_2database: custom_viral_db

database_profiletool: kaiju

dudes_rpt

kraken_db_custom_1database: custom_viral_db

kaiju_db_custom_profiledatabase: custom_viral_db

kaiju_db_custom_4database: custom_viral_db

dudes_db_custom_3database: custom_viral_db

Figure 12: DAG - custom database Directed acyclic graph of the MetaMetapipeline for one sample, one database (custom) and 4 tools.

kraken_run_1database: archaea_bacteria

kraken_rpt

kraken_run_1

kraken_rpt

kraken_db_custom_check

metametamerge

kraken_run_1

metametamerge

dudes_db_custom_3database: custom_viral_db

dudes_db_custom_profiledudes_db_custom_4

krona

all

dudes_run_2

dudes_rpt kaiju_rpt

clean_filestool: kaiju

motus_rpt

metametamerge

clean_filestool: motus

db_archaea_bacteria_check

metametamerge

dudes_run_1database: archaea_bacteria

dudes_run_1database: archaea_bacteria

kraken_db_custom_2database: custom_viral_db

kraken_db_custom_3

clean_filestool: dudes

clean_filestool: dudes

clark_rpt

clean_filestool: clark

dudes_run_1

dudes_run_2

errorcorr_reads

subsample_reads

db_archaea_bacteria_downloadtool: kaiju

db_archaea_bacteria_unpack

clark_db_custom_1database: custom_viral_db

clark_db_custom_3

clark_db_custom_2

db_archaea_bacteria_unpack

db_archaea_bacteria_check

clean_filestool: kaiju

gottcha_rpt

clean_filestool: gottcha

trim_readssample: sample_data_custom_viral

errorcorr_reads

kraken_run_1database: archaea_bacteria

kraken_rpt

kaiju_db_custom_check

kaiju_run_1 kaiju_run_1

database_profiletool: clark

clark_db_custom_check

db_archaea_bacteria_downloadtool: dudes

db_archaea_bacteria_unpack

db_archaea_bacteria_check

clark_run_1database: archaea_bacteria

clark_run_1database: archaea_bacteria

dudes_db_custom_check

dudes_run_1

motus_run_1database: archaea_bacteria

motus_run_1database: archaea_bacteria

database_profiletool: dudes

clark_rpt

clean_filestool: clark

kaiju_db_custom_2

kaiju_db_custom_3

clark_run_1

clark_rpt

dudes_run_2

subsample_reads

kaiju_run_1database: archaea_bacteria

gottcha_run_1database: archaea_bacteria clark_run_1

krona

database_profiletool: kraken

db_archaea_bacteria_downloadtool: motus

dudes_run_2

dudes_rpt

kaiju_db_custom_1database: custom_viral_db

krona clean_reads

db_archaea_bacteria_unpack

db_archaea_bacteria_check

db_archaea_bacteria_unpack

clean_reads

dudes_rpt

db_archaea_bacteria_downloadtool: clark

db_archaea_bacteria_downloadtool: kraken

db_archaea_bacteria_unpack

clean_filestool: gottcha

clean_filestool: clark

kraken_db_custom_1database: custom_viral_db

kraken_db_custom_profile

kaiju_rpt dudes_rpt

clean_filestool: dudes

clean_filestool: kraken

db_archaea_bacteria_check

clean_filestool: kraken

kaiju_rpt

clean_filestool: kaiju

gottcha_rpt

clean_filestool: kaiju

dudes_db_custom_1database: custom_viral_db

kaiju_run_1database: archaea_bacteria

db_archaea_bacteria_check

kaiju_rpt

clean_filestool: dudes

clean_filestool: kraken

gottcha_run_1database: archaea_bacteria

database_profiletool: kaiju

clark_rpt

krona

kraken_rptmotus_rpt

clean_filestool: motus

clean_filestool: krakenmetametamerge_get_taxdump

dudes_db_custom_2database: custom_viral_db

trim_readssample: sample_data_custom_viral_2

db_archaea_bacteria_downloadtool: gottcha

kaiju_db_custom_4database: custom_viral_db

kaiju_db_custom_profiledatabase: custom_viral_db

clark_db_custom_profiledatabase: custom_viral_db

clean_filestool: clark

Figure 13: DAG - multiple samples Directed acyclic graph of the MetaMetapipeline for two samples, two databases (pre-configured and custom) and sixtools.

14