day3 Lecture binning - medicalintelligence.org

www.elixir-europe.org

ELIXIRWorkshop in metagenomics

Nador February 2018

www.elixir-europe.org

Genome assembly from metagenomic reads

Overview• Obtaining a genome sequence

• Binning

• Taxonomic classification of bins

Why do we want to obtain a genome sequence?o Genome

o Chromosomal structure and organisation

o Identify all genes and predict their functions

o Starting point for other studies

o Transcriptomics, proteomics, evolutionary studies

What is in the databases - for example RefSeq?oBiased with sequences having pathogenic or industrial implications

oLarge fraction of Proteobacteria

oHost-associated are overrepresented

Ecosystem Total

Host-associated 11,816

Humans 4973

Animal 1804

Plants 1410

Mammals 867

Other 2762

Environmental 6774

Aquatic 4559

Terrestrial 2057

Other 158

Engineered systems 1658

Food production 440

Wastewater 410

Lab synthesis 387

Other 418

Total 20,248GOLD database

RefSeq

•Microbial dark matter• Uncultivable organisms

• Many lineages known from 16 rRNA sequencing lacks a genomic representative

Christian Rinke / Tanja Woyke, DOE JGI

Obtaining a genome sequence

Genome sequencing => Requires culturing

Single cell sequencing => Incomplete genomes

Metagenomic sequencing => The solution?

Recap – Challenges in assembly of genome sequences

homes.cs.washington.edu

• Identify overlapping sequence reads or K-mers and create a graph

• Challenges when there are variations or repeats

• Creates bubbles in the graph

• Split into contigs

TCGCGGACGCGGATGCGGATT

* * *

Obtaining a genome sequence from a metagenomic sample

www.slideshare.net Mads Albertsen, University of Vienna

• Metagenomic Assembled Genomes (MAGs)

• Similar as to genome sequencing

• Trying to reconstruct the individual genomes of a mixture of DNA from an entire population

Challenges obtaining a MAGs• Very fragmented and rarely complete genomes in the sample

• Highly diverse DNA (extremely many K-mers?)

• Diverse level of abundance

• Different relatedness to each other (same specie but different strains?)

• Computational challenging

www.slideshare.net Mads Albertsen, University of Vienna

Challenges obtaining a MAGs• Most metagenomics assemblers use de Bruijn graphs

• Algorithms for single genome assemblies cannot be used directly

• Digital normalization aims to eliminate redundant reads

• Partition the de Bruijun graph prior to assembly – lower memory costs

• ‘Bubble popping' procedure. Parallel paths in the graph that differ by only a small amount, these paths are collapsed into one

• Metagenomic assemblies will still be highly fragmented - Binning

Binning• Method to sort data values into a smaller groups or “bins”

• For example to group animals into more taxon-specific bins

Jungle Animals Real Life Wallpapers Hd

Binning• Method to sort data values into a smaller groups or “bins”

• For example to group animals into more taxon-specific bins

• Various taxonomic levels: All belong to Kindom = Animalia, but Class = Aves AND Mammalia

Animalia

Mammalia Aves

Binning - metagenomics• Group contigs or reads belonging to the same specie

• Group nucleotide sequences based on composition

• Group nucleotide sequences based on abundance

SlideShare / CoMet: Coverage and Composition based binning of Metagenomes

Two types of binning strategies• Taxonomy dependent and taxonomy independent strategies

• Taxonomy dependent methods classifies the DNA fragments by performing a standard homology inference against a reference database

• Taxonomy independent methods performs the reference-free binning by applying clustering techniques

BINNING

Taxonomy independent

Alignment based

Taxonomy dependent

Composition based

Hybrid

Abundance based

Composition based

Hybrid

ReferenceNo reference

Two types of binning strategies• Taxonomy dependent and taxonomy independent strategies

Modified from: Siegwald et al. / PLoS ONE 12(1): e0169563. https://doi.org/10.1371/journal.pone.0169563

How are nucleotide sequences binned?• Abundance based binning

• Also called coverage based binning

• Sequences originating from the same specie will have similar abundance in the sample

SlideShare / A. Murat Eren, Intro to metagenomic binning

How are nucleotide sequences binned?• Composition based binning

• Genomic signatures have been shown to display a species-specific pattern

• GC content is simple and commonly used genomic signature

• More widely used genomic signature is tetranucleotide frequencies

SlideShare / A. Murat Eren, Intro to metagenomic binning

Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG


AATTATTC

TTCCTCCG

CCGG


AATTATTC

TTCCTCCG

CCGG

AATT ATTC TTCC TCCG CCGG

Seq1


AATTATTC

TTCCTCCG

CCGG

AATT ATTC TTCC TCCG CCGG

Seq1 1 1 1 1 1


AATTATTC

TTCCTCCG

CCGG

AATT ATTC TTCC TCCG CCGG ATTA TTAA TAAG AAGG

Seq1 1 1 1 1 1

Seq2 1 1 1 1 1

Seq2: AATTAAGGAATT

ATTATTAA

TAAGAAGG


AATTATTC

TTCCTCCG

CCGG

AATT ATTC TTCC TCCG CCGG ATTA TTAA TAAG AAGG TCCC AGGA GGAA GAAG TAAT

Seq1 1 1 1 1 1

Seq2 1 1 1 1 1

Seq3 2 1 1 1

Seq4 2 1 1 1

Seq5 1 1 2 1

Seq2: AATTAAGGAATT

ATTATTAA

TAAGAAGG

Seq3: AAGGAAGGAAGG

AGGAGGAA

GAAGAAGG

Seq4: AATTAATTAATT

ATTATTAA

TAATAATT

Seq5: GGAAGGAAGGAA

GAAGAAGG

AGGAGGAA


AATTATTC

TTCCTCCG

CCGG

AATT ATTC TTCC TCCG CCGG ATTA TTAA TAAG AAGG TCCC AGGA GGAA GAAG TAAT

Seq1 1 1 1 1 1

Seq2 1 1 1 1 1

Seq4 2 1 1 1

Seq3 2 1 1 1

Seq5 1 1 2 1

Seq2: AATTAAGGAATT

ATTATTAA

TAAGAAGG

Seq3: AAGGAAGGAAGG

AGGAGGAA

GAAGAAGG

Seq4: AATTAATTAATT

ATTATTAA

TAATAATT

Seq5: GGAAGGAAGGAA

GAAGAAGG

AGGAGGAA


AATTATTC

TTCCTCCG

CCGG

Seq1

Seq2

Seq4

Seq3

Seq5

Seq2: AATTAAGGAATT

ATTATTAA

TAAGAAGG

Seq3: AAGGAAGGAAGG

AGGAGGAA

GAAGAAGG

Seq4: AATTAATTAATT

ATTATTAA

TAATAATT

Seq5: GGAAGGAAGGAA

GAAGAAGG

AGGAGGAA


AATTATTC

TTCCTCCG

CCGG

Seq1

Seq2

Seq4

Seq3

Seq5

Seq2: AATTAAGGAATT

ATTATTAA

TAAGAAGG

Seq3: AAGGAAGGAAGG

AGGAGGAA

GAAGAAGG

Seq4: AATTAATTAATT

ATTATTAA

TAATAATT

Seq5: GGAAGGAAGGAA

GAAGAAGG

AGGAGGAA

Seq1

Seq2 Seq4Seq3

Seq5

PCAAnalysis

PCA1

PCA2


AATTATTC

TTCCTCCG

CCGG

Seq1

Seq2

Seq4

Seq3

Seq5

Seq2: AATTAAGGAATT

ATTATTAA

TAAGAAGG

Seq3: AAGGAAGGAAGG

AGGAGGAA

GAAGAAGG

Seq4: AATTAATTAATT

ATTATTAA

TAATAATT

Seq5: GGAAGGAAGGAA

GAAGAAGG

AGGAGGAA

Seq1

Seq2 Seq4Seq3

Seq5Bin 1

Bin 2PCAAnalysis

PCA1

PCA2

K. Sedlar et al. / Computational and Structural Biotechnology Journal 15 (2017)

Binning – Taxonomy independent methods• Hybrid binning

• Combine abundance and composition

• Give more accurate binning results

• Can be performed on either sequence reads or assembled contigs

• Potentially separate subspecies into individual bins

Binning – Taxonomy dependent methods• Referred to as supervised.

• Methods rely on comparison of contigs or reads against reference databases (known phylogenetic origin)

• Reads classified under similar taxonomic categories are finally grouped into bins

• Comparison on sequence level using:

•BLAST

•BLAT

•Bowtie

•BWA

• Comparison on model level using:

•HMMs + databases

Modified from: Mande et al. / Briefings in Bioinformatics, Volume 13, Issue 6, 1 November 2012, Pages 669–681

Binning – Taxonomy dependent methods

Taxonomy dependent Taxonomy independent

Binning – Taxonomy independent methods• Referred to as un-supervised

• Enables the discovery of new microbial of new organisms

• Two types of features used for classification

• Sequence composition based

•Assumption that the genome composition is unique for each taxon

•DNA fragments from the same genome are more similar than those from different genomes

•Cluster formation being defined by k-mer composition

• Abundance based

•Coverage reflecting abundance of given tax

•Cluster formation being defined by k-mer abundance

• Binning of assembled contigs using an expectation-maximization algorithm

• Bins are predicted from initial identification of marker genes in assembled sequences

• Tetranucleotide frequencies and scaffold coverages are combined to organize metagenomicsequences into individual bins

• Estimation of genome completeness – 107 marker genes

Wu et al., Microbiome 2014 2:26

Taxonomy independent binning of contigs - MaxBin

• Sequencing depths highly affect the results

• 10-genome simulated datasets - 20X versus 80X coverage

MaxBin - preformance

Wu et al., Microbiome 2014 2:26

S. Girotto et al. / Bioinformatics, Volume 32, Issue 17, 1 September 2016, Pages i567–i575

Taxonomy independent binning of sequence reads- MetaProb• Assembly-assisted tool for binning of reads

• Phase 1 groups overlapping reads into groups

• Phase 2 builds the probabilistic sequence signatures of independent reads and merges the groups into clusters

K. Sedlar et al. / Computational and Structural Biotechnology Journal 15 (2017)

Binning – Taxonomy independent methods

• Investigated performance when recovering individual genome bins

• Large variation:

• Average genome completeness (34% to 80%)

• Average purity (70% to 97%)

Reference

Binning results for the CAMI data sets

Selecting a binning method• Highly dependent on the sample and the aim of the project and available resources

• The length of the metagenomic sequences – key factor

• Ultra-short sequences (75 bp) - assembly step becomes a necessity

• Short length sequences (200-400 bp)- alignment-based or hybrid binning methods

• Long length sequences - alignment-based as well as composition-based binning methods

• Are you aiming to identify novel un-culturable species?

• Human microbiota

•Most species are known

•Presence or absence of one or several species?

• Environmental samples

•Most species are unknown

Visualization of bins - GroopM• Assume that sequences in bins originate from the same specie

• Sequences are positioned according to the first two principal components of their tetranucleotide frequencies

• The diameter of each circle is proportional to the length its respective contig

Ilmerford et al. / PeerJ. 2014; 2: e603. Published online 2014 Sep 30

Visualization of bins - GroopM• Assume that sequences in bins originate from the same specie - MAG

• Sequences are visualized in coverage space

• Contigs belonging to a single bin are assigned the same color

Ilmerford et al. / PeerJ. 2014; 2: e603. Published online 2014 Sep 30

Refinement of bins - ICoVeR• The final bins will contain errors (sequences that belong to another bin)

• Curation of bin assignments based on multiple binning algorithms

• Higher completeness and lower contamination

Refinement of bins - RefineM• Methods for outlier filtering reduces the total number of contigs being binned

• Deviating GC

• Deviating tetranucleotide composition

• Deviating coverage depth

• Identify contigs with a coding density suggestive of a Eukaryotic origin

Refinement of bins

Estimate completeness and contamination of MAGs• Assembly statistics

• Total size of MAG (sum of contigs in bin)

• Contig size (N50 value)

• Presence and absence of lineage-specific genes

• Presence of 20 standard tRNAs

Estimate completeness and contamination of MAGs• CheckM assess the quality of genomes recovered from metagenomes

• Estimate genome completeness and contamination

• Using collocated single-copy marker genes within a phylogenetic lineage

•Bacteria: 104 markers organized into 58 sets

•Archaea: 150 markers organized into 108 sets

Parks et al. / Nat Microbiol. 2017 Nov;2(11):1533-1542.

Taxonomic classification of MAGs• Identification of 16S rRNA genes in MAGs

• However many MAGs often lack 16s rRNA genes

• Identification of lineage-specific genes

• Kraken – you have seen before

• Kaiju – you have seen before

• BBtools - sketch

Bbtools sketch - Hash functions• Comparing hash tables is much faster than comparing sequences

• Alignment-based = high computational cost

• Composition-based = fast, much less compute intensive




Erik Hassan Nils Espen Kamal

DATABASE OF ALL CRIMINALS IN MOROCCO

?





DATABASE OF ALL CRIMINALS IN MOROCCO

?Hash function Hashes

EH1

EH2

EH3

EH4

EH5





DATABASE OF ALL CRIMINALS IN MOROCCOHash function Hashes

ER1

ER2

ER3

ER4

ER5

EH1

EH2

EH3

EH4

EH5

HG1

HG2

HG3

HG4

HG5

NW1

NW2

NW3

NW4

NW5

ER1

ER2

ER3

ER4

ER5

KM1

KM2

KM3

KM4

KM5

?





DATABASE OF ALL CRIMINALS IN MOROCCOHash function Hashes

ER1

ER2

ER3

ER4

ER5

EH1

EH2

EH3

EH4

EH5

HG1

HG2

HG3

HG4

HG5

NW1

NW2

NW3

NW4

NW5

ER1

ER2

ER3

ER4

ER5

KM1

KM2

KM3

KM4

KM5

?761-167

Bbtools sketch - Hash functions• Sketch creation works like this:

• Gather all length kmers in a sequence – 31 nucleotides is typically used

• Apply a hash function to them

• A genome tend to contain millions of kmers

• Keep the N smallest hash codes and call this set a "sketch” – normally 10 000

Modified from: http://slideplayer.com/slide/10898138/

DATABASE WITH SKETCHES

?

Sketch

Binning pipeline - Desman• De novo Extraction of Strains from MetAgeNomes

Example of important outcome from MAGs• Analysed 1500 metagenomics datasets

• First genomes from 17 bacterial phyla and 3 archaeal phyla

Major outcomes from MAGs• All genomes are estimated to be ≥50% complete and nearly half are ≥90%

complete with ≤5% contamination.

• These genomes increase the phylogenetic diversity of bacterial and archaeal genome trees by >30%

Parks et al. / Nat Microbiol. 2017 Nov;2(11):1533-1542.

day3 Lecture binning - medicalintelligence.org

Documents

Transcript of day3 Lecture binning - medicalintelligence.org