day3 Lecture binning - medicalintelligence.org
Transcript of day3 Lecture binning - medicalintelligence.org
www.elixir-europe.org
ELIXIRWorkshop in metagenomics
Nador February 2018
www.elixir-europe.org
Genome assembly from metagenomic reads
Overview• Obtaining a genome sequence
• Binning
• Taxonomic classification of bins
Why do we want to obtain a genome sequence?o Genome
o Chromosomal structure and organisation
o Identify all genes and predict their functions
o Starting point for other studies
o Transcriptomics, proteomics, evolutionary studies
What is in the databases - for example RefSeq?oBiased with sequences having pathogenic or industrial implications
oLarge fraction of Proteobacteria
oHost-associated are overrepresented
Ecosystem Total
Host-associated 11,816
Humans 4973
Animal 1804
Plants 1410
Mammals 867
Other 2762
Environmental 6774
Aquatic 4559
Terrestrial 2057
Other 158
Engineered systems 1658
Food production 440
Wastewater 410
Lab synthesis 387
Other 418
Total 20,248GOLD database
RefSeq
•Microbial dark matter• Uncultivable organisms
• Many lineages known from 16 rRNA sequencing lacks a genomic representative
Christian Rinke / Tanja Woyke, DOE JGI
Obtaining a genome sequence
Genome sequencing => Requires culturing
Single cell sequencing => Incomplete genomes
Metagenomic sequencing => The solution?
Obtaining a genome sequence
Genome sequencing => Requires culturing
Single cell sequencing => Incomplete genomes
Metagenomic sequencing => The solution?
Recap – Challenges in assembly of genome sequences
homes.cs.washington.edu
• Identify overlapping sequence reads or K-mers and create a graph
• Challenges when there are variations or repeats
• Creates bubbles in the graph
• Split into contigs
TCGCGGACGCGGATGCGGATT
* * *
Obtaining a genome sequence from a metagenomic sample
www.slideshare.net Mads Albertsen, University of Vienna
• Metagenomic Assembled Genomes (MAGs)
• Similar as to genome sequencing
• Trying to reconstruct the individual genomes of a mixture of DNA from an entire population
Challenges obtaining a MAGs• Very fragmented and rarely complete genomes in the sample
• Highly diverse DNA (extremely many K-mers?)
• Diverse level of abundance
• Different relatedness to each other (same specie but different strains?)
• Computational challenging
www.slideshare.net Mads Albertsen, University of Vienna
Challenges obtaining a MAGs• Most metagenomics assemblers use de Bruijn graphs
• Algorithms for single genome assemblies cannot be used directly
• Digital normalization aims to eliminate redundant reads
• Partition the de Bruijun graph prior to assembly – lower memory costs
• ‘Bubble popping' procedure. Parallel paths in the graph that differ by only a small amount, these paths are collapsed into one
• Metagenomic assemblies will still be highly fragmented - Binning
Binning• Method to sort data values into a smaller groups or “bins”
• For example to group animals into more taxon-specific bins
Jungle Animals Real Life Wallpapers Hd
Binning• Method to sort data values into a smaller groups or “bins”
• For example to group animals into more taxon-specific bins
• Various taxonomic levels: All belong to Kindom = Animalia, but Class = Aves AND Mammalia
Animalia
Mammalia Aves
Binning - metagenomics• Group contigs or reads belonging to the same specie
• Group nucleotide sequences based on composition
• Group nucleotide sequences based on abundance
SlideShare / CoMet: Coverage and Composition based binning of Metagenomes
Two types of binning strategies• Taxonomy dependent and taxonomy independent strategies
• Taxonomy dependent methods classifies the DNA fragments by performing a standard homology inference against a reference database
• Taxonomy independent methods performs the reference-free binning by applying clustering techniques
BINNING
Taxonomy independent
Alignment based
Taxonomy dependent
Composition based
Hybrid
Abundance based
Composition based
Hybrid
ReferenceNo reference
Two types of binning strategies• Taxonomy dependent and taxonomy independent strategies
Modified from: Siegwald et al. / PLoS ONE 12(1): e0169563. https://doi.org/10.1371/journal.pone.0169563
How are nucleotide sequences binned?• Abundance based binning
• Also called coverage based binning
• Sequences originating from the same specie will have similar abundance in the sample
SlideShare / A. Murat Eren, Intro to metagenomic binning
How are nucleotide sequences binned?• Composition based binning
• Genomic signatures have been shown to display a species-specific pattern
• GC content is simple and commonly used genomic signature
• More widely used genomic signature is tetranucleotide frequencies
SlideShare / A. Murat Eren, Intro to metagenomic binning
Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG
Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG
AATTATTC
TTCCTCCG
CCGG
Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG
AATTATTC
TTCCTCCG
CCGG
AATT ATTC TTCC TCCG CCGG
Seq1
Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG
AATTATTC
TTCCTCCG
CCGG
AATT ATTC TTCC TCCG CCGG
Seq1 1 1 1 1 1
Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG
AATTATTC
TTCCTCCG
CCGG
AATT ATTC TTCC TCCG CCGG ATTA TTAA TAAG AAGG
Seq1 1 1 1 1 1
Seq2 1 1 1 1 1
Seq2: AATTAAGGAATT
ATTATTAA
TAAGAAGG
Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG
AATTATTC
TTCCTCCG
CCGG
AATT ATTC TTCC TCCG CCGG ATTA TTAA TAAG AAGG TCCC AGGA GGAA GAAG TAAT
Seq1 1 1 1 1 1
Seq2 1 1 1 1 1
Seq3 2 1 1 1
Seq4 2 1 1 1
Seq5 1 1 2 1
Seq2: AATTAAGGAATT
ATTATTAA
TAAGAAGG
Seq3: AAGGAAGGAAGG
AGGAGGAA
GAAGAAGG
Seq4: AATTAATTAATT
ATTATTAA
TAATAATT
Seq5: GGAAGGAAGGAA
GAAGAAGG
AGGAGGAA
Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG
AATTATTC
TTCCTCCG
CCGG
AATT ATTC TTCC TCCG CCGG ATTA TTAA TAAG AAGG TCCC AGGA GGAA GAAG TAAT
Seq1 1 1 1 1 1
Seq2 1 1 1 1 1
Seq3 2 1 1 1
Seq4 2 1 1 1
Seq5 1 1 2 1
Seq2: AATTAAGGAATT
ATTATTAA
TAAGAAGG
Seq3: AAGGAAGGAAGG
AGGAGGAA
GAAGAAGG
Seq4: AATTAATTAATT
ATTATTAA
TAATAATT
Seq5: GGAAGGAAGGAA
GAAGAAGG
AGGAGGAA
Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG
AATTATTC
TTCCTCCG
CCGG
AATT ATTC TTCC TCCG CCGG ATTA TTAA TAAG AAGG TCCC AGGA GGAA GAAG TAAT
Seq1 1 1 1 1 1
Seq2 1 1 1 1 1
Seq4 2 1 1 1
Seq3 2 1 1 1
Seq5 1 1 2 1
Seq2: AATTAAGGAATT
ATTATTAA
TAAGAAGG
Seq3: AAGGAAGGAAGG
AGGAGGAA
GAAGAAGG
Seq4: AATTAATTAATT
ATTATTAA
TAATAATT
Seq5: GGAAGGAAGGAA
GAAGAAGG
AGGAGGAA
Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG
AATTATTC
TTCCTCCG
CCGG
AATT ATTC TTCC TCCG CCGG ATTA TTAA TAAG AAGG TCCC AGGA GGAA GAAG TAAT
Seq1 1 1 1 1 1
Seq2 1 1 1 1 1
Seq4 2 1 1 1
Seq3 2 1 1 1
Seq5 1 1 2 1
Seq2: AATTAAGGAATT
ATTATTAA
TAAGAAGG
Seq3: AAGGAAGGAAGG
AGGAGGAA
GAAGAAGG
Seq4: AATTAATTAATT
ATTATTAA
TAATAATT
Seq5: GGAAGGAAGGAA
GAAGAAGG
AGGAGGAA
Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG
AATTATTC
TTCCTCCG
CCGG
Seq1
Seq2
Seq4
Seq3
Seq5
Seq2: AATTAAGGAATT
ATTATTAA
TAAGAAGG
Seq3: AAGGAAGGAAGG
AGGAGGAA
GAAGAAGG
Seq4: AATTAATTAATT
ATTATTAA
TAATAATT
Seq5: GGAAGGAAGGAA
GAAGAAGG
AGGAGGAA
Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG
AATTATTC
TTCCTCCG
CCGG
Seq1
Seq2
Seq4
Seq3
Seq5
Seq2: AATTAAGGAATT
ATTATTAA
TAAGAAGG
Seq3: AAGGAAGGAAGG
AGGAGGAA
GAAGAAGG
Seq4: AATTAATTAATT
ATTATTAA
TAATAATT
Seq5: GGAAGGAAGGAA
GAAGAAGG
AGGAGGAA
Seq1
Seq2 Seq4Seq3
Seq5
PCAAnalysis
PCA1
PCA2
Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG
AATTATTC
TTCCTCCG
CCGG
Seq1
Seq2
Seq4
Seq3
Seq5
Seq2: AATTAAGGAATT
ATTATTAA
TAAGAAGG
Seq3: AAGGAAGGAAGG
AGGAGGAA
GAAGAAGG
Seq4: AATTAATTAATT
ATTATTAA
TAATAATT
Seq5: GGAAGGAAGGAA
GAAGAAGG
AGGAGGAA
Seq1
Seq2 Seq4Seq3
Seq5Bin 1
Bin 2PCAAnalysis
PCA1
PCA2
K. Sedlar et al. / Computational and Structural Biotechnology Journal 15 (2017)
Binning – Taxonomy independent methods• Hybrid binning
• Combine abundance and composition
• Give more accurate binning results
• Can be performed on either sequence reads or assembled contigs
• Potentially separate subspecies into individual bins
Binning – Taxonomy dependent methods• Referred to as supervised.
• Methods rely on comparison of contigs or reads against reference databases (known phylogenetic origin)
• Reads classified under similar taxonomic categories are finally grouped into bins
• Comparison on sequence level using:
•BLAST
•BLAT
•Bowtie
•BWA
• Comparison on model level using:
•HMMs + databases
Modified from: Mande et al. / Briefings in Bioinformatics, Volume 13, Issue 6, 1 November 2012, Pages 669–681
Binning – Taxonomy dependent methods
Taxonomy dependent Taxonomy independent
Binning – Taxonomy independent methods• Referred to as un-supervised
• Enables the discovery of new microbial of new organisms
• Two types of features used for classification
• Sequence composition based
•Assumption that the genome composition is unique for each taxon
•DNA fragments from the same genome are more similar than those from different genomes
•Cluster formation being defined by k-mer composition
• Abundance based
•Coverage reflecting abundance of given tax
•Cluster formation being defined by k-mer abundance
• Binning of assembled contigs using an expectation-maximization algorithm
• Bins are predicted from initial identification of marker genes in assembled sequences
• Tetranucleotide frequencies and scaffold coverages are combined to organize metagenomicsequences into individual bins
• Estimation of genome completeness – 107 marker genes
Wu et al., Microbiome 2014 2:26
Taxonomy independent binning of contigs - MaxBin
• Sequencing depths highly affect the results
• 10-genome simulated datasets - 20X versus 80X coverage
MaxBin - preformance
Wu et al., Microbiome 2014 2:26
S. Girotto et al. / Bioinformatics, Volume 32, Issue 17, 1 September 2016, Pages i567–i575
Taxonomy independent binning of sequence reads- MetaProb• Assembly-assisted tool for binning of reads
• Phase 1 groups overlapping reads into groups
• Phase 2 builds the probabilistic sequence signatures of independent reads and merges the groups into clusters
K. Sedlar et al. / Computational and Structural Biotechnology Journal 15 (2017)
Binning – Taxonomy independent methods
• Investigated performance when recovering individual genome bins
• Large variation:
• Average genome completeness (34% to 80%)
• Average purity (70% to 97%)
Reference
Binning results for the CAMI data sets
Selecting a binning method• Highly dependent on the sample and the aim of the project and available resources
• The length of the metagenomic sequences – key factor
• Ultra-short sequences (75 bp) - assembly step becomes a necessity
• Short length sequences (200-400 bp)- alignment-based or hybrid binning methods
• Long length sequences - alignment-based as well as composition-based binning methods
• Are you aiming to identify novel un-culturable species?
• Human microbiota
•Most species are known
•Presence or absence of one or several species?
• Environmental samples
•Most species are unknown
Visualization of bins - GroopM• Assume that sequences in bins originate from the same specie
• Sequences are positioned according to the first two principal components of their tetranucleotide frequencies
• The diameter of each circle is proportional to the length its respective contig
Ilmerford et al. / PeerJ. 2014; 2: e603. Published online 2014 Sep 30
Visualization of bins - GroopM• Assume that sequences in bins originate from the same specie - MAG
• Sequences are visualized in coverage space
• Contigs belonging to a single bin are assigned the same color
Ilmerford et al. / PeerJ. 2014; 2: e603. Published online 2014 Sep 30
Refinement of bins - ICoVeR• The final bins will contain errors (sequences that belong to another bin)
• Curation of bin assignments based on multiple binning algorithms
• Higher completeness and lower contamination
Refinement of bins - RefineM• Methods for outlier filtering reduces the total number of contigs being binned
• Deviating GC
• Deviating tetranucleotide composition
• Deviating coverage depth
• Identify contigs with a coding density suggestive of a Eukaryotic origin
Refinement of bins
Estimate completeness and contamination of MAGs• Assembly statistics
• Total size of MAG (sum of contigs in bin)
• Contig size (N50 value)
• Presence and absence of lineage-specific genes
• Presence of 20 standard tRNAs
Estimate completeness and contamination of MAGs• CheckM assess the quality of genomes recovered from metagenomes
• Estimate genome completeness and contamination
• Using collocated single-copy marker genes within a phylogenetic lineage
•Bacteria: 104 markers organized into 58 sets
•Archaea: 150 markers organized into 108 sets
Parks et al. / Nat Microbiol. 2017 Nov;2(11):1533-1542.
Taxonomic classification of MAGs• Identification of 16S rRNA genes in MAGs
• However many MAGs often lack 16s rRNA genes
• Identification of lineage-specific genes
• Kraken – you have seen before
• Kaiju – you have seen before
• BBtools - sketch
Bbtools sketch - Hash functions• Comparing hash tables is much faster than comparing sequences
• Alignment-based = high computational cost
• Composition-based = fast, much less compute intensive
Bbtools sketch - Hash functions• Comparing hash tables is much faster than comparing sequences
• Alignment-based = high computational cost
• Composition-based = fast, much less compute intensive
Erik Hassan Nils Espen Kamal
DATABASE OF ALL CRIMINALS IN MOROCCO
?
Bbtools sketch - Hash functions• Comparing hash tables is much faster than comparing sequences
• Alignment-based = high computational cost
• Composition-based = fast, much less compute intensive
Erik Hassan Nils Espen Kamal
DATABASE OF ALL CRIMINALS IN MOROCCO
?Hash function Hashes
EH1
EH2
EH3
EH4
EH5
Bbtools sketch - Hash functions• Comparing hash tables is much faster than comparing sequences
• Alignment-based = high computational cost
• Composition-based = fast, much less compute intensive
Erik Hassan Nils Espen Kamal
DATABASE OF ALL CRIMINALS IN MOROCCOHash function Hashes
ER1
ER2
ER3
ER4
ER5
EH1
EH2
EH3
EH4
EH5
HG1
HG2
HG3
HG4
HG5
NW1
NW2
NW3
NW4
NW5
ER1
ER2
ER3
ER4
ER5
KM1
KM2
KM3
KM4
KM5
?
Bbtools sketch - Hash functions• Comparing hash tables is much faster than comparing sequences
• Alignment-based = high computational cost
• Composition-based = fast, much less compute intensive
Erik Hassan Nils Espen Kamal
DATABASE OF ALL CRIMINALS IN MOROCCOHash function Hashes
ER1
ER2
ER3
ER4
ER5
EH1
EH2
EH3
EH4
EH5
HG1
HG2
HG3
HG4
HG5
NW1
NW2
NW3
NW4
NW5
ER1
ER2
ER3
ER4
ER5
KM1
KM2
KM3
KM4
KM5
?761-167
Bbtools sketch - Hash functions• Sketch creation works like this:
• Gather all length kmers in a sequence – 31 nucleotides is typically used
• Apply a hash function to them
• A genome tend to contain millions of kmers
• Keep the N smallest hash codes and call this set a "sketch” – normally 10 000
Modified from: http://slideplayer.com/slide/10898138/
DATABASE WITH SKETCHES
?
Sketch
Bbtools sketch - Hash functions• Sketch creation works like this:
• Gather all length kmers in a sequence – 31 nucleotides is typically used
• Apply a hash function to them
• A genome tend to contain millions of kmers
• Keep the N smallest hash codes and call this set a "sketch” – normally 10 000
Modified from: http://slideplayer.com/slide/10898138/
DATABASE WITH SKETCHES
?
Sketch
Binning pipeline - Desman• De novo Extraction of Strains from MetAgeNomes
Example of important outcome from MAGs• Analysed 1500 metagenomics datasets
• First genomes from 17 bacterial phyla and 3 archaeal phyla
Major outcomes from MAGs• All genomes are estimated to be ≥50% complete and nearly half are ≥90%
complete with ≤5% contamination.
• These genomes increase the phylogenetic diversity of bacterial and archaeal genome trees by >30%
Parks et al. / Nat Microbiol. 2017 Nov;2(11):1533-1542.