2013 hmp-assembly-webinar
-
Upload
ctitusbrown -
Category
Technology
-
view
1.504 -
download
0
Transcript of 2013 hmp-assembly-webinar
![Page 1: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/1.jpg)
C. Titus BrownAssistant Professor
CSE, MMG, BEACONMichigan State University
HMP – Metagenome assembly
![Page 2: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/2.jpg)
Acknowledgements
Lab members involved Collaborators• Adina Howe (w/Tiedje)• Jason Pell• Arend Hintze• Rosangela Canino-Koning• Qingpeng Zhang• Elijah Lowe• Likit Preeyanon• Jiarong Guo• Tim Brom• Kanchan Pavangadkar• Eric McDonald• Jordan Fish• Chris Welcher
• Jim Tiedje, MSU• Billie Swalla, UW• Janet Jansson, LBNL• Susannah Tringe, JGI
Funding
USDA NIFA; NSF IOS; BEACON.
![Page 3: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/3.jpg)
Open, online science
All of the software and approaches I’m talking about today are available:
Assembling large, complex metagenomesarxiv.org/abs/1212.2832
khmer software:github.com/ged-lab/khmer/
Blog: http://ivory.idyll.org/blog/Twitter: @ctitusbrown
![Page 4: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/4.jpg)
Illumina! De Bruijn graphs!
• Today I’ll be talking about Illumina data sets, and de Bruijn graph assembly (k-mer assembly).
• This is because my research has largely focused on scaling to large data sets (soil metagenomics!) and Illumina is the real scaling challenge.
![Page 5: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/5.jpg)
Assembler heuristics
• In order to build assemblies, each assembler makes choices – uses heuristics – to reach a conclusion.
• These heuristics may not be appropriate for your sample!– High polymorphism?– Mixed population vs clonal?– Genomic vs metagenomic vs mRNA– Low coverage drives differences in assembly.
![Page 6: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/6.jpg)
Evaluating assembly
Evaluating correctness of metagenomes is still undiscovered country.
![Page 7: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/7.jpg)
Shotgun sequencing
“Coverage” is simply the average number of reads that overlapeach true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
![Page 8: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/8.jpg)
Reducing to k-mers overlaps
Note that k-mer abundance is not properly represented here! Each blue k-mer will be present around 10 times.
![Page 9: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/9.jpg)
Errors create new k-mers
Each single base error generates ~k new k-mers.Generally, erroneous k-mers show up only once – errors are random.
![Page 10: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/10.jpg)
So, k-mer abundance plots are mixtures of true and false k-mers.
![Page 11: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/11.jpg)
Counting k-mers - histograms
Low-abundance peak (errors)
![Page 12: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/12.jpg)
Counting k-mers - histograms
High-abundance peak(true k-mers)
![Page 13: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/13.jpg)
Approach: Digital normalization(a computational version of library normalization)
Suppose you have a dilution factor of A (10) to B(1). To
get 10x of B you need to get 100x of A! Overkill!!
This 100x will consume disk space and, because of
errors, memory.
We can discard it for you…
![Page 14: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/14.jpg)
Digital normalization
![Page 15: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/15.jpg)
Digital normalization
![Page 16: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/16.jpg)
Digital normalization
![Page 17: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/17.jpg)
Digital normalization
![Page 18: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/18.jpg)
Digital normalization
![Page 19: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/19.jpg)
Digital normalization
![Page 20: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/20.jpg)
Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
• Reference free.
• Is single pass: looks at each read only once;
• Does not “collect” the majority of errors;
• Keeps all low-coverage reads;
• Smooths out coverage of regions.
![Page 21: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/21.jpg)
Coverage before digital normalization:
(MD amplified)
![Page 22: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/22.jpg)
Coverage after digital normalization:
Normalizes coverage
Discards redundancy
Eliminates majority oferrors
Scales assembly dramatically.
Assembly is 98% identical.
![Page 23: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/23.jpg)
In our experience…
• Digital normalization produces “good” metagenome assemblies.
• Smooths out abundance variation, strain variation.
• Reduces computational requirements for assembly.
• It also kinda makes sense :)
![Page 24: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/24.jpg)
Additional Approach for Metagenomes: Data partitioning
(a computational version of cell sorting)
Split reads into “bins” belonging to different source species.
Can do this based almost entirely on connectivity of sequences.
“Divide and conquer”Memory-efficient
implementation helps to scale assembly.
Pell et al., 2012, PNAS
![Page 25: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/25.jpg)
Partitioning separates reads by genome.Strain variants co-partition.
When computationally spiking HMP mock data with one E. coli genome (left) or multiple E. coli strains (right), majority of partitions
contain reads from only a single genome (blue) vs multi-genome partitions (green).
Partitions containing spiked data indicated with a * Adina Howe
**
![Page 26: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/26.jpg)
Conclusions re strain variation/chimerism (previous slide)
• When spiking in intentionally complex mixtures, only a small fraction of partitions are chimeric.
• These means that only a small fraction of contigs could be chimeric.
• Strain variants will almost certainly assemble together.
• Can separate on abundance.See Sharon et al., 2013, PMID 22936250, for Banfield work on this.
![Page 27: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/27.jpg)
Looking at k-mer histograms…
![Page 28: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/28.jpg)
Diginorm shifts left
![Page 29: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/29.jpg)
Partitioning picks out diff genomes
![Page 30: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/30.jpg)
Error correction “fixes” k-mers
Jason Pell
![Page 31: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/31.jpg)
Our experience
• Our metagenome assemblies compare well with others, but we have little in the way of ground truth with which to evaluate.
• Scaffold assembly is tricky; we believe in contig assembly for metagenomes, but not scaffolding.
• See arXiv paper, “Assembling large, complex metagenomes”, for our suggested pipeline and statistics & references.
![Page 32: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/32.jpg)
Metagenomic assemblies are highly variable
Adina Howe et al., arXiv 1212.0159
![Page 33: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/33.jpg)
High coverage is needed.
Low coverage is the dominant problem blocking assembly of your soil metagenome.
![Page 34: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/34.jpg)
Strain variation (soil)To
p tw
o al
lele
freq
uenc
ies
Position within contig
Of 5000 most abundantcontigs, only 1 has apolymorphism rate > 5%
Can measure by read mapping.
![Page 35: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/35.jpg)
Overconfident predictions
• We can assemble virtually anything but soil ;).– Genomes, transcriptomes, MDA, mixtures, etc.– Repeat resolution will be fundamentally limited by sequencing
technology (insert size; sampling depth)
• Strain variation confuses assembly, but does not prevent useful results.– Diginorm is systematic strategy to enable assembly.– Banfield has shown how to deconvolve strains at differential
abundance.– Kostas K. results suggest that there will be a species gap
sufficient to prevent contig misassembly.– Even genes “chimeric” between strains are useful.
![Page 36: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/36.jpg)
Reasons why you shouldn’t believe me
1) Strain variation – when we get deeper in soil, we should see more (?). Not sure what will happen, and we do not (yet) have proven approaches.
2) We, by definition, are not yet seeing anything that doesn’t assemble.
3) We have not tackled scaffolding much. Serious investigation of scaffolding will be necessary for any good genome assembly, and scaffolding is weak point.
![Page 37: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/37.jpg)
Metagenome assemblers
In addition to khmer prefiltering,
• SPADES• IDBA-UD• MetaVelvet• Ray Meta
![Page 38: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/38.jpg)
Assembling in the cloud
• Most metagenomes require 50-150 GB of RAM.
• Many people don’t have access to computers of that size.
• Amazon Web Services (aws.amazon.com) will happily rent you such computers for $1-2/hr.
• I will post instructions and sample data sets for using Amazon today at ged.msu.edu/angus/.
![Page 39: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/39.jpg)
Current research
• Optimizing our programs => faster.
• Building an evaluation framework for metagenome assemblers.
• Error correction!
![Page 40: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/40.jpg)
De novo metagenome error correction makes reads more mappable.
Jason Pell, unpub.
![Page 41: 2013 hmp-assembly-webinar](https://reader035.fdocuments.us/reader035/viewer/2022062405/554e75b8b4c90545698b4ce0/html5/thumbnails/41.jpg)
Concluding thoughts
• Achieving one or more assemblies is fairly straightforward.
• Evaluating them is challenging, however, and where you should be thinking hardest about assembly.
• There are relatively few pipelines available for analyzing assembled metagenomic data. MG-RAST does support this; others?