Chinook: A collaborative system for bioinformatics analysis. VanBUG October 2004 Stephen Montgomery.
2015 vancouver-vanbug
-
Upload
ctitusbrown -
Category
Science
-
view
498 -
download
0
Transcript of 2015 vancouver-vanbug
![Page 1: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/1.jpg)
Building a platform for bioinformatics: some exciting
new directions for khmer.
C. Titus Brown
March 12, 2015
![Page 2: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/2.jpg)
Hello!Associate Professor (#tenure!);
School of Veterinary Medicine
University of California, Davis.
More information at:
• ged.msu.edu/ ( URL needs to be updated :)
• github.com/ged-lab/
• ivory.idyll.org/blog/
• @ctitusbrown
![Page 3: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/3.jpg)
WarningsThis talk contains information that may constitute
“forward-looking statements.” Generally, the words
“believe,” “expect,” “intend,” “estimate,”
“anticipate,” “project,” “will” and similar expressions
identify forward-looking statements, which generally
are not historical in nature.
I have been advised to put this disclaimer in as well:
Dr. Brown is not currently under treatment for any
disorders related to megalomania.
![Page 4: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/4.jpg)
Introducing k-mers
CCGATTGCACTGGACCGA (<- read)
CCGATTGCAC
CGATTGCACT
GATTGCACTG
ATTGCACTGG
TTGCACTGGA
TGCACTGGAC
GCACTGGACC
ACTGGACCGA
![Page 5: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/5.jpg)
De Bruijn graphs –assemble on overlaps
J.R. Miller et al. / Genomics (2010)
![Page 6: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/6.jpg)
K-mers give you an implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
![Page 7: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/7.jpg)
K-mers give you an implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
CATGGACCGATTGCACTGGACCGATGCACGGACCG
(with no accounting for mismatches or indels)
![Page 8: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/8.jpg)
The problem with k-mers
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTCGACCGATGCACGGTACCG
Each sequencing error results in k novel k-mers!
![Page 9: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/9.jpg)
The opportunity:
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTCGACCGATGCACGGTACCG
The graph contains information about errors(can be used for error trimming in reads).
The graph also contains information about variants (can be used for variant calling).
![Page 10: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/10.jpg)
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: [email protected]
One big challenge: scalability!
De Bruijn graph size scales with # errors.
![Page 11: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/11.jpg)
One big challenge: scalability!
De Bruijn graph size scales with # errors.
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set
![Page 12: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/12.jpg)
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: [email protected]
One big challenge: scalability!
De Bruijn graph size scales with # errors.
![Page 13: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/13.jpg)
Goals
• Initial goal: can we assemble large data sets??
• Longer-term goal: can we find efficient (De Bruijn?)
graph-based approaches to sequence analysis?
![Page 14: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/14.jpg)
First attempt: compressible De Bruijn graphs
1% 5%
15%10%
Pell et al., 2012
Can use Bloom filters to store
De Bruijn graph structures.
=> Overall structure
remains as you squish graphs
down.
![Page 15: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/15.jpg)
Technical challenges met (and defeated)
• Exhaustive in-memory traversal of graphs containing
5-15 billion nodes.
• Sequencing technology introduces false
connections in graph.
• Implementation lets us scale ~20x over other
approaches.
Pell et al., 2012
![Page 16: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/16.jpg)
Technical challenges met (and defeated)
• Exhaustive in-memory traversal of graphs containing
5-15 billion nodes.
• Sequencing technology introduces false
connections in graph.
• Implementation lets us scale ~20x over other
approaches, but this is not enough.
• Although, see Minia assembler (Chikhi et al.)
Pell et al., 2012
![Page 17: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/17.jpg)
Second attempt: diginorm
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: [email protected]
![Page 18: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/18.jpg)
Random sampling => deep sampling
needed
Typically 10-100x needed for robust recovery (30-300 Gbp for human)
![Page 19: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/19.jpg)
Actual coverage varies widely from the average.
Low coverage introduces unavoidable breaks.
![Page 20: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/20.jpg)
But! Shotgun sequencing is very redundant!
Lots of the high coverage simply isn’t needed.
(unnecessary data)
![Page 21: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/21.jpg)
Digital normalization
![Page 22: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/22.jpg)
Digital normalization
![Page 23: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/23.jpg)
Digital normalization
![Page 24: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/24.jpg)
Digital normalization
![Page 25: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/25.jpg)
Digital normalization
![Page 26: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/26.jpg)
Digital normalization
![Page 27: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/27.jpg)
Contig assembly now scales with underlying genome size
• Transcriptomes, microbial genomes incl MDA,
and most metagenomes can be assembled in
under 50 GB of RAM, with identical or improved
results.
• Memory efficient is improved by use of CountMin
Sketch.
Brown et al., 2012, arXiv.
![Page 28: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/28.jpg)
Diginorm is simple:
![Page 29: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/29.jpg)
Diginorm is only a good start:
• Diginorm alters the coverage of the data
set.
• Diginorm also discards lots of data!
• Various other infelicities…
oRepeats go away!
oCoverage estimation approach ~poor.
![Page 30: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/30.jpg)
Diginorm is a good start:
• Diginorm works on genomes,
metagenomes, and transcriptomes;
• Diginorm is streaming and uses
sublinear space.
![Page 31: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/31.jpg)
Third attempt: a semi-streaming
framework for sequence analysis
https://github.com/ged-lab/2014-streaming/
![Page 32: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/32.jpg)
Diginorm can detect graph saturation
Zhang et al., submitted.
![Page 33: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/33.jpg)
This generically permits semi-
streaming approaches.
Zhang et al., submitted.
![Page 34: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/34.jpg)
e.g. E. coli analysis => ~1.2 pass,
sublinear memory
Zhang et al., submitted.
![Page 35: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/35.jpg)
=> Efficient k-mer error trimming.
Zhang et al., submitted.
(This all works on metagenomes & transcriptomes, too.)
![Page 36: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/36.jpg)
Moving some sequence analysis to streaming.
~1.2 pass, sublinear memory
Zhang et al., submitted.
First pass: digital normalization - reduced set of k-mers.
Second pass: spectral analysis of data with reduced k-mer set.
First pass: collection of low-abundance reads + analysis of saturated reads.
Second pass: analysis of collected low-abundance reads.
First pass: collection of low-abundance reads + analysis of saturated reads.
(a)
(b)
(c)
two-pass;
reduced memory
few-pass;
reduced memory
online; streaming.
![Page 37: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/37.jpg)
Sublinear time/space read error analysis --
Zhang et al., submitted.
Read error profile from mouse mRNAseq (c.f. Grabherr et al., 2011).
![Page 38: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/38.jpg)
Another simple algorithm.
Zhang et al., submitted.
![Page 39: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/39.jpg)
So, that’s pretty cool, right?
• We provide simple time- and memory-efficient approaches for k-mer spectral analysis of large data sets.
• These semi-streaming approaches provide a general framework for applying k-mer spectral approaches to all(deep) sequencing data, including genomes, metagenomes, and RNAseq.
• The khmer software provides a functional and reasonably efficient reference implementation, freely available under the BSD license and actively developed at github.com/ged-lab/.
![Page 40: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/40.jpg)
Stream all the things! (1/2)
![Page 41: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/41.jpg)
Stream all the things! (2/2)
![Page 42: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/42.jpg)
But that’s not all!Buy now, and you can also get sequence-to-graph
alignment for the low, low price of free!*
graph = khmer.new_counting_hash(…)
aligner = khmer.ReadAligner(graph, trusted=5)
score, graph_align, read_align, is_truncated = \
aligner.align(seq)
* Terms and conditions may apply. Not all source code fully works :)
![Page 43: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/43.jpg)
Pair-HMM-based graph alignment
Jordan Fish and Michael Crusoe
![Page 44: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/44.jpg)
(Full model)
Jordan Fish and Michael Crusoe
![Page 45: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/45.jpg)
This is a general API…Many potential uses:
• Error correction;
• Variant calling;
• Counting (to replace mapping) & allelic counts;
• Align to multiple references;
• Tackle strain variation and polyploidy;
• Building consensus graphs from shallow population
sequencing;
• Consensus graph building from multiple read types;
• Protein-guided graph search (BlastGraph & Xander)
![Page 46: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/46.jpg)
![Page 47: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/47.jpg)
Whole-genome variant calling
![Page 48: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/48.jpg)
Graphalign is still alpha.• We don’t understand parameters well.
• Unoptimized.
• Not yet competitive with existing approaches.
• Broadly applicable!
• Hope to engage w/broader community, soon.
![Page 49: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/49.jpg)
Concluding thoughts #1
• None of our theory is particularly limited to De Bruijn
graphs, although our implementation is deeply tied
to them at the moment.
• We view these ideas (streaming; graphs) as a
potentially substantial improvement over current
mainstream approaches.
• We are not alone – there is a larger community
exploring these approaches! (GA4GH, esp.)
![Page 50: 2015 vancouver-vanbug](https://reader030.fdocuments.us/reader030/viewer/2022032421/55a6a1a31a28ab21578b4755/html5/thumbnails/50.jpg)
Concluding thoughts #2
• Our implementations are usable but not yet terribly optimized.
• We are moving khmer towards a platform for providing reference implementations of these approaches, as well as for research and development.
• We are interested in providing components with decent performance & statistical guarantees, for fun and profit.
• Python and C++ FTW!