Zhang Q - A probabilistic approach to k-mer counting
description
Transcript of Zhang Q - A probabilistic approach to k-mer counting
A probabilistic approach to k-mer counting
Qingpeng Zhang
Department of Computer Science and EngineeringMichigan State University
East Lansing, Michigan, USA
July 13, 2012
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 1 / 12
What is k-mer counting?
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 2 / 12
What is our k-mer counting approach?
The Bloom counting hashconsists of one or morehash tables of differentsize
Each entry in the hashtables is a counterrepresenting the numberof k-mers that hash tothat location
Bloom filter(0/1) orCount-minSketch(counting)
The hash function is totake the modulus of anumber representing thek-mer with the table size.
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 3 / 12
What is our k-mer counting approach?
With certain counting false positive rate1 as tradeoff because of collision
Probabilistic properties well suited to next generation sequencing datasets
Highly scalable: Counting accuracy is related to memory usage. Howeverour approach will never break an imposed memory bound.
1counting false positive rate: the possibility that the number of counts willbe incorrect (off by 1 or more)
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 4 / 12
How does our k-mer counting approach perform?How many k-mers have incorrect count? - counting error rate
Example: N=915898,Z=4, H=400000,
f = (1 − e−N/H)Z =0.6523
observed countingerror rate f : 0.6566
N: number of unique kmers; Z:number of hash tables; H: sizeof hash tables
The probability that no collisionshappened in a specific entry inone hash table is(1 − 1/H)N ,which is e−N/H .
The individual collision rate inone hash table is 1 − e−N/H .
The counting error rate f , whichis the probability that collisionhappened in all the locationswhere a k-mer is hashed to in allZ hash tables, will be(1 − e−N/H)Z
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 5 / 12
How does our k-mer counting approach perform?Ok, some counts are incorrect. However, how ”incorrect”?
factors to influence miscount:
number of total k-mers
hash table size
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 6 / 12
How does our k-mer counting approach perform?Time Usage
Figure: Time usage of khmer counting approach
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 7 / 12
How does our k-mer counting approach perform?Memory Usage
Figure: Memory usage of different k-mer counting tools
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 8 / 12
How does our k-mer counting approach perform?Disk Storage Usage
Figure: disk storage usage of different k-mer counting tools
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 9 / 12
What is the application of our approach?Filtering out reads with low-abundance k-mers for de novo assembly
Figure: Percentage of ”bad” reads in the remaining reads
Iterating filtering out low-abundance reads(”bad” reads) that contain even a
single unique k-mer with hash tables with different sizes(1e8 and 1e9) for a
human gut microbiome metagenomic dataset(MH0001, 42,458,402 reads)Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 10 / 12
Summary
a simple probabilistic approach for fast and memory efficient counting ofk-mers
arbitrary-length k-mersarbitrary-size sequence data setwith a tradeoff of counting error
other possible applications
digital normalizationrepeat detectiondiversity analysis of metagenomic sample....
The khmer software package is written in C++ and Python, available athttps://github.com/ged-lab/khmer
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 11 / 12
Acknowledgement
Jason Pell, Rose Canino-Koning, Adina Chuang Howe
Dr. C. Titus Brown
GED lab members@ Michigan State University
Funding from USDA, DOE, MSU, BEACON, iCER
Thanks!
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 12 / 12