zhang q - a probabilistic approach to k-mer counting

A probabilistic approach to k-mer counting

Qingpeng Zhang

Department of Computer Science and EngineeringMichigan State University

East Lansing, Michigan, USA

[email protected]

July 13, 2012

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 1 / 12

What is k-mer counting?


What is our k-mer counting approach?

The Bloom counting hashconsists of one or morehash tables of differentsize

Each entry in the hashtables is a counterrepresenting the numberof k-mers that hash tothat location

Bloom filter(0/1) orCount-minSketch(counting)

The hash function is totake the modulus of anumber representing thek-mer with the table size.


What is our k-mer counting approach?

With certain counting false positive rate1 as tradeoff because of collision

Probabilistic properties well suited to next generation sequencing datasets

Highly scalable: Counting accuracy is related to memory usage. Howeverour approach will never break an imposed memory bound.

1counting false positive rate: the possibility that the number of counts willbe incorrect (off by 1 or more)


How does our k-mer counting approach perform?How many k-mers have incorrect count? - counting error rate

Example: N=915898,Z=4, H=400000,

f = (1 − e−N/H)Z =0.6523

observed countingerror rate f : 0.6566

N: number of unique kmers; Z:number of hash tables; H: sizeof hash tables

The probability that no collisionshappened in a specific entry inone hash table is(1 − 1/H)N ,which is e−N/H .

The individual collision rate inone hash table is 1 − e−N/H .

The counting error rate f , whichis the probability that collisionhappened in all the locationswhere a k-mer is hashed to in allZ hash tables, will be(1 − e−N/H)Z


How does our k-mer counting approach perform?Ok, some counts are incorrect. However, how ”incorrect”?

factors to influence miscount:

number of total k-mers

hash table size


How does our k-mer counting approach perform?Time Usage

Figure: Time usage of khmer counting approach


How does our k-mer counting approach perform?Memory Usage

Figure: Memory usage of different k-mer counting tools


How does our k-mer counting approach perform?Disk Storage Usage

Figure: disk storage usage of different k-mer counting tools


What is the application of our approach?Filtering out reads with low-abundance k-mers for de novo assembly

Figure: Percentage of ”bad” reads in the remaining reads

Iterating filtering out low-abundance reads(”bad” reads) that contain even a

single unique k-mer with hash tables with different sizes(1e8 and 1e9) for a

human gut microbiome metagenomic dataset(MH0001, 42,458,402 reads)Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 10 / 12

Summary

a simple probabilistic approach for fast and memory efficient counting ofk-mers

arbitrary-length k-mersarbitrary-size sequence data setwith a tradeoff of counting error

other possible applications

digital normalizationrepeat detectiondiversity analysis of metagenomic sample....

The khmer software package is written in C++ and Python, available athttps://github.com/ged-lab/khmer


Acknowledgement

Jason Pell, Rose Canino-Koning, Adina Chuang Howe

Dr. C. Titus Brown

GED lab members@ Michigan State University

Funding from USDA, DOE, MSU, BEACON, iCER

Thanks!


zhang q - a probabilistic approach to k-mer counting

Education

mer countingjuly

simple probabilistic

tradeo of counting error

counting error rate

number of hash tables

memory usage of dierent

memory usage figure

disk storage usage of