zhang q - a probabilistic approach to k-mer counting

12
A probabilistic approach to k-mer counting Qingpeng Zhang Department of Computer Science and Engineering Michigan State University East Lansing, Michigan, USA [email protected] July 13, 2012 Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 1 / 12

Upload: jan-aerts

Post on 10-May-2015

715 views

Category:

Education


6 download

DESCRIPTION

Presentation at BOSC2012 by Zhang Q - A probabilistic approach to k-mer counting

TRANSCRIPT

Page 1: Zhang Q - A probabilistic approach to k-mer counting

A probabilistic approach to k-mer counting

Qingpeng Zhang

Department of Computer Science and EngineeringMichigan State University

East Lansing, Michigan, USA

[email protected]

July 13, 2012

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 1 / 12

Page 2: Zhang Q - A probabilistic approach to k-mer counting

What is k-mer counting?

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 2 / 12

Page 3: Zhang Q - A probabilistic approach to k-mer counting

What is our k-mer counting approach?

The Bloom counting hashconsists of one or morehash tables of differentsize

Each entry in the hashtables is a counterrepresenting the numberof k-mers that hash tothat location

Bloom filter(0/1) orCount-minSketch(counting)

The hash function is totake the modulus of anumber representing thek-mer with the table size.

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 3 / 12

Page 4: Zhang Q - A probabilistic approach to k-mer counting

What is our k-mer counting approach?

With certain counting false positive rate1 as tradeoff because of collision

Probabilistic properties well suited to next generation sequencing datasets

Highly scalable: Counting accuracy is related to memory usage. Howeverour approach will never break an imposed memory bound.

1counting false positive rate: the possibility that the number of counts willbe incorrect (off by 1 or more)

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 4 / 12

Page 5: Zhang Q - A probabilistic approach to k-mer counting

How does our k-mer counting approach perform?How many k-mers have incorrect count? - counting error rate

Example: N=915898,Z=4, H=400000,

f = (1 − e−N/H)Z =0.6523

observed countingerror rate f : 0.6566

N: number of unique kmers; Z:number of hash tables; H: sizeof hash tables

The probability that no collisionshappened in a specific entry inone hash table is(1 − 1/H)N ,which is e−N/H .

The individual collision rate inone hash table is 1 − e−N/H .

The counting error rate f , whichis the probability that collisionhappened in all the locationswhere a k-mer is hashed to in allZ hash tables, will be(1 − e−N/H)Z

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 5 / 12

Page 6: Zhang Q - A probabilistic approach to k-mer counting

How does our k-mer counting approach perform?Ok, some counts are incorrect. However, how ”incorrect”?

factors to influence miscount:

number of total k-mers

hash table size

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 6 / 12

Page 7: Zhang Q - A probabilistic approach to k-mer counting

How does our k-mer counting approach perform?Time Usage

Figure: Time usage of khmer counting approach

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 7 / 12

Page 8: Zhang Q - A probabilistic approach to k-mer counting

How does our k-mer counting approach perform?Memory Usage

Figure: Memory usage of different k-mer counting tools

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 8 / 12

Page 9: Zhang Q - A probabilistic approach to k-mer counting

How does our k-mer counting approach perform?Disk Storage Usage

Figure: disk storage usage of different k-mer counting tools

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 9 / 12

Page 10: Zhang Q - A probabilistic approach to k-mer counting

What is the application of our approach?Filtering out reads with low-abundance k-mers for de novo assembly

Figure: Percentage of ”bad” reads in the remaining reads

Iterating filtering out low-abundance reads(”bad” reads) that contain even a

single unique k-mer with hash tables with different sizes(1e8 and 1e9) for a

human gut microbiome metagenomic dataset(MH0001, 42,458,402 reads)Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 10 / 12

Page 11: Zhang Q - A probabilistic approach to k-mer counting

Summary

a simple probabilistic approach for fast and memory efficient counting ofk-mers

arbitrary-length k-mersarbitrary-size sequence data setwith a tradeoff of counting error

other possible applications

digital normalizationrepeat detectiondiversity analysis of metagenomic sample....

The khmer software package is written in C++ and Python, available athttps://github.com/ged-lab/khmer

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 11 / 12

Page 12: Zhang Q - A probabilistic approach to k-mer counting

Acknowledgement

Jason Pell, Rose Canino-Koning, Adina Chuang Howe

Dr. C. Titus Brown

GED lab members@ Michigan State University

Funding from USDA, DOE, MSU, BEACON, iCER

Thanks!

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 12 / 12