comprehensive exam slides 11/13/2013

Investigate the diversity of extremely complex metagenomic samples

Qingpeng ZhangDepartment of Computer Science and Engineering

Michigan State UniversitySupervisor: Dr. Titus Brown

Outline

● Significance and background– Metagenomics

– Microbial diversity measurement

● Preliminary results– A novel method to investigate microbial diversity

based on an efficient k-mer counting approach

● Proposed research– Prove effectiveness using test data sets

– Tackle extremely large metagenomic data sets generated from extremely complex microbial samples

The Great Prairie Grand Challenge

● How many different species in a soil sample? What is their abundance distribution? How different are the soil samples from 100-year cultivated Iowa agricultural soil and native Iowa prairie?

● “Grand Challenge” - extremely large data sets from extremely complex microbial community

– Estimated 50 Tbps are needed for an individual gram of soil (Jason Gans,2005)

– In a gram of soil, there are approximately a billion microbial cells, containing an estimated 4 petabase pairs of DNA (Jack A. Gilbert,2013)

– Over a tera bases of sequences from Iowa cultivated and uncultivated

Metagenomics and Next Generation Sequencing

species

Individuals

OTUs

16S rRNAs sequences

Uniquek-mers

total k-mers in WGS data

Nature Reviews GeneticsÂ 6,Â 805-814, ettc.

Whole genome sequencing reads

Diversity measurement based on different unit concepts

97% similarity of 16S sequences

Statistics for Diversity Estimation

● rarefaction curve– Quite incapable of dealing with the scale of diversity

of the microbial world

● extrapolation from curves● parametric estimators(need relative species

abundance)● non-parametric estimators(Chao1,etc.)

– Lower bound estimator

– Sensitive to underlying distribution

The Goal of this Project

● Using whole genome shotgun metagenomic data set rather than 16S rRNA

– Measuring the microbial diversity of samples alpha-diversity

– Comparing microbial samples beta-diversity

● A novel method that is:

– Binning-free

– Assembly-free

– Annotation-free

– Reference-free

● Efficient (Memory and Time)

– extremely large shotgun metagenomic data sets (Terabytes, etc.)

– extremely diverse microbial communities (Soil, etc.)

species

Individuals

OTUs

16S rRNAs sequences

Uniquek-mers

total k-mers in WGS data

Nature Reviews GeneticsÂ 6,Â 805-814, ettc.

Whole genome sequencing reads

Diversity measurement based on different unit concepts

97% similarity of 16S sequences

Preliminary Results

● A novel method to investigate microbial diversity based on an efficient k-mer counting approach– Diversity measurement of one sample

– Comparison of multiple samples

● an approach to count k-mer efficiently●

–

An Approach to Count k-mer Efficiently

• Highly scalable: Constant memory consuming, independent of k and dataset size

• Probabilistic properties well suited to next generation sequencing datasets

• With certain counting false positive rate as tradeoff because of collision

(Zhang, Pell, Canino-Koning, Howe, & Brown, 2013,submitted)

What is khmer 's advantage?

● Good performance in time/memory usage

● Online counting, updating and retrieving (important for this project!!)

● With Python API – flexible and expandable

median k-mer frequency to represent the sequencing coverage of the read

Using median k-mer frequency rather than average k-mer frequency can decrease the influenceof sequencing error

Mapping and k-mer coverage measures correlate for simulated genome data and a real E. coli data set (5m reads).

(Brown, Howe, Zhang, Pyrkosz, & Brom, 2012)

iGS

It there are Y reads with a sequencing depth of X. In other word, for each of those Y reads, there are X-1 other reads that cover the same DNA segment in a genome that single read originates. So we can estimate that there are Y/X distinct DNA segments with reads coverage as X. We term these distinct DNA segments in species genome as IGS(informative genomic segment).

IGS(informative genomic

segment) can represent the

novel information of a genome

N =G/(L-k+1)

1000000/(80-22+1) Borrowing statistical methods from OTU based diversity

analysis, (rarefaction curve, estimators, etc.)

Compare the contents of multiple metagenomics samples

● How different are two samples?●

–

If sequencing coverage of a read from sample A in sample B >0,

the segment in sample A

that read originates exists in sample B

Synthetic datasetsA:(same abundance)

– SampleA: 100 species with 80 common to B

– SampleB: 100 species with 80 common to A

– SampleC: 100 species with 20 common to A/B, and 60 common to D

– SampleD: 100 species with 20 common to A/B, and 60 common to D

●

Synthetic datasetB:

– Sample1A:

● species IDs: 1,2,3,4,5,6,7,8,9,10 relative abundance: 20:18:16:4:3:2:2:2:2:2

– Sample1B:


– Sample1C:


– A and B high overlap on individual level, low overlap on species level A and C high overlap on species level, low overlap on individual level

– B and C low overlap on species level and low overlap on individual level

What's Next

● Refi ne the methods

– Errors are still haunting.

– More statistics of IGSs(informative genomic segment)

● Prove effectiveness using test data sets

– Simulated data sets based on real microbial genomes

– MetaHIT, 124 metagenomic samples from 99 healthy people, and 25 patients with inflammatory bowel disease (IBD) syndrome. Each sample has on average 65 ± 21 million reads.

● Integrate functions into khmer package

The Great Prairie Grand Challenge● How many different species in a soil sample? What is their abundance distribution? How

different are the soil samples from 100-year cultivated Iowa agricultural soil and native Iowa prairie?

● “Grand Challenge” - extremely large data sets from extremely complex microbial community

– Over a tera bases of sequences from Iowa cultivated and uncultivated

– Should be prepared to face technical challenge when dealing with such large-scale data sets (Storage, Computing, Resource, HPCC, etc.)

– A preliminary result :The majority of the prairie reads (50%) are present in the corn with a coverage of > 0

Acknowledgement

● Dr. Titus Brown● Lab members of GED● Dr. Jason Pell● Dr. Adina Howe● Eric McDonald● Everybody in this room

comprehensive exam slides 11/13/2013

Education