comprehensive exam slides 11/13/2013

21
Investigate the diversity of extremely complex metagenomic samples Qingpeng Zhang Department of Computer Science and Engineering Michigan State University Supervisor: Dr. Titus Brown

Upload: qingpeng

Post on 15-Jul-2015

87 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Comprehensive Exam Slides 11/13/2013

Investigate the diversity of extremely complex metagenomic samples

Qingpeng ZhangDepartment of Computer Science and Engineering

Michigan State UniversitySupervisor: Dr. Titus Brown

Page 2: Comprehensive Exam Slides 11/13/2013

Outline

● Significance and background– Metagenomics

– Microbial diversity measurement

● Preliminary results– A novel method to investigate microbial diversity

based on an efficient k-mer counting approach

● Proposed research– Prove effectiveness using test data sets

– Tackle extremely large metagenomic data sets generated from extremely complex microbial samples

Page 3: Comprehensive Exam Slides 11/13/2013

The Great Prairie Grand Challenge

● How many different species in a soil sample? What is their abundance distribution? How different are the soil samples from 100-year cultivated Iowa agricultural soil and native Iowa prairie?

● “Grand Challenge” - extremely large data sets from extremely complex microbial community

– Estimated 50 Tbps are needed for an individual gram of soil (Jason Gans,2005)

– In a gram of soil, there are approximately a billion microbial cells, containing an estimated 4 petabase pairs of DNA (Jack A. Gilbert,2013)

– Over a tera bases of sequences from Iowa cultivated and uncultivated

Page 4: Comprehensive Exam Slides 11/13/2013

Metagenomics and Next Generation Sequencing

Page 5: Comprehensive Exam Slides 11/13/2013

species

Individuals

OTUs

16S rRNAs sequences

Uniquek-mers

total k-mers in WGS data

Nature Reviews Genetics 6, 805-814, ettc.

Whole genome sequencing reads

Diversity measurement based on different unit concepts

97% similarity of 16S sequences

Page 6: Comprehensive Exam Slides 11/13/2013

Statistics for Diversity Estimation

● rarefaction curve– Quite incapable of dealing with the scale of diversity

of the microbial world

● extrapolation from curves● parametric estimators(need relative species

abundance)● non-parametric estimators(Chao1,etc.)

– Lower bound estimator

– Sensitive to underlying distribution

Page 7: Comprehensive Exam Slides 11/13/2013

The Goal of this Project

● Using whole genome shotgun metagenomic data set rather than 16S rRNA

– Measuring the microbial diversity of samples alpha-diversity

– Comparing microbial samples beta-diversity

● A novel method that is:

– Binning-free

– Assembly-free

– Annotation-free

– Reference-free

● Efficient (Memory and Time)

– extremely large shotgun metagenomic data sets (Terabytes, etc.)

– extremely diverse microbial communities (Soil, etc.)

Page 8: Comprehensive Exam Slides 11/13/2013

species

Individuals

OTUs

16S rRNAs sequences

Uniquek-mers

total k-mers in WGS data

Nature Reviews Genetics 6, 805-814, ettc.

Whole genome sequencing reads

Diversity measurement based on different unit concepts

97% similarity of 16S sequences

Page 9: Comprehensive Exam Slides 11/13/2013

Preliminary Results

● A novel method to investigate microbial diversity based on an efficient k-mer counting approach– Diversity measurement of one sample

– Comparison of multiple samples

Page 10: Comprehensive Exam Slides 11/13/2013

● an approach to count k-mer efficiently●

An Approach to Count k-mer Efficiently

• Highly scalable: Constant memory consuming, independent of k and dataset size

• Probabilistic properties well suited to next generation sequencing datasets

• With certain counting false positive rate as tradeoff because of collision

Page 11: Comprehensive Exam Slides 11/13/2013

(Zhang, Pell, Canino-Koning, Howe, & Brown, 2013,submitted)

What is khmer 's advantage?

● Good performance in time/memory usage

● Online counting, updating and retrieving (important for this project!!)

● With Python API – flexible and expandable

Page 12: Comprehensive Exam Slides 11/13/2013

median k-mer frequency to represent the sequencing coverage of the read

Using median k-mer frequency rather than average k-mer frequency can decrease the influenceof sequencing error

Page 13: Comprehensive Exam Slides 11/13/2013

Mapping and k-mer coverage measures correlate for simulated genome data and a real E. coli data set (5m reads).

(Brown, Howe, Zhang, Pyrkosz, & Brom, 2012)

Page 14: Comprehensive Exam Slides 11/13/2013

iGS

It there are Y reads with a sequencing depth of X. In other word, for each of those Y reads, there are X-1 other reads that cover the same DNA segment in a genome that single read originates. So we can estimate that there are Y/X distinct DNA segments with reads coverage as X. We term these distinct DNA segments in species genome as IGS(informative genomic segment).

IGS(informative genomic

segment) can represent the

novel information of a genome

Page 15: Comprehensive Exam Slides 11/13/2013

N =G/(L-k+1)

1000000/(80-22+1) Borrowing statistical methods from OTU based diversity

analysis, (rarefaction curve, estimators, etc.)

Page 16: Comprehensive Exam Slides 11/13/2013

Compare the contents of multiple metagenomics samples

● How different are two samples?●

If sequencing coverage of a read from sample A in sample B >0,

the segment in sample A

that read originates exists in sample B

Page 17: Comprehensive Exam Slides 11/13/2013

Synthetic datasetsA:(same abundance)

– SampleA: 100 species with 80 common to B

– SampleB: 100 species with 80 common to A

– SampleC: 100 species with 20 common to A/B, and 60 common to D

– SampleD: 100 species with 20 common to A/B, and 60 common to D

Page 18: Comprehensive Exam Slides 11/13/2013

Synthetic datasetB:

– Sample1A:

● species IDs: 1,2,3,4,5,6,7,8,9,10 relative abundance: 20:18:16:4:3:2:2:2:2:2

– Sample1B:

● species IDs: 1,2,3,14,15,16,17,18,19,20 relative abundance: 20:18:16:4:3:2:2:2:2:2

– Sample1C:

● species IDs: 21,22,3,4,5,6,7,8,9,10 relative abundance: 2:2:2:2:2:3:4:16:18:20

– A and B high overlap on individual level, low overlap on species level A and C high overlap on species level, low overlap on individual level

– B and C low overlap on species level and low overlap on individual level

Page 19: Comprehensive Exam Slides 11/13/2013

What's Next

● Refi ne the methods

– Errors are still haunting.

– More statistics of IGSs(informative genomic segment)

● Prove effectiveness using test data sets

– Simulated data sets based on real microbial genomes

– MetaHIT, 124 metagenomic samples from 99 healthy people, and 25 patients with inflammatory bowel disease (IBD) syndrome. Each sample has on average 65 ± 21 million reads.

● Integrate functions into khmer package

Page 20: Comprehensive Exam Slides 11/13/2013

The Great Prairie Grand Challenge● How many different species in a soil sample? What is their abundance distribution? How

different are the soil samples from 100-year cultivated Iowa agricultural soil and native Iowa prairie?

● “Grand Challenge” - extremely large data sets from extremely complex microbial community

– Over a tera bases of sequences from Iowa cultivated and uncultivated

– Should be prepared to face technical challenge when dealing with such large-scale data sets (Storage, Computing, Resource, HPCC, etc.)

– A preliminary result :The majority of the prairie reads (50%) are present in the corn with a coverage of > 0

Page 21: Comprehensive Exam Slides 11/13/2013

Acknowledgement

● Dr. Titus Brown● Lab members of GED● Dr. Jason Pell● Dr. Adina Howe● Eric McDonald● Everybody in this room