2014 talk at nyu cusp: "biology caught the bus: now what? sequencing, big data, and biology"

Click here to load reader

Upload: ctitusbrown

Post on 10-May-2015

576 views

Category:

Technology


1 download

TRANSCRIPT

  • 1.Like the Dog that Caught the Bus: Sequencing, Big Data, and Biology C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University Jan 2014 [email protected]

2. 20 years in Started working in Dr. Koonins group in 1993; First publication was submitted almost exactly 20years ago! 3. Like the Dog that Caught the Bus: Sequencing, Big Data, and Biology C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University Jan 2014 [email protected] 4. Analogy: we seek an understanding of humanity via our libraries.http://eofdreams.com/library.html; 5. But, our only observation tool is shredding a mixture of all of the books & digitizing the shreds.http://eofdreams.com/library.html; http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/ 6. Points: Lots of fragments needed! (Deep sampling.) Having read and understood some books will help quite a bit (Prior knowledge.) Rare books will be harder to reconstruct than common books. Errors in OCR process matter quite a bit. The more, different specialized libraries you sample, the more likely you are to discover valid correlations between topics and books. A categorization system would be an invaluable but not infallible guide to book topics. Understanding the language would help you validate & understand the books. 7. Biological analog: shotgun metagenomics Collect samples; Extract DNA; Feed into sequencer; Computationally analyze.Sequence it all and let the bioinformaticians sort it out Wikipedia: Environmental shotgun sequencing.png 8. Investigating soil microbial communities 95% or more of soil microbes cannot be culturedin lab. Very little transport in soil and sediment => slow mixing rates. Estimates of immense diversity: Billions of microbial cells per gram of soil. Million+ microbial species per gram of soil (Gans etal, 2005) One observed lower bound for genomic sequence complexity => 26 Gbp (Amazon Rain Forest Microbial Observatory) 9. By 'soil' we understand (Vil'yams, 1931) a loose surface layer of earth capable of yielding plant crops. In the physical sense the soil represents a complex disperse system consisting of three phases: solid, liquid, and gaseous.Microbies live in & on: Surfaces of aggregate particles; Pores within microaggregates;N. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTS http://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.h tml 10. Questions to address Role of soil microbes in nutrient cycling: How does agricultural soil differ from native soil? How do soil microbial communities respond toclimate perturbation? Genome-level questions: What kind of strain-level heterogeneity is present inthe population? What are the phage and viral populations & dynamic? What species are where, and how much is shared between different geographical locations? 11. Must use culture independent and metagenomic approaches Many reasons why you cant or dont want toculture: Syntrophic relationships Niche-specificity or unknown physiology Dormant microbes Abundance within communities If you want to get at underlying function, 16sanalysis alone is not sufficient. Single-cell sequencing & shotgun metagenomics are two common ways to investigate complex microbial communities. 12. Shotgun metagenomics Collect samples; Extract DNA; Feed into sequencer; Computationally analyze.Sequence it all and let the bioinformaticians sort it out Wikipedia: Environmental shotgun sequencing.png 13. Computational reconstruction of (meta)genomic content.http://eofdreams.com/library.html; http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/ 14. Points: Lots of fragments needed! (Deep sampling.) Having read and understood some books will help quite a bit (Reference genomes.) Rare books will be harder to reconstruct than common books. Errors in OCR process matter quite a bit. (Sequencing error) The more, different specialized libraries you sample, the more likely you are to discover valid correlations between topics and books. (We dont understand most microbial function.) A categorization system would be an invaluable but not infallible guide to book topics. (Phylogeny can guide interpretation.) Understanding the language would help you validate 15. Great Prairie Grand Challenge --SAMPLING LOCATIONS2008 16. A Grand Challenge dataset (DOE/JGI) Total: 1,846 Gbp soil metagenome 600MetaHIT (Qin et. al, 2011), 578 GbpBasepairs of Sequencing (Gbp)500400Rumen (Hess et. al, 2011), 268 Gbp300200Rumen K-mer Filtered, 111 Gbp100NCBI nr database, 37 Gbp0 Iowa, Iowa, Native Kansas, Continuous Prairie Cultivated corn cornKansas, Native Prairie GAIIWisconsin, Wisconsin, Wisconsin, Wisconsin, Restored Switchgrass Continuous Native corn Prairie PrairieHiSeq 17. Why do we need so much data?! 20-40x coverage is necessary; 100x is ~sufficient. Mixed population sampling => sensitivity driven bylowest abundance. For example, for E. coli in 1/1000 dilution, you wouldneed approximately 100x coverage of a 5mb genome at 1/1000, or 500 Gbp of sequence! (For soil, estimate is 50 Tbp) Sequencing is straightforward; data analysis is not.$1000 genome with $1m analysis 18. Great Prairie Grand Challenge goals How much of the source metagenome can wereconstruct from ~300-600 Gbp+ of shotgun sequencing? (Largest data sets thus far.) What can we learn about soil from looking at thereconstructed metagenome? (See list of questions) 19. Great Prairie Grand Challenge goals How much of the source metagenome can wereconstruct from ~300-600 Gbp+ of shotgun sequencing? (Largest data sets thus far.) What can we learn about soil from looking at thereconstructed metagenome? (See list of questions) (For complex ecological and evolutionary systems, were just starting to get past the first question. More on that later.) 20. So, we want to go from raw data: Name @SRR606249.17/1 GAGTATGTTCTCATAGAGGTTGGTANNNNT + B@BDDFFFHHHHHJIJJJJGHIJHJ####1 Quality score @SRR606249.17/2 CGAANNNNNNNNNNNNNNNNNCCTGGCTCA + CCCF#################22@GHIJJJ 21. to assembled original sequence.UMD assembly primer (cbcb.umd.edu) 22. De Bruijn graphs assemble on overlapsJ.R. Miller et al. / Genomics (2010) 23. Two problems: (1) variation/errorSingle nucleotide variations cause long branches; They dont rejoin quickly. 24. Two problems: (2) No graph locality. Assembly is inherently an all by all process. There is no good way to subdivide the reads without potentially missing a key connection 25. Assembly graphs scale with data size, not information.Conway T C , Bromage A J Bioinformatics 2011;27:479-486 The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 26. Why do k-mer assemblers scale badly? Memory usage ~ real variation + number of errors Number of errors ~ size of data set 27. Practical memory measurementsVelvet measurements (Adina Howe) 28. The Problem We can cheaply gather DNA data in quantitiessufficient to swamp straightforward assembly algorithms running on commodity hardware. No locality to the data in terms of graph structure. Since ~2008: The field has engaged in lots of engineeringoptimization but the data generation rate has consistently outstripped Moores Law. 29. Our two solutions. 1. Subdivide data 2. Discard redundant data. 30. 1. Data partitioning (a computational version of cell sorting) Split reads into bins belonging to different source species. Can do this based almost entirely on connectivity of sequences. Divide and conquer Memory-efficient implementation helps to scale assembly.Pell et al., 2012, PNAS 31. Our two solutions. 1. Subdivide data (~20x scaling; 2 years to develop; 100x data increase) 2. Discard redundant data. 32. 2. Approach: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Diversity vs richness.The high-coverage reads in sample A are unnecessary for assembly, and, 33. Shotgun sequencing and coverageCoverage is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 just draw a line straight down from the top through all of the reads. 34. Most shotgun data is redundant.You only need 5-10 reads at a locus to assemble or call (diploid) SNPs but because sampling is random, and you need 5-10 reads at every locus, you 35. Digital normalization 36. Digital normalization 37. Digital normalization 38. Digital normalization 39. Digital normalization 40. Digital normalization 41. Coverage estimation If you can estimate the coverage of a read in a data set without a reference, this is straightforward: for read in dataset: if estimated_coverage(read) < CUTOFF: save(read) (Trick: the read coverage estimator needs to be errortolerant.) 42. The median k-mer count in a read is a good approximate estimator of coverage. This gives us a reference-free measure of coverage. 43. Diginorm builds a De Bruijn graph & then downsamples based on observed coverage. Corresponds exactly to underlying abstraction used for assembly; retains graph structure. 44. Digital normalization approach Is streaming and single pass: looks at each readonly once; Does not collect the majority of errors; Keeps all low-coverage reads; Smooths out coverage of regions.raw data can be retained for later abundance estimation. 45. Contig assembly now scales with richness, not (data) (information) diversity.Most samples can be assembled in < 50 GB of memory. 46. Diginorm is widely useful: 1. Assembly of the H. contortus parasitic nematode genome, a high polymorphism/variable coverage problem. (Schwarz et al., 2013; pmid 23985341) 2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a big assembly problem. (in prep) 3. Osedax symbiont metagenome, a contaminated metagenome problem (Goffredi et al, 2013; pmid 47. Diginorm is lossy compression Nearly perfect from an information theoreticperspective: Discards 95% more of data for genomes. Loses < 00.02% of information. 48. Prospective: sequencing tumor cells Goal: phylogenetically reconstruct causal drivermutations in face of passenger mutations. 1000 cells x 3 Gbp x 20 coverage: 60 Tbp ofsequence. Most of this data will be redundant and not useful. Developing diginorm-based algorithms toeliminate data while retaining variant information. 49. Where are we taking this? Streaming online algorithms only look at data~once. Diginorm is streaming, online Conceptually, can move many aspects ofsequence analysis into streaming mode. => Extraordinary potential for computational efficiency. 50. => Streaming, online variant calling.Single pass, reference free, tunable, streaming online varian Potentially quite clinically useful. 51. What about the assembly results for Iowa corn and prairie??Total AssemblyTotal Contigs (> 300 bp)% Reads AssembledPredicted protein coding2.5 bill4.5 mill19%5.3 mill3.5 bill5.9 mill22%6.8 millPutting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bpAdina Howe 52. Resulting contigs are low coverage.Figure 11: Coverage (median basepair) dist ribut ion of assembled cont igs from soil met agenomes. 53. So, for soil: We really do need more data; But at least now we can assemble what wealready have. Estimate required sequencing depth at 50 Tbp; Now also have 2-8 Tbp from Amazon Rain ForestMicrobial Observatory. still not saturated coverage, but getting closer. But, diginorm approach turns out to be widely useful. 54. Biogeography: Iowa sample overlap? Corn and prairie De Bruijn graps have 51% overlap.CornPrairieSuggests that at greater depth, samples may have similar geno 55. Concluding thoughts Empirically effective tools, in reasonably wideuse. Diginorm provides streaming, online algorithmicbasis for Coverage downsampling/lossy compression Error identification (sublinear) Error correction Variant calling? Enables analyses that would otherwise be hard orimpossible. Most assembly doable in cloud or on commodityhardware; 56. The real challenge: understanding We have gotten distracted by shiny toys:sequencing!! Data!! Data is now plentiful! But: We typically have no knowledge of what > 50% ofan environmental metagenome means, functionally. Most data is not openly available, so we cannot mine correlations across data sets. Most computational science is not reproducible, so I cant reuse other peoples tools or approaches. 57. Data intensive biology & hypothesis generation My interest in biological data is to enable betterhypothesis generation. 58. My interests Open source ecosystem of analysis tools. Loosely coupled APIs for querying databases. Publishing reproducible and reusable analyses,openly. Education and training.Platform perspective 59. Practical implications of diginorm Data is (essentially) free; For some problems, analysis is now cheaperthan data gathering (i.e. essentially free); plus, we can run most of our approaches inthe cloud. 60. khmer-protocols Read cleaning Effort to provide standard cheapassembly protocols for the cloud. Diginorm Entirely copy/paste; ~2-6 days fromraw reads to assembly, annotations, and differential expression analysis. ~$150 on Amazon per data set. Open, versioned, forkable, citable.AssemblyAnnotationRSEM differential expression 61. IPython Notebook: data + code => IPython)Notebook) 62. My interests Open source ecosystem of analysis tools. Loosely coupled APIs for querying databases. Publishing reproducible and reusableanalyses, openly. Education and training.Platform perspective 63. We practice open science! Everything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog (titus brown blog) Twitter: @ctitusbrown Grants on Lab Web site: http://ged.msu.edu/research.html Preprints: on arXiv, q-bio: diginorm arxiv 64. Acknowledgements Lab members involved Adina Howe (w/Tiedje) Jason Pell Arend Hintze Qingpeng Zhang Elijah Lowe Likit Preeyanon Jiarong Guo Tim Brom Kanchan Pavangadkar Eric McDonald Camille Scott Jordan Fish Michael Crusoe Leigh ShenemanCollaborators Jim Tiedje, MSU Susannah Tringe and Janet Jansson (JGI, LBNL) Erich Schwarz, Caltech / Cornell Paul Sternberg, Caltech Robin Gasser, U. Melbourne Weiming Li, MSU Shana Goffredi, OccidentalFundingUSDA NIFA; NSF IOS; NIH; BEACON.