developing tools & methodologies for the next generation of genomics & bio informatics
DESCRIPTION
TRANSCRIPT
Computational challenges to getting NextGen sequence right: implications to diagnostics and therapeutics
Progress in biomedical discovery has been enabled by technological progress
• New sequencing technology – 100s of genomes a day are now produced
• Advances in software• Standard DNA analysis codes have emerged
• New versions continuously released
• Custom software developed for unconventional analysis
• Development of analysis pipelines for automated analysis and compilation of results
• Advances in computational hardware• Codes standardized on Intel processor based systems ease porting to new systems
• Continuous advances in Intel product line enable us to easily “keep up”
• The bottom line – With process advances and new Intel MIC processors we have seen speedups from 1 genome/2 weeks to 50 genomes/day. It is straightforward to expand hardware in response to computational demand
Computing is primarily done on a machine we developed: SHADOWFAX
A heterogeneous computing environment for data intensive computations
~2,524 CPUs, > 12TB RAM (spectrum of Intel)
8 Intel® Xeon® E5-2600/FPGA hybrid core systems (in partnership with Convey)
~0.8 PB Disk Arrays (DDN) 100 PB Sun/Oracle tape storage
system
Computing is primarily done on a machine we developed: SHADOWFAX
With local synchronized copies of
major databases: Medline, arXiv, PubMed Central,
Genbank, SwissProt, 1,000 Genomes Project, The Cancer Genome Atlas,
Wikipedia To meet the needs of applications
that demand HPC: deep sequencing assembly and
analysis, molecular modeling, simulations, proteomics analysis, text mining, Health IT
NextGen DNA sequence analysis is now the rate limiting step
• The cost of sequencing has dropped from $3B/genome to ~$1K/genome.
• New genomes are sequenced daily.
• It is estimated that there are 30,000 human genomes complete, with 15,000 of these in the public domain.
• Analysis has focused on on Single Nucleotide Polymorphisms (“ SNPs”), which are single letter changes in the DNA code.
• For complex diseases like cancer, heart disease and mental disorders, extensive work has still only explains 10-20% of the known genetic component.
• Recent research indicates that do to experimental measurement noise, perhaps most of the measured variations are false positives.
Microsatellites, or repetitive DNA sequences are particularly challenging• Microsatellites, also called Simple Sequence Repeats or Short
Tandem Repeats, are an understudied portion of genome; because they are considered part of our “Junk DNA” or more recently “Dark Matter” DNA; research focus has been on Single Nucleotide Polymorphisms (“ SNPs”)
• Microsatellites have known value: long used for paternity and forensic testing and linked to neurological diseases (e.g. Huntington’s and Fragile-X)
• None of major genomic research projects have focused on Microsatellites: not Human Genome Project, 1000 Genome Project, The Cancer Genome Atlas, ENCODE or the iCOGS study.
Genomeon’s Research MethodologyDownload and rebuild thousands
of “healthy and “affected” genomes
Create genotype distributions for “healthy” and “affected”
populations
Compute Fishers Exact Test p-value for each of ~1 million loci
and rank results
Identify “Patterns of Informative Microsatellites” (PIM) from loci
that pass Bonferroni and Benjamini–Hochberg False
Discovery Rate tests
Annotate with ontologies, literature, input from experts
Business analysis; product definition; IP
Validate PIM with sequencing of well-
characterized samples
Publish; translate, regulatory approval, reimbursement; team with established
clinical services co.
Manually review, do QC, compute sensitivity and specificity
Genomeon has created a unique library of over 7700 genomes from 1000 Genomes Project and The Cancer Genome Atlas with corrected microsatellites
• “Healthy Population” representing many ethnicities
• Ovarian cancer
• Breast cancer
• Brain cancer: Glioma; Glioblastoma; Medulloblastoma
• Lung adenocarcinoma
• Prostate cancer
• Melanoma
• Autism
Breast Cancer
Pattern of 55 informative microsatellites differentiates Breast Cancer germlines from healthy germlines
Sensitivity = 84%Specificity = 87%
BRCA positive samples
Applications of these microsatellite loci variations Cancer Risk Diagnostics – Microsatellite profiling for increased risk of cancer, and the tissues at highest risk
Companion/Treatment Diagnostics - Many informative microsatellites are functional elements implicated in therapeutic response
Clinical Trial Support - Use of microsatellite profile to differentiate sub-populations in clinical trials
Drug Targets - Identification of large number of genes previously unassociated with cancer - many with functions associated with cancer processes
Toxicology - Quantification of stress induced exposures via microsatellite mutation screen
Prognosis - Comparison of microsatellite variations between germlines and tumors
Non-cancer Diseases - PTSD, Autism, MS, cardiac diseases, aging
Thank you. Any Questions?