developing tools & methodologies for the next generation of genomics & bio informatics

Computational challenges to getting NextGen sequence right: implications to diagnostics and therapeutics

Progress in biomedical discovery has been enabled by technological progress

• New sequencing technology – 100s of genomes a day are now produced

• Advances in software• Standard DNA analysis codes have emerged

• New versions continuously released

• Custom software developed for unconventional analysis

• Development of analysis pipelines for automated analysis and compilation of results

• Advances in computational hardware• Codes standardized on Intel processor based systems ease porting to new systems

• Continuous advances in Intel product line enable us to easily “keep up”

• The bottom line – With process advances and new Intel MIC processors we have seen speedups from 1 genome/2 weeks to 50 genomes/day. It is straightforward to expand hardware in response to computational demand

Computing is primarily done on a machine we developed: SHADOWFAX

A heterogeneous computing environment for data intensive computations

~2,524 CPUs, > 12TB RAM (spectrum of Intel)

8 Intel® Xeon® E5-2600/FPGA hybrid core systems (in partnership with Convey)

~0.8 PB Disk Arrays (DDN) 100 PB Sun/Oracle tape storage

system

Computing is primarily done on a machine we developed: SHADOWFAX

With local synchronized copies of

major databases: Medline, arXiv, PubMed Central,

Genbank, SwissProt, 1,000 Genomes Project, The Cancer Genome Atlas,

Wikipedia To meet the needs of applications

that demand HPC: deep sequencing assembly and

analysis, molecular modeling, simulations, proteomics analysis, text mining, Health IT

NextGen DNA sequence analysis is now the rate limiting step

• The cost of sequencing has dropped from $3B/genome to ~$1K/genome.

• New genomes are sequenced daily.

• It is estimated that there are 30,000 human genomes complete, with 15,000 of these in the public domain.

• Analysis has focused on on Single Nucleotide Polymorphisms (“ SNPs”), which are single letter changes in the DNA code.

• For complex diseases like cancer, heart disease and mental disorders, extensive work has still only explains 10-20% of the known genetic component.

• Recent research indicates that do to experimental measurement noise, perhaps most of the measured variations are false positives.

Microsatellites, or repetitive DNA sequences are particularly challenging• Microsatellites, also called Simple Sequence Repeats or Short

Tandem Repeats, are an understudied portion of genome; because they are considered part of our “Junk DNA” or more recently “Dark Matter” DNA; research focus has been on Single Nucleotide Polymorphisms (“ SNPs”)

• Microsatellites have known value: long used for paternity and forensic testing and linked to neurological diseases (e.g. Huntington’s and Fragile-X)

• None of major genomic research projects have focused on Microsatellites: not Human Genome Project, 1000 Genome Project, The Cancer Genome Atlas, ENCODE or the iCOGS study.

Genomeon’s Research MethodologyDownload and rebuild thousands

of “healthy and “affected” genomes

Create genotype distributions for “healthy” and “affected”

populations

Compute Fishers Exact Test p-value for each of ~1 million loci

and rank results

Identify “Patterns of Informative Microsatellites” (PIM) from loci

that pass Bonferroni and Benjamini–Hochberg False

Discovery Rate tests

Annotate with ontologies, literature, input from experts

Business analysis; product definition; IP

Validate PIM with sequencing of well-

characterized samples

Publish; translate, regulatory approval, reimbursement; team with established

clinical services co.

Manually review, do QC, compute sensitivity and specificity

Genomeon has created a unique library of over 7700 genomes from 1000 Genomes Project and The Cancer Genome Atlas with corrected microsatellites

• “Healthy Population” representing many ethnicities

• Ovarian cancer

• Breast cancer

• Brain cancer: Glioma; Glioblastoma; Medulloblastoma

• Lung adenocarcinoma

• Prostate cancer

• Melanoma

• Autism

Breast Cancer

Pattern of 55 informative microsatellites differentiates Breast Cancer germlines from healthy germlines

Sensitivity = 84%Specificity = 87%

BRCA positive samples

Applications of these microsatellite loci variations Cancer Risk Diagnostics – Microsatellite profiling for increased risk of cancer, and the tissues at highest risk

Companion/Treatment Diagnostics - Many informative microsatellites are functional elements implicated in therapeutic response

Clinical Trial Support - Use of microsatellite profile to differentiate sub-populations in clinical trials

Drug Targets - Identification of large number of genes previously unassociated with cancer - many with functions associated with cancer processes

Toxicology - Quantification of stress induced exposures via microsatellite mutation screen

Prognosis - Comparison of microsatellite variations between germlines and tumors

Non-cancer Diseases - PTSD, Autism, MS, cardiac diseases, aging

Thank you. Any Questions?

developing tools & methodologies for the next generation of genomics & bio informatics

Technology