be retreat 2015 poster
TRANSCRIPT
-
Data science for pathogen genomic surveillance: predicting quantitative phenotype from genotype
Eric J. Ma, Islam T. M. Hussein, Jonathan A. RunstadlerDepartment of Biological Engineering, MIT
Analysis: HIV Drug Resistance
Cross-Validated Prediction Performance
Good prediction performance: high correlation, low error.
Global Drug Resistance Prediction
Model predictions are largely concordant with one another.
Important Amino Acids
Match between expert-identified important positions and model predictions.
Position10338884474654
Rel. Impt47%12%5%3%2%2%2%
DatabaseYYYYYYY
Temporal Emergence of Drug Resistance
FDA approval dates - IDV (indinavir): 1996, FPV (fosamprenavir): 2003, DRV (duranivir): 2006 (arrows)
Drug resistance emerged a few years after approval FPV and DRV have similar chemical structures
Goal: Establish example pipeline for genomic surveillance Input: HIV protease sequence & drug resistance profile
Conclusions & Future Work Machine learning models can predict drug resistance from protein sequence. Genomic surveillance able to capture temporal rise of drug resistance Applicable to other pathogens, with high quality genotype-phenotype data
Genomic Surveillance Zoonotic pathogens circulating in wild may affect livestock and human health. Given sequence information, can we compute a pathogen risk factor? Given a computed risk, can we do preventative surveillance of zoonotic pathogens?
InfluenzaInuenza Genome Structure
1 PB2 2.4 kb2 PB1 2.4 kb3 PA 2.2 kb4 HA 1.8 kb5 NP 1.6 kb6 NA 1.5 kb7 M 1.0 kb8 NS 0.9 kb
Reassortment
Influenza is a zoonotic pathogen that has a broad tropic range. Segmented genome allows reassortment, accelerating viral evolution. A high polymerase mutation rate rapidly generates novel sequence diversity.
Introduction
Difficulties Necessity: The presence of a point mutation may
enhance phenotype, but not necessarily cause dangerouse phenotype levels (right).
Epistasis: Mapping from genotype to phenotype. Experiments: Require assays to measure
biochemical phenotype relevant to pathogenesis without infecting humans.
Data: Lack genotype-phenotype data. Biology: Novel sequence diversity generated
through error-prone polymerase.
Gene(s)Genotypes PhenotypeHT Assay
HA
NA
receptor anity
inhibitor resistance
a(2,6) binding
a(2,6) cleavage
antigenic distanceHA, NAhemagglutinin/neuraminidase
inhibition
Pol replication activitypolymeraseassay
Signicance
infectionpotential
treatability
diseaseburden
immunity
training data for ML ML predicts phenotype
Risk Prediction
risk
phenotypeMERI...
MKAK... risk
phenotype
risk
phenotypeMNPN...
risk
phenotype
MKAK...MNPN...
Application
Risk Prole
InformedIntervention
NS1 IFN-productiondampening
innate immunity ris
k
phenotypeMDSN... immunity
Vision
Assay: Biochemical, quantitative measure relevant to pathogenesis Characterize: population diversity Machine learning: learn non-linear mapping from genotype to phenotype Model: quantitative risk profile
Experimentation PlanRational Library
ATGGTAACCA
PacBioSequencing
PolymeraseAssay
Genotype-Phenotype
Sequence PEU
MachineLearning
Web Query
MERIKELMERIRELMDRIKELMERIKNL
2610915
Rational sampling to cover polymorphic diversity. High throughput library construction & verification. Safe, scalable, standardized assay of RNA replication rate. Matched phenotype to genotype Machine learning models to predict RNA replication rate. Open data release via web interface & API