machine learning for high-throughput biological data
DESCRIPTION
Machine Learning for High-Throughput Biological Data. These notes were originally from KDD2006 tutorial notes, written by David page at Dept. Biostatistics and Medical Informatics Dept. Computer Sciences University of Wisconsin-Madison . http://www.biostat.wisc.edu/~page/PageKDD2006.ppt. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/1.jpg)
Machine Learning for High-Throughput Biological Data
These notes were originally from KDD2006 tutorial notes, written by David page at Dept. Biostatistics and Medical Informatics Dept. Computer Sciences University of Wisconsin-Madison.http://www.biostat.wisc.edu/~page/PageKDD2006.ppt
![Page 2: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/2.jpg)
Some Data Types We’ll Discuss
Gene expression microarraySingle-nucleotide polymorphisms ( 單一核苷酸基因多形性 )Mass spectrometry proteomics ( 蛋白質組學 ) and metabolomics ( 代謝物組學 )Protein-protein interactions (from co-immunoprecipitation)High-throughput screening of potential drug molecules
![Page 3: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/3.jpg)
image from the DOE Human Genome Programhttp://www.ornl.gov/hgmis
![Page 4: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/4.jpg)
Probes (DNA)
Gene Chip Surface
Hybridization
Labeled Sample (RNA)
How Microarrays Work
![Page 5: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/5.jpg)
Two Views of Microarray Data
Data points are genes Represented by expression levels across
different samples (ie, features=samples) Goal: categorize new genesData points are samples (eg, patients) Represented by expression levels of
different genes (ie, features=genes) Goal: categorize new samples
![Page 6: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/6.jpg)
Two Ways to View The Data
Person Gene A28202_ac AB00014_at AB00015_at . . . Person 1 1142.0 321.0 2567.2 . . . Person 2 586.3 586.1 759.0 . . . Person 3 105.2 559.3 3210.7 . . . Person 4 42.8 692.1 812.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
![Page 7: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/7.jpg)
Data Points are Genes
Person Gene A28202_ac AB00014_at AB00015_at . . . Person 1 1142.0 321.0 2567.2 . . . Person 2 586.3 586.1 759.0 . . . Person 3 105.2 559.3 3210.7 . . . Person 4 42.8 692.1 812.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
![Page 8: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/8.jpg)
Data Points are Samples
Person Gene A28202_ac AB00014_at AB00015_at . . . Person 1 1142.0 321.0 2567.2 . . . Person 2 586.3 586.1 759.0 . . . Person 3 105.2 559.3 3210.7 . . . Person 4 42.8 692.1 812.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
![Page 9: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/9.jpg)
Supervision: Add Class Values
Person Gene A28202_ac AB00014_at AB00015_at . . . Class Person 1 1142.0 321.0 2567.2 . . . normal Person 2 586.3 586.1 759.0 . . . cancer Person 3 105.2 559.3 3210.7 . . . normal Person 4 42.8 692.1 812.0 . . . cancer . . . . . . . . . . . . . . . . . . . . . . . . . . .
![Page 10: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/10.jpg)
Supervised Learning TaskGiven: a set of microarray experiments, each done with mRNA from a different patient (same cell type from every patient) Patient’s expression values for each gene constitute the features, and patient’s disease constitutes the class
Do: Learn a model that accurately predicts class based on features
![Page 11: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/11.jpg)
Data Points are: Genes Samples
Clustering Supervised Data Mining
Predict the class value for a patient
based on the expression levels for his/her genes
Location in Task Space
![Page 12: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/12.jpg)
Leukemia (Golub et al., 1999)Classes Acute Lymphoblastic Leukemia( 淋巴白血病 ) (ALL) and Acute Myeloid Leukemia ( 骨髓白血病 ) (AML)Approach Weighted voting (essentially naïve Bayes)Cross-Validated Accuracy Of 34 samples, declined to predict 5, correct on other 29
![Page 13: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/13.jpg)
Cancer vs. NormalRelatively easy to predict accurately, because so much goes “haywire” in cancer cellsPrimary barrier is noise in the data… impure RNA, cross-hybridization, etcStudies include breast, colon ( 结肠 ), prostate ( 前列腺 ), lymphoma ( 淋巴瘤 ), and multiple myeloma ( 骨髓瘤 )
![Page 14: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/14.jpg)
X-Val Accuracies for Multiple Myeloma (74 MM vs.
31 Normal)
Trees
Boosted Trees
SVMs
Vote
Bayes Nets
98.1
99.0
100.0
97.0
100.0
![Page 15: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/15.jpg)
More MM (300), Benign Condition MGUS (Hardin et al., 2004)
![Page 16: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/16.jpg)
ROC Curves: Cancer vs. Normal
![Page 17: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/17.jpg)
ROC: Cancer vs. Benign (MGUS)
![Page 18: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/18.jpg)
Work by Statisticians Outside of Standard Classification/Clustering
Methods to better convert Affymetrix’s low-level intensity measurements into expression levels: e.g., work by Speed, Wong, IrrizaryMethods to find differentially expressed genes between two samples, e.g. work by Newton and KendziorskiBut the following is most related…
![Page 19: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/19.jpg)
Ranking Genes by Significance
Some biologists don’t want one predictive model, but a rank-ordered list of genes to explore further (with estimated significance)For each gene we have a set of expression levels under our conditions, say cancer vs. normalWe can do a t-test to see if the mean expression levels are different under the two conditions: p-valueMultiple comparisons problem: if we repeat this test for 30,000 genes, some will pop up as significant just by chance aloneCould do a Bonferoni correction (multiply p-values by 30,000), but this is drastic and might eliminate all
![Page 20: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/20.jpg)
False Discovery Rate (FDR) [Storey and Tibshirani, 2001]
Addresses multiple comparisons but is less extreme than BonferoniReplaces p-value by q-value: fraction of genes with this p-value or lower that really don’t have different means in the two classes (false discoveries)Publicly available in R as part of Bioconductor packageRecommendation: Use this in addition to your supervised data mining… your collaborators will want to see it
![Page 21: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/21.jpg)
FDR Highlights Difficulties Getting Insight into Cancer vs. Normal
![Page 22: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/22.jpg)
Using Benign Condition Instead of Normal Helps Somewhat
![Page 23: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/23.jpg)
Question to AnticipateYou’ve run a supervised data mining algorithm on your collaborator’s data, and you present an estimate of accuracy or an ROC curve (from X-val)How did you adjust this for the multiple comparisons problem?Answer: you don’t need to because you commit to a single predictive model before ever looking at the test data for a fold—this is only one comparison
![Page 24: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/24.jpg)
Prognosis and TreatmentFeatures same as for diagnosis
Rather than disease state, class value becomes life expectancy with a given treatment (or positive response vs. no response to given treatment)
![Page 25: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/25.jpg)
Breast Cancer Prognosis (Van’t Veer et al., 2002)
Classes good prognosis (no metastasis within five years of initial diagnosis) vs. poor prognosisAlgorithm Ensemble of votersResults 83% cross-validated accuracy on 78 cases
![Page 26: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/26.jpg)
A LessonPrevious work selected features to use in ensemble by looking at the entire data setShould have repeated feature selection on each cross-val foldAuthors also chose ensemble size by seeing which size gave highest cross-val resultAuthors corrected this in web supplement;accuracy went from 83% to 73%Remember to “tune parameters” separately for each cross-val fold!
![Page 27: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/27.jpg)
Prognosis with Specific Therapy (Rosenwald et al., 2002)
Data set contains gene-expression patterns for 160 patients with diffuse large B-cell lymphoma, receiving anthracycline chemotherapyClass label is five-year survivalOne test-train split 80/80True positive rate: 60% False negative rate: 39%
![Page 28: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/28.jpg)
Some Future DirectionsUsing gene-chip data to select therapy Predict which therapy gives best prognosis for patientCombining Gene Expression Data with Clinical Data such as Lab Results, Medical and Family History Multiple relational tables, may benefit from relational learning
![Page 29: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/29.jpg)
Unsupervised Learning Task
Given: a set of microarray experiments under different conditionsDo: cluster the genes, where a gene described by its expression levels in different experiments
![Page 30: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/30.jpg)
Data Points are: Genes Samples
Clustering Supervised Data Mining
Group genes into clusters, where all
members of a cluster tend to go
up or down together
Location in Task Space
![Page 31: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/31.jpg)
Example(Green = up-regulated, Red = down-regulated)
Gen
es
Experiments (Samples)
![Page 32: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/32.jpg)
Visualizing Gene Clusters (eg, Sharan and Shamir, 2000)
Time (10-minute intervals)
Nor
mal
ized
expr
essi
on
Gene Cluster 1, size=20 Gene Cluster 2, size=43
![Page 33: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/33.jpg)
Unsupervised Learning Task 2
Given: a set of microarray experiments (samples) corresponding to different conditions or patientsDo: cluster the experiments
![Page 34: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/34.jpg)
Data Points are: Genes Samples
Clustering Supervised Data Mining
Group samples by gene expression
profile
Location in Task Space
![Page 35: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/35.jpg)
ExamplesCluster samples from mice subjected to a variety of toxic compounds (Thomas et al., 2001)Cluster samples from cancer patients, potentially to discover different subtypes of a cancerCluster samples taken at different time points
![Page 36: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/36.jpg)
Some Biological PathwaysRegulatory pathways Nodes are labeled by genes Arcs denote influence on transcription G1 codes for P1, P1 inhibits G2’s transcription
Metabolic pathways Nodes are metabolites, large biomolecules (eg,
sugars, lipids, proteins and modified proteins) Arcs from biochemical reaction inputs to
outputs Arcs labeled by enzymes (themselves proteins)
![Page 37: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/37.jpg)
Metabolic Pathway Example
Fumarate
Malate
Oxaloacetate
Citrate cis-Aconitate
Isocitrate
-Ketoglutarate
Succinyl-CoASuccinate
fumarase
succinate thikinase
MDH
citrate synthase aconitase
IDH
-KDGH
FAD
FADH2
H20
NAD+
NADH
Acetyl CoAHSCoA
H20
H20
NAD+
NADH + CO2
NAD+ + HSCoA
NADH + CO2
GDP + Pi GTP+ HSCoA
(Krebs Cycle,TCA Cycle,Citric Acid Cycle)
![Page 38: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/38.jpg)
Regulatory Pathway (KEGG)
![Page 39: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/39.jpg)
Using Microarray Data Only
Regulatory pathways Nodes are labeled by genes Arcs denote influence on transcription G1 codes for P1, P1 inhibits G2’s transcription
Metabolic pathways Nodes are metabolites, large biomolecules (eg,
sugars, lipids, proteins, and modified proteins) Arcs from biochemical reaction inputs to
outputs Arcs labeled by enzymes (themselves proteins)
![Page 40: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/40.jpg)
Supervised Learning Task 2
Given: a set of microarray experiments for same organism under different conditionsDo: Learn graphical model that accurately predicts expression of some genes in terms of others
![Page 41: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/41.jpg)
Some Approaches to Learning Regulatory
NetworksBayes Net Learning (started with Friedman & Halpern, 1999, we’ll see more)
Boolean Networks (Akutsu, Kuhara, Maruyama & Miyano, 1998; Ideker, Thorsson & Karp, 2002)
Related Graphical Approaches (Tanay & Shamir, 2001; Chrisman, Langley, Baay & Pohorille, 2003)
![Page 42: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/42.jpg)
Note: direction of arrowindicates dependencenot causality
P(geneA)
P(geneB)P(
gene
A)
parent node
child node
parent node
child node
geneA
geneB
Bayesian Network (BN)
1.00.0
DatageneBgeneA
Expt1Expt2Expt3Expt4
0.50.5
0.50.5
![Page 43: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/43.jpg)
Problem: Not CausalityA B
A is a good predictor of B. But is A regulating B??Ground truth might be:
B A A C B
B C A
B
C
A Or a more complicated variant
![Page 44: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/44.jpg)
Approaches to Get Causality
Use “knock-outs” (Pe’er, Regev, Elidan and Friedman, 2001). But not available in most organisms.Use time-series data and Dynamic Bayesian Networks (Ong, Glasner and Page, 2002). But even less data typically.Use other data sources, eg sequences upstream of genes, where transcription regulators may bind. (Segal, Barash, Simon, Friedman and Koller, 2002; Noto and Craven, 2005)
![Page 45: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/45.jpg)
A Dynamic Bayes Net
gene 1
gene 2 gene
3gene
N
gene 1
gene 2 gene
3gene
N
![Page 46: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/46.jpg)
Problem: Not Enough Data Points to Construct Large Network
Fortunate to get 100s of chipsBut have 1000s of genes E. coli: ~4000 Yeast: ~6000 Human: ~30,000Want to learn causal graphical model over 1000s of variables with 100s of examples (settings of the variables)
![Page 47: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/47.jpg)
Advance: Module Networks [Segal, Pe’er, Regev, Koller & Friedman, 2005]
Cluster genes by similarity over expression experimentsAll genes in a cluster are “tied together”: same parents and CPDsLearn structure subject to this tying together of genesIteratively re-form clusters and re-learn network, in an EM-like fashion
![Page 48: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/48.jpg)
Problem: Data are Continuous but Models are Discrete
Gene chips provide a real-valued mRNA measurementBoolean networks and most practical Bayes net learning algorithms assume discrete variablesMay lose valuable information by discretizing
![Page 49: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/49.jpg)
Advance: Use of Dynamic Bayes Nets with Continuous Variables [Segal, Pe’er, Regev, Koller & Friedman, 2005]
Expression measurements used instead of discretized (up, down, same)Assume linear influence of parents on children (Michaelis-Menten assumption)Work so far constructed the network from literature and learned parameters
![Page 50: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/50.jpg)
Problem: Much Missing Information
mRNA from gene 1 doesn’t directly alter level of mRNA from gene 2Rather, the protein product from gene 1 may alter level of mRNA from gene 2 (e.g., transcription factor)Activation of transcription factor might not occur by making more of it, but just by phosphorylating it (post-translational modification)
![Page 51: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/51.jpg)
R
Example: Transcription Regulation
DNAgeneA geneB geneCP TO
OperongeneRP O T
Operon OperonOperonR
mRNAmRNA
![Page 52: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/52.jpg)
Approach: Measure More Stuff
Mass spectrometry (later) can measure protein rather than mRNA Doesn’t measure all proteins Not very quantitative (presence/absence)
2D gels can measure post-translational modifications, but still low-throughput because of current analysisCo-immunoprecipitation (later), Yeast 2-Hybrids can measure protein interactions, but noisy
![Page 53: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/53.jpg)
Another Way Around Limitations
Identify smaller part of the task that is a step toward a full regulatory pathway Part of a pathway Classes or groups of genesExamples:Chromatin remodelers Predicting the operons in E. coli
![Page 54: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/54.jpg)
Chromatin Remodelers and Nucleosome [Segal et al. 2006]
Previous DNA picture oversimplifiedDNA double-helix is wrapped in further complex structureDNA is accessible only if part of this structure is unwoundCan we predict what chromatin remodelers act on what parts of DNA, also what activates a remodeler?
![Page 55: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/55.jpg)
The E. Coli Genome
![Page 56: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/56.jpg)
Finding Operons in E. coli(Craven, Page, Shavlik, Bockhorst and Glasner, 2000)
Given: known operons and other E. coli data Do: predict all operons in E. coli
Additional Sources of Information gene-expression data functional annotation
g1g2 g3
g5g4
promoter terminator
![Page 57: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/57.jpg)
Comparing Naive Bayes and Decision Trees (C5.0)
![Page 58: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/58.jpg)
Using Only Individual Features
![Page 59: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/59.jpg)
Single-Nucleotide Polymorphisms
SNPs: Individual positions in DNA where variation is commonRoughly 2 million known SNPs in humansNew Affymetrix whole-genome scan measures 500,000 of theseEasier/faster/cheaper to measure SNPs than to completely sequence everyoneMotivation …
![Page 60: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/60.jpg)
Not Susceptible or Not Responding
Susceptible to Disease D or Responds to Treatment T
If We Sequenced Everyone…
![Page 61: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/61.jpg)
Example of SNP Data
Person SNP 1 2 3 . . . CLASS Person 1 C T A G T T . . . old Person 2 C C A G C T . . . young Person 3 T T A A C C . . . old Person 4 C T G G T T . . . young . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
![Page 62: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/62.jpg)
Phasing (Haplotyping)
![Page 63: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/63.jpg)
Advantages of SNP DataPerson’s SNP pattern does not change with time or disease, so it can give more insight into susceptibility
Easier to collect samples (can simply use blood rather than affected tissue)
![Page 64: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/64.jpg)
Challenges of SNP DataUnphased Algorithms exist for phasing (haplotyping), but they make errors and typically need related individuals, dense coverageMissing values are more common than in microarray data (though improving substantially, down to around 1-2% now)Many more measurements. For example, Affymetrix human SNP chip at a half million SNPs.
![Page 65: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/65.jpg)
Supervised Learning TaskGiven: a set of SNP profiles, each from a different patient.Phased: nucleotides at each SNP position on each copy of each chromosome constitute the features, and patient’s disease constitutes the class
Unphased: unordered pair of nucleotides at each SNP position constitute the features, and patient’s disease constitutes the class
Do: Learn a model that accurately predicts class based on features
![Page 66: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/66.jpg)
Waddell et al., 2005Multiple Myeloma, Young (susceptible) vs. Old (less susceptible), 3000 SNPs, best at 64% acc (training)SVM with feature selection (repeated on every fold of cross-validation): 72% accuracy, also naïve Bayes. Significantly better than chance.
Old
Young
Old Young
Actual31 9
14 26
![Page 67: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/67.jpg)
Listgarten et al., 2005 SVMs from SNP data predict lung cancer susceptibility at 69% accuracy Naïve Bayes gives similar performance Best single SNP at less than 60% accuracy (training)
![Page 68: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/68.jpg)
LessonsSupervised data mining algorithms can predict disease susceptibility at rates better than chance and better than individual SNPsAccuracies much lower than we see with microarray data, because we’re predicting who will get disease, not who already has it
![Page 69: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/69.jpg)
Future DirectionsPharmacogenetics: predicting drug response from SNP profile Drug Efficacy Adverse ReactionCombining SNP data with other data types, such as clinical (history, lab tests) and microarray
![Page 70: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/70.jpg)
ProteomicsMicroarrays are useful primarily because mRNA concentrations serve as surrogate for protein concentrationsLike to measure protein concentrations directly, but at present cannot do so insame high-throughput mannerProteins do not have obvious direct complementsCould build molecules that bind, but binding greatly affected by protein structure
![Page 71: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/71.jpg)
Time-of-Flight (TOF) Mass Spectrometry (thanks Sean McIlwain)
Laser
+VSample
Measures the time for an ionized particle, starting from the sample plate, to hit the detector
Detector
![Page 72: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/72.jpg)
Time-of-Flight (TOF) Mass Spectrometry 2
Laser
+VSample
Matrix-Assisted Laser Desorption-Ionization (MALDI) Crystalloid structures made using proton-rich matrix moleculeHitting crystalloid with laser causes molecules to ionize and “fly” towards detector
Detector
![Page 73: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/73.jpg)
Time-of-Flight Demonstration 0
Sample Plate
![Page 74: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/74.jpg)
Time-of-Flight Demonstration 1
Matrix Molecules
![Page 75: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/75.jpg)
Time-of-Flight Demonstration 2
Protein Molecules
![Page 76: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/76.jpg)
Time-of-Flight Demonstration 3
LaserDetector
+10KV Positive Charge
![Page 77: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/77.jpg)
Time-of-Flight Demonstration 4
+10KV
+
Proton kicked off matrix molecule onto another molecule
Laser pulsed directly onto sample
![Page 78: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/78.jpg)
Time-of-Flight Demonstration 5
+10KV
++
++ +
Lots of protons kicked off matrix ions, giving rise to more positively charged molecules
![Page 79: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/79.jpg)
Time-of-Flight Demonstration 6
+10KV
++
++ +
The high positive potential under sample plate, causes positively charged molecules to accelerate towards detector
![Page 80: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/80.jpg)
Time-of-Flight Demonstration 7
+10Kv
+
+
+
+
+
+
Smaller mass molecules hit detector first, while heavier ones detected later
![Page 81: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/81.jpg)
Time-of-Flight Demonstration 8
+10KV
+
+
+
+
+
+
The incident time measured from when laser is pulsed until molecule hits detector
![Page 82: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/82.jpg)
Time-of-Flight Demonstration 9
+10KV
++ + + ++
Experiment repeated a number of times, counting frequencies of “flight-times”
![Page 83: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/83.jpg)
Example Spectra from a Competition by Lin et al. at
Duke
These are different fractions from the same sample.
M/Z
Inte
nsity
![Page 84: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/84.jpg)
Trypsin-Treated Spectra
M/Z
Freq
uenc
y
![Page 85: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/85.jpg)
Many Challenges Raised by Mass Spectrometry DataNoise: extra peaks from handling of sample, from machine and environment (electrical noise), etc.M/Z values may not align exactly across spectra (resolution ~0.1%)Intensities not calibrated across spectra: quantification is difficultCannot get all proteins… typically only several hundred. To improve odds of getting the ones we want, may fractionate our sample by 2D gel electrophoresis or liquid chromatography.
![Page 86: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/86.jpg)
Challenges (Continued)Better results if partially digest proteins (break into smaller peptides) firstCan be difficult to determine what proteins we have from spectrumIsotopic peaks: C13 and N15 atoms in varying numbers cause multiple peaks for a single peptide
![Page 87: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/87.jpg)
Handling Noise: Peak Picking
Want to pick peaks that are statistically significant from the noise signal
Want to use these as features in our learning algorithms.
![Page 88: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/88.jpg)
Many Supervised Learning Tasks
Learn to predict proteins from spectra, when the organism’s proteome is knownLearn to identify isotopic distributionsLearn to predict disease from either proteins, peaks or isotopic distributions as featuresConstruct pathway models
![Page 89: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/89.jpg)
Using Mass Spectrometry for Early Detection of Ovarian Cancer [Petricoin et al., 2002]Ovarian cancer difficult to detect early, often leading to poor prognosisTrained and tested on mass spectra from blood serum100 training cases, 50 with cancerHeld-out test set of 116 cases, 50 with cancer100% sensitivity, 95% specificity (63/66) on held-out test set
![Page 90: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/90.jpg)
Not So FastData mining methodology seems soundBut Keith Baggerly argues that cancer samples were handled differently than normal samples, and perhaps data were preprocessed differently tooIf we run cancer samples Monday and normals Wednesday, could get differences from machine breakdown or nearby electrical equipment that’s running on Monday but not WedLesson: tell collaborators they must randomize samples for the entire processing phase and of course all our preprocessing must be sameDebate is still raging… results not replicated in trials
![Page 91: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/91.jpg)
Other Proteomics: 3D Structures
![Page 92: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/92.jpg)
Other Proteomics: Interactions
Figure from Ideker et al., Science 292(5518):929-934, 2001
each node represents a gene product (protein)blue edges show direct protein-protein interactionsyellow edges show interactions in which one protein binds to DNA and affects the expression of another
![Page 93: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/93.jpg)
Protein-Protein Interactions
Yeast 2-HybridImmunoprecipitation Antibodies (“immuno”) are made by
combinatorial combinations of certain proteins
Millions of antibodies can be made, to recognize a wide variety of different antigens (“invaders”), often by recognizing specific proteinsantibody protein
![Page 94: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/94.jpg)
Protein-Protein Interactions
![Page 95: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/95.jpg)
Immunoprecipitationantibody
![Page 96: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/96.jpg)
Co-Immunoprecipitationantibody
![Page 97: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/97.jpg)
Many Supervised Learning Tasks
Learn to predict protein-protein interactions: protein 3D structures may be criticalUse protein-protein interactions in construction of pathway modelsLearn to predict protein function from interaction data
![Page 98: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/98.jpg)
ChIP-Chip DataImmunoprecipitation can also be done to identify proteins interacting with DNA rather than other proteinsChromatin immunoprecipitation (ChIP): grab sample of DNA bound to a particular protein (transcription factor)ChIP-Chip: run this sample of DNA on a microarray to see which DNA was boundExample of analysis of such new data: Keles et al., 2006
![Page 99: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/99.jpg)
MetabolomicsMeasures concentration of each low-molecular weight molecule in sampleThese typically are “metabolites,” or small molecules produced or consumed by reactions in biochemical pathwaysThese reactions typically catalyzed by proteins (specifically, enzymes)This data typically also mass spectrometry, though could also be NMR
![Page 100: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/100.jpg)
LipomicsAnalogous to metabolomics, but measuring concentrations of lipids rather than metabolitesPotentially help induce biochemical pathway information or to help disease diagnosis or treatment choice
![Page 101: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/101.jpg)
To Design a Drug:Identify Target
Protein
DetermineTarget SiteStructure
Synthesize aMolecule that
Will Bind
Knowledge of proteome/genomeRelevant biochemical pathways
Crystallography, NMRDifficult if Membrane-Bound
Imperfect modeling of structureStructures may change at bindingAnd even then…
![Page 102: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/102.jpg)
Molecule Binds Target But May:
Bind too tightly or not tightly enough.Be toxic.Have other effects (side-effects) in the body.Break down as soon as it gets into the body, or may not leave the body soon enough.It may not get to where it should in the body (e.g., crossing blood-brain barrier).Not diffuse from gut to bloodstream.
![Page 103: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/103.jpg)
And Every Body is Different:
Even if a molecule works in the test tube and works in animal studies, it may not work in people (will fail in clinical trials).A molecule may work for some people but not others.A molecule may cause harmful side-effects in some people but not others.
![Page 104: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/104.jpg)
Typical Practice when Target Structure is Unknown
High-Throughput Screening (HTS): Test many molecules (1,000,000) to find some that bind to target (ligands).Infer (induce) shape of target site from 3D structural similarities.Shared 3D substructure is called a pharmacophore.Perfect example of a machine learning task with spatial target.
![Page 105: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/105.jpg)
An Example of Structure Learning
Act
ive
Inac
tive
![Page 106: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/106.jpg)
Common Data Mining Approaches
Represent a molecule by thousands to millions of features and use standard techniques (e.g., KDD Cup 2001)Represent each low-energy conformer by feature vector and use multiple-instance learning (e.g., Jain et al., 1998)Relational learning Inductive logic programming (e.g., Finn et
al., 1998) Graph mining
![Page 107: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/107.jpg)
Supervised Learning TaskGiven: a set of molecules, each labeled by activity -- binding affinity for target protein -- and a set of low-energy conformers for each molecule
Do: Learn a model that accurately predicts activity (may be Boolean or real-valued)
![Page 108: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/108.jpg)
ILP as Illustration: The Logical Representation of a Pharmacophore
Active(X) if: has-conformation(X,Conf),has-hydrophobic(X,A),has-hydrophobic(X,B),distance(X,Conf,A,B,3.0,1.0),has-acceptor(X,C),distance(X,Conf,A,C,4.0,1.0),distance(X,Conf,B,C,5.0,1.0).
This logical clause states that a molecule X is active if it has someconformation Conf, hydrophobic groups A and B, and a hydrogen acceptor Csuch that the following holds: in conformation Conf of molecule X, thedistance between A and B is 3 Angstroms (plus or minus 1), the distancebetween A and C is 4, and the distance between B and C is 5.
![Page 109: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/109.jpg)
Background Knowledge IInformation about atoms and bonds in the molecules
atm(m1,a1,o,3,5.915800,-2.441200,1.799700).atm(m1,a2,c,3,0.574700,-2.773300,0.337600).atm(m1,a3,s,3,0.408000,-3.511700,-1.314000).
bond(m1,a1,a2,1). bond(m1,a2,a3,1).
![Page 110: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/110.jpg)
Background knowledge IIDefinition of distance equivalencedist(Drug,Atom1,Atom2,Dist,Error):- number(Error), coord(Drug,Atom1,X1,Y1,Z1), coord(Drug,Atom2,X2,Y2,Z2), euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),Dist1), Diff is Dist1-Dist, absolute_value(Diff,E1), E1 =< Error.
euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),D):- Dsq is (X1-X2)^2+(Y1-Y2)^2+(Z1-Z2)^2, D is sqrt(Dsq).
![Page 111: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/111.jpg)
Central Idea: Generalize by searching a lattice
Lattice of Hypotheses active(X)
active(X) if active(X) if active(X) ifhas-hydrophobic(X,A) has-donor(X,A) has-acceptor(X,A)
active(X) if active(X) if active(X) ifhas-hydrophobic(X,A), has-donor(X,A), has-acceptor(X,A),has-donor(X,B), has-donor(X,B), has-donor(X,B),distance(X,A,B,5.0) distance(X,A,B,4.0) distance(X,A,B,6.0)
etc.
![Page 112: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/112.jpg)
Conformational model Conformational flexibility modelled as multiple conformations:
Sybyl randomsearch Catalyst
![Page 113: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/113.jpg)
Pharmacophore description
Atom and site centred Hydrogen bond donor Hydrogen bond acceptor Hydrophobe Site points (limited at present) User definableDistance based
![Page 114: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/114.jpg)
Example 1: Dopamine agonists
Agonists taken from Martin data set on QSAR society web pagesExamples (5-50 conformations/molecule)
NCH3
CH3
OH
OH
OH
OHNH2 NH
OH
OH
NH2
OH
OHN
CH3 OH
OH
NH2
OH
OH
![Page 115: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/115.jpg)
Pharmacophore identified
Molecule A has the desired activity if: in conformation B molecule A contains a hydrogen acceptor at C, and in conformation B molecule A contains a basic nitrogen group at D, and the distance between C and D is 7.05966 +/- 0.75 Angstroms, and in conformation B molecule A contains a hydrogen acceptor at E, and the distance between C and E is 2.80871 +/- 0.75 Angstroms, and the distance between D and E is 6.36846 +/- 0.75 Angstroms, and in conformation B molecule A contains a hydrophobic group at F, and the distance between C and F is 2.68136 +/- 0.75 Angstroms, and the distance between D and F is 4.80399 +/- 0.75 Angstroms, and the distance between E and F is 2.74602 +/- 0.75 Angstroms.
![Page 116: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/116.jpg)
Example II: ACE inhibitors28 angiotensin converting enzyme inhibitors taken from literature D. Mayer et al., J. Comput.-Aided Mol.
Design, 1, 3-16, (1987)
N
NN
O
O
COOH
SHN
COOHO
CH3
NH
POH
OHO N
COOHO
CH3
NH
COOH
![Page 117: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/117.jpg)
ACE pharmacophoreMolecule A is an ACE inhibitor if: molecule A contains a zinc-site B, molecule A contains a hydrogen acceptor C, the distance between B and C is 7.899 +/- 0.750 A, molecule A contains a hydrogen acceptor D, the distance between B and D is 8.475 +/- 0.750 A, the distance between C and D is 2.133 +/- 0.750 A, molecule A contains a hydrogen acceptor E, the distance between B and E is 4.891 +/- 0.750 A, the distance between C and E is 3.114 +/- 0.750 A, the distance between D and E is 3.753 +/- 0.750 A.
![Page 118: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/118.jpg)
Pharmacophore discovered
Distance Progol Mayer
A 4.9 5.0
B 3.8 3.8
C 8.5 8.6
Zinc site
H-bond acceptor
C
AB
![Page 119: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/119.jpg)
Additional FindingOriginal pharmacophore rediscovered plus one other different zinc ligand position similar to alternative proposed by
Ciba-Geigy
7.3
3.94.0
![Page 120: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/120.jpg)
Example III: Thermolysin inhibitors
10 inhibitors for which crystallographic data is available in PDBConformationally challenging moleculesExperimentally observed superposition
![Page 121: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/121.jpg)
Key binding site interactions
Asn112-NH
S2’ O=C Asn112
Arg203-NH S1’
O=C Ala113Zn
OOH
NH
O
NHPO
O
R
![Page 122: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/122.jpg)
Interactions made by inhibitors
Interaction 1HYT 1THL 1TLP 1TMN 2TMN 4TLN 4TMN 5TLN 5TMN 6TMNAsn112-NH S2’ Asn112 C=O Arg 203 NH S1’ Ala113-C=O Zn
![Page 123: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/123.jpg)
Pharmacophore Identification
Structures considered 1HYT 1THL 1TLP 1TMN 2TMN 4TLN 4TMN 5TLN 5TMN 6TMNConformational analysis using “Best” conformer generation in Catalyst98-251 conformations/molecule
![Page 124: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/124.jpg)
Thermolysin Results10 5-point pharmacophore identified, falling into 2 groups (7/10 molecules) 3 “acceptors”, 1 hydrophobe, 1 donor 4 “acceptors, 1 donorCommon core of Zn ligands, Arg203 and Asn112 interactions identifiedCorrect assignments of functional groupsCorrect geometry to 1 Angstrom tolerance
![Page 125: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/125.jpg)
Thermolysin resultsIncreasing tolerance to 1.5Angstroms finds common 6-point pharmacophore including one extra interaction
![Page 126: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/126.jpg)
Example IV: Antibacterial peptides [Spatola et al., 2000]
Dataset of 11 pentapeptides showing activity against Pseudomonas aeruginosa 6 actives <6g/ml IC50 5 inactives
![Page 127: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/127.jpg)
Pharmacophore Identified
A Molecule M is active against Pseudomonas Aeruginosaif it has a conformation B such that:
M has a hydrophobic group C,M has a hydrogen acceptor D,the distance between C and D in conformation B is 11.7 AngstromsM has a positively-charged atom E,the distance between C and E in conformation B is 4 Angstromsthe distance between D and E in conformation B is 9.4 AngstromsM has a positively-charged atom F,the distance between C and F in conformation B is 11.1 Angstromsthe distance between D and F in conformation B is 12.6 Angstromsthe distance between E and F in conformation B is 8.7 Angstroms
Tolerance 1.5 Angstroms
![Page 128: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/128.jpg)
![Page 129: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/129.jpg)
Clinical Databases of the Future (Dramatically Simplified)
PatientID Gender Birthdate
P1 M 3/22/63
PatientID Date Physician Symptoms Diagnosis
P1 1/1/01 Smith palpitations hypoglycemic P1 2/1/03 Jones fever, aches influenza
PatientID Date Lab Test Result
P1 1/1/01 blood glucose 42 P1 1/9/01 blood glucose 45
PatientID SNP1 SNP2 … SNP500K
P1 AA AB BB P2 AB BB AA
PatientID Date Prescribed Date Filled Physician Medication Dose Duration
P1 5/17/98 5/18/98 Jones prilosec 10mg 3 months
![Page 130: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/130.jpg)
Final Wrap-upMolecular biology collecting lots and lots of data in post-genome eraOpportunity to “connect” molecular-level information to diseases and treatmentNeed analysis tools to interpretData mining opportunities aboundHopefully this tutorial provided solid start toward applying data mining to high-throughput biological data
![Page 131: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/131.jpg)
Thanks ToJude ShavlikJohn ShaughnessyBart BarlogieMark CravenSean McIlwainJan StruyfArno SpatolaPaul FinnBeth Burnside
Michael MollaMichael WaddellIrene OngJesse DavisSoumya RayJo HardinJohn CrowleyFenghuang ZhanEric Lantz
![Page 132: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/132.jpg)
If Time Permits… some of my group’s directions in the area
Clinical Data (with Jesse Davis, Beth Burnside, M.D.)Addressing another problem with current approaches to biological network learning (with Soumya Ray, Eric Lantz)
![Page 133: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/133.jpg)
Using Machine Learning with Clinical Histories: Example
We’ll use example of Mammography to show some issues that ariseThese issues arise here even with just one relational tableThese issues are even more pronounced with data in multiple tables
![Page 134: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/134.jpg)
Supervised Learning TaskGiven: a database of mammogram abnormalities for different patients (same cell type from every patient) Radiologist-entered values describing the abnormality constitute the features, and abnormality’s biopsy result as benign or malignant constitutes the class
Do: Learn a model that accurately predicts class based on features
![Page 135: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/135.jpg)
P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …
Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal
Mammography Database[Davis et al, 2005; Burnside et al, 2005]
![Page 136: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/136.jpg)
Original Expert Structure
![Page 137: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/137.jpg)
Level 1: Parameters
Be/Mal
Shape Size
Given: Features (node labels, or fields in database), Data, Bayes net structureLearn: Probabilities. Note: probabilities needed are Pr(Be/Mal), Pr(Shape|Be/Mal), Pr (Size|Be/Mal)
![Page 138: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/138.jpg)
Level 2: Structure
Be/Mal
Shape Size
Given: Features, Data Learn: Bayes net structure and probabilities. Note: with this structure, now will need Pr(Size|Shape,Be/Mal) instead of Pr(Size|Be/Mal).
![Page 139: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/139.jpg)
P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …
Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal
Mammography Database
![Page 140: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/140.jpg)
P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …
Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal
Mammography Database
![Page 141: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/141.jpg)
P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …
Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal
Mammography Database
![Page 142: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/142.jpg)
Level 3: Aggregates
Given: Features, Data, Background knowledge – aggregation functions such as average, mode, max, etc. Learn: Useful aggregate features, Bayes net structure that uses these features, and probabilities. New features may use other rows/tables.
Be/Mal
Shape Size
Avg size this date
![Page 143: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/143.jpg)
P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …
Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal
Mammography Database
![Page 144: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/144.jpg)
P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …
Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal
Mammography Database
![Page 145: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/145.jpg)
P1 1 5/02 Spic 0.03 RU4 B P1 2 5/04 Var 0.04 RU4 M P1 3 5/04 Spic 0.04 LL3 B … … … … … … …
Patient Abnormality Date Mass Shape … Mass Size Loc Be/Mal
Mammography Database
![Page 146: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/146.jpg)
Level 4: View LearningGiven: Features, Data, Background knowledge – aggregation functions and intensionally-defined relations such as “increase” or “same location”Learn: Useful new features defined by views (equivalent to rules or SQL queries), Bayes net structure, and probabilities.
Be/Mal
Shape Size
Avg size this date
Shape change in abnormality at this location
Increase in average size of abnormalities
![Page 147: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/147.jpg)
Example of Learned Ruleis_malignant(A) IF:
'BIRADS_category'(A,b5), 'MassPAO'(A,present), 'MassesDensity'(A,high), 'HO_BreastCA'(A,hxDCorLC), in_same_mammogram(A,B), 'Calc_Pleomorphic'(B,notPresent), 'Calc_Punctate'(B,notPresent).
![Page 148: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/148.jpg)
ROC: Level 2 (TAN) vs. Level 1
![Page 149: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/149.jpg)
Precision-Recall Curves
![Page 150: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/150.jpg)
All Levels of Learning
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Prec
isio
nLevel 4 (View)
Level 3 (Aggregate)
Level 2 (Structure)
Level 1 (Parameter)
![Page 151: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/151.jpg)
SAYU-ViewImproved View Learning approachSAYU: Score As You UseFor each candidate rule, add it to the Bayesian network and see if it improves the network’s scoreOnly add a rule (new field for the view) if it improves the Bayes net
![Page 152: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/152.jpg)
Relational Learning Algorithms
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Prec
isio
n
SAYU-View
Initial Level 4 (View)
Level 3 (Aggregates)
![Page 153: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/153.jpg)
Average Area Under PR Curve For Recall >= 0.5
Level 3
Initial Level 4
SAYU-View
0.00
0.02
0.04
0.06
0.08
0.10
0.12
AURPR
![Page 154: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/154.jpg)
Clinical Databases of the Future (Dramatically Simplified)
PatientID Gender Birthdate
P1 M 3/22/63
PatientID Date Physician Symptoms Diagnosis
P1 1/1/01 Smith palpitations hypoglycemic P1 2/1/03 Jones fever, aches influenza
PatientID Date Lab Test Result
P1 1/1/01 blood glucose 42 P1 1/9/01 blood glucose 45
PatientID SNP1 SNP2 … SNP500K
P1 AA AB BB P2 AB BB AA
PatientID Date Prescribed Date Filled Physician Medication Dose Duration
P1 5/17/98 5/18/98 Jones prilosec 10mg 3 months
![Page 155: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/155.jpg)
Another Problem with Current Learning of Regulatory Models
Current techniques all use greedy heuristicBayes net learning algorithms use “sparse candidate” approach: to be considered as a parent of gene 1, another gene 2 must be correlated with gene 1CPDs often represented as trees: use greedy tree learning algorithmsAll can fall prey to functions such as exclusive-or – do these arise?
![Page 156: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/156.jpg)
Skewing Example [Page & Ray, 2003; Ray & Page, 2004; Rosell et al., 2005; Ray & Page, 2005]
Drosophila survival based on gender and Sxl gene activity
Female SxL Gene Gene3 … Gene100 Survival
0 00…00000000…0000001 …1…1111111
0
0 10…00000000…0000001 …1…1111111
1
1 00…00000000…0000001 …1…1111111
1
1 10…00000000…0000001 …1…1111111
0
![Page 157: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/157.jpg)
Hard FunctionsOur Definition: those functions for which no attribute has “gain” according to standard purity measures (GINI, Entropy) NOTE: “Hard” does not refer to size
of representation Example: n-variable odd parity
Many others
![Page 158: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/158.jpg)
Learning Hard Functions
Standard method of learning hard functions (e.g. with decision trees): depth-k Lookahead O(mn2k+1-1) for m examples in n variables
We devise a technique that allows learning algorithms to efficiently learn hard functions
![Page 159: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/159.jpg)
Key IdeaHard functions are not hard for all data distributionsWe can skew the input distribution to simulate a different one By randomly choosing “preferred
values” for attributesAccumulate evidence over several skews to select a split attribute
![Page 160: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/160.jpg)
Example: Uniform Distribution
0)(
25.0)1;(25.0)0;(
25.0)(
i
i
i
xGAIN
xfGINIxfGINI
fGINI
x1 x2 x3 f
0 0 0 01
0 1 0 11
1 0 0 11
1 1 0 01
![Page 161: Machine Learning for High-Throughput Biological Data](https://reader033.vdocuments.net/reader033/viewer/2022061612/56815d06550346895dcb0785/html5/thumbnails/161.jpg)
Example: Skewed Distribution(“Sequential Skewing”, Ray & Page, ICML 2004)
x1 x2 x3 f Weight
0 0 0 01
0 1 0 11
1 0 0 11
1 1 0 01
41
43
25.0)1;(25.0)1;(19.0)1;(
25.0)(
3
2
1
xfGINIxfGINIxfGINI
fGINI
43
43
43
41
41
41
0)(0)(0)(
3
2
1
xGAINxGAINxGAIN