automated hypothesis testing with large scale scientific workflows
TRANSCRIPT
Automated Hypothesis Testing with Large Scale Scientific Workflows
Yolanda GilDaniel GarijoRajiv Mayani
Varun Ratnakar
Information Sciences Institute & Department of Computer Science
University of Southern Californiahttp://www.isi.edu
Parag MallickRavali AdusumilliHunter Boyce
Stanford School of MedicineCanary Center for Early Cancer Detection
Stanford Universityhttp://mallicklab.stanford.edu
http://www.disk-project.org
Talk Outline๏ Motivation
๏ Research Challenges1. Representing Hypotheses
2. Representing Lines of Inquiry
3. Meta-analysis to review workflow results
๏ DISK Scenario walkthrough
๏ Results in cancer multi-omics
๏ Related work
๏ Contributions and Future Work
Scientific Data Analysis Today: Inefficient, Incomplete, Irreproducible
๏ Data analysis is time consuming
๏ Not systematic
๏ Not updated when new data/methods become available
๏ Hard/impractical to reproduce prior work
๏ Overall process is manually done: inefficient and error-prone
๏ Analytic knowledge is compartmentalised
New hypothesis
Formulate line of inquiry
(data + method)
Retrieve data
Run
workflows (methods)
Meta-analysis of results
Our Focus: Cancer Multi-Omics๏ Data Availability and Complexity:
• The multi-omic domain is filled with multiple levels of heterogeneous data that is regularly expanding in volume and complexity through projects like The Cancer Genome Atlas TCGA and and the associated Clinical Proteomic Tumor Analysis Consortium (CPTAC)
Our Focus: Cancer Multi-Omics๏ Analytic Complexity:
• Multi-omic analysis requires the use of dozens of interconnected tools each of which may require substantial domain knowledge. MAQ
BWABWA-SW(SEonly)PERMSOAPv2MOSAIKNOVOALIGN
SAMTOOLSPICARDGATKPICARDSAMTOOLSIGVtools
Domain Knowledge is isolated
Our Focus: Cancer Multi-Omics๏ Multiple types and complexities
of hypotheses:
• Hypotheses span the range from single-gene/single dataset to multi-gene/multi-ome/multi-dataset
• Is this protein is found in this sample ?• Is this gene is found in this sample ?• Is this protein is associated with a
certain cancer ?• Which proteins are associated with a
certain cancer ?• ..• ..
Talk Outline๏ Motivation
๏ Our Approach & Research Challenges1. Representing Hypotheses
2. Representing Lines of Inquiry
3. Meta-analysis to review workflow results
๏ DISK Scenario walkthrough
๏ Results in cancer multi-omics
๏ Related work
๏ Contributions and Future Work
Our Approach: Hypotheses-Driven Discovery๏ Represent scientist
hypotheses
๏ Formulate lines of inquiry that express how a type of hypothesis can be pursued by data analysis workflows
๏ Design a meta-analysis that examines the results of lines of inquiry and either validates or revises the original hypotheses
๏ Develop an intelligent agent that can report and explain new findings to the scientist
Hypothesis
Lines of Inquiry Specify relevant analytic methods (workflows),
type of data needed, and how to combine results
Query to retrieve Data
Data Analysis Workflows
Workflow Bindings
Meta-Workflows
Confidence Estimation
Benchmarking
Revised hypothesis & interesting findings
Representing Hypotheses
Hypothesis
Lines of Inquiry Specify relevant analytic methods (workflows),
type of data needed, and how to combine results
Query to retrieve Data
Data Analysis Workflows
Workflow Bindings
Meta-Workflows
Confidence Estimation
Benchmarking
Revised hypothesis & interesting findings
Representing Hypotheses
Requirements from Omics
๏ Graph-based hypothesis representation
• Entities are nodes
• Relationships are links
๏ Annotations on graphs
• Represent qualifications of hypotheses: confidence and evidence
๏ Representing hypothesis evolution
• Graph versioning
Graph representation in RDF
๏ Standard semantic web language
๏ Scalable reasoners available
๏ Qualifications and provenance through triple reification
๏ Versioning through multiple named graphs
Representing Hypotheses
Representing Hypotheses
Biology ontology
Hypothesis ontology
hyp:expressedInuser:TCGA-AA-3561-01A-22
User data definitions
hyp:associatedWithbio:ColonCancer
Graph Hy1
Graph Hy2
bio:PRKCDBP
bio:PRKCDBP
Lifecycle of a hypothesis
Biology ontology
Hypothesis ontology
hyp:expressedInuser:TCGA-AA-3561-01A-22
User data definitions
hyp:associatedWithbio:ColonCancer
Graph Hy1
Graph Hy2
bio:PRKCDBP
bio:PRKCDBP
1. Initial Hypothesis, Data & Workflows
Data Available
Workflows Available
Proteomics
Proteogenomics
XX_3561Proteome_VU.zip (MassSpecData)
producedData TCGA-AA-3561(Patient)
collectedFromTCGA-AA-3561-01A-22(Sample)
AA_3561_EX2(Experiment)
experimentedOn
Hypothesis Statement Hy1
PRKCDBPexpressedIn
TCGA-AA-3561-01A-22
2. Running workflows on Data
Data Available
Workflows Available
Proteomics
Proteogenomics
XX_3561Proteome_VU.zip (MassSpecData)
producedData TCGA-AA-3561(Patient)
collectedFromTCGA-AA-3561-01A-22(Sample)
AA_3561_EX2(Experiment)
experimentedOn
Workflow ExecutionW1
hasWorkflowTemplate
used
Hypothesis Statement Hy1
PRKCDBPexpressedIn
TCGA-AA-3561-01A-22
Qualifications of Hy1'Provenance of Hy1'
Hypothesis Statement Hy1
3. Meta reasoning about workflow results
PRKCDBPexpressedIn
TCGA-AA-3561-01A-22
Data Available
Workflows Available
Proteomics
Proteogenomics
XX_3561Proteome_VU.zip (MassSpecData)
producedData TCGA-AA-3561(Patient)
collectedFromTCGA-AA-3561-01A-22(Sample)
AA_3561_EX2(Experiment)
experimentedOn
Workflow ExecutionW1
hasWorkflowTemplate
used
Meta-Workflow ExecutionMW1
used
Revised Hypothesis Statement Hy1'
PRKCDBPexpressedIn
TCGA-AA-3561-01A-22
hasConfidenceValue
0
Statement Hy1'-S1hasProvenanceproducedused
produced
revisionOf
4. New Data becomes available
Workflows Available
Proteomics
Proteogenomics
Hypothesis Statement Ha1
PRKCDBPexpressedIn
TCGA-AA-3561-01A-22
Data Available
XX_3561Proteome_VU.zip (MassSpecData)
producedData
producedData
experimentedOn
experimentedO
nTCGA-AA-3561
(Patient)collectedFromTCGA-AA-3561-01A-22
(Sample)
AA_3561_EX1(Experiment)
AA_3561_EX2(Experiment)
XX_3561_DD.zip (RNASeqData)
5. New Multi-Workflows are also run
Workflows Available
Proteomics
Proteogenomics
used
Data Available
XX_3561Proteome_VU.zip (MassSpecData)
producedData
producedData
experimentedOn
experimentedO
nTCGA-AA-3561
(Patient)collectedFromTCGA-AA-3561-01A-22
(Sample)
AA_3561_EX1(Experiment)
AA_3561_EX2(Experiment)
Workflow ExecutionW2
XX_3561_DD.zip (RNASeqData)
Workflow ExecutionW1
used
Hypothesis Statement Ha1
PRKCDBPexpressedIn
TCGA-AA-3561-01A-22
Qualifications of Ha1'
hasProvenance
Provenance of Ha1'
6. Hypothesis Revision
Workflows Available
Proteomics
Proteogenomics
used
used
Revised Hypothesis Statement Ha1'
PRKCDBPMutated
expressedInTCGA-AA-3561-01A-22
hasConfidenceValue
0.98
Statement Ha1'-S1producedused
Data Available
XX_3561Proteome_VU.zip (MassSpecData)
producedData
producedData
experimentedOn
experimentedO
nTCGA-AA-3561
(Patient)collectedFromTCGA-AA-3561-01A-22
(Sample)
AA_3561_EX1(Experiment)
AA_3561_EX2(Experiment)
Workflow ExecutionW2
XX_3561_DD.zip (RNASeqData)
Workflow ExecutionW1
used usedproduced
Meta-Workflow ExecutionMW2
Hypothesis Statement Ha1
PRKCDBPexpressedIn
TCGA-AA-3561-01A-22
revisionOf
Representing Lines of Inquiry & Data analysis workflows
Hypothesis
Lines of Inquiry Specify relevant analytic methods (workflows),
type of data needed, and how to combine results
Query to retrieve Data
Data Analysis Workflows
Workflow Bindings
Meta-Workflows
Confidence Estimation
Benchmarking
Revised hypothesis & interesting findings
Data Query Pattern
DataFile ?d
Hypothesis Pattern
Lines of Inquiry๏ Capture how to setup potential analyses that can be pursued to test a certain type of
hypothesis
bio:Protein ?phyp:expressedIn
bio:Sample ?s
producedDataPatient ?pcollectedFromSample ?sExperiment ?e
experimentedOn
Data Analytic WorkflowsProteomicsProteogenomics
DataFile ?d
Meta-workflowsComparisonConfidence estimation Benchmarking
Example Multi-omics Workflow (Zhang et. al replication)
Automated Workflow Generation in WINGS by Reasoning about Semantic Constraints
Example: all input data must be from human species, i.e. must have HS in metadata
Workflow system uses this constraint to select datasets that have HS in their metadata so they are valid
Representing Hypotheses
Hypothesis
Lines of Inquiry Specify relevant analytic methods (workflows),
type of data needed, and how to combine results
Query to retrieve Data
Data Analysis Workflows
Workflow Bindings
Meta-Workflows
Confidence Estimation
Benchmarking
Revised hypothesis & interesting findings
Meta-workflows:1) Comparison Meta-Workflows
Variant Detection
Custom Protein DB
Protein Identification
Protein Identification
Custom DB Reference DB
Protein IDs Protein IDs
Similarity Score Data Dependent:
• Peptide Level • Protein Level • Scan Level
Comparison Meta-Workflow
๏ Goals:
• Compare results amongst multiple workflows
• Measure the global similarity amongst multiple workflows
• Provide users with explanation of workflow-dependent differences in results
Meta-workflows:2) Benchmark Meta-Workflows
๏ Goals:
• Evaluation of workflow performance
• Training of confidence estimation models (probabilistic)
Probabilistic Models
Benchmark Meta-Workflow
ROC, True/False Positive Rate
Meta-workflows:3) Confidence estimation Meta-Workflows
๏ Goals:
• Composite results from multiple workflows
• Estimate confidence of the workflow result
• Use estimated confidence to update hypothesis
Protein Identification
Protein Identification
Custom DB Reference DB
Protein IDs Protein IDs Probabilistic Model
Estimate Confidence
Update Hypothesis
Benchmark Meta-Workflow
Talk Outline๏ Motivation
๏ Our Approach & Research Challenges1. Representing Hypotheses
2. Representing Lines of Inquiry
3. Meta-analysis to review workflow results
๏ DISK Scenario walkthrough
๏ Results in cancer multi-omics
๏ Related work
๏ Contributions and Future Work
DISK Walkthrough: Initial Hypothesis๏ Initial hypothesis is provided by the user
• PRKCDBP protein is expressed in a patient sample
DISK Walkthrough: Lines of Inquiry๏ Line of inquiry suggests to find data from different experiments done with the
patient’s sample, then run multi-omic workflows, and then combine evidence into confidence score
General hypothesis pattern
Data query pattern: search for different experiments that produced omics data (eg type RNASeq and MassSpecData)
Data analysis workflows to run on genomics and proteomics data (more omics in the future)
Meta-workflows to assess confidence on the hypothesis based on workflow results
DISK Walkthrough: Data & Workflows
To test a hypothesis that a protein is present in a patient’s sample:
๏ Retrieve mass spec and RNASeq data
๏ Use workflows
• Wf1: Proteome only
• Wf2: ProteoGenomic
DISK Walkthrough: Meta-Workflows๏ After running the workflows, meta-
workflow analyse the results and generate a confidence value
DISK Walkthrough: Revised Hypothesis๏ The hypothesis is revised and given a confidence value:
• A mutation of the protein PRKCDBP has been expressed in the patient’s sample TCGA-AA-3561-01A-22 with a confidence 0.9887
DISK Walkthrough: Provenance Details๏ Hypothesis provenance stores information about workflows run and the data used
• Workflow execution provenance is published by WINGS in the prov standard.
Talk Outline๏ Motivation
๏ Our Approach & Research Challenges1. Representing Hypotheses
2. Representing Lines of Inquiry
3. Meta-analysis to review workflow results
๏ DISK Scenario walkthrough
๏ Results in cancer multi-omics
๏ Related work
๏ Contributions and Future Work
DISK: Automated DIscovery of Scientific Knowledge
Workflow Constraints
Workflow Reasoning
Open Publication of
Results as Linked Data
Workflow Provenance
WINGS Intelligent Workflow System
Lines of Inquiry
Interactive Discovery
Agent
Hypothesis Evaluation Hypotheses
Revised hypotheses
& interesting findings
Analytic Workflows
Data Retrieval
Workflow Binding
Meta-Workflows
Confidence Estimation
Benchmarking
Formulate Lines of Inquiry
Meta-Analysis of Results
Data Repository
Our Initial Focus: Reproduce Seminal Omics Analysis [Zhang et al 2014]
๏ Replicated [Zhang et al 2014] Proteogenomic analysis of Colo-rectal cancer
๏ Successfully reproduced paper findings comparing results at multiple levels (final figure, supplementary tables, etc.)
๏ Took months and direct conversations with authors to replicate paper figures and supplemental figures
๏ Application of analysis approach to new cancer type now takes minutes
• Useful when TCGA is integrated
๏ Expanded analysis to
• compare how sensitive findings were to workflow details0
2
4
6
−1.0 −0.5 0.0 0.5 1.0spearman correlation
den
sity
Correlation between mRNA−protein abundance(within samples)
0
1
2
−4 −3 −2 −1 0spearman correlation
den
sity
Correlation between mRNA−protein variation(across samples)
Impact on Cancer Multi-Omics
Talk Outline๏ Motivation
๏ Our Approach & Research Challenges1. Representing Hypotheses
2. Representing Lines of Inquiry
3. Meta-analysis to review workflow results
๏ DISK Scenario walkthrough
๏ Results in cancer multi-omics
๏ Related work
๏ Contributions and Future Work
Related Work1) Discovery Systems๏ [Lenat 1976]
๏ [Lindsay et al 1980]
๏ [Langley 1981]
๏ [Falkenhainer 1985]
๏ [Kulkarni and Simon 1988]
๏ [Cheeseman et al 1989]
๏ [Zytkow et al 1990]
๏ [Simon 1996]
๏ [Valdes-Perez 1997]
๏ [Todorovski et al 2000]
๏ [Schmidt and Lipson 2009]
Related Work:2) Hypothesis Representation as Graphs๏ Existing vocabularies are related but need to be extended to represent hypotheses in
DISK
• SWAN [Gao et al 2006]
• EXPO [Soldatova and King 2006]
• Nanopublications [Groth et al 2010]
• Ovopublications [Callahan and Dumontier 2013]
• Micropublications [Clark et al 2014]
• LSC
• BEL
Talk Outline๏ Motivation
๏ Our Approach & Research Challenges1. Representing Hypotheses
2. Representing Lines of Inquiry
3. Meta-analysis to review workflow results
๏ DISK Scenario walkthrough
๏ Results in cancer multi-omics
๏ Related work
๏ Contributions and Future Work
Contributions
๏ Represent scientist hypotheses
• Hypothesis ontology includes revisions & provenance
๏ Formulate lines of inquiry that express how a type of hypothesis can be pursued with a data analysis workflow
• Lines of inquiry outline what type of data and workflows to use, and customize them to the hypotheses at hand
๏ Design a meta-analysis to assess the results of lines of inquiry and revise the original hypotheses
• Meta-analysis workflows assess diverse evidence
Ongoing & Future Work๏ Ongoing work:
• Interactive Discovery Agent that explains interesting findings
• Continuous analysis of data (TCGA/CPTAC) as it grows
• Extending and generalizing meta-workflows
• Using DISK in geosciences: Subsurface water resource modeling
๏ Future challenges:
• More complex hypotheses about several entities
• Incorporate evidence over time
• Designing domain-independent meta-workflows
• Resource-bound hypothesis exploration
Thank you