![Page 1: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/1.jpg)
Exploring and Exploiting the Biological Maze
Zoé Lacroix
Arizona State University
![Page 2: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/2.jpg)
Data collection queries
Scientific protocol– Must be able to reproduce the process
Involve multiple resources– Data sources– Applications
![Page 3: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/3.jpg)
Expressing scientific protocols
Scientific protocols mix design and implementation
Design – What the protocols does (tasks)– Scientific objects involved
Implementation – How the protocol is executed– Data sources and applications
![Page 4: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/4.jpg)
Expressing scientific protocols
Scientific protocols are driven by their implementation– Scientists use the resources they know
• data (quality)• access to data• format, limits, etc.
– Scientists may not exploit better resources because they do not know them
Queries should be driven by the design, the implementation should meet the design needs
![Page 5: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/5.jpg)
Example* - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs
The alternative splicing pipeline will provide a complete characterization of variations in proteins due to splice variation or SNPs evident in repositiories of contiguous genome sequence data and expressed sequence tags (ESTs). The pipeline applies secondary structure, tertiary structure, domain motif detection and sequence comparison tools to proteins encoded by genes with alternatively splice forms or SNPs.
*Courtesy of Dr. Marta Janer, Institute for Systems Biology
![Page 6: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/6.jpg)
Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs
From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides.
![Page 7: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/7.jpg)
Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs
From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides.
Data sources
![Page 8: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/8.jpg)
Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs
From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides.
tools
![Page 9: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/9.jpg)
Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs
From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides.
tasks
![Page 10: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/10.jpg)
Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs
From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides.
Scientific objects
![Page 11: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/11.jpg)
Pipeline Selecting Target Proteins*
SMART Swiss-Prot
BIND DIP CEY2H
sigpep
blast x D.mel
Step 1 = retrieve all proteins from SMART and Swiss-Prot with textual search with the keyword “apoptosis”Step 2 = retrieve all proteins from Swiss-Prot with a signal peptide feature and the keyword “apoptosis” Step 3 = retrieve their binding partners from DIP, BIND and the C.elegans datasetStep 4 = run through a signal peptide prediction program such as SigPep to check for the presence of signal peptides in each of the sequencesStep 5 = homology search using BLAST of the retrieved sequences with proteins predicted from the Drosophila melanogaster genome might yield additional candidatesOutput = final set of signal peptide proteins involved in apoptosis
*Courtesy of Dr. Terry Gaasterland, The Rockefeller University
![Page 12: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/12.jpg)
Design and implementation
Step Task Implementation
Input Relevant keyword for which the proteins are required
Step 1All proteins with keyword and with signal feature peptide must be retrieved
SMART
Swissprot
Step 2Binding partners of all of these proteins are retrieved DIP
BIND
Step 3Integration into final set is run through a signal peptide prediction program
SigPep
Step 4Homology search of the retrieved sequences with proteins predicted from the specific genome yield additional candidates
BLAST
![Page 13: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/13.jpg)
Expressing scientific pipelines with BioNavigation Queries are expressed at a conceptual
level (design)
DNA Seq.
Disease
GeneCitation
Protein Seq.
Conceptual level
Scientific classes
![Page 14: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/14.jpg)
Conceptual graph
Labeled edges– Scientific meaningful edges
Gene
NucleotideSequence
DNA
RNA mRNA
Protein
isA
isA
isA
isA
transcribesTo
isTranscribedFrom
isTranslatedFrom
translatesTo
![Page 15: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/15.jpg)
Conceptual graph
Gene
NucleotideSequence
DNA
RNA mRNA
Protein
isA
isA
isA
isA
transcribesTo
isTranscribedFrom
isTranslatedFrom
translatesToIsRelatedTo
IsRelatedTo
IsRelatedTo
IsRelatedTo
IsRelatedTo
IsRelatedTo
IsRelatedTo
![Page 16: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/16.jpg)
Mapping to physical resources
OMIM
Gen-Bank
Pub-Med
HUGO
NCBIProtein
DNA Seq.
Disease
GeneCitation
Protein Seq.
Conceptual level
Physical level
Data Sources
Scientific classes
![Page 17: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/17.jpg)
Mapping to physical resources
OMIM
Gen-Bank
Pub-Med
HUGO
NCBIProtein
DNA Seq.
Disease
GeneCitation
Protein Seq.
Conceptual level
Physical level
Data Sources
Scientific classes
![Page 18: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/18.jpg)
Exploring biological metadata “Return all citations that are related to some
disease or condition” Diabetes : 11 Aging : 71 Cancer : 391
OMIM
NUCLEOTIDE PROTEIN
PUBMED
(P1)(P2) (P3)
•Link: Entrez provides an index with the Links in the display option from each entry • Parse: Parsing each entry to retrieve its related entries
•All: Entrez provides an index with the Links in the display option which allows to look at a set of entries at a time
![Page 19: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/19.jpg)
Selecting biological resources
3 resources that look the same – Are they the same?
3 paths that will retrieve PubMed entries related to citations– Do they have the same semantics?
![Page 20: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/20.jpg)
Results for the disease conditions diabetes, aging and cancer
P1 P2 P3Diabetes Link 43,890
42,969
59,959
Parse 43,747
43,090
51,906 All 44,037
43,581
49,719
Aging Link 48,393
51,712
60,129 Parse 48,398
51,855
61,260
All 48,393
51,474
60,938 Cancer Link 56,315
54,487
62,686
Parse 56,315
54,607
63,367 All 56,532
52,488
60,033
![Page 21: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/21.jpg)
Overlap results for the disease conditions diabetes
P1 P2 P3
Link
P1 100% 25.82% 21.95%P2 25.28% 100% 70.00%P3 29.98% 97.68% 100%
Parse
P1 100% 23.93% 22.87%P2 29.18% 100% 81.20%P3 33.60% 97.81% 100%
All
P1 100% 24.75% 24.29%P2 24.64% 100% 79.49%P3 27.42% 90.68% 100%
![Page 22: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/22.jpg)
Evaluating resources
Similar applications– Different outputs
Similar data sources– Different output
Number of resources– Different output
Order of resources– Different output
![Page 23: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/23.jpg)
Exploiting semantics of resources
Number of entries Characterization of entries (number of
attributes) Time
![Page 24: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/24.jpg)
Exploiting the semantics of links
![Page 25: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/25.jpg)
BioNavigation (joint work with Louiqa Raschid and Maria-Esther Vidal) Conceptual graph
– No labeled links Queries
– Regular expressions of concepts ESearch
– Path cardinality - number of instances of paths of the result. For a path of length 1 between two sources S1 and S2, it is the number of pairs (e1, e2) of entries e1 of S1 linked to an entry e2 of S2.
– Target Object Cardinality – number of distinct objects retrieved from the final data source.
– Evaluation Cost – cost of the evaluation plan, which involves both the local processing cost and remote network access delays.
![Page 26: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/26.jpg)
Work in progress
Conceptual graph– Labeled links
Queries– Complex dataflows
Physical graph– Access to a BioMetaDatabase– Data sources– Applications
![Page 27: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/27.jpg)
Representing the conceptual graph in Protégé
![Page 28: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/28.jpg)
Visualization Limitations in Protégé
Using the GraphViz plugin– Shows only IsA hierarchy
![Page 29: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/29.jpg)
TgiViz plugin
![Page 30: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University](https://reader036.vdocuments.net/reader036/viewer/2022081603/5697bfe81a28abf838cb65bf/html5/thumbnails/30.jpg)
Conclusion
Scientists need support to select resources to express their protocols
Semantics of resources may be exploited to enhance the data collection process
Need for a repository of biological metadata (BioMetaDatabase)