supporting scientific discovery through linkages of literature and data

1
CVDI is a collaboration between the University of Louisiana at Lafayette & Drexel University Supporting scientific discovery through linkages of literature and data Don Pellegrino, Jean-Claude Bradley, Chaomei Chen THE SOCIAL-MOLECULE VIEW “Open Notebook Science is the practice of making the entire primary record of a research project publicly available online as it is recorded.”[1] The Bradley Laboratory at Drexel University follows this practice and publishes their work in the UsefulChem Notebook[2]. The notebook is published using a wiki. The contents include textual information such as the objective of the experiment, discussions, conclusions and a log. Structured information such as chemical reactions and spectra are also included. A “Reaction Attempts” database stores records of reactions in a Google Spreadsheet. Figure 1 shows a typical notebook entry. An associative network has been instantiated from the open notebook science (ONS) data. Figure 2 shows an overview of the social molecule network[3]. Nodes are instantiated for each author or collaboration of an open notebook record. Nodes are also instantiated for each compound used in a reaction that was recorded in a notebook record. Edges are instantiated for the notebook page, reaction number and reaction type. Edges therefore document a relationship between authors and compounds that is contextualized by all authors and all compounds in the collection. It is hypothesized that structural properties of the graph can be used to identify nodes and edges with high computational interestingness scores. It is further hypothesized that these high scoring nodes would serve as useful starting points for discovery tasks. The single edge between the Michael Wolfle cluster and the Ugi reactions cluster shown in Figure 3 meets these criteria. It is a bridge between two clusters and had been reported by researchers as being a significant bridge between separate research groups. RESULTS AND CONCLUSIONS Experiments are ongoing in the domains of INTRODUCTION We aim to provide systematic support for discovery by identifying novel relationships between information stored in scientific literature and information stored in scientific databases. There are two limitations to current techniques for the search and retrieval of structured data and semi-structured text that must be overcome. First, query-based search is predicated on the user’s ability to formulate a query. Thus, the user must already have some sense of what they are looking for. Query-based search or browse requires the user to know where to start. Second, searches are performed on collections which have homogeneous structures. The literature can be searched for the occurrence of a keyword or phrase within the body of the text or within the text of the descriptive metadata. Databases can be searched by attribute/value pairs expressed using various notations for Boolean algebra. Therefore, current support for discovery tasks requires that (1) the user know where to start and (2) paths from that starting point are limited to traversal of subsets that are homogeneous by a semantically arbitrary underlying storage structure. METHODS We propose a new architecture which can overcome both of these limitations. Associative network structures are sufficient for establishing relations between nodes of attribute/value pairs from structured sources and nodes of (keywords/tokens/noun phrases) from text sources extracted via natural language processing (NLP) techniques. Graph- theoretic interestingness measures can then be applied to instances of the associative networks. These measures can be used identify paths and nodes which have the semantics necessary to support discovery tasks, such as novelty, surprisingness, and peculiarity. Associative networks can be added and subtracted to increase or decrease the scope of the information space analyzed. Workflow integration allows for a mixed-initiative ipedia contributors, “Open Notebook Science,” Wikipedia, The Free Encyclopedia, Revision 28 February 2011, [Online]. Available: http :// en.wikipedia.org/wiki/Open_Notebook_Science , Retrieved 20 011. dley, J-C., “UsefulChem Project,” [Online]. Available: http:// usefulchem.wikispaces.com , Retrieved 20 April 2011. legrino, D., “Social Molecule View (DC-Exp-001),” 2010 December 15, [Online]. Available: http:// onschallenge.wikispaces.com/DC-Exp-001 . tin, E., “Exp262,” [Online]. Available: http:// usefulchem.wikispaces.com/Exp262 , Retrieved 20 April 2011. legrino, D. and Chen C., “Data repository mapping for influenza protein sequence analysis,” Conference on Visualization and Data Analysis 2011, Part of IS&T/SPIE Electronic Imaging 2011, IS&T/SPIE, 2011. ]. doi:10.1117/12.872266. Figure 1: Notebook entry for UsefulChem Experiment 262 published on a wiki[4]. Figure 2: An overview of the social molecule network with a large, dense cluster of Ugi reactions on the right[3]. Figure 3: Michael Wolfle’s research and the notebook entry that connects it to the Ugi reactions[3]. This is a zoom of the cluster in the bottom middle of Figure 2.

Upload: don-pellegrino

Post on 20-Nov-2014

876 views

Category:

Technology


0 download

DESCRIPTION

Poster on "Supporting scientific discovery through linkages of literature and data" by Don Pellegrino, Jean-Claude Bradley and Chaomei Chen. To be presented on May 3, 2011 as part of an event for the Center for Visual and Decision Informatics. More this NSF funded center can be found online at [http://nsf-cvdi.louisiana.edu/]. More on Don Pellegrino's research activities can be found online at [http://www.donpellegrino.com].

TRANSCRIPT

Page 1: Supporting scientific discovery through linkages of literature and data

CVDI is a collaboration between the University of Louisiana at Lafayette & Drexel University

Supporting scientific discovery through linkages of literature and dataDon Pellegrino, Jean-Claude Bradley, Chaomei Chen

THE SOCIAL-MOLECULE VIEW“Open Notebook Science is the practice of making the entire primary record of a research project publicly available online as it is recorded.”[1] The Bradley Laboratory at Drexel University follows this practice and publishes their work in the UsefulChem Notebook[2]. The notebook is published using a wiki. The contents include textual information such as the objective of the experiment, discussions, conclusions and a log. Structured information such as chemical reactions and spectra are also included. A “Reaction Attempts” database stores records of reactions in a Google Spreadsheet. Figure 1 shows a typical notebook entry.

An associative network has been instantiated from the open notebook science (ONS) data. Figure 2 shows an overview of the social molecule network[3]. Nodes are instantiated for each author or collaboration of an open notebook record. Nodes are also instantiated for each compound used in a reaction that was recorded in a notebook record. Edges are instantiated for the notebook page, reaction number and reaction type. Edges therefore document a relationship between authors and compounds that is contextualized by all authors and all compounds in the collection.

It is hypothesized that structural properties of the graph can be used to identify nodes and edges with high computational interestingness scores. It is further hypothesized that these high scoring nodes would serve as useful starting points for discovery tasks. The single edge between the Michael Wolfle cluster and the Ugi reactions cluster shown in Figure 3 meets these criteria. It is a bridge between two clusters and had been reported by researchers as being a significant bridge between separate research groups.

RESULTS AND CONCLUSIONSExperiments are ongoing in the domains of ONS, Influenza Protein

Sequence records from NCBI [5], and drug discovery project data from industry. Initial results support the hypotheses that graph-theoretic interestingness scores are successful for systematically identifying relations amongst heterogeneous data. These relations support discovery tasks of chemical and biological researchers by providing starting points for information exploration. An “ONS Explorer” system is under development to integrate visualizations of interesting relations with ONS workflow.

INTRODUCTIONWe aim to provide systematic support for discovery by identifying novel relationships between information stored in scientific literature and information stored in scientific databases. There are two limitations to current techniques for the search and retrieval of structured data and semi-structured text that must be overcome. First, query-based search is predicated on the user’s ability to formulate a query. Thus, the user must already have some sense of what they are looking for. Query-based search or browse requires the user to know where to start. Second, searches are performed on collections which have homogeneous structures. The literature can be searched for the occurrence of a keyword or phrase within the body of the text or within the text of the descriptive metadata. Databases can be searched by attribute/value pairs expressed using various notations for Boolean algebra. Therefore, current support for discovery tasks requires that (1) the user know where to start and (2) paths from that starting point are limited to traversal of subsets that are homogeneous by a semantically arbitrary underlying storage structure.

METHODSWe propose a new architecture which can overcome both of these limitations. Associative network structures are sufficient for establishing relations between nodes of attribute/value pairs from structured sources and nodes of (keywords/tokens/noun phrases) from text sources extracted via natural language processing (NLP) techniques. Graph-theoretic interestingness measures can then be applied to instances of the associative networks. These measures can be used identify paths and nodes which have the semantics necessary to support discovery tasks, such as novelty, surprisingness, and peculiarity. Associative networks can be added and subtracted to increase or decrease the scope of the information space analyzed. Workflow integration allows for a mixed-initiative approach. The system can then suggest starting points and potentially profitable paths to the user without requiring the user to know where to start. Visualization of suggested nodes and paths within the associative network provides a contextualization of their semantics. Paths can traverse nodes from heterogeneous sources.

[1] Wikipedia contributors, “Open Notebook Science,” Wikipedia, The Free Encyclopedia, Revision 28 February 2011, [Online]. Available: http://en.wikipedia.org/wiki/Open_Notebook_Science, Retrieved 20 April 2011.[2] Bradley, J-C., “UsefulChem Project,” [Online]. Available: http://usefulchem.wikispaces.com, Retrieved 20 April 2011.[3] Pellegrino, D., “Social Molecule View (DC-Exp-001),” 2010 December 15, [Online]. Available: http://onschallenge.wikispaces.com/DC-Exp-001.[4] Curtin, E., “Exp262,” [Online]. Available: http://usefulchem.wikispaces.com/Exp262, Retrieved 20 April 2011.[5] Pellegrino, D. and Chen C., “Data repository mapping for influenza protein sequence analysis,” Conference on Visualization and Data Analysis 2011, Part of IS&T/SPIE Electronic Imaging 2011, IS&T/SPIE, 2011. [Online]. doi:10.1117/12.872266.

Figure 1: Notebook entry for UsefulChem Experiment 262 published on a wiki[4].

Figure 2: An overview of the social molecule network with a large, dense cluster of Ugi reactions on the right[3].

Figure 3: Michael Wolfle’s research and the notebook entry that connects it to the Ugi reactions[3]. This is a zoom of the cluster in the bottom middle of Figure 2.