247th acs meeting: the eureka research workbench
DESCRIPTION
Academic scientists need a tool to capture the science they do so that it can be shared in open science, integrated with linked data, and shared/searched. Eureka is an evolving platform to do this.TRANSCRIPT
Eureka Research Workbench:An Open Source eScience
Laboratory Notebook
Stuart J. ChalkDepartment of ChemistryUniversity of North Florida
2014 Spring ACS Meeting – CINF Paper 38
Big Data Electronic Notebooks The Eureka Research Workbench Experiment Markup Language ExptML Schema and Files Semantic Data and Ontologies File Storage Eureka Interface Web Interface Conclusion
Outline
Current buzz word for “this bring together lots of data and build tools on top to extract knowledge”
This is great, except… …how do we do that for science?
Platform, data structures, and exchange protocols to capture, identify, and disseminate scientific information
Research Data Alliance (https://rd-alliance.org/) “Research Data Sharing without barriers” Fran Berman at RPI is NSF funded co-chair of RDA
Big Data
Scientists need to move todigital notebooks…
...and record not just the databut the flow and context
How science is doneis important for searching,aggregation, meta-analysis
We need more than an electronic version of a notebook
We need a science version of “Second Life” (SciLife?)
Electronic Notebooks
Started in 2006 after getting involved in the Analytical Information Markup Language (AnIML) project
Store all research notes/data in a digital format Capture the workflow of scientists Writing in a lab notebook is equivalent to
“multi-type” blogging in the digital world How to capture information? Many data types!
(ExptML) How to store files “online”? (Fedora-Commons) How to access files in the browser? (CakePHP) How to represent laboratory resources? (ExptML) How to link data together? RDF (in Fedora-Commons)
Eureka Research Workbench
A specification (written in XML) that describes different types of information recorded during the scientific process (http://exptml.sourceforge.net)
Experiment Markup Language (ExptML)
Sample Solution Space Specimen Substance Task Template Timeline User Vendor
Annotation Api Calculation Chemical Citation Customer Data Dataset Definition Element
Equipment Event Experiment Group Message Project Protocol Quote Report Result
ExptML Chemical Schema
ExptML Chemical Schema
ExptML Chemical (Instance)
Data are connected to other data – ‘Linked Data’(http://www.w3.org/standards/semanticweb/data)
The ‘Semantic Web’ approach to contextualize data Proposed storage of ‘relationships’ between data is
the Resource Description Format (RDF - http://www.w3.org/RDF/)
Semantic Data
Digital repository software http://fedora-commons.org/ Creation and management of online digital libraries
Fedora ‘Digital Object’ consists of metadata + streams Metadata stored as Dublin Core (DC stream) ExptML file stored as EXPTML stream Other files (PDFs, Images, Word etc.) stored as streams Relationships stored as RDF (RELS-EXT stream)
Features: Version control, Checksumming, Archiving Built-in search of objects and relationships Add-on for file content search (Fedora GSearch)
Fedora Commons
Fedora-Commons defines and works on digital objects
In the definition of a Fedora object an ExptML file is just one stream of many. By default each object also has a “DC” stream of metadata and an “RELS-EXT” stream of relationships
Each Fedora object can have any number of additional streams for Paper PDFs, product/sample pictures,
binary file formats (if a conversion has been done) Video, audio, RDF, anything…
You can export individual streams or the whole Fedora object with streams binary encoded (Sharing/archiving)
Fedora for File Storage
Fedora Object Storage
Web interface written in PHP using the CakePHP Framework
Communicates with Fedora-Commons API to create, retrieve, update and delete (CRUD) ExptML and other files
Representational State Transfer (REST) format for URLs E.g.
http://example.com/chemicals/view/exptml:chm1 Creation of ExptML via interface Provides search via Fedora and Gsearch Can extract data out of XML files Can gather data from other websites (via API
controller)and integrate into ExptML files
Eureka Web Application
Eureka Website – Group View
Only data types related to the research group show up on left
Eureka Website – Bench View
Clicking on the “Add” menu on the rightallows you add a comment or link to data
Eureka Website – Notebook View
Eureka Website – Laboratory View
The “Rel” menu shows you the information related to this instrument
Eureka Website – Library View
You can add the PDF of the paper to the citation. The contents of the PDF are searchable in the system
Eureka Website – Stockroom View
Web Application Server: Fedora 4, JSON-LD, ElasticSearch Client: CakePHP 3/HTML5, Recline.js, Annotator, JQuery
Standards Linked Data Platform (http://www.w3.org/TR/ldp/) Datapackage/Simple Data Format (http://dataprotocols.org/) Markup Languages: AnIML, UnitsML, CML Other Molecular File Formats: MOL/SDF/CDX/CIF/PDB etc. Open Framework for Laboratory Data (Allotrope Foundation)
Datasources ChemSpider, CIR, PubChem, Google Scholar, CrossRef, VIVO ExchangeNetwork (EPA), NIST, SDBS (no API’s yet)
Tools Marvin for JS, JSXGraph, JSpecView, Chemicalize.org
Eureka Technology Stack
Implement ingest of all data types, file (if appropriate) and web based
In browser processing of data -> dataset -> result, report writing Extraction of file based legacy data -> ExptML format data Open access to data/spectra, ‘available data’ page (browser only) Access to data/spectra via linked data server (discovery/indexing) Publishing of packaged datasets with authenticated download option Automated ingestion of data from instruments/sensors Collaborative research: authentication and data exchange
Timeframe? Depends on securing funding
Eureka Roadmap
Eureka: Web application to create ExptML files Built on ExptML to capture data/resources/workflows Reliable storage/archiving system for ExptML files
(Fedora) Storage of relationships between data (RDF) TODO
Provide mechanism for sharing of data (different levels) Add tools to find, visualize and work on science data Integration into the RDA model for sharing research data Get the word out and test system with many users
Conclusion
References Eureka – http://sourceforge.net/projects/eureka Fedora-Commons – http://fedora-commons.org XML – http://www.w3.org/standards/xml AnIML – http://animl.sourceforge.net ExptML – http://exptml.sourceforge.net/ UnitsML – http://unitsml.nist.gov/ CML – http://www.xml-cml.org/ JSON-LD – http://www.w3.org/TR/json-ld/ RDF – http://www.w3.org/RDF/ CIR – http://cactus.nci.nih.gov/chemical/structure RDA – http://rd-alliance.org ChemSpider – http://www.chemspider.com/ Allotrope Foundation – http://allotrope.org