data provenance and scientific workflow management

21
Data Provenance and Scientific Workflow Management Data Provenance Neuroscience Data Scientific Workflow Management (and Questionnaires) Kelly Rosa Braghetto [email protected] Departamento de Ciência da Computação Instituto de Matemática e Estatística Universidade de São Paulo 05 de Junho de 2013 1 / 21

Upload: neuromat

Post on 26-Jun-2015

272 views

Category:

Education


4 download

DESCRIPTION

Introductory class on techniques and tools to manage scientific data, focusing on sources of information and data analysis. Lecturer: Prof. Kelly Rosa Braghetto, a NeuroMat associate investigator and a professor at the University of São Paulo's Department of Computer Science.

TRANSCRIPT

Page 1: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Data ProvenanceNeuroscience Data

Scientific Workflow Management(and Questionnaires)

Kelly Rosa [email protected]

Departamento de Ciência da ComputaçãoInstituto de Matemática e Estatística

Universidade de São Paulo

05 de Junho de 2013

1 / 21

Page 2: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Agenda

1 Data Provenance

2 Neuroscience DataCARMEN ProjectNEMO Project

3 Scientific Workflow Management Systems (SWMS)Taverna

4 Questionnaires

2 / 21

Page 3: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Data Provenance

Data Provenance

Frequently asked questions for Scientists

Where was a document found?How was this data set produced?Were all facts included in this decision?Were all the latest figures included in this diagram?Can this scientific experiment be reproduced?

Source: http://openprovenance.org/

3 / 21

Page 4: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Data Provenance

Data Provenance

What is Provenance?Provenance refers to the sources of information, such as entities andprocesses, involved in producing or delivering an artifact.

Why does Provenance matter?

The provenance of information is crucial in deciding whetherinformation is to be trusted, how it should be integrated with otherdiverse information sources, and how to give credit to its originatorswhen reusing it.

In an open and inclusive environment such as the Web, users findinformation that is often contradictory or questionable.

People make trust judgments based on provenance that may or maynot be explicitly offered to them. Problem: lack of a standardmodel.

Source: http://www.w3.org/2011/prov/wiki/Main_Page 4 / 21

Page 5: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Data Provenance

Works devoted to Data ProvenanceProvenance Working Group, maintained by W3C“Mission: to support the widespread publication and use ofprovenance information of Web documents, data, andresources.”http://www.w3.org/2011/prov/wiki/Main_Page

Wf4Ever project“Wf4Ever addresses some of the challenges associated to thepreservation of scientific experiments in data-intensive science.”http://www.wf4ever-project.org/

Open Provenance Model (OPM)http://openprovenance.org/

5 / 21

Page 6: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Data Provenance

Open Provenance Model (OPM)The Open Provenance Model is a model of provenance that isdesigned to meet the following requirements:

1 To allow provenance information to be exchanged betweensystems, by means of a compatibility layer based on a sharedprovenance model.

2 To allow developers to build and share tools that operate onsuch a provenance model.

3 To define provenance in a precise, technology-agnostic manner.4 To support a digital representation of provenance for any

’thing’, whether produced by computer systems or not.5 To allow multiple levels of description to coexist.6 To define a core set of rules that identify the valid inferences

that can be made on provenance representation.6 / 21

Page 7: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Neuroscience Data

Projects recording provenance of neurosciencedata

Code Analysis, Repository & Modelling for e-Neuroscience(CARMEN)http://www.carmen.org.uk/“CARMEN is an e-Science Pilot Project funded by the Engineeringand Physical Sciences Research Council (UK). It will deliver avirtual laboratory for neurophysiology, enabling sharing andcollaborative exploitation of data, analysis code and expertise.Neural activity recordings (signals and image series) are the primarydata types.”

Neural ElectroMagnetic Ontologies (NEMO)http://nemo.nic.uoregon.edu/wiki/NEMO[More details in the next slides...]

7 / 21

Page 8: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Neuroscience Data

CARMEN Project

The CARMEN consortium“A core part of our work is the development of minimum reportingguidelines for annotation of data and other computational resourcesfor the purpose of sharing”

Result: a MINI module for Electrophysiology

MINI (Minimum Information about a Neuroscienceinvestigation) – is a family of reporting guideline documentsA module represents the minimum information that should bereported about a dataset to:

facilitate computational access and analysisto allow a reader to interpret and critically evaluate the processperformed and the conclusions reachedto support their experimental corroboration

8 / 21

Page 9: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Neuroscience Data

CARMEN Project

MINI module for ElectrophysiologyThe reporting recommendadions cover both extracellular andintracellular electrophysiologyCovered data:

date stamps and responsible personsthe subject under studythe subject task or stimulus if appropriatethe recording protocoland the resulting description of time series data

The entire module is described in:http://www.carmen.org.uk/standards/mini.pdfThe module is registered in the MIBBI portal(http://www.biosharing.org/standards/mibbi andhttp://mibbi.sourceforge.net/legacy.shtml).MIBBI – Minimum Information for Biological and BiomedicalInvestigations – is a pioneering project that aims to coordinateguidelines for reporting of metadata across domains 9 / 21

Page 10: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Neuroscience Data

NEMO Project

Neural ElectroMagnetic Ontologies (NEMO)An NIH funded projectAims to create EEG and MEG ontologies and ontology basedtools. These resources will be used to support representation,classification, and meta-analysis of brain electromagnetic data.Based on three pillars: DATA, ONTOLOGY, and DATABASE

Data – raw EEG, averaged EEG (ERPs), and ERP dataanalysis resultsOntologies – include concepts related to ERP data (includingspatial and temporal features of ERP patterns), dataprovenance, and the cognitive and linguistic paradigms thatwere used to collect the dataDatabase – the NEMO database portal is a large repositorythat stores NEMO consortium data, data analysis results, anddata provenance

Site: http://nemo.nic.uoregon.edu10 / 21

Page 11: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Neuroscience Data

NEMO Project

Ontology (informal definition)In both computer science and information science, an ontologyrepresents a set of concepts within a domain and therelationships between those concepts. It is used to reasonabout the objects within that domain.Ontologies are used as a form of knowledge representationabout the world or some part of it.Ontologies generally describe:

Individuals: the basic or “ground level” objectsClasses: sets, collections, or types of objectsAttributes: properties, features, characteristics, or parametersthat objects can have and shareRelations: ways that objects can be related to one anotherEvents: the changing of attributes or relations

Source: http://neurolex.org

11 / 21

Page 12: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Neuroscience Data

NEMO Project

MINEMO – an extension of the MINI module forElectrophysiology

MINEMO = Minimal Information for Neural ElectromagneticOntologies“A standards-compliant method for analysis and integration ofevent-related potentials (ERP) data”; in other words: achecklist for the description of ERP studiesThe checklist comprises no more than ∼60 fields; ∼20 of thesefields are considered “mandatory”MINEMO promotes the use of controlled vocabularies (orlexicons) for data annotation. Aim: to conduct cross-labmeta-analysisEach MINEMO checklist item is linked to a term defined inthe NEMO ontology

12 / 21

Page 13: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Neuroscience Data

NEMO Project

Subset of “mandatory” MINEMO terms1 Research lab (General features)

2 Experiment (General features)

3 Publication

4 Study subjects (Group characteristics)

5 Experiment condition

6 Stimulus representation

7 Behavioral data collection

8 EEG data collection

9 EEG/ERP data preprocessing

10 EEG/ERP data file

The entire set of terms is defined in the article:http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3235514/They are also in the MIBBI portal:

13 / 21

Page 14: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Neuroscience Data

NEMO Project

More about NEMO...Data in the NEMO Portal are aligned with the MINEMOchecklist and ontologyhttps://portal.nemo.nic.uoregon.edu

NIF (the Neuroscience Information Framework project –http://www.neuinfo.org/) uses the NEMO ontology. NIFaggregates online sources of neuroscience data, includingdatabase, web sites, and publications, and provides a searchinterface across these disparate sourcesThe NEMO ontology can be seen in:http://bioportal.bioontology.org/ontologies/40522

14 / 21

Page 15: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Neuroscience Data

NEMO Project

A “detail” to worry about...

The MINI module for Electrophysiology and MINEMO do not coverthe description of image data

To see later:MIfMRI – Minimum Information about an fMRI Studyhttp://www.fmrimethods.org/

15 / 21

Page 16: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Scientific Workflow Management Systems (SWMS)

Scientific WorkflowsA data analysis (or processing) generally can be described as aworkflow, e.g., a set of computational tasks that “transform”dataIn Bioinformatics, a workflow is frequently called pipelineIn a workflow, the output data of a task is generally used asinput data for other(s) tasks(s). So, the flow of data definesan execution order for the workflows tasksFrequently, a same task can be appear in more than oneworkflow

16 / 21

Page 17: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Scientific Workflow Management Systems (SWMS)

Scientific Workflow Management System(SWMS)

A computational tool that controls the execution of workflowsIt provides mechanisms for a scientist to describe his/herworkflow using “intuitive” modeling languagesIt can optimize the execution considering the characteristics ofthe available computational resourcesIt helps to generate provenance data of an analysis process. Inaddition, it improves the reproducibility of analyses

17 / 21

Page 18: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Scientific Workflow Management Systems (SWMS)

Most successful SWMSsTaverna – http://www.taverna.org.uk

VisTrails – http://www.vistrails.org

Kepler – https://kepler-project.org

Galaxy – http://galaxyproject.org

18 / 21

Page 19: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Scientific Workflow Management Systems (SWMS)

Online workflow repositories – collaborativescience

MyExperiments project (http://www.myexperiment.org/):Users upload their workflow modelsModels are categorized according their research domainUsers can search and download models uploaded by other usersSite stores models from different SWMSs (Taverna, Kepler,etc.)

19 / 21

Page 20: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Scientific Workflow Management Systems (SWMS)

Taverna

TavernaFeatures:

Graphical user interface for the description of the workflowsEasy installation and useRecording of the “execution history” and intermediate results(= provenance data of the entire analysis)Provenance export capability to OPM

20 / 21

Page 21: Data Provenance and Scientific Workflow Management

Data Provenance and Scientific Workflow Management

Questionnaires

Automatic Generation of Online QuestionnairesThere are computational tools that automatically generateelectronic questionnaires.One of the most used is the LimeSurvey(https://www.limesurvey.org/).Functionalities of the LimeSurvey:

Generates online questionnairesHas a big set of question typesKeeps questionnaire data in a real databaseManages usersCreates a print version of questionnairesMakes basic statistical analysis...

21 / 21