scientific data management - from the lab to the web semantic data management dagstuhl seminar 22-27...
Post on 04-Jan-2016
215 Views
Preview:
TRANSCRIPT
Scientific Data Management -
From the Lab to the Web
Semantic Data ManagementDagstuhl Seminar22-27 April 2012
José Manuel Gómez Pérez, iSOCO
www.wf4ever-project.org
2
Some factsThe data deluge
Source: IDC ‘s The 2011 Digital Universe Study – Extracting Value from Chaos
» In 2010 the size of the digital universe exceeded 1 Zettabyte (=1 trillion Gb)
» 1.8 Zb in 2011» 35 Zb expected in 2020
» 90% unstructured data» 70% user-generated» 75% resulting from data copying,
merging, and transforming
» Metadata is the fastest growing data category
» Much of such data is dynamic, real-time, volatile
3
Two main challengesDealing with dynamicity
» Challenge 1: Identifying and structuring the relevant portions of the data for the task at hand
› First-class data citizens
» Challenge 2: Managing the lifecycle of data entities
› Preservation› Evolution and versioning› Decay
Both technical and social aspects involved
4
Experiment Results (data)
Scientific Interpretatio
n
Workflows in the Scientific MethodThe Research Lifecycle
Example: Genome-Wide Association Studies
BackgroundHypothesis
AssumptionsInput data
Method
PublicationResults(Data)
5
Workflow-based Science
» A mechanism for coordinating the execution of services and linking together resources.
» The combination of data and processes into a configurable, structured set of steps that implement semi-automated computational solutions in scientific problem-solving
What is a Scientific Workflow?
Scientific workflows are at the core of scientific data management
› Enable automation› Encourage best practices
Challenge 1
Identifying and structuring the relevant portions of the data for the task at
hand
First-class data citizens
7
Questions for Scientific Data and Workflows IssuesWho are you ? Where and when were you born ? Who were your parents (creators) ?
Identity and DescriptionAuthenticityUniqueness
For which purpose were you conceived and have been used ? Reuse, Repurpose
What do you have inside ? InspectionVisualizationAnnotations
How is your content linked ? Graphical Representation
May I access all your parts ? Access Rights
Which parts can I replace ? Adaptability
What have they done to you ? Who and When ? Why did they do that ?
ProvenanceVersioning
Why have you been recommended to me ? Can I believe what you are saying or trust your results ?
Information Quality
Do you still produce the same results ? Reproducibility
Are you still working ?How could I repair you ?
Completeness Stability
How could I thank you ? How could I talk about you ?
Credit
8
Research Objects as Technical ObjectsChallenge 1: Identifying and structuring the relevant data
Carriers of Research Context» Referentiable» Aggregation, Dispersed
› Heterogeneous › Local and External
» Annotated metadata› Provenance› Structured: Manifests,
Recipes, Permissions, Discourse
» Lifecycle › Publishing, Evolution› Versioning
» Mixed Stewardship› Graceful Degradation
» Sharing» Security & Privacy
» Stereotypical User Profiles» Services
Distributed Third Party Tenancy
Alien Store
Technical Objects Social Objects
OAI-ORE
99 9
Research Objects as Social Objects
Package, Explore, Inspect, Review, Exchange, Share, Reuse, Publish, Credit
10
Research Object model core (simplified)http://purl.org/wf4ever/ro#
ro:Resourcero:ResearchObject
ro:Manifest
ro:AggregatedAnnotation
ore:aggregates
ro:annotatesAggregatedResource
wfdesc:Workflow
ore:isDescribedBy
Note: This figure shows a simplified view of the RO core.
RO specification: http://wf4ever.github.com/ro
› ro (aggregation and annotation)› wfdesc (workflow description)› Minim* (minimum info model)› wfprov (workflow provenance)› roprov (RO provenance)› roevo (evolution model)
*Minim based on M. Gamble’s MIM
Challenge 2
Managing the lifecycle of data entities
Evolution and Decay
12
RO Evolution & VersioningChallenge 2: Managing the lifecycle of data entities
13
Workflow Decay• Component level• flux/decay/unavailability• Data level• Infrastructure level
Experiment Decay• Methodological changes• New technologies• New resources/components• New data
RO DecayChallenge 2: Managing the lifecycle of data entities
14
Preservation, Conservation, Recreating
PreservingArchived RecordFixed SnapshotsReviewRerun & Replay
ConservingActive InstrumentLiveRerun & ReuseRepair & Restore
RecreatingArchived RecordActive InstrumentLiveRebuild Recycle Repurpose
15
Possible types of decay (an example)Challenge 2: Managing the lifecycle of data entities
16
A Taxonomy of RO decayDecay Analysis
1. Service tool is missing
2. Service file descriptor disappeared
3. Service up but not contactable
4. Service up but functionality changed
5. Local software dependencies
6. Data unavailability
7. Changes in data formats
8. Chained dependency
9. Credentials deprecated
10. Input data superseded by other data
11. RO metadata outdated (upon versioning)
12. Old fashioned RO
13. External references lose credit
14. Execution framework no longer available
17
Sample decay typeA taxonomy of workflow decay
18
1.0 Certificate – Evaluation of Stability and CompletenessDecay Analysis
Is the RO free from any form of decay preventing workflow execution?
» Focus on reproducibility» Assisted detection of RO decay» Active monitoring on decay forms» RO and workflow provenance
Is the minimal aggregation of resources encapsulated by the RO consistent?
» RO checklists» Produced by scientists» Automatically checked against
minimal model (minim)» RO evolution
Stability Completeness
1.0 Certificate notion originally proposed by Yde de Jong
1.0 Certificate of quality
» Notification» Explanation
19
Lessons learntRecap
» Data with a Purpose
» Encapsulate & Conquer› Goal-driven (purpose)› Aggregation› Community-managed
» Nothing is immutable, especially data.
› Foster evolution › Monitor decay
Scalability
Provenance
20
QuestionsThanks for your Attention!
Any Questions?
http://www.wf4ever-project.org/
top related