small is beautiful: summarizing scientific workflows using semantic annotations. ieee bigdata...

31
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotations Pinar Alper, Khalid Belhajjame, Carole A. Goble University of Manchester Pinar Karagoz Middle East Technical University IEEE 2nd International Congress on Big Data June 27-July 2, 2013

Upload: khalid-belhajjame

Post on 19-Jun-2015

553 views

Category:

Technology


0 download

DESCRIPTION

Scientific Workflows have become the workhorse of BigData analytics for scientists. As well as being repeatable and optimizable pipelines that bring together datasets and analysis tools, workflows make-up an important part of the provenance of data generated from their execution. By faithfully capturing all stages in the analysis, workflows play a critical part in building up the audit-trail (a.k.a. provenance) meta- data for derived datasets and contributes to the veracity of results. Provenance is essential for reporting results, reporting the method followed, and adapting to changes in the datasets or tools. These functions, however, are hampered by the complexity of workflows and consequently the complexity of data-trails generated from their instrumented execution. In this paper we propose the generation of workflow description summaries in order to tackle workflow complexity. We elaborate reduction primitives for summarizing workflows, and show how prim- itives, as building blocks, can be used in conjunction with semantic workflow annotations to encode different summariza- tion strategies. We report on the effectiveness of the method through experimental evaluation using real-world workflows from the Taverna system.

TRANSCRIPT

Page 1: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Small Is Beautiful: Summarizing Scientific Workflows

Using Semantic AnnotationsPinar Alper, Khalid Belhajjame, Carole A. Goble

University of Manchester

Pinar Karagoz Middle East Technical University

IEEE 2nd International Congress on Big DataJune 27-July 2, 2013

Page 2: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Pinar and her daughter Nile at the end of year school party.

Page 3: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

• Data driven analysis pipelines

• Systematic gathering of data and analysis tools into computational solutions for scientific problem-solving

• Tools for automating frequently performed data intensive activities

• Provenance for the resulting datasets

– The method followed

– The resources used

– The datasets used

Scientific Workflows

Page 4: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Science with workflows

GWAS, PharmacogenomicsAssociation study of Nevirapine-induced skin rash in Thai Population

Trypanosomiasis (sleeping sickness parasite) in African Cattle

Astronomy & HelioPhysics

Library Doc

Preservation

Systems Biology of Micro-Organisms

Observing Systems Simulation Experiments

JPL, NASA

BioDiversity Invasive Species Modelling

[Credit Carole A. Goble]

Page 5: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Provenance is paramount for science

• Reporting findings– Derivation - how did we get this result?

• processes/programs used, execution trace, data lineage, source of components (data, services)

– History - who did what when? • creator, contributors, timestamps.

• Adapting to Change– Explanation - why did this record start to appear in

the result? – Change Impact - which steps will be affected if I

change this tool or data input?

Page 6: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

PROV Primer, Gil et al

WF Execution TraceRetrospective Provenance: Actual data used, actual invocations, timestamps and data derivation trace

WF Description Prospective Provenance: Intended method for analysis

Page 7: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Workflows can get complex!• Overwhelming for users who are not the

developers

• Abstractions required for reporting

• Lineage queries result in very long trails

Page 8: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Reason and extent of complexity

• a.k.a. Shims

• Dealing with data and protocol heterogeneities

• Local organization of data

Garijo D., Alper. P., Belhajjame K. et al

D. Hull et al

~ 60%

Page 9: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Static Ways To Tackle ComplexityProcess-Wise and Data-Wise

abstractions• Sub-workflows

– Not always a significant unit of function (e.g. aesthetic purposes)

• Bookmarked data links– Cluster the output signature– Further complicates workflow

• Components– Library dependent

Page 10: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Our Solution: Workflow Description Summaries

• A graph model for representing workflows

• Graph re-write rules for summarization

IF <performs certain function> THEN <re-write WF graph>

motifs reduction-primitives

Page 11: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

• Domain Independent categorization– Data-Oriented Nature– Resource/Implementation-Oriented

Nature

• Captured In a lightweight OWL Ontology

http://purl.org/net/wf-motifs

PART-1: Scientific Workflow Motifs

Page 12: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

A graph model of data-driven workflowsPure Dataflows

W= <N,E>

Operation and Port Nodes

N = (Nop U Np)

Dataflow edges

E = (Eopp U Epp U Epop )

Page 13: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Motif annotations over operations

motifs(color_pathway_by_objects) = {m1:DataRetrieval}

motifs(Get_Image_From_URL_2) = {m2:DataMoving}

DataRetrievalDataRetrieval

DataMovinglDataMovingl

Page 14: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

PART-2: Workflow reduction primitives

• Collapse (Up/Down)

• Compose

• Eliminate

Page 15: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Collapse Down

Page 16: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Collapse Up

Page 17: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Compose

Page 18: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Eliminate

Page 19: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

How will rules be put to use

• Strategies as a set of rules for summarization

• Two sample strategies based on an empirical analysis of workflows

• Reporting:– Process: Significant activities (Retrieval, Analysis,

Visualization)– Data:

• Reduced cardinality • Stripped of protocol specific payload/formatting

Page 20: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Two sample strategies• By-Eliminate

– Minimal annotation effort – Single rule

• By Collapse– More specific annotation– Multiple rules

Page 21: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Overall Approach

Workflow Designer

Taverna Workbench

Motif Ontology

WF Summary

WF Description

Summarizer

Summarization Rules

Page 22: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Analysis Data Set

• 30 Workflows from the Taverna system• Entire dataset & queries accessible from

http://www.myexperiment.org/packs/467.html

• Manual Annotation using Motif Vocabulary

Page 23: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Summaries at a glance

By-Collapse

By-Elimination

• Causal Ordering of operations

• Reduced depth

Page 24: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

By-Collapse

Page 25: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

By-Elimination

Page 26: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Mechanistic Effect of Summarization

Page 27: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

User Summaries vs. Summary Graphs

Page 28: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Related Work• User Views over provenance O. Biton, et al.

– User specified significant operations– Automatic partitioning of workflow graph.

• Provenance Redaction T. Cadenhead, et al. – Redaction primitives – Graph queries with regular expressions

• Provenance Publishing S. C. Dey et al.

– User policies on publishing (hide, retain)– Consistency checks

Page 29: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Highlights• Re-writing workflow graphs with rules• Exploiting semantic annotations of operations• Controlled, primitive-based re-writing

– Preserve acyclicity

• Users indirectly control the summarization– Encoding their preferences as summary rules

• Querying of Workflow Execution Provenance using summaries.

Future Work

Page 30: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

Thank you!

Carole A. GOBLEUniversity of Manchester

Khalid BELHAJJAMEUniversity of Manchester

Pinar KARAGOZMiddle East Technical University

Pinar ALPERUniversity of Manchester

Page 31: Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotations. IEEE BigData Conference 2013

BibliographyD. Garijo, P. Alper, K. Belhajjame, O. Corcho, C. Goble, and Y. Gil. Common motifs in scientific workflows: An empirical

analysis. In the proceedings of the IEEE eScience Conference 2012.

P. Alper, K. Belhajjame, C. A. Goble, and P. Senkul. Enhancing and abstracting scientific workflow provenance for data publishing. Submitted for publication to BIGProv 2013 International Workshop on Managing and Querying Provenance Data at Scale, co-located with EDBT-2013.

O. Biton, et al. Querying and Managing Provenance through User Views in Scientific Workflows. 2008 IEEE 24th International Conference on Data Engineering, pages 1072–1081, Apr. 2008. J.Cheney et al.Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379–474, 2009.

D. Hull et al. Treating shimantic web syndrome with ontologies. In AKT Workshop on Semantic Web Services, 2004.

S. C. Dey, D. Zinn, and B. Ludäscher. Propub: towards a declarative approach for publishing customized, policy-aware provenance. In Proceedings of the 23rd international conference on Scientific and statistical database management, SSDBM’11, pages 225–243, Berlin, Heidelberg, 2011. Springer-Verlag.

Y.Gil and S. Miles Editors. The PROV Model Primer http://www.w3.org/TR/prov-primer/ .

S. C. Dey, D. Zinn, and B. Ludäscher. Propub: towards a declarative approach for publishing customized, policy-aware provenance. In Proceedings of the 23rd international conference on Scientific and statistical database management, SSDBM’11, pages 225–243, Berlin

T. Cadenhead, V. Khadilkar, M. Kantarcioglu, and B. Thuraisingham. Transforming provenance using redaction. In Proceedings of the 16th ACM symposium on Access control models and technologies, SACMAT ’11, pages 93–102, New York, NY, USA, 2011. ACM.,

Taverna Open Source and Domain Independent Workflow Management System http://www.taverna.org.uk/