towards semantic typing support for scientific workflows

80
Towards Semantic Typing Support for Towards Semantic Typing Support for Scientific Workflows Scientific Workflows Bertram Ludäscher Knowledge-Based Information Systems Lab San Diego Supercomputer Center University of California San Diego http://seek.ecoinformatics.org http://www.geongrid.org

Upload: zan

Post on 15-Jan-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Towards Semantic Typing Support for Scientific Workflows. Bertram Ludäscher Knowledge-Based Information Systems Lab San Diego Supercomputer Center University of California San Diego. http://seek.ecoinformatics.org. http://www.geongrid.org. Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Towards Semantic Typing Support for  Scientific Workflows

Towards Semantic Typing Support for Towards Semantic Typing Support for Scientific WorkflowsScientific Workflows

Bertram Ludäscher

Knowledge-Based Information Systems LabSan Diego Supercomputer CenterUniversity of California San Diego

http://seek.ecoinformatics.org http://www.geongrid.org

Page 2: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 2

Outline

1. Motivation: Traditional vs Scientific Data Integration

2. Semantic (a.k.a. Model-Based) Mediation

3. Scientific Workflows (a.k.a. Analysis Pipelines)

4. DB Theory Appetizer: Web Service Composition Through Declarative Queries

Page 3: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 3

Information Integration Challenges

• System aspects: “Grid” Middleware• distributed data & computing• Web Services, WSDL/SOAP, OGSA, …• sources = functions, files, data sets …

• Syntax & Structure: (XML-Based) Data Mediators

• wrapping, restructuring • (XML) queries and views• sources = (XML) databases

• Semantics: Model-Based/Semantic Mediators

• conceptual models and declarative views • Knowledge Representation: ontologies,

description logics (RDF(S),OWL ...)• sources = knowledge bases (DB+CMs+ICs)

SyntaxSyntax

StructureStructure

SemanticsSemantics

System aspectsSystem aspects

reconciling reconciling SS44 heterogeneitiesheterogeneities

““gluing” together gluing” together resources resources

bridging information and bridging information and knowledge gaps knowledge gaps computationallycomputationally

Page 4: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 4

Information Integration from a DB Perspective

• Information Integration Problem– Given: data sources S1, ..., Sk (DBMS, web sites, ...)

and user questions Q1,..., Qn that can be answered using the Si

– Find: the answers to Q1, ..., Qn

• The Database Perspective: source = “database” Si has a schema (relational, XML, OO, ...) Si can be queried define virtual (or materialized) integrated/global

view G over S1 ,..., Sk using database query languages (SQL,

XQuery,...) questions become queries Qi against G(S1,..., Sk)

Page 5: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 5

Standard (XML-Based) Mediator Architecture

MEDIATORMEDIATOR

Integrated Global(XML) View G

Integrated ViewDefinition

G(..) S1(..)…Sk(..)

USER/ClientUSER/Client

1. Query Q ( G (S1. Query Q ( G (S11,..., S,..., Skk) )) )

S1

Wrapper

(XML) View

S2

Wrapper

(XML) View

Sk

Wrapper

(XML) Viewweb services as wrapper APIs

3. Q1 Q2 Q33. Q1 Q2 Q3

4. {answers(Q1)} {answers(Q2)} {answers(Q3)}4. {answers(Q1)} {answers(Q2)} {answers(Q3)}

6. {answers(Q)}6. {answers(Q)}

Page 6: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 6

Query Planning for Mediators

• Given: – User query Q: answer(…) …G ...– … & { G … S … } global-as-view (GAV)– … & { S … G … } local-as-view (LAV)– … & { false … S … G… } integrity constraints (ICs)

• Find: – equivalent (or min. containing, max.contained) query

plan Q’: answer(…) … S … • Results:

– A variety of results/algorithms; depending on classes of queries, views, and ICs: P, NP,…, undecidable

– many variants still open

Page 7: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 7

From Scientific Data Integration to Process & Application Integration (and back…)• Data Integration

– Database mediation + Knowledge-based extension Query rewriting w/ GAV, LAV, ICs, access patterns

• “Process/Application”Integration– Scientific models (ocean, atmosphere, ecology, …),

assimilation models (e.g., real-time data feeds), …– Data sets– Legacy tools Components = web services Applications = composite components

(“workflows”) Need for semantic type extensions

Page 8: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 8

Geologic Map Integration

• Given: – Geologic maps from different state geological surveys

(shapefiles w/ different data schemas)– Different ontologies:

• Geologic age ontology• Rock type ontologies:

– Multiple hierarchies (chemical, fabric, texture, genesis) from Geological Survey of Canada (GSC)

– Single hierarchy from British Geological Survey (BGS)

• Problem– Support uniform queries against the multiple geologic

maps using different ontologies– Support registration w/ ontology A, querying w/ ontology

B

Page 9: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 9

Ontology Mappings: Motivation

• Establish correspondences between ontologies Integrate data sets which are registered to different

ontologies Query data sets through different ontologies

Data set 1

Data set 2

Ontology A

Ontology B

register

register

Ontology mappings queries

Page 10: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 10

A Multi-Hierarchical Rock Classification Ontology (GSC)

Composition

Genesis

Fabric

Texture

Page 11: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 11

Some enabling operations on “ontology data”

Composition

Concept expansion:Concept expansion:• what else to look for what else to look for when asking for ‘Mafic’when asking for ‘Mafic’

Page 12: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 12

Some enabling operations on “ontology data”

Composition

Generalization:Generalization:• finding data that is finding data that is “like” X and Y“like” X and Y

Page 13: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 13

Implementation in OWL: Not only “for the machine” …

Page 14: Towards Semantic Typing Support for  Scientific Workflows

Geologic Map Integration

domainknowledge

domainknowledge

Knowledge r

epresentatio

n

Ontologies!?

NevadaNevada

Geoscientists + Computer Scientists Igneous Geoinformaticists+/- Energy

GEON Metamorphism Equation:

+/- a few hundred million years

Page 15: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 16

Geology Workbench: Registering Data to an OntologyStep 1: Choose Classes

Click on Submission Data set name

Select a shapefile

Choose an ontology class

Page 16: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 17

Geology Workbench: Data RegistrationStep 2: Choose Columns for Selected Classes

AREA

PERIMETER

AZ_1000

AZ_1000_ID

GEO

PERIOD

ABBREV

DESCR

D_SYMBOL

P_SYMBOL

It contains information about geologic age

Page 17: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 18

Geology Workbench: Data RegistrationStep 3: Resolve Mismatches

Two terms arenot matched anyontology terms

Manually mappingalgonkian intothe ontology

Page 18: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 19

Geology Workbench: Ontology-enabled Map Integrator

Click on the nameChoose interesting

Classes

All areas with the age Paleozoic

Page 19: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 20

Geology Workbench: Change Ontology

Submit a mapping

Ontology mappingbetween British Rock

Classification and CanadianRock Classification

Switch from Canadian Rock Classification to

British Rock Classification

Run it New query interface

Page 20: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 22

Ontologies and Data Management

Schema Schema Schema Schema

ConceptualModel

ConceptualModel

Ontology

Data

Metadata

DesignArtifact

use concepts from(explicitly or implicitly)

• How to define and refine an ontology?• How to register a dataset to an ontology?

Page 21: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 23

Biomedical InformaticsResearch Networkhttp://nbirn.net

Biomedical InformaticsResearch Networkhttp://nbirn.net

Refining an Ontology – the logic way, enables “Source Contextualization”

Page 22: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 24

Connecting Datasets to Ontologies:“Semantic Registration”

Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57

Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57

DataCollectionEventMeasurement

MeasurementContextMeasurableItem

SpeciesCountSpeciesAbundance

AbundanceCollectionEventLocation

LTERSiteSBLTERSite

{naples,…}

⊑ contains.Measurement⊑ measureOf.MeasurableItem ⊓ hasContext.MeasurementContext

⊑ hasTime.DateTime ⊓ hasLocation.Location ⊑ hasUnit.Unit ⊓ hasValue.UnitValue ⊑ MeasurableItem ⊓ hasSpecies.Species ⊓ hasUnit.RatioUnit

… ⊑ Measurement ⊓ measureOf.SpeciesCount ⊑ DataCollectionEvent ⊓ contains.SpeciesAbundance ⊑ position.Coordinate ⊑ Location ⊑ LTERSite ⊓ position.SBLTERCoordinate ⊑ SBLTERSite

How can we “register”the dataset to concepts in the Ontology?

Ontology (snippet)

Dataset

Page 23: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 25

Purpose of Semantic Registration

Expose “hidden” information:– What do attributes represent? – What do specific values represent? – What conceptual “objects” are in the dataset?

Capture connections between the dataset and ontology to:– Find existing datasets (or parts of datasets) via

ontological concepts (discovery)– Enable integration of datasets (mediation)– Generate metadata for new data products (in a

pipeline)

Page 24: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 26

Semantic Registration Framework

Step 1: Data provider selects relevant ontological concepts (for the dataset)

Step 2: The semantic registration system creates a structural representation based on chosen concepts (data provide refines if needed)

Step 3: The data provider maps the dataset information to the generated structural representation

Page 25: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 27

Step1: Selecting Relevant Concepts

Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57

Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57

Concepts from an Ontology

Dataset

• DataCollectionEvent• AbundanceCollectionEvent

• Measurement• Abundance

• SpeciesAbundance

• MeasurableItem• SpeciesCount

• Location• LTERSite

• SBLTERSite• naples

• Species• …

• MeasurementContext• …

Page 26: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 28

Step1: Selecting Relevant Concepts

Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57

Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57

Concepts from an Ontology

Dataset

• DataCollectionEvent• AbundanceCollectionEvent

• Measurement• Abundance

• SpeciesAbundance

• MeasurableItem• SpeciesCount

• Location• LTERSite

• SBLTERSite• naples

• Species• …

• MeasurementContext• …

Page 27: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 29

Step2: Generate Object ModelConcepts from an Ontology

AbundanceCollection Event

SpeciesAbundance

containsSpeciesCount

measureOf

Species

hasSpecies

RatioUnit

hasUnit

RatioValue

hasValue

DateTime SBLTERSite

hasTime hasLoc

• DataCollectionEvent• AbundanceCollectionEvent

• Measurement• Abundance

• SpeciesAbundance

• MeasurableItem• SpeciesCount

• Location• LTERSite

• SBLTERSite• naples

• Species• …

• MeasurementContext• …

Page 28: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 30

Page 29: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 31

Page 30: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 32

Page 31: Towards Semantic Typing Support for  Scientific Workflows

Scientific Workflows

Page 32: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 34

Promoter Identification Workflow (PIW)

Source: Matt Coleman (LLNL)Source: Matt Coleman (LLNL)

Page 33: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 35

Source: NIH BIRN (Jeffrey Grethe, UCSD)Source: NIH BIRN (Jeffrey Grethe, UCSD)

Page 34: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 36

Ecology: GARP Analysis Pipeline for Invasive Species Prediction

Training sample

(d)

GARPrule set

(e)

Test sample (d)

Integrated layers

(native range) (c)

Speciespresence &

absence points(native range)

(a)EcoGridQuery

EcoGridQuery

LayerIntegration

LayerIntegration

SampleData

+A3+A2

+A1

DataCalculation

MapGeneration

Validation

User

Validation

MapGeneration

Integrated layers (invasion area) (c)

Species presence &absence points

(invasion area) (a)

Native range

predictionmap (f)

Model qualityparameter (g)

Environmental layers (native

range) (b)

GenerateMetadata

ArchiveTo Ecogrid

RegisteredEcogrid

Database

RegisteredEcogrid

Database

RegisteredEcogrid

Database

RegisteredEcogrid

Database

Environmental layers (invasion

area) (b)

Invasionarea prediction

map (f)

Model qualityparameter (g)

Selectedpredictionmaps (h)

Source: NSF SEEK (Deana Pennington et. al, UNM)Source: NSF SEEK (Deana Pennington et. al, UNM)

Page 35: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 37

Scientific Workflows: Some Findings

• More dataflow than (business) workflow• Need for “programming extension”

– Iterations over lists (foreach); filtering; functional composition; generic & higher-order operations (zip, map(f), …)

• Need for abstraction and nested workflows• Need for data transformations • Need for rich user interaction & workflow steering:

– pause / revise / resume– select & branch; e.g., web browser capability at specific steps

as part of a coordinated SWF• Need for high-throughput transfers (“grid-enabling”,

“streaming”)• Need for persistence of intermediate products

data provenance (“virtual data” concept)

Page 36: Towards Semantic Typing Support for  Scientific Workflows

Our Starting Point: Dataflow Process Networks and Ptolemy II

see!see!see!see!

try!try!try!try!

read!read!read!read!

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

Page 37: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 39

Kepler Team, Projects, Sponsors

• Ilkay Altintas SDM • Chad Berkley SEEK • Shawn Bowers SEEK• Jeffrey Grethe BIRN• Christopher H. Brooks Ptolemy II • Zhengang Cheng SDM • Efrat Jaeger GEON • Matt Jones SEEK • Edward A. Lee Ptolemy II • Kai Lin GEON• Ashraf Memon GEON• Bertram Ludaescher BIRN, GEON, SDM, SEEK• Steve Mock NMI• Steve Neuendorffer Ptolemy II • Mladen Vouk SDM • Yang Zhao Ptolemy II • …

Ptolemy IIPtolemy II

                                                

                                            

Page 38: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 40

Commercial Workflow/Dataflow Systems

Page 39: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 41

SCIRun: Problem Solving Environments for Large-Scale Scientific Computing

• SCIRun: PSE for interactive construction, debugging, and steering of large-scale scientific computations

• Component model, based on generalized dataflow programming

Steve Parker (cs.utah.edu)Steve Parker (cs.utah.edu)

Page 40: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 42

E-Science and Link-Up Buddies

• … <UPDATE ME> …– Taverna, Scufl, Freefluo, ..– DiscoveryNet– Triana– ICENI– …

Page 41: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 43

Dataflow Process Networks:Putting Computation Models first!

• Synchronous Dataflow Network (SDF)– Statically schedulable single-threaded dataflow

• Can execute multi-threaded, but the firing-sequence is known in advance– Maximally well-behaved, but also limited expressiveness

• Process Network (PN)– Multi-threaded dynamically scheduled dataflow– More expressive than SDF (dynamic token rate prevents static

scheduling)– Natural streaming model

• Other Execution Models (“Domains”)– Implemented through different “Directors”

actor actor

typed i/o ports

FIFO

advanced push/pull

Page 42: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 44

Promoter Identification Workflow (PIW)

Source: Matt Coleman (LLNL)Source: Matt Coleman (LLNL)

Page 43: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 45

Promoter Identification

Workflowin Ptolemy-II[SSDBM’03]

ExecutionSemantics

Page 44: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 46

hand-crafted control solution; also: forces sequential execution!

designed to fit

designed to fit

hand-craftedWeb-service

actor

Complex backward control-flow

No data transformations

available

Page 45: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 47

Simplified Process Network PIW

• Back to purely functional dataflow process network(= a data streaming

model!)• Re-introducing map(f) to

Ptolemy-II (was there in PT Classic) no control-flow spaghetti data-intensive apps free concurrent execution free type checking automatic support to go

from piw(GeneId) to PIW :=map(piw) over [GeneId]

map(f)-style

iterators Powerful type

checking Generic,

declarative “programming”

constructs

Generic data transformation

actors

Forward-only, abstractable sub-workflow piw(GeneId)

Page 46: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 48

Optimization by Declarative Rewriting

• PIW as a declarative, referentially transparent functional process optimization via functional

rewriting possiblee.g. map(f o g) = map(f) o

map(g)

• Details: – Technical report &PIW

specification in Haskell

map(f o g) instead of map(f) o

map(g)

Combination of map and zip

http://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf

Page 47: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 49

Web Services & Scientific Workflows in Kepler

• Web services = individual components (“actors”)• “Minute-Made” Application Integration:

– Plugging-in and harvesting web service components is easy and fast

• Rich SWF modeling semantics (“directors” and more):– Different and precise dataflow models of computation– Clear and composable component interaction semantics Web service composition and application integration tool

• Coming soon:– Shrinked wrapped, pre-packaged “Kepler-to-Go” (v0.8)– SWFs with structural and semantic data types (better design

support)– Grid-enabled web services (for big data, big computations,…) – Different deployment models (SWF WS, web site, applet, …)

Page 48: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 50

KEPLER Core Capabilities (1/2)

• Designing scientific workflows– Composition of actors (tasks) to perform a scientific WF

• Actor prototyping• Accessing heterogeneous data

– Data access wizard to search and retrieve Grid-based resources– Relational DB access and query– Ability to link to EML data sources

Page 49: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 51

KEPLER Core Capabilities (2/2)

• Data transformation actors to link heterogeneous data

• Executing scientific workflows– Distributed and/or local computation– Various models for computational semantics and

scheduling– SDFSDF and PNPN: Most common for scientific workflows

• External computing environments:– C++, Python, C (… Perl--planned ...)

• Deploying scientific tasks and workflows as web services themselves (… planned …)

Page 50: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 52

The KEPLER GUI (Vergil)

Drag and drop utilities, director and actor libraries.

Page 51: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 53

Running the workflow

Page 52: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 54

Distributed SWFs in KEPLER

• Web and Grid Service plug-ins– WSDL, and whatever comes after GWSDL– ProxyInit, GlobusGridJob, GridFTP, DataAccessWizard

• WS Harvester– Imports all the operations of a specific WS (or of all the WSs in a UDDI repository) as Kepler actors

• WS-deployment interface (…ongoing work…)• XSLT and XQuery transformers to link non-fitting

services together

Page 53: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 55

A Generic Web Service Actor

Given a WSDL and the name of an operation of a web service, dynamically customizes itself to implement and execute that method.

Configure - select service operation

Page 54: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 56

Set Parameters and Commit

Set parameters and commit

Page 55: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 57

WS Actor after Instantiation

Page 56: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 58

Web Service Harvester

• Imports the web services in a repository into the actor library.• Has the capability to search for web services based on a keyword.

Page 57: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 59

Composing 3rd-Party WSs

Output of previousweb service

User interaction &Transformations

Input of next web service

Page 58: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 60

Classifying with Kepler

Page 59: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 61

Classifying with Kepler

Page 60: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 62

Page 61: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 63

SWF Designed in Kepler

Page 62: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 64

Result launched via the BrowserUI actor

Page 63: Towards Semantic Typing Support for  Scientific Workflows

Querying Example

Page 64: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 66

KEPLER and YOU

• Kepler …– is a community-based, cross-

project, open source collaboration

– uses web services as basic building blocks

– has a joint CVS repository, mailing lists, web site, …

– is gaining momentum thanks to contributors and contributions

• BSD-style license allows commercial spin-offs

– a pre-packaged, shrink-wrapped version (“Kepler-to-GO”) coming soon to a place near you…

Page 65: Towards Semantic Typing Support for  Scientific Workflows

Now back to the “Semantics Stuff”

Page 66: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 68

Semantic Types for Scientific Workflows

Page 67: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 69

From Semantic to Structural Mappings

Page 68: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 70

Structural and Semantic Mappings

Page 69: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 71

• Large collaborative NSF/ITR project: UNM, UCSB, UCSD, UKansas,..• Goals: global access to ecologically relevant data; rapidly locate and

utilize distributed computation; (semi-)automate, streamline analysis process – “Knowledge Discovery Workflows”

Summary I: Putting it all together for the Summary I: Putting it all together for the Science Environment for Ecological Science Environment for Ecological KnowledgeKnowledge

ASx ASy ASzTS1TS2

Semantic MediationEngine

Data Binding

Query Processing

ECO2

Logic Rules ECO2-CL

Analytical Pipeline (AP)

SemanticMediation System (SMS)

EcoGrid

provides unified access to Distributed Data Stores , Parameter Ontologies, & Stored Analyses, and runtime capabilities via the Execution Environment

Semantic Mediation System & Analysis and Modeling System use WSDL/UDDI to access services within theEcoGrid, enabling analytically driven data discovery and integration

SEEK is the combination of EcoGrid data resources and information services, coupled with advanced semantic and modeling capabilities

AM: Analysis & Modeling System (KEPLER)

ASr

Parameters w/ Semantics

CC

C

CC

CParameterOntologies

WSDL/UDDI WSDL/UDDI

SRB KNB

MC

Species

WrpDar

...

Raw data setswrappedfor integrationw/ EML, etc.

ECO2 TaxOn

EML

etc.

Execution Environment

SAS, MATLAB,FORTRAN, etc

Library of Analysis Steps, Pipelines& Results

Invasive speciesover time

ASr

WSDL/UDDI

Example of “AP0”

AP0

Page 70: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 72

Outline

1. Motivation: Traditional vs Scientific Data Integration

2. Semantic (a.k.a. Model-Based) Mediation

3. Scientific Workflows (a.k.a. Analysis Pipelines)

4. DB Theory Appetizer: Web Service Composition Through Declarative Queries

Page 71: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 73

Planning with Limited Access Patterns(back to GAV mediation …) • User query Q: answer(ISBN, Author, Title)

book(ISBN, Author, Title),catalog(ISBN, Author),not library(ISBN).

• Limited (web service) APIs (access patterns):– Src1.books: in: ISBN out: Author, Title– Src1.books: in: Author out: ISBN, Title– Src2.catalog: in: {} out: ISBN, Author– Src3.library: in: {} out: ISBN

• Note: Q is not executable, but feasible (equivalent to executable Q’: catalog ; book ; not library)

Page 72: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 74

Query Feasibility is as hard as Containment

• Theorem [EDBT’04]: For UCQneg queries Q:Q is feasible iff ans(Q) Q

• The answerable part ans(Q) can be computed in quadratic time. Idea: scan Q for answerable literals, rescan, repeat until ans(Q) is reached

• Checking query containment Q1 Q2 is hard:– Already NP-complete for CQ (conjunctive queries)– Undecidable for FO (first-order logic queries)

Page 73: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 75

Conjunctive Query Containment

• Given: conjunctive queries Q1, Q2 (aka Select-Project-Join queries)• Problem: Is answers(D, Q1) answers(D, Q2) for all databases D?• If yes, we say that “Q1 is contained in Q2”; short: Q1 Q2• Examples:

Q1: answer(X) student(X, cs)Q2: answer(X) student(X,Dept), advisor(X,Y), dept(Y,cs)Q3: answer(X) student(X,Dept)

• Quiz: – Q1 Q2 ?– No: not every student X necessarily has an adviser Y who is in the

cs department!– Q1 Q3 ?– Yes: every cs student is student in some department (crux of the “proof”: Dept = cs)Homework: What about Q1 Q2 if we know that every student must

have an advisor from the same department?

Page 74: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 76

The World’s Shortest Conjunctive Query Containment Checker (an NP-complete problem): 7 lines in Prolog …

Quiz: 1. find the bug in the 7 lines of code2. Fix the bug (hint: add one more line of code)

Moral: Short programs can be buggy too

Page 75: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 77

Summary II: Got milk/eggs/meat/wool?Or: “Die eierlegende Wollmilchsau …”

• Data Integration– query rewriting under GAV/LAV – w/ binding pattern constraints– distributed query processing

• Semantic Mediation– semantic integrity constraints, reasoning w/ plans,

automated deduction– deductive database/logic programming technology, AI

“stuff”...– Semantic Web technology

• Scientific Workflow Management– more procedural than database mediation (the scientist is

the “query planner”)– deployment using web services

Page 76: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 78

• Large collaborative NSF/ITR project: UNM, UCSB, UCSD, UKansas,..• Goals: global access to ecologically relevant data; rapidly locate and

utilize distributed computation; (semi-)automate, streamline analysis process – “Knowledge Discovery Workflows”

Science Environment for Science Environment for Ecological KnowledgeEcological Knowledge

ASx ASy ASzTS1TS2

Semantic MediationEngine

Data Binding

Query Processing

ECO2

Logic Rules ECO2-CL

Analytical Pipeline (AP)

SemanticMediation System (SMS)

EcoGrid

provides unified access to Distributed Data Stores , Parameter Ontologies, & Stored Analyses, and runtime capabilities via the Execution Environment

Semantic Mediation System & Analysis and Modeling System use WSDL/UDDI to access services within theEcoGrid, enabling analytically driven data discovery and integration

SEEK is the combination of EcoGrid data resources and information services, coupled with advanced semantic and modeling capabilities

AM: Analysis & Modeling System (KEPLER)

ASr

Parameters w/ Semantics

CC

C

CC

CParameterOntologies

WSDL/UDDI WSDL/UDDI

SRB KNB

MC

Species

WrpDar

...

Raw data setswrappedfor integrationw/ EML, etc.

ECO2 TaxOn

EML

etc.

Execution Environment

SAS, MATLAB,FORTRAN, etc

Library of Analysis Steps, Pipelines& Results

Invasive speciesover time

ASr

WSDL/UDDI

Example of “AP0”

AP0

Page 77: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 79

Building the EcoGrid

AND

LUQ

HBR

NTL

Metacat node

Legacy system

LTER Network (24) Natural History Collections (>> 100)Organization of Biological Field Stations (180)UC Natural Reserve System (36)Partnership for Interdisciplinary Studies of Coastal Oceans (4)Multi-agency Rocky Intertidal Network (60)

SRB node

DiGIR node

VCR

VegBank node

Xanthoria node

Source: Matthew Jones (UCSB)Source: Matthew Jones (UCSB)

Page 78: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 80

Heterogeneous Data integration

• Requires advanced metadata and processing

– Attributes must be semantically typed– Collection protocols must be known– Units and measurement scale must be known– Measurement relationships must be known

• e.g., that ArealDensity=Count/Area

Page 79: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 81

Ecological ontologies

• What was measured (e.g., biomass)• Type of measurement (e.g., Energy)• Context of measurement (e.g., Psychotria limonensis)• How it was measured (e.g., dry weight)

• SEEK intends to enable community-created ecological ontologies using OWL– Represents a controlled vocabulary for ecological metadata

• More about this in Bertram’s talk

Page 80: Towards Semantic Typing Support for  Scientific Workflows

B. Ludäscher – Scientific Data Management 82

• Label data with semantic types (e.g. concept expressions in OWL)

• Label inputs and outputs of analytical components with semantic types

• Use reasoning engines to generate transformation steps– Observe analytical constraints

• Use reasoning engine to discover relevant components

Semantic Mediation

Data Ontology Workflow Components