taverna, myexperiment and biocatalogue: workflow tools for informatics integration dr katy...

Post on 01-Apr-2015

221 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Taverna, myExperiment and BioCatalogue: Workflow Tools for Informatics Integration

Dr Katy Wolstencroft

School of Computer Science

University of Manchester

• Interoperability, Integration and Collaboration

• Access to distributed and local resources

• Iteration over data sets• Automation of data flow• Agile software development• Extensible• Experimental protocols• Part of the myGrid toolkit

Taverna Workflows

What is myGrid?An e-Science Collaboration Since 2001

• Software ● Services ● Content ● Skills ● Community

• Manchester, Southampton, Oxford and the EMBL-EBI+ an alliance of intl. contributing projects and partners

• Sustainable production level quality– Open Middleware Infrastructure Institute UK– Software Sustainability Institute– Mixture of developers, bioinformaticians and researchers

• Open source development and content LGPL or BSD

Connecting Things Together

• Data Resources– Genome databases– Kinetic/metabolite data

• Analysis tools– Sequence alignment– Similarity searching– Pattern matching

• Knowledge Resources– Ontologies– Controlled vocabularies

Create and run workflows

Share, discover and reuse workflows

Manage the metadata needed and generated

RDF, OWL

Discover and reuse services

Feta

A Collection of Components

What is a Workflow

• Set of services (web services, RESTful, local scripts, other workflows)

• Set of data links between services - “put output X from service A as input Y to service B”– If needed: List handling, control links

• This can be called a data-oriented workflows (dataflow)

– Say where you want the data to flow instead of what you want to do

– Compare with more procedural workflow languages like BPEL

• Beneficial way of thinking for much data-driven scientific research

KeplerTriana

BPEL

Ptolemy II

Taverna

Workflow diagram

Tree view of workflow structure

Tree view of workflow structure

Available services

Taverna

Open source and extensible

Taverna Gui and Enactor

Taverna Remote Execution service

T-REX

Graphical WorkbenchDrag and drop interface

Plug-in architectureNested Workflows

Workflow EnactorLocal and remote enactorImplicit iteration over data collectionsAutomation of data flowLogging and data provenance tracking

Taverna http://www.taverna.org.uk

Software Release • Taverna first released 2004. • Current versions 1.7.2 and Taverna 2.1.2• Currently 1500 + users per month, 350+ organizations, ~40 countries,

80000+ downloads across versions

Availability• Freely available, open source LGPL• Windows, Mac OS, and Linux

Resources• http://www.taverna.org.uk, http://www.mygrid.org.uk• User and developer workshops, documentation, email help desk• Collaborations with numerous groups including NCI’s cancer biomedical

informatics grid (caBIG), EMBL-EBI, NCBI, Concept Web Alliance, Bio2RDF

Software ● Services ● Content ● Skills ● Community ●

What types of service?

• WSDL Web Services• BioMart • R-processor• BioMoby• Soaplab• Grid Services• Local Java services• Beanshell• Workflows• Coming soon.....New REST support

Who Provides the Services?

• Open domain services and resources• Taverna accesses 3500+ services (11,874 operations)• Third party – we don’t own them – we didn’t build them• All the major providers

– NCBI, DDBJ, EBI …• Enforce NO common data model.

• Quality Web Services considered desirable

What do Scientists use Taverna for?

Astronomy Music

Meteorology

Social Science

Cheminformatics

UK Institutes

Systems Biology

International Institutes

International

Networks

Universities

ProjectsLots of Universities

Tav

erna

Ado

ptio

n

Hypothesis Construction and Explanation from the Literaturemy BioAID, Vl-e

Manipulation of SBML models in workflows

PharmacogenomicsAssociation study of Nevirapine-induced skin rash in Thai Population

Data WarehousingtGRAP Database

Rescue

Genome-wide SNP Analysis

• Analysis over compute clusters• Automate annotation of results• Mine annotation data for patterns

[Hoyle]

Shared Genomics

Taverna Grid Use Cases

– KnowArc – The Grid-enabled Know-how Sharing Technology Based on ARC Services and Open Standards

– caGrid – US Cancer Research project– Moteur – A medical imaging project running on EGEE

MicroArray from

tumor tissue

Microarray

preprocessing

Lymphoma

prediction

Lymphoma Prediction Workflow

Wei Tan Univ. Chicago

Ack. Juli Klemm, Xiaopeng Bian, Rashmi Srinivasa (NCI)Jared Nedzel (MIT)

caArray

GenePattern

Use gene-expression

patterns associated with two lymphoma types to predict the type of an

unknown sample.

caGrid Plugin for Taverna

• Taverna support for GAARDS-secured caGrid services

• Wrap existing 3rd party services (that are used by existing Taverna users) for caGrid and annotate them to match compatibility guidelines

• Enables discovery of services in caGrid service registry

Lymphoma type prediction workflow

Genotype Phenotype Studies

• Mouse whipworm infection - parasite model of the human parasite - Trichuris trichuria

Understanding Phenotype• Comparing resistant vs susceptible strains – Microarrays

Understanding Genotype• Mapping quantitative traits – Classical genetics QTL

Joanne Pennock, Richard GrencisUniversity of Manchester

Workflow Results

• Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite.

• Manual experimentation: Two year study of candidate genes, processes unidentified

Joanne Pennock, Richard GrencisUniversity of Manchester

• Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite.

• Manual experimentation: Two year study of candidate genes, processes unidentified

• JO IS A LAB BIOLOGIST

• JO HAS NEVER BUILT A WORKFLOW

Joanne Pennock, Richard GrencisUniversity of Manchester

Workflow Results

Understanding Phenotype

• Comparing resistant vs susceptible strains – Microarrays

Understanding Genotype

• Mapping quantitative traits – Classical genetics QTL

Integrated Microarray data, genomic sequences, pathway data, literature mining.

Trypanosomiasis Study

Identified a pathway for which its correlating gene (Daxx) is believed to

play a role in trypanosomiasis resistance

Paul Fisher, et al Nucleic Acids Research, 2007, 35(16)

http://www.youtube.com/watch?v=x83pzMMw7lkhttp://www.youtube.com/watch?v=Y6_Kz5L010g

Just Enough Sharing….

• myExperiment can provide a central location for workflows from one community/group

• myExperiment allows you to say– Who can look at your workflow– Who can download your workflow– Who can modify your workflow– Who can run your workflow

The most important aspect of myExperiment - Designed by scientistsThe most important aspect of myExperiment - Designed by scientists

Ownership and Attribution

• Packs allow you to collect different items together, like you might with a "wish list" or "shopping basket"

• You can collect internal things (such as workflows, files and even other packs) as well as link to things outside myExperiment

• Your packs can then be shared, tagged, discovered and discussed easily on myExperiment

Packs

Bringing myExperiment to the Taverna User

myExperiment Plugin in Taverna

Running Workflows Through myExperimentTaverna Remote Execution (T-REX)

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX myexp: <http://rdf.myexperiment.org/ontology#>PREFIX sioc: <http://rdfs.org/sioc/ns#>select ?friend1 ?friend2 ?acceptedat where {?z rdf:type<http://rdf.myexperiment.org/ontology#Friendship> . ?z myexp:has-requester?x .?x sioc:name ?friend1 . ?z myexp:has-accepter ?y . ?y sioc:name ?friend2 .?z myexp:accepted-at ?acceptedat }

All accepted Friendships including accepted-at time

Semantically-Interlinked Online Communities

Service Discovery

There are thousands of distributed services. How do we find an appropriate one?

• We need to annotate services by their functions (and not their names!)

• The services might be distributed, but a registry of service descriptions can be central and queried

BioCataloguewww.biocatalogue.org

• A “Web 2.0” catalogue for sharing, discovering and monitoring web services for the Life Sciences.

• Community and expert curation• Community and provider

contribution• Launched mid 2009. • Currently: 370+ members, 1700+

services, 11,870+ operations• 110+ providers, 110+ different

countries

REST APIsLinked Open DataSoftware Open source BSD

Software ● Services ● Content ● Skills ● Community ●

Data and Provenance

• Workflows can generate vast amount of data - how can we manage and track it?

• We need to manage data AND metadata AND experimental provenance

• Scientists need to check back over past results, compare workflow runs and share workflow runs with colleagues

• Scientists need to look at intermediate results when designing and debugging

Provenance ##

• Another slide here• Screenshot of provenance view

myGrid Open Suite of Tools

Client User InterfacesWorkflow GUI Workbench

Workflow Repository

Service CatalogueThird Party Tools

Programming and APIs

Web Portal

Activity and Service Plug-in Manager

Provenance Store

Workflow Server

Open Provenance

Model

Secure Service Access

Toolkits “Taverna Inside”

Workflows under the hood• e-Laboratories (portals)

– Systems Biology, e-Health

• Web based execution– Running workflows over the web through myExperiment

• Visualisation clients that call workflows in the background

Open e-Lab Platforms

• Customised myExperiment instances– Australian Kepler Repository– eStat, NeuroHub, Nema, – SpaceBook, HPC/NA– Microsoft Trident

• BioCatalogue installations– Emory – ed unify project– Eli Lilly

SysMO-SEEK• e-Laboratory for interlinking and sharing data,

models, SOPS and workflows for Systems Biology in Europe

• ISA-TAB & SBML/MIRIAM compliant

Software ● Services ● Content ● Skills ● Community ●

Current Work

Taverna 2.2

Released end June

• Workflow diagnostics and error resolution• Retry and parallelisation• Stop/pause/resume workflows• Intermediate results display

Taverna Roadmap

• Next Generation Workbench• Access to service, data and workflow repositories• More data driven• Component families for vertical markets• Workflow Patterns• Taverna from Excel

“myGrid-in-a-Box” – Virtualised Taverna server deployment and distribution, bundle of

myExperiment, BioCatalogue and database/tools components.

Taverna Labs

• Semantic Taverna– Semantic provenance

• Open Provenance Model

– Linked Open Data• Dutch NBIC Aida toolkit

– Automated workflow planning through reasoning

• e-Lico with U Zurich and Rapid-Miner

• Taverna in the Cloud• Blogging the lab book

– Blog3 with Southampton U

Training

• Tutorials and Training– 58+ tutorials to >900 people.– >20 universities, Life Science

institutes, and networks.– Major Bio conferences– Summer schools in Biology and

Middleware.

• Developer and User Days– Annotation Jamborees

• Undergraduate and Postgraduate Bioinformatics in > 30 universities.

Software ● Services ● Content ● Skills ● Community

More Information

• myGrid– http://www.mygrid.org.uk

• Taverna– http://www.taverna.org.uk

• myExperiment– http://www.myexperiment.org – http://wiki.myexperiment.org

• BioCatalogue– http://www.biocatalogue.org– http://beta.biocatalogue.org

top related