taverna workflows in the cloud

39
Taverna workflows in the Cloud Robert Haines University of Manchester [email protected]

Upload: mygrid-team

Post on 24-Jun-2015

173 views

Category:

Technology


1 download

DESCRIPTION

Robert Haines

TRANSCRIPT

Page 1: Taverna workflows in the cloud

Taverna workflows in the Cloud

Robert HainesUniversity of Manchester

[email protected]

Page 2: Taverna workflows in the cloud

Taverna* and workflows

*Other workflow systems are available

Page 3: Taverna workflows in the cloud

Taverna workflows • Sophisticated analysis

pipelines• A set of services to analyze or

manage data (local or remote)• Workflows run through the

workbench or via a server• Automation of data flow

through services• Control of service invocation• Iteration over data sets• Provenance collection• Extensible and open source

Page 4: Taverna workflows in the cloud

Taverna Workbench

• Desktop application• GUI• Plug-in Framework• Intermediate results

views• Search for Web

Services in catalogues• Search and publish to

myExperiment

Page 5: Taverna workflows in the cloud

Taverna Server family

• Taverna Server – Multiple clients, Multi-user– Local and large scale infrastructures– Site Replication

• Taverna Server Amazon Image– Local R server– Multiple instances in Amazon Cloud and as required,

for multiple users/uses and different security scenarios• Taverna Virtual Machine• Taverna Command Line• Bundled Servers, Services and Tools

Page 6: Taverna workflows in the cloud

Users are not the same….any one individual can be all of these

• Pro Makers: Technical Experts– Rich power tools– Control, flexibility, expressivity

• In the Field Users– Re-modellers

• Simplified though limited tools• Revise variants, tweaking• Inspection and guidance

– Vanilla Users: Pre-cooked workflows• Point and click / form fill / ambient configuration• Web based / Bespoke / Embedded launch

Workbench

Lite

Page 7: Taverna workflows in the cloud

Taverna Tool Spectrum

Technical ComputationalScientist

DomainScientist

Workbench WorkbenchComponents

LiteDomain-SpecificWebsite / Tool

Workflow Visibility

Concept KnowledgeTaverna Domain

High Low

Player Command Line

Page 8: Taverna workflows in the cloud

The Taverna Suite of ToolsClient User Interfaces

User InterfacesWorkflow Repository

Service Catalogue

Third Party Tools

Web Portals

Activity and Service Plug-in Manager

WorkflowProvenance

Workflow Server

Secure Service AccessCredential Manager

Workflow Engine

Virtual Machine

Prog & APIs

Command Line

Taverna Lite

Player

Taverna Workbench

Page 9: Taverna workflows in the cloud

Freely availableopen source

Current Version 2.4

80,000+ downloads across versions

Part of the myGrid Toolkit

Windows/Mac OS X/Linux/unix

Katherine Wolstencroft, Robert Haines, Donal Fellows, Alan Williams, David Withers, Stuart Owen, Stian Soiland-Reyes, Ian Dunlop, Aleksandra Nenadic, Paul Fisher, Jiten Bhagat, Khalid Belhajjame, Finn Bacall, Alex Hardisty, Abraham Nieva de la Hidalga, Maria P. Balcazar Vargas, Shoaib Sufi, and Carole Goble: “The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, Web or in the Cloud”, Nucleic Acids Res., May 2013. doi:10.1093/nar/gkt328

Taverna – www.taverna.org.uk

Page 10: Taverna workflows in the cloud

Workflows in the Cloud

Biodiversity Virtual e-Laboratory

Page 11: Taverna workflows in the cloud

Biodiversity Virtual e-Laboratory

• BioVeL is an international network of experts– Connects two scientific communities: IT and biodiversity

• “Pals” system

– Roughly a three-way split between:• Biodiversity scientists, Biodiversity Informaticians, Computer

Scientists

– Shares expertise in workflow studies among BioVeL’s users and friends

– Fosters an international community of researchers and partners on biodiversity issues

Page 12: Taverna workflows in the cloud

Biodiversity Virtual e-Laboratory

• BioVeL users want to be able to:– Import data from own research and/or from existing

libraries– Use workflows to process vast amounts of data.– Build their own workflows– Access a library of workflows and re-use existing

workflows– Cut down research time and overhead expenses– Contribute to other such initiatives, such as LifeWatch

and GEO BON

Page 13: Taverna workflows in the cloud

Species occurrence Environmental layers

Salinity

Temp bottom

Ice conc

Primary production

Ecological niche modeling of an invasive species

Page 14: Taverna workflows in the cloud

Model projection

Model test

Create model

Select parameter values for the chosen algorithm

Select algorithm

Test the performance of the parameter in the modelTest performance of the

distribution prediction on the model

Assemble the model on CRIA server

Project Model with prediction layers

High quality occurrence data set

Select layers with environmental factors that are likely to influence

the distribution of the speciesCh

angi

ng a

lgor

ithm

, par

amet

er

valu

es, a

nd s

et o

f lay

ers

Select prediction layers (e.g. 2050)

Project Model with original layers

Statistical analysis of the raster data

Semi-automatized ecological niche modeling workflow

• Scientist’s PowerPoint workflow– Used everyday

• Came toManchester

• Two days with a Taverna developer– Not Scalable!

• First iteration of workflow produced

Page 15: Taverna workflows in the cloud

Ecological niche modeling workflow

Scary!

Page 16: Taverna workflows in the cloud

Ecological niche modeling workflow

Better?

Page 17: Taverna workflows in the cloud

Population Modelling

• Is the population growing or declining?• What effect has exploitation or other stimulus

had on the population?• Which stage should be the focus of

conservation?

Year 1• Stage• # flowers/fruits• Other variable

S

J

V

G

D

Year 2• Survival• Stage 2• # flowers/fruits 2• # of seedlings recruited • Other variables 2

SURVIVALGROWTH RATEFECUNDITYRECRUITMENT

Page 18: Taverna workflows in the cloud

Population Modelling Workflow

Page 19: Taverna workflows in the cloud

Simplifications for users

• Pre-cooked workflows– In myExperiment

• Run from the Web– Taverna Player

• Wire into familiar tools – Spreadsheets– Community portals,

e.g. ViBRANT Scratchpads

• Packaging – Taverna VM

Page 20: Taverna workflows in the cloud

Making it “too simple” for users!?

• Portal– Can handle many users– Makes it very easy to run workflows

• So we see lots of workflow runs!– Which is GREAT!

• Taverna has big requirements– BioVeL workflows are BIG– High CPU/Memory– Per running workflow

• Taverna becomes the bottleneck

Page 21: Taverna workflows in the cloud

Scale workflows: More Taverna!

• Scale and load-balance Taverna– Now we can run loads more

workflows

• Users are happy

• Service providers are NOT!– Using services – Good– Overloading services – Bad

** Please imagine loads of arrows here!

Page 22: Taverna workflows in the cloud

Scale workflows: More services?

• We need to replicate services– Bundle local to Taverna?

• But we don’t “own” all services– Too big/complex for us to

replicate? (Data)– Closed source?

• BioVeL has (some) funds to help service providers– Scale, redesign, re-engineer?

• Partnerships/MOUs

Page 23: Taverna workflows in the cloud

Data: Local services

• Data can be uploaded once• It is:

– Within your firewall/DMZ/VPC– Secure– Easy to access by services– In the right place at the right time

• Data can be read/written by services– Quickly– Without worrying about security– At no cost (£)

Page 24: Taverna workflows in the cloud

Data: Remote services• Data should be uploaded once• It is:

– Within your firewall/DMZ/VPC– Secure

• It is not:– Easy to access by services– In the right place at the right time

• To pass data between services it must be moved– Need secure third-party access– Bandwidth costs in to and out of the

Cloud– Need “pass by reference”

Page 25: Taverna workflows in the cloud

Workflows in the Cloud

Cloud Analytics for Life Sciences

Page 26: Taverna workflows in the cloud

SNP annotation

Annotation task• Location, Gene, Transcript• Present in public databases, dbSNP, etc• Frequency in e.g. 1000 genome data• Conservation data (cross species)

Page 27: Taverna workflows in the cloud

Infrastructure Requirements

• Execute analysis workflows• Accessible to clinicians and genetic testers• Cope with expanding demands on compute• Provide a secure environment• Collect provenance

Page 28: Taverna workflows in the cloud

Architecture overview

Webinterface

Inputs

Results

Storage (S3)

Ensembl (mySQL)

Cache(S3)

Taverna Server

Taverna Server

Taverna Server

Workflow engine

orchestratore-Hive

Other?

Taverna

Com

mon

API

Application specific tools and Web Services

Application specific tools and Web Services

Application specific tools and Web Services

WS WS ToolToolWS

Secure area(OpenAM)

All user interaction via web interface

User data stored in the Cloud

Data for all tools and Web Services stored in the Cloud

Unified access to different workflow engines with our common REST API

Tools and Web Services for each workflow are installed together for easy replication

Page 29: Taverna workflows in the cloud

Orchestrating workflows in the CloudInput

Workflow

Datastore

Find virtual machine for

this workflow

Is one running?

Start one

Is there space on

it?

Wait until ready

Run workflow

Yes

No

Yes

No

Delete run

Is this instance empty?

Done

Terminate it

Yes

No

Status updates

Page 30: Taverna workflows in the cloud

The user’s view

• Curated set of workflows– Designed, built and tested by domain experts– Quality assurance tested (if appropriate)

• Workflows are presented as applications– The workflows themselves are hidden– Configured and run via a web interface

• All user data stored securely in the Cloud– User separation

• Workflows as a Service

Page 31: Taverna workflows in the cloud

Web interface: Getting started

Page 32: Taverna workflows in the cloud

Web interface: Creating a Run

Page 33: Taverna workflows in the cloud

Web interface: Checking run progress

Page 34: Taverna workflows in the cloud

Conclusions

Page 35: Taverna workflows in the cloud

The user’s view

• “Science”, “Tools”, “Applications”, “Data”– Not workflows– Not infrastructure

• But they ALL have workflows– On paper– In PowerPoint– In scripts– Run “by hand”– Too personal/specific – cannot share them– “Works on my machine”

Page 36: Taverna workflows in the cloud

Workflow as a Service

• The workflow IS the service– Users do not see the Workflows– Run restricted sets of Taverna workflows in the cloud

• Connects to other cloud based resources – storage, tools, etc.• Scale everything behind the scenes

– Users can tweak parameters, but not design their own– Web portal access for scientists– Data passed by reference instead of by file– Pay as you go – cheap at the point of use

Page 37: Taverna workflows in the cloud

Supporting end-users

• Make it easy– Automate workflows they are already using– Don’t get in the way of the science– Hide the infrastructure where possible

• But it is really hard– So much has to be co-ordinated– Scale everything– Stay secure

Page 38: Taverna workflows in the cloud

Acknowledgements/Partners• University of Manchester• Cardiff University• European Commission 7th

Framework Programme– 283359 - BioVeL

• Eagle Genomics• Technology Strategy Board

– 100932 - Cloud Analytics for Life Sciences

• National Health Service• Amazon Web Services

Page 39: Taverna workflows in the cloud

Thanks

• myGrid Team– Carole Goble (PI)– Shoaib Sufi– Alan Williams– Katy Wolstencroft

• CA4LS– Abel Ureta-Vidal (PI)– Mike Cornell– Madhu Donapudi– Helen Hulme– Nick James

• BioVeL– Alex Hardisty (PI)– Renato De Giovanni– Jonathan Giddy– Norman Morrison– Abraham Nieva de la Hidalga– Matthias Obst– Maria Paula Balcazar Vargas– Elisabeth Paymal– Hannu Saarenmaa

…and many, many more…