taverna: a workbench for the design and execution of ... · taverna: a workbench for the design and...

16
29.11.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid myGrid University of Manchester What is Taverna? Taverna enables the interoperation between databases Taverna enables the interoperation between databases and tools by providing a toolkit for composing, executing and managing workflow experiments Access to local and remote resources and analysis tools Automation of data flow Iteration over large data sets

Upload: others

Post on 08-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Taverna: A Workbench for the Design and Execution of ... · Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

29.11.2007

1

Taverna: A Workbench for the Design and Execution of Scientific Workflows

Katy WolstencroftmyGridmyGrid

University of Manchester

What is Taverna?

Taverna enables the interoperation between databasesTaverna enables the interoperation between databases and tools by providing a toolkit for composing, executing and managing workflow experiments

• Access to local and remote resources and analysis tools• Automation of data flow• Iteration over large data sets

Page 2: Taverna: A Workbench for the Design and Execution of ... · Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

29.11.2007

2

Taverna and myGrid

• myGrid a suite of components designed to support in• myGrid a suite of components designed to support in silico experiments in biology

• Taverna workbench – main user interface • Semantic service discovery components• myGrid Ontology for bioinformatics services• Provenance components• myGrid provenance ontology

myExperimentWeb PortalWorkflow

Warehouse

Taverna Workbench

GUI

Client Applications Provenance

Ontology

Taverna Workflow Enactor

3rd Party Resources

Warehouse

Service / ComponentCatalogue

Custom

DefaultResults

ProvenanceWarehouse

GUI

FetaInformation

Services

LogBookProvenanceManagement

Service Ontology

3rd Party Resources(Web Services, Grid Services)

CustomDatasets

ResourcesService

Management

Page 3: Taverna: A Workbench for the Design and Execution of ... · Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

29.11.2007

3

What is a Workflow?

Workflows provide a general technique for describing and enacting a process Describes what you want to do, not how you want to do itSimple language specifies how bioinformatics processes fit togetherProcesses are represented as web services

RepeatMasker

Web serviceGenScan

Web ServiceBlast

Web Service

Sequence Predicted Genes out

Workflow diagram

Available services

Tree view of workflow structure

Page 4: Taverna: A Workbench for the Design and Execution of ... · Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

29.11.2007

4

Who Provides the Services?

• Open domain services and resources.• Taverna accesses 3000+ services• Third party – we don’t own them – we didn’t build them• All the major providers

– NCBI, DDBJ, EBI …• Enforce NO common data model.

Quality Web• Quality Web Services considered desirable

What types of service?

• WSDL Web Services• BioMart• BioMart • R-processor• BioMoby• Soaplab• Local Java services• BeanshellBeanshell• Workflows

Page 5: Taverna: A Workbench for the Design and Execution of ... · Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

29.11.2007

5

Who uses Taverna?

~41288 downloadsUsers worldwide • Systems biology• Systems biology• Proteomics• Gene/protein annotation• Microarray data analysis• Medical image analysis• Heart simulations• High throughput screening• Genotype/Phenotype studies• Health Informatics

A t• Astronomy• Chemoinformatics• Data integration

• ISMB07 – 6 posters, 2 demos,1 BOF, 1 tutorial

What do Scientists use Taverna for?

• Data gathering and annotating• Data gathering and annotating– Distributed data and knowledge

• Data analysis– Distributed analysis tools and high throughput

• Data mining and knowledge management– Hypothesis generation and modelling

Page 6: Taverna: A Workbench for the Design and Execution of ... · Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

29.11.2007

6

Data Gathering

• Collecting evidence from lots of places• Accessing local and remote databases extracting info• Accessing local and remote databases, extracting info

and displaying a unified view to the user

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccatttattttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

Annotation Pipelines

• Genome annotation pipelines – Bergen Center for Computational Science – Gene Prediction inBergen Center for Computational Science Gene Prediction in

Algal Viruses, a case study.• Workflow assembles evidence for predicted genes / potential

functions • Human expert can ‘review’ this evidence before submission to the

genome database

• Data warehouse pipelines– e-Fungi – model organism warehouse– e-Fungi – model organism warehouse– ISPIDER – proteomics warehouse

Page 7: Taverna: A Workbench for the Design and Execution of ... · Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

29.11.2007

7

Case Study – Graves Disease

• Autoimmune disease that causes hyperthyroidism • Antibodies to the thyrotropin receptor result in• Antibodies to the thyrotropin receptor result in

constitutive activation of the receptor and increased levels of thyroid hormone

• Original myGrid Case Study

Ref: Li P, Hayward K, Jennings C, Owen K, Oinn T, Stevens R Pearce S and Wipat A (2004) Association of variations inR, Pearce S and Wipat A (2004) Association of variations in NFKBIE with Graves? disease using classical and myGrid methodologies. UK e-Science All Hands Meeting 2004

Annotation Pipeline

• Analysing microarray data to determine genes diff ti ll d i G Di ti t ddifferentially-expressed in Graves Disease patients and healthy controls

• For each gene, evidence is extracted from other data sources to potentially support it as a candidate for disease involvement

Evidence includes:• SNPs in coding and non-coding regions• Protein products • Protein structure and functional features• Metabolic Pathways• Gene Ontology terms

Page 8: Taverna: A Workbench for the Design and Execution of ... · Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

29.11.2007

8

Data Analysis

• Access to local and remote analysis tool• You start with your own data / public data of interest• You need to analyse it to extract biological knowledge

Page 9: Taverna: A Workbench for the Design and Execution of ... · Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

29.11.2007

9

Prime Minister's Office Thailand Center of Excellence for Life Sciences (TCELS)

Pharmacogenomics project

Wasun ChantratitaProject director

2003->2006

Pharmacogenomics

• Heavy use of R-Statistics for clinical data analysis• Association study of Nevirapine induced skin rash in• Association study of Nevirapine-induced skin rash in

Thai Population• A systemic (bodywide) allergic reaction with a

characteristic rash– 100 Cases: rash – 100 Cases: no rash controls– 10,000 SNP significantly associated with rash

Pathway analysis and systems biology– Pathway analysis and systems biology– Prioritising SNPs– Functional studies– Diagnostic tools

Page 10: Taverna: A Workbench for the Design and Execution of ... · Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

29.11.2007

10

[Peter Li, Doug Kell]

Systems Biology Model Construction

Automatic reconstruction of genome-scale yeast metabolism from distributed data in the life sciences to create and manipulate Systems Biology Markup Models.

Trichuris muris

• Mouse whipworm infection - parasite model of the human parasite - Trichuris trichuria

Understanding Phenotype• Comparing resistant vs susceptible strains – MicroarraysUnderstanding Genotype• Mapping quantitative traits – Classical genetics QTLMapping quantitative traits Classical genetics QTL

Joanne Pennock, Richard GrencisUniversity of Manchester

Page 11: Taverna: A Workbench for the Design and Execution of ... · Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

29.11.2007

11

Trichuris muris

• Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite.

• Manual experimentation: Two year study of candidate genes, processes unidentified

• JO IS A LAB BIOLOGIST• JO HAS NEVER BUILT A WORKFLOW

Joanne Pennock, Richard GrencisUniversity of Manchester

Encapsulating your Experiment

• Workflows are protocols and records.– Explicit and precise descriptions of a scientific protocolExplicit and precise descriptions of a scientific protocol – Scientific transparency. Easier to explain, share, relocate, reuse and repurpose

and remember.– Provenance of results for credibility.

• Workflows are know-how. – Specialists create applications; experts design and set parameters;

inexperienced punch above their weight with sophisticated protocols• Workflows are collaborations.Workflows are collaborations.

– Multi-disciplinary workflows promote even broader collaborations.

Page 12: Taverna: A Workbench for the Design and Execution of ... · Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

29.11.2007

12

Steve

• Sleeping Sickness in African Cattle• Caused by infection by parasite (Trypanosoma brucei)

Andy B

rassK

emp

y y p ( yp )

• Some cattle breeds more resistant than others• Differences between resistant and susceptible cattle?• Can we breed cattle resistant to infection?

Fi h t l (2007) A t ti t t f

http://www.genomics.liv.ac.uk/tryps/trypsindex.html

sP

aul Fisher

Fisher et al (2007) A systematic strategy for large-scale analysis of genotype phenotype correlations: identification of candidate genes involved in African trypanosomiasis.

Nucleic Acids Res.35(16):5625-33

Why was the Workflow Approach Successful?

• Workflow analysed each piece of data systematically• Workflow analysed each piece of data systematically– Eliminated user bias and premature filtering of datasets and

results leading to single sided, expert-driven hypotheses

• The size of the QTL and amount of the microarray data made a manual approach impractical

• Workflows capture exactly where data came from and how it was analysedhow it was analysed

• Workflow output produced a manageable amount of data for the biologists to interpret and verify– “make sense of this data” -> “does this make sense?”

Page 13: Taverna: A Workbench for the Design and Execution of ... · Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

29.11.2007

13

Sharing Experiments

• myGrid supports the in silico experimental process for individual scientistsindividual scientists

• How do you share your results/experiments/experiences with your– Research group– Collaborators– Scientific community

• How do you compare your results with others produced• How do you compare your results with others produced by e.g. Kepler / Triana?

Page 14: Taverna: A Workbench for the Design and Execution of ... · Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

29.11.2007

14

Just Enough Sharing….

• myExperiment can provide a central location for• myExperiment can provide a central location for workflows from one community/group

• myExperiment allows you to say– Who can look at your workflow– Who can download your workflow– Who can modify your workflow

Who can run your workflow– Who can run your workflow

Page 15: Taverna: A Workbench for the Design and Execution of ... · Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

29.11.2007

15

Summary

Taverna• allows interoperation between local and remote resources• allows interoperation between local and remote resources• allow automated access or analysis to sets of data• helps with data integration• Is extensible and open source – for application embedding

MyExperimentMyExperiment• Allows sharing across particular communities• Provides a central location for publishing/finding useful

workflows

Taverna and CASIMIR

• CASIMIR - lots of distributed and heterogeneous mouse datag– Data Gathering workflows

• CASIMIR – lots of omix data– High throughput analysis

• CASIMIR – lots of distributed scientists• CASIMIR Users

– Some bioinformatics and some biologists ALL want to use resourcesWorkflows built by bioinformaticians– Workflows built by bioinformaticians

– Stored in myExperiment– Run by biologists using their own data

Page 16: Taverna: A Workbench for the Design and Execution of ... · Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

29.11.2007

16

myGrid acknowledgements

Carole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer

• OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble, Alan Williams, Ian Dunlop

• Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan.

• Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people.

• User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell, Marco Roos, Matthew Pocock, Mark Wilkinson

• Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis Alvaro Fernandes Justin Ferris Robert Gaizaukaus Kevin Glover ChrisDavis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe.

• Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica.• Funding EPSRC, Wellcome Trust.

http://www.mygrid.org.uk

http://www.myexperiment.org