taverna: a workbench for the design and execution of ... · taverna: a workbench for the design and...
TRANSCRIPT
29.11.2007
1
Taverna: A Workbench for the Design and Execution of Scientific Workflows
Katy WolstencroftmyGridmyGrid
University of Manchester
What is Taverna?
Taverna enables the interoperation between databasesTaverna enables the interoperation between databases and tools by providing a toolkit for composing, executing and managing workflow experiments
• Access to local and remote resources and analysis tools• Automation of data flow• Iteration over large data sets
29.11.2007
2
Taverna and myGrid
• myGrid a suite of components designed to support in• myGrid a suite of components designed to support in silico experiments in biology
• Taverna workbench – main user interface • Semantic service discovery components• myGrid Ontology for bioinformatics services• Provenance components• myGrid provenance ontology
myExperimentWeb PortalWorkflow
Warehouse
Taverna Workbench
GUI
Client Applications Provenance
Ontology
Taverna Workflow Enactor
3rd Party Resources
Warehouse
Service / ComponentCatalogue
Custom
DefaultResults
ProvenanceWarehouse
GUI
FetaInformation
Services
LogBookProvenanceManagement
Service Ontology
3rd Party Resources(Web Services, Grid Services)
CustomDatasets
ResourcesService
Management
29.11.2007
3
What is a Workflow?
Workflows provide a general technique for describing and enacting a process Describes what you want to do, not how you want to do itSimple language specifies how bioinformatics processes fit togetherProcesses are represented as web services
RepeatMasker
Web serviceGenScan
Web ServiceBlast
Web Service
Sequence Predicted Genes out
Workflow diagram
Available services
Tree view of workflow structure
29.11.2007
4
Who Provides the Services?
• Open domain services and resources.• Taverna accesses 3000+ services• Third party – we don’t own them – we didn’t build them• All the major providers
– NCBI, DDBJ, EBI …• Enforce NO common data model.
Quality Web• Quality Web Services considered desirable
What types of service?
• WSDL Web Services• BioMart• BioMart • R-processor• BioMoby• Soaplab• Local Java services• BeanshellBeanshell• Workflows
29.11.2007
5
Who uses Taverna?
~41288 downloadsUsers worldwide • Systems biology• Systems biology• Proteomics• Gene/protein annotation• Microarray data analysis• Medical image analysis• Heart simulations• High throughput screening• Genotype/Phenotype studies• Health Informatics
A t• Astronomy• Chemoinformatics• Data integration
• ISMB07 – 6 posters, 2 demos,1 BOF, 1 tutorial
What do Scientists use Taverna for?
• Data gathering and annotating• Data gathering and annotating– Distributed data and knowledge
• Data analysis– Distributed analysis tools and high throughput
• Data mining and knowledge management– Hypothesis generation and modelling
29.11.2007
6
Data Gathering
• Collecting evidence from lots of places• Accessing local and remote databases extracting info• Accessing local and remote databases, extracting info
and displaying a unified view to the user
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccatttattttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
Annotation Pipelines
• Genome annotation pipelines – Bergen Center for Computational Science – Gene Prediction inBergen Center for Computational Science Gene Prediction in
Algal Viruses, a case study.• Workflow assembles evidence for predicted genes / potential
functions • Human expert can ‘review’ this evidence before submission to the
genome database
• Data warehouse pipelines– e-Fungi – model organism warehouse– e-Fungi – model organism warehouse– ISPIDER – proteomics warehouse
29.11.2007
7
Case Study – Graves Disease
• Autoimmune disease that causes hyperthyroidism • Antibodies to the thyrotropin receptor result in• Antibodies to the thyrotropin receptor result in
constitutive activation of the receptor and increased levels of thyroid hormone
• Original myGrid Case Study
Ref: Li P, Hayward K, Jennings C, Owen K, Oinn T, Stevens R Pearce S and Wipat A (2004) Association of variations inR, Pearce S and Wipat A (2004) Association of variations in NFKBIE with Graves? disease using classical and myGrid methodologies. UK e-Science All Hands Meeting 2004
Annotation Pipeline
• Analysing microarray data to determine genes diff ti ll d i G Di ti t ddifferentially-expressed in Graves Disease patients and healthy controls
• For each gene, evidence is extracted from other data sources to potentially support it as a candidate for disease involvement
Evidence includes:• SNPs in coding and non-coding regions• Protein products • Protein structure and functional features• Metabolic Pathways• Gene Ontology terms
29.11.2007
8
Data Analysis
• Access to local and remote analysis tool• You start with your own data / public data of interest• You need to analyse it to extract biological knowledge
29.11.2007
9
Prime Minister's Office Thailand Center of Excellence for Life Sciences (TCELS)
Pharmacogenomics project
Wasun ChantratitaProject director
2003->2006
Pharmacogenomics
• Heavy use of R-Statistics for clinical data analysis• Association study of Nevirapine induced skin rash in• Association study of Nevirapine-induced skin rash in
Thai Population• A systemic (bodywide) allergic reaction with a
characteristic rash– 100 Cases: rash – 100 Cases: no rash controls– 10,000 SNP significantly associated with rash
Pathway analysis and systems biology– Pathway analysis and systems biology– Prioritising SNPs– Functional studies– Diagnostic tools
29.11.2007
10
[Peter Li, Doug Kell]
Systems Biology Model Construction
Automatic reconstruction of genome-scale yeast metabolism from distributed data in the life sciences to create and manipulate Systems Biology Markup Models.
Trichuris muris
• Mouse whipworm infection - parasite model of the human parasite - Trichuris trichuria
Understanding Phenotype• Comparing resistant vs susceptible strains – MicroarraysUnderstanding Genotype• Mapping quantitative traits – Classical genetics QTLMapping quantitative traits Classical genetics QTL
Joanne Pennock, Richard GrencisUniversity of Manchester
29.11.2007
11
Trichuris muris
• Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite.
• Manual experimentation: Two year study of candidate genes, processes unidentified
• JO IS A LAB BIOLOGIST• JO HAS NEVER BUILT A WORKFLOW
Joanne Pennock, Richard GrencisUniversity of Manchester
Encapsulating your Experiment
• Workflows are protocols and records.– Explicit and precise descriptions of a scientific protocolExplicit and precise descriptions of a scientific protocol – Scientific transparency. Easier to explain, share, relocate, reuse and repurpose
and remember.– Provenance of results for credibility.
• Workflows are know-how. – Specialists create applications; experts design and set parameters;
inexperienced punch above their weight with sophisticated protocols• Workflows are collaborations.Workflows are collaborations.
– Multi-disciplinary workflows promote even broader collaborations.
29.11.2007
12
Steve
• Sleeping Sickness in African Cattle• Caused by infection by parasite (Trypanosoma brucei)
Andy B
rassK
emp
y y p ( yp )
• Some cattle breeds more resistant than others• Differences between resistant and susceptible cattle?• Can we breed cattle resistant to infection?
Fi h t l (2007) A t ti t t f
http://www.genomics.liv.ac.uk/tryps/trypsindex.html
sP
aul Fisher
Fisher et al (2007) A systematic strategy for large-scale analysis of genotype phenotype correlations: identification of candidate genes involved in African trypanosomiasis.
Nucleic Acids Res.35(16):5625-33
Why was the Workflow Approach Successful?
• Workflow analysed each piece of data systematically• Workflow analysed each piece of data systematically– Eliminated user bias and premature filtering of datasets and
results leading to single sided, expert-driven hypotheses
• The size of the QTL and amount of the microarray data made a manual approach impractical
• Workflows capture exactly where data came from and how it was analysedhow it was analysed
• Workflow output produced a manageable amount of data for the biologists to interpret and verify– “make sense of this data” -> “does this make sense?”
29.11.2007
13
Sharing Experiments
• myGrid supports the in silico experimental process for individual scientistsindividual scientists
• How do you share your results/experiments/experiences with your– Research group– Collaborators– Scientific community
• How do you compare your results with others produced• How do you compare your results with others produced by e.g. Kepler / Triana?
29.11.2007
14
Just Enough Sharing….
• myExperiment can provide a central location for• myExperiment can provide a central location for workflows from one community/group
• myExperiment allows you to say– Who can look at your workflow– Who can download your workflow– Who can modify your workflow
Who can run your workflow– Who can run your workflow
29.11.2007
15
Summary
Taverna• allows interoperation between local and remote resources• allows interoperation between local and remote resources• allow automated access or analysis to sets of data• helps with data integration• Is extensible and open source – for application embedding
MyExperimentMyExperiment• Allows sharing across particular communities• Provides a central location for publishing/finding useful
workflows
Taverna and CASIMIR
• CASIMIR - lots of distributed and heterogeneous mouse datag– Data Gathering workflows
• CASIMIR – lots of omix data– High throughput analysis
• CASIMIR – lots of distributed scientists• CASIMIR Users
– Some bioinformatics and some biologists ALL want to use resourcesWorkflows built by bioinformaticians– Workflows built by bioinformaticians
– Stored in myExperiment– Run by biologists using their own data
29.11.2007
16
myGrid acknowledgements
Carole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer
• OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble, Alan Williams, Ian Dunlop
• Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan.
• Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people.
• User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell, Marco Roos, Matthew Pocock, Mark Wilkinson
• Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis Alvaro Fernandes Justin Ferris Robert Gaizaukaus Kevin Glover ChrisDavis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe.
• Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica.• Funding EPSRC, Wellcome Trust.
http://www.mygrid.org.uk
http://www.myexperiment.org