an introduction to taverna dr. georgina moulton and stian soiland the university of manchester (...

Download An Introduction to Taverna Dr. Georgina Moulton and Stian Soiland The University of Manchester ( @manchester.ac.uk; ssoiland@cs.man.ac.uk

If you can't read please download the document

Post on 11-Jan-2016

215 views

Category:

Documents

2 download

Embed Size (px)

TRANSCRIPT

  • An Introduction to Taverna

    Dr. Georgina Moulton and Stian SoilandThe University of Manchester([email protected]; [email protected] )(on behalf of the myGRID team)

  • Outline of the dayIntroduction to workflowsIntroduction to TavernaCase-studiesHands-on Taverna workshopBuild you own workflowsExplore features of TavernaTaverna in a little more detail

  • What you will learnNo prior knowledge of workflow technologyBy the end of the tutorial participant will know how to install the workbench software, import and run existing workflows and build their own from components available on the public internet. use the semantic search technologies in myGrid assist this process by enabling service discoverydo basic troubleshooting of workflows using Taverna's fault tolerance and debug mechanisms manage the import and export of data to and from the workflow system.

  • What is Taverna?Taverna enables the interoperation between databases and tools by providing a toolkit for composing, executing and managing workflow experiments

    Access to local and remote resources and analysis toolsAutomation of data flowIteration over large data sets

  • Workflow language specifies how processes (web services) fit togetherDescribes what you want to do, not how you want to do it

    High level workflow diagram separated from any lower level coding you dont have to be a coder to build workflowsWorkflow is a kind of script or protocol that you configure when you run it.Easier to explain, share, relocate, reuse and repurpose.Workflow ModelWorkflow is the integrator of knowledgeWorkflowsRepeat MaskerWeb serviceGenScanWeb ServiceBlast Web ServiceSequence Predicted Genes out

  • Two types of workflows

    Data workflowsA task is invoked once its expected data has been received, and when complete passes any resulting data downstream

    Control workflowsA task is invoked once its dependant tasks have completedABCDEF

  • Williams-Beuren Syndrome (WBS)Contiguous sporadic gene deletion disorder1/20,000 live births, caused by unequal crossover (homologous recombination) during meiosisHaploinsufficiency of the region results in the phenotypeMultisystem phenotype muscular, nervous, circulatory systemsCharacteristic facial featuresUnique cognitive profileMental retardation (IQ 40-100, mean~60, normal mean ~ 100 )Outgoing personality, friendly nature, charming

  • Williams-Beuren Syndrome Microdeletion

    GTF2IRFC2CYLN2GTF2IRD1NCF1WBSCR1/E1f4HLIMK1ELNCLDN4CLDN3STX1AWBSCR18WBSCR21TBL2BCL7BBAZ1BFZD9WBSCR5/LABWBSCR22FKBP6POM121NOLR1GTF2IRD2C-cen C-midA-cen B-midB-cen A-midB-telA-telC-telWBSCR14Eicher E, Clark R & She, X An Assessment of the Sequence Gaps: Unfinished Business in a Finished Human Genome. Nature Genetics Reviews (2004) 5:345-354Hillier L et al. The DNA Sequence of Human Chromosome 7. Nature (2003) 424:157-164

  • Filling a genomic gap in silico Two steps to filling the genomic gap: Identify new, overlapping sequence of interestCharacterise the new sequence at nucleotide and amino acid level

    Number of issues if we are to do it the traditional way:

    Frequently repeated info rapidly added to public databasesTime consuming and mundane Dont always get resultsHuge amount of interrelated data is produced

  • Traditional Bioinformatics

  • RequirementsAutomationReliabilityRepeatabilityFew programming skill requiredWorks on distributed resources

  • ABCThe Williams WorkflowsA: Identification of overlapping sequenceB: Characterisation of nucleotide sequenceC: Characterisation of protein sequence

  • The Biological ResultsCTA-315H11CTB-51J22ELNWBSCR14 Four workflow cycles totalling ~ 10 hoursThe gap was correctly closed and all known features identified

  • Workflow AdvantagesAutomationCapturing processes in an explicit mannerTedium! Computers dont get bored/distracted/hungry/impatient!Saves repeated time and effortModification, maintenance, substitution and personalisation Easy to share, explain, relocate, reuse and buildReleases Scientists/Bioinformaticians to do other workRecordProvenance: what the data is like, where it came from, its qualityManagement of data (LSID - Life Science Identifiers)

  • Benefit to the Scientist?Automated plumbingSystematic. Making boring stuff easier so can do more funky stuff. Data chaining replaces manual hand-offs. Accelerated creation of results. Repetitive and unbiased analysis. Potentially reproducible but not always.Easier to use (but maybe not design)Gives non-developers access to sophisticated codes and applications. Avoids need to download-install-learn how to use someone else's code. A framework to leverage a communitys applications, services, datasets and codesHonours original codes and applications. Heterogeneous coding styles and tools sets. The best applications.Promoting community metadata and common formats & standardsA framework for extensibility, adaptability & innovation.Add my code, reuse and repurpose

  • Its more than plumbing.Workflows are protocols and records.Explicit and precise descriptions of a scientific protocol Scientific transparency. Easier to explain, share, relocate, reuse and repurpose and remember.Provenance of results for credibility.Workflows are know-how. Specialists create applications; experts design and set parameters; inexperienced punch above their weight with sophisticated protocolsWorkflows are collaborations.Multi-disciplinary workflows promote even broader collaborations.

  • In silico experiment lifecycle

  • Part of a bigger picture (which we will talk in more detail later)

  • Taverna Workflow Workbench

  • TavernaTaverna is :A workflow language based on a dataflow model.A graphical editing environment for that language.An invocation system to run instances of that language on data supplied by a user of the system.When you download it you get all this rolled into a single piece of desktop softwareThe enactor can be run independently of the GUIJava based, runs on Windows, Mac OS, Linux, Solaris .It doesn't necessarily run "on a grid". Can be used to access resources, either on a grid, or anywhere else.

  • OMII-UKFunded through the Open Middleware Infrastructure Institute (OMII-UK) as part of the myGrid project run by Carole GobleFour years old, funding secured through 2008 and beyond.Development team at Manchester & Hinxton, UKWide group of friends and allies across the world particularly within UK eScienceImplemented in Java, released under LGPL licence.

  • Biomart querySoaplab operation wrapping an EMBOSS toolWorkflow diagramTree view of workflow structureAvailable servicesVersion 1.5.1 Shown running on a Mac but written in Java, Runs & developed on Windows, OS X and Linux.

  • An Open World Open domain services and resources.Taverna accesses 3500+ operation.Third party.All the major providersNCBI, DDBJ, EBI Enforce NO common data model.Quality Web Services considered desirable.

  • ServicesTaverna can interoperate the following by default :SOAP based web servicesBiomart data warehousesSoaplab wrapped command line toolsBioMoby services and object constructors (talk tomorrow)Inline interpreted scripting (Java based)Other service classes can be added through an extension point (but you probably dont need to)

  • Multi-disciplinary~37000 downloadsRanked 210 on sourceforgeUsers in US, Singapore, UK, Europe, Australia,Systems biologyProteomicsGene/protein annotationMicroarray data analysisMedical image analysisHeart simulationsHigh throughput screeningPhenotypical studiesPlants, Mouse, HumanAstronomyAerospaceDilbert Cartoons

  • What do Scientists use Taverna for?

    Data gathering and annotatingDistributed data and knowledgeData analysisDistributed analysis tools and Data mining and knowledge managementHypothesis generation and modelling

  • Case Study Graves DiseaseAutoimmune disease that causes hyperthyroidism Antibodies to the thyrotropin receptor result in constitutive activation of the receptor and increased levels of thyroid hormone Original myGrid Case StudyRef: Li P, Hayward K, Jennings C, Owen K, Oinn T, Stevens R, Pearce S and Wipat A (2004) Association of variations in NFKBIE with Graves? disease using classical and myGrid methodologies. UK e-Science All Hands Meeting 2004

  • Graves Disease

    The experiment: Analysing microarray data to determine genes differentially-expressed in Graves Disease patients and healthy controlsCharacterising these genes (and any proteins encoded by them) in an annotation pipeline From affymetrix probeset identifier, extract information about genes encoded in this region. For each gene, evidence is extracted from other data sources to potentially support it as a candidate for disease involvement

  • Annotation Pipeline

    Evidence includes:SNPs in coding and non-coding regionsProtein products Protein structure and functional featuresMetabolic PathwaysGene Ontology terms

  • Data Analysis

    Access to local and remote analysis toolYou start with your own data / public data of interestYou need to analyse it to extract biological knowledge

  • Case study: Investigating Genotype-Phenotype Correlations in Trypanotolerance

    Fisher P, Hedeler C, Wolstencroft K, Hulme H, Noyes H, Kemp S, Stevens R, Brass A. (2007) A systematic strategy for large-scale analysis of genotype phenotype correlations: identification of candidate genes involved in African trypanosomiasis.Nucleic Acids Res.35(16):5625-33

  • Bioinformatics Challenges

    Linking from genotype to phenotypeIntegrated omics (GIMS)Microarray analysisWorking with the literaturePresentation of results to non-bioinformaticiansSepa