taverna, myexperiment and biocatalogue: workflow tools for informatics integration dr katy...
TRANSCRIPT
Taverna, myExperiment and BioCatalogue: Workflow Tools for Informatics Integration
Dr Katy Wolstencroft
School of Computer Science
University of Manchester
• Interoperability, Integration and Collaboration
• Access to distributed and local resources
• Iteration over data sets• Automation of data flow• Agile software development• Extensible• Experimental protocols• Part of the myGrid toolkit
Taverna Workflows
What is myGrid?An e-Science Collaboration Since 2001
• Software ● Services ● Content ● Skills ● Community
• Manchester, Southampton, Oxford and the EMBL-EBI+ an alliance of intl. contributing projects and partners
• Sustainable production level quality– Open Middleware Infrastructure Institute UK– Software Sustainability Institute– Mixture of developers, bioinformaticians and researchers
• Open source development and content LGPL or BSD
Connecting Things Together
• Data Resources– Genome databases– Kinetic/metabolite data
• Analysis tools– Sequence alignment– Similarity searching– Pattern matching
• Knowledge Resources– Ontologies– Controlled vocabularies
Create and run workflows
Share, discover and reuse workflows
Manage the metadata needed and generated
RDF, OWL
Discover and reuse services
Feta
A Collection of Components
What is a Workflow
• Set of services (web services, RESTful, local scripts, other workflows)
• Set of data links between services - “put output X from service A as input Y to service B”– If needed: List handling, control links
• This can be called a data-oriented workflows (dataflow)
– Say where you want the data to flow instead of what you want to do
– Compare with more procedural workflow languages like BPEL
• Beneficial way of thinking for much data-driven scientific research
KeplerTriana
BPEL
Ptolemy II
Taverna
Workflow diagram
Tree view of workflow structure
Tree view of workflow structure
Available services
Taverna
Open source and extensible
Taverna Gui and Enactor
Taverna Remote Execution service
T-REX
Graphical WorkbenchDrag and drop interface
Plug-in architectureNested Workflows
Workflow EnactorLocal and remote enactorImplicit iteration over data collectionsAutomation of data flowLogging and data provenance tracking
Taverna http://www.taverna.org.uk
Software Release • Taverna first released 2004. • Current versions 1.7.2 and Taverna 2.1.2• Currently 1500 + users per month, 350+ organizations, ~40 countries,
80000+ downloads across versions
Availability• Freely available, open source LGPL• Windows, Mac OS, and Linux
Resources• http://www.taverna.org.uk, http://www.mygrid.org.uk• User and developer workshops, documentation, email help desk• Collaborations with numerous groups including NCI’s cancer biomedical
informatics grid (caBIG), EMBL-EBI, NCBI, Concept Web Alliance, Bio2RDF
Software ● Services ● Content ● Skills ● Community ●
What types of service?
• WSDL Web Services• BioMart • R-processor• BioMoby• Soaplab• Grid Services• Local Java services• Beanshell• Workflows• Coming soon.....New REST support
Who Provides the Services?
• Open domain services and resources• Taverna accesses 3500+ services (11,874 operations)• Third party – we don’t own them – we didn’t build them• All the major providers
– NCBI, DDBJ, EBI …• Enforce NO common data model.
• Quality Web Services considered desirable
What do Scientists use Taverna for?
Astronomy Music
Meteorology
Social Science
Cheminformatics
UK Institutes
Systems Biology
International Institutes
International
Networks
Universities
ProjectsLots of Universities
Tav
erna
Ado
ptio
n
Hypothesis Construction and Explanation from the Literaturemy BioAID, Vl-e
Manipulation of SBML models in workflows
PharmacogenomicsAssociation study of Nevirapine-induced skin rash in Thai Population
Data WarehousingtGRAP Database
Rescue
Genome-wide SNP Analysis
• Analysis over compute clusters• Automate annotation of results• Mine annotation data for patterns
[Hoyle]
Shared Genomics
Taverna Grid Use Cases
– KnowArc – The Grid-enabled Know-how Sharing Technology Based on ARC Services and Open Standards
– caGrid – US Cancer Research project– Moteur – A medical imaging project running on EGEE
MicroArray from
tumor tissue
Microarray
preprocessing
Lymphoma
prediction
Lymphoma Prediction Workflow
Wei Tan Univ. Chicago
Ack. Juli Klemm, Xiaopeng Bian, Rashmi Srinivasa (NCI)Jared Nedzel (MIT)
caArray
GenePattern
Use gene-expression
patterns associated with two lymphoma types to predict the type of an
unknown sample.
caGrid Plugin for Taverna
• Taverna support for GAARDS-secured caGrid services
• Wrap existing 3rd party services (that are used by existing Taverna users) for caGrid and annotate them to match compatibility guidelines
• Enables discovery of services in caGrid service registry
Lymphoma type prediction workflow
Genotype Phenotype Studies
• Mouse whipworm infection - parasite model of the human parasite - Trichuris trichuria
Understanding Phenotype• Comparing resistant vs susceptible strains – Microarrays
Understanding Genotype• Mapping quantitative traits – Classical genetics QTL
Joanne Pennock, Richard GrencisUniversity of Manchester
Workflow Results
• Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite.
• Manual experimentation: Two year study of candidate genes, processes unidentified
Joanne Pennock, Richard GrencisUniversity of Manchester
• Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite.
• Manual experimentation: Two year study of candidate genes, processes unidentified
• JO IS A LAB BIOLOGIST
• JO HAS NEVER BUILT A WORKFLOW
Joanne Pennock, Richard GrencisUniversity of Manchester
Workflow Results
Understanding Phenotype
• Comparing resistant vs susceptible strains – Microarrays
Understanding Genotype
• Mapping quantitative traits – Classical genetics QTL
Integrated Microarray data, genomic sequences, pathway data, literature mining.
Trypanosomiasis Study
Identified a pathway for which its correlating gene (Daxx) is believed to
play a role in trypanosomiasis resistance
Paul Fisher, et al Nucleic Acids Research, 2007, 35(16)
http://www.youtube.com/watch?v=x83pzMMw7lkhttp://www.youtube.com/watch?v=Y6_Kz5L010g
Just Enough Sharing….
• myExperiment can provide a central location for workflows from one community/group
• myExperiment allows you to say– Who can look at your workflow– Who can download your workflow– Who can modify your workflow– Who can run your workflow
The most important aspect of myExperiment - Designed by scientistsThe most important aspect of myExperiment - Designed by scientists
Ownership and Attribution
• Packs allow you to collect different items together, like you might with a "wish list" or "shopping basket"
• You can collect internal things (such as workflows, files and even other packs) as well as link to things outside myExperiment
• Your packs can then be shared, tagged, discovered and discussed easily on myExperiment
Packs
Bringing myExperiment to the Taverna User
myExperiment Plugin in Taverna
Running Workflows Through myExperimentTaverna Remote Execution (T-REX)
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX myexp: <http://rdf.myexperiment.org/ontology#>PREFIX sioc: <http://rdfs.org/sioc/ns#>select ?friend1 ?friend2 ?acceptedat where {?z rdf:type<http://rdf.myexperiment.org/ontology#Friendship> . ?z myexp:has-requester?x .?x sioc:name ?friend1 . ?z myexp:has-accepter ?y . ?y sioc:name ?friend2 .?z myexp:accepted-at ?acceptedat }
All accepted Friendships including accepted-at time
Semantically-Interlinked Online Communities
Service Discovery
There are thousands of distributed services. How do we find an appropriate one?
• We need to annotate services by their functions (and not their names!)
• The services might be distributed, but a registry of service descriptions can be central and queried
BioCataloguewww.biocatalogue.org
• A “Web 2.0” catalogue for sharing, discovering and monitoring web services for the Life Sciences.
• Community and expert curation• Community and provider
contribution• Launched mid 2009. • Currently: 370+ members, 1700+
services, 11,870+ operations• 110+ providers, 110+ different
countries
REST APIsLinked Open DataSoftware Open source BSD
Software ● Services ● Content ● Skills ● Community ●
Data and Provenance
• Workflows can generate vast amount of data - how can we manage and track it?
• We need to manage data AND metadata AND experimental provenance
• Scientists need to check back over past results, compare workflow runs and share workflow runs with colleagues
• Scientists need to look at intermediate results when designing and debugging
Provenance ##
• Another slide here• Screenshot of provenance view
myGrid Open Suite of Tools
Client User InterfacesWorkflow GUI Workbench
Workflow Repository
Service CatalogueThird Party Tools
Programming and APIs
Web Portal
Activity and Service Plug-in Manager
Provenance Store
Workflow Server
Open Provenance
Model
Secure Service Access
Toolkits “Taverna Inside”
Workflows under the hood• e-Laboratories (portals)
– Systems Biology, e-Health
• Web based execution– Running workflows over the web through myExperiment
• Visualisation clients that call workflows in the background
Open e-Lab Platforms
• Customised myExperiment instances– Australian Kepler Repository– eStat, NeuroHub, Nema, – SpaceBook, HPC/NA– Microsoft Trident
• BioCatalogue installations– Emory – ed unify project– Eli Lilly
SysMO-SEEK• e-Laboratory for interlinking and sharing data,
models, SOPS and workflows for Systems Biology in Europe
• ISA-TAB & SBML/MIRIAM compliant
Software ● Services ● Content ● Skills ● Community ●
Current Work
Taverna 2.2
Released end June
• Workflow diagnostics and error resolution• Retry and parallelisation• Stop/pause/resume workflows• Intermediate results display
Taverna Roadmap
• Next Generation Workbench• Access to service, data and workflow repositories• More data driven• Component families for vertical markets• Workflow Patterns• Taverna from Excel
“myGrid-in-a-Box” – Virtualised Taverna server deployment and distribution, bundle of
myExperiment, BioCatalogue and database/tools components.
Taverna Labs
• Semantic Taverna– Semantic provenance
• Open Provenance Model
– Linked Open Data• Dutch NBIC Aida toolkit
– Automated workflow planning through reasoning
• e-Lico with U Zurich and Rapid-Miner
• Taverna in the Cloud• Blogging the lab book
– Blog3 with Southampton U
Training
• Tutorials and Training– 58+ tutorials to >900 people.– >20 universities, Life Science
institutes, and networks.– Major Bio conferences– Summer schools in Biology and
Middleware.
• Developer and User Days– Annotation Jamborees
• Undergraduate and Postgraduate Bioinformatics in > 30 universities.
Software ● Services ● Content ● Skills ● Community
More Information
• myGrid– http://www.mygrid.org.uk
• Taverna– http://www.taverna.org.uk
• myExperiment– http://www.myexperiment.org – http://wiki.myexperiment.org
• BioCatalogue– http://www.biocatalogue.org– http://beta.biocatalogue.org