1
A myGrid Project Tutorial
Dr Mark Greenwood
University of Manchester
With considerable help from Justin Ferris, Peter Li, Phil Lord, Chris Wroe, Carole Goble and the rest of the myGrid team.
2
• Open Source Upper Middleware for Bioinformatics
• (Web) Service-based architecture• Targeted at Tool Developers,
Bioinformaticians and Service Providers
Newcastle
NottinghamManchester
Southampton
Hinxton
Sheffield
3
myGrid PeopleCore• Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis,
Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe.
Users• Simon Pearce and Claire Jennings, Institute of Human Genetics School of
Clinical Medical Sciences, University of Newcastle, UK• Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital,
Manchester, UKPostgraduates• Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman,
Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair HampshireIndustrial • Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM)• Robin McEntire (GSK)Collaborators• Keith Decker
4
Roadmap - start
services
data
6
Tenet I• High level Middleware
services for data intensive resource interoperation for Bioinformatics– Information Grid not
computational Grid• Exploratory, ad hoc • For individuals• In silico experiment as
workflow• Distributed query processing• Information Management
7
Tenet II• High level services for e-Science
experimental management;– Provenance– Event notification– Personalisation
• Sharing knowledge and sharing components– Scientific discovery is personal &
global.– Federated third party registries for
workflows and services– Workflow and service discovery for
reuse and repurposing
Registry
Re
giste
rF
ind
Annotate
8
Tenet III
• Open Source and Open Services– No control or influence over
service providers
• Open to third party metadata and services
• Open extensible architecture– Assemble your own
components– Designed to work together– Toolkit
Freefluo
WfEE
TavernaViewUDDIregistry
EventNotification
mIR
PedroSemanticDiscovery
Info.Model
Soaplab
Gateway & Portal
LSID
HaystackProvenanceBrowser
9
Tenet IV• (Web) Service architecture
– Publication, discovery, interoperation, composition, decommissioning of myGrid services
– WS-I -> OGSA / WSRF
• Metadata driven– Ontologies– Common information model– Semantic Web technologies
• RDF, OWL
10
Tenet V
Middleware for
• Tool Developers • Bioinformaticians • Service Providers• Biologists are indirectly
supported by the portals and apps these develop.
11
Roadmap
run workflows
services
workflows
data
discover services
data management
workflows
12
Data-intensive bioinformatics
ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
13
Use ScenariosGraves’ Disease• Autoimmune disease of the thyroid • Simon Pearce and Claire Jennings, Institute of
Human Genetics School of Clinical Medical Sciences, University of Newcastle
• Discover all you can about a gene• Annotation pipelines and Gene expression analysis• Services from Japan, Hong Kong, various sites in UK
Williams-Beuren Syndrome• Microdeletion of 155 Mbases on Chromosome 7• Hannah Tipney, May Tassabehji, Andy Brass, St
Mary’s Hospital, Manchester, UK• Characterise an unknown gene• Annotation pipelines and Gene expression analysis
Services from USA, Japan, various sites in UK
14
Manually filling a genomic gap
Two major steps:• Extend into the gap: Similarity searches; RepeatMasker, BLAST• Characterise the new sequence: NIX, Interpro, etc…
• Numerous web-based services (i.e. BLAST, RepeatMasker)• Cutting and pasting between screens• Large number of steps• Frequently repeated – info now rapidly added to public databases• Don’t always get results• Time consuming• Huge amount of interrelated data is produced – handled in lab book and
files saved to local hard drive• Mundane• Much knowledge remains undocumented• Bioinformatician does the analysis
15
WBS Workflows:GenBank Accession No
GenBank Entry
Seqret
Nucleotide seq (Fasta)
GenScanCoding sequence
ORFs
prettyseq
restrict
cpgreport
RepeatMasker
ncbiBlastWrapper
sixpack
transeq
6 ORFs
Restriction enzyme map
CpG Island locations and %
Repetative elements
Translation/sequence file. Good for records and publications
Blastn Vs nr, est databases.
Amino Acid translation
epestfind
pepcoil
pepstats
pscan
Identifies PEST seq
Identifies FingerPRINTS
MW, length, charge, pI, etc
Predicts Coiled-coil regions
SignalPTargetPPSORTII
InterProPFAMPrositeSmart
Hydrophobic regions
Predicts cellular location
Identifies functional and structural domains/motifs
Pepwindow?Octanol?
ncbiBlastWrapper
URL inc GB identifier
tblastn Vs nr, est, est_mouse, est_human databases.Blastp Vs nr
RepeatMasker
Query nucleotide sequence ncbiBlastWrapper
Sort for appropriate Sequences only
Pink: Outputs/inputs of a servicePurple: Taylor-made servicesGreen: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns
RepeatMasker
16
Graves’ Disease Bioinformatics
Annotation PipelineWhat is known about my
candidate gene?
Medline
OMIM
GO
BLAST
EMBL
DQP
Query
Genotype Assay Design System 3D Protein StructureIs this SNP present
in my samples?What is the structure of the protein
product encoded by my candidate gene?
Primer Design
Gene ID
Restriction FragmentLength Polymorphism experiment
SNP SNPSNP
Use primers designed by myGrid to amplify region flanking SNP on the gene
PDB
Query PDB & display proteinstructure
Obtain information about protein& extract information about active site
Swiss-ProtAMBITInterpro
Emboss Eprimer applicationin SoapLab
Selection of restriction enzyme
Talisman
SNP
Emboss Restrictin SoapLab
AMBIT
Determine whether coding SNPaffects the active site of the protein
Peter Li1, Claire Jennings2, Simon Pearce2 and Anil Wipat1, (2003)1School of Computing Science and 2Institute of Human Genetics, University of Newcastle-upon-Tyne.Candidate gene
pool
17
Experiment life cycle
Discovering and reusing
experiments and resources
Managing lifecycle, provenance and
results of experiments
Sharingservices &
experiments
Personalisation
Forming experiments
Executing and monitoring
experiments
18
(e-)Scientists…• …Experiment
• Can workflow be used as an experimental method?• How many times has this experiment been run?
• …Analyze• How do we manage the results to draw conclusions from
them?• How reliable are these results?
• …Collaborate• Can we share workflows, results, metadata etc?
• …Publish• Can we link to these workflows and results from our papers?
• …Review• Can I find, comprehend and review your work?• How was that result derived?
19
Collections of Tasks
Finding
Description ServiceDiscovery
Enactment
BuildingWorkflow
Provenance
StorageData
ManagementQuerying
DomainTasks Service
Providers
Bioinformaticians
Scientists
Annotation providers
20
Registry
mIR
Discovery View
HaystackProvenance
Browser
FreeFluoEnactor
TavernaWF Builder
PedroAnnotation tool
Ontology Store
Others
WSDLSoap-lab
Interface Description
Annotation/description
Annotation providers
Query &Retrieve Workflow
Execution
Store data/knowledge
Scientists
Bioinformaticians
invoking
Querying/sharing/federating/registering
ServiceProviders
Data descriptions
Vocabulary
21
Web Service (Grid Service) communication fabricWeb Service (Grid Service) communication fabric
AMBITText Extraction
Service
Provenance
Personalisation
Event Notification
Gateway
Service and WorkflowDiscovery
myGrid Information Repository
Ontology Mgt
Metadata Mgt
Work bench Taverna Talisman
Native Web Services
SoapLab
Web Portal
Legacy apps
Registries
Ontologies
FreeFluo Workflow Enactment Engine
OGSA-DQPDistributed Query Processor
Bio
info
rmat
icia
nsT
ool P
rovi
ders
Ser
vice
Pro
vide
rsA
pplicationsC
ore servicesE
xternal servicesmyGrid Service Stack
Views
Legacy apps
GowLab
22
Two+ Paths
Core functionality• Services – Soaplab
and Gowlab• Workflow enactment
engine – Freefluo• Workflow workbench
– Taverna• Data integration –
OGSADQP• Information model &
management
Innovative work• Service and workflow
registration• Semantic discovery• Provenance
management• Text mining
In between• Event notification• Gateway
23
Web Service (Grid Service) communication fabricWeb Service (Grid Service) communication fabric
AMBITText Extraction
Service
Provenance
Personalisation
Event Notification
Gateway
Service and WorkflowDiscovery
myGrid Information Repository
Ontology Mgt
Metadata Mgt
Work bench Taverna Talisman
Native Web Services
SoapLab
Web Portal
Legacy apps
Registries
Ontologies
FreeFluo Workflow Enactment Engine
OGSA-DQPDistributed Query Processor
Bio
info
rmat
icia
nsT
ool P
rovi
ders
Ser
vice
Pro
vide
rsA
pplicationsC
ore servicesE
xternal servicesmyGrid Service Stack
Views
Legacy apps
GowLab
24
25
Run the Workflow
Viewing intermediate results
26
Run the Workflow
27
Drilling Down: myGrid and Semantics
• Workflow and service discovery – Prior to and during enactment– Semantic registration
• Workflow assembly– Semantic service typing of inputs and outputs
• Provenance of workflows and other entities• Experimental metadata glue• Use of RDF, RDFS, DAML+OIL/OWL
– Instance store, ontology server, reasoner– Materialised vs at point of delivery reasoning.
• myGrid Information Model
28
Semantic Discovery
View annotations on workflow
Pedro data capture tool
Drag a workflow entry into the explorer pane and the workflow loads.Drag a service/ workflow to the scavenger window for inclusion into the workflow
29
Tutorial focus
Core functionality• Services – Soaplab
and Gowlab• Workflow enactment
engine – Freefluo• Workflow workbench
– Taverna• Data integration –
OGSADQP• Information model &
management
Innovative work• Service and workflow
registration• Semantic discovery• Provenance
management• Text mining
In between• Event notification• Gateway
30
Roadmap
LSID authorities
Taverna workbench
Registry1. Describe services
3. Write & run workflows
services
workflows
data
2. Discover services
4. Provenance & datamanagement
workflows
31
Sessions on Details• Workflows - hands on with Taverna• Semantics• Timetable – split sessions
– Session 1• Group 1 – hands on (Swanson)• Group 2 – semantics (Newhaven)
– Teabreak (short)– Session 2
• Group 1 – semantics (Newhaven)• Group 2 –hands on (Swanson)
– Discussions and Conclusions
32
Questions?
http://www.mygrid.org.uk
http://taverna.sf.net
http://freefluo.sf.net/