mygrid and the semantic web phillip lord school of computer science university of manchester
TRANSCRIPT
myGrid and the Semantic Web
Phillip Lord
School of Computer Science
University of Manchester
myGrid: eScience and Bioinformatics
• Oct 2001 – April 2005.
• £3.4 million.
• UK e-Science Pilot Project.
• £0.4 million studentships.
Newcastle
NottinghamManchester
Southampton
Hinxton
Sheffield
Data (Type) Intensive Bioinformatics
ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
Web Service (Grid Service) communication fabric
Web Service (Grid Service) communication fabric
AMBITText Extraction
Service
Provenance
Personalisation
Event Notification
Gateway
Service and WorkflowDiscovery
myGrid Information Repository
Ontology Mgt
Metadata Mgt
Work bench Taverna Talisman
Native Web Services
SoapLab
Web Portal
Legacy apps
Registries
Ontologies
FreeFluo Workflow Enactment Engine
OGSA-DQPDistributed Query Processor
Bio
info
rmat
icia
nsT
ool P
rovi
ders
Ser
vice
Pro
vide
rsA
pplicationsC
ore servicesE
xternal services
Views
Legacy apps
GowLab
Support not Automation
Thin Semantics• PRETTYSEQ of CDS1|>CDS2|strand_1 from 1 to 129
• ---------|---------|---------|---------|---------|---------|• 1 atgacggacactgctggtcgctgtggcttcctcctacgcgttcggtcactcctgcacatg 60• 1 M T D T A G R C G F L L R V R S L L H M 20
• ---------|---------|---------|---------|---------|---------|• 61 tccgcagtagtggtgctctcggggaccccctcgccaccccacaataccgctcaccacatg 120• 21 S A V V V L S G T P S P P H N T A H H M 40
• ---------• 121 gccaaacag 129• 41 A K Q 43
CPGREPORT of CDS1|>CDS2|strand_1 from 1 to 129
Sequence Begin End Score CpG %CG CG/GC
CDS1|>CDS2|strand_1 5 109 58 9 64.8
1.12
########################################
# Program: restrict
# Rundate: Thu Jul 15 16:32:30 2004
# Report_format: table
# Report_file: /scratch/emboss_interfaces/a/unknown/Projects/default/Data/out1089905549241
########################################
Start End Enzyme_name Restriction_site 5prime 3prime 5primerev 3primerev
4 8 TspGWI ACGGA 19 17 . .
9 15 TspRI CASTGNN 15 6 . .
14 19 BtsI GCAGTG 8 6 . .
25 28 CviJI RGCY 26 26 . .
30 33 MnlI CCTC 40 39 . .
36 41 MluI ACGCGT 36 40 . .
#---------------------------------------
#---------------------------------------
Semantic Discovery with FetaQuery-ontology – discovering workflows and services described in the registry by building a query in Taverna.
A common ontology is used to annotate and query. (Planning For OBO release)
Knowledge in Feta
Ontology (OWL-DL)
Service Descriptions
(XML)
Jena Querying
(RDF)
Service DiscoveryGood:
RDF provides a convenient search capability, with a
well defined link to an ontology
Bad:
Unsure about scalability. Issues of security,
Concurrency will probably also affect us.
Provenance• Bioinformatics has a data circularity problem.
• Computational data is hard to trace, reproduce or repeat.
• We need to store provenance.
• Service Orientated Architecture and Service Descriptions start to enable us to do this.
Provenance: The Semantic Web
Generating Provenance
Web Services
Taverna
FreeFluo
MetadataRepository
(reified)
Data Repository
LaunchPad Haystack
Workflow run
Workflow design
Experiment design
Project
Person
Organisation
Process
Service
Event
Data item
Data itemData item
data derivation e.g. output data derived from input data
instanceOf
partOf componentProcesse.g. web service invocation of BLAST @ NCBI
componentEvente.g. completion of a web service invocation at 12.04pm
runBye.g. BLAST @ NCBI
run for
Organisation level provenance Process level provenance
User can add templates to each workflow process to determine links between data items.
ProvenanceGOOD:
RDF provides a convenient data model, which is
flexible, and adaptable.
BAD:
Visualisation tools are lacking. Scalability even more an
of issue with reification
LSID’s• Standard identifier mechanism, aimed at the life
sciences
• Has standard resolution mechanism by which the data can be obtained.
• Has semantics for versioning
• Has standard association with metadata
• Abbreviation distressingly similar to LSD
Provenance• Used LSID within provenance; all of our data is
stored and resolved with LSID
• Notion of a single identifier system within myGrid is attractive.
Worries• We are unclear as how the metadata/data split
happens with LSID: Use former for mutability, later for immutability.
• We have also tending toward using “metadata” for RDF based data, and “data” for relational.
LSIDGOOD:
Defined resolution mechanism, data and metadata.
BAD:
Unclear how to use data/metadata split.
AcknowledgementsCore
• Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pocock, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe.
Users
• Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle, UK
• Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK
Postgraduates
• Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair Hampshire
Industrial
• Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM)
• Robin McEntire (GSK)
Collaborators
• Keith Decker
SummaryGOOD:
RDF provides a convenient search capability, with a well defined link to an ontology
RDF provides a convenient data model, which is flexible, and adaptable.
LSID: Defined resolution mechanism, data and metadata.
BAD:
Unsure about scalability. Issues of security, Concurrency will probably also affect
Visualisation tools are lacking. Scalability even more an of issue with reification
LSID: Unclear how to use data/metadata split.