adding meaning to your data
DESCRIPTION
Biosapiens talk 2007-12-04TRANSCRIPT
Making your datamore meaningfulUsing BioMOBY and myGrid Taverna services as examples
Duncan Hull
University of Manchester
BioSapiens Network of Excellence 2007-12-04
Outline
• Semantic Web redux
– Semantic web stack,
– There are lots of TLAs (three letter acronyms) in this talk, unavoidable
– XML, URI, Namespaces, Unicode,
– RDF, RDFS OWL
• In this talk when I say “data” I usually mean Web Services…
– Example: InterProScan
– BioMOBY http://www.biomoby.org
– myGrid Taverna http://www.mygrid.org.uk
– myExperiment http://www.myexperiment.org
• Conclusions
– What did we learn?
The semantic web “knows what you mean”thanks to added meaning
• According to Tim Berners-Lee, Ora Lassila and Jim Hendlerhttp://scholar.google.com/scholar?q=semantic+web
• …and Mark Butler from HP labshttp://www.flickr.com/photos/dullhunk/303503677/
• Vague, audacious, “visionary”,controversial and/or doomed (depending on who you ask)
• in practice this means…
Semantic Web “stack”
• A suite of technology and standards* for adding meaning to data
Taken from “Semantic Web Architecture: Stack or Two Towers?” by Ian Horrocks et al see http://dx.doi.org/10.1007/11552222_4 and http://www.flickr.com/photos/dullhunk/415645490/
…But first, InterProScan…
• InterProScan: Protein domains identifier @ EBI http://view.ncbi.nlm.nih.gov/pubmed/15980438
• http://www.ebi.ac.uk/Tools/webservices/rest/submit?tool=iprscan&sequence=uniprot:slpi_human&seqtype=P&[email protected]
• That horrendously long URI submits a job to InterProScan with 4 parameters
– ?tool=iprscan “use the InterProScan tool”
– &sequence=uniprot:slpi_human “…with the sequence secretory leukocyte proteinase inhibitor (SLPI) in UniProt format”
– &seqtype=P “sequence type is protein (e.g. not DNA)”
– &email:[email protected] “email results to Homer Simpson”
– Returns a job identifier e.g. iprscan-20071203-18053660
http://www.ebi.ac.uk/cgi-bin/iprscan/iprscan?tool=iprscan&jobid=iprscan-20071203-18053660
Back to the Stack
URI, Unicode, XML and Namespaces
• Bottom of semantic web stack:
• Namespaces http://www.w3.org/TR/xml-names/
• eXtensible Markup Language (XML) http://www.w3.org/TR/xml
• Uniform Resource Identifiers (URI) http://www.ietf.org/rfc/rfc3986
• Unicode http://www.unicode.org/charts
URI: Uniform Resource IDENTIFIER
• URIs include Uniform Resource Locators (URLs) most people are familiar with for locating things, usually just called them “links”…
– E.g. http://www.biosapiens.info locator for the biosapiens website
– E.g. http://view.ncbi.nlm.nih.gov/pubmed/16015280 locates a biosapiens publication
– Not persistent, sometimes unstable and break e.g. “404 not found”
– Not guaranteed to be unique
• URIs include Uniform Resource Names (URNs) for naming things that are less familiar like ISBN, Digital Object Identifiers (DOI) and Life Science Identifiers (LSID) etc
– E.g. urn:doi:10.1038/sj.ejhg.5201470 names a publication using DOI
– E.g. urn:isbn:0387484361 names a book using ISBN
– E.g. urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:genbank:bx247883 names a biological sequence using LSID
– Unlike URLs, URNs are UNIQUE and PERSISTENT
• URIs can be names, identifiers or locators (sometimes all three)
• See URI Generic syntax http://www.ietf.org/rfc/rfc3986 and URN syntax http://www.ietf.org/rfc/rfc2141 from the Internet Engineering Task Force (IETF)
Unicode: Boring but important
• Unicode provides a unique number for every character
– no matter what the platform (Windows, Unix, iPhone, toaster etc)
– no matter what the program (protein database, email client etc)
– no matter what the language (English, Chinese, Swahili… you name it)
• E.g. U+0041 is the number for “LATIN CAPITAL A”
• E.g. U+0F03 is the number for “TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA”
• E.g. U+221E Is the number for “INFINITY”• http://www.unicode.org/standard/WhatIsUnicode.html
http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
XML: eXtensible Markup Language, boring but incredibly useful
• Data marked using tags as “trees”
<operation name="InterProScan" method="get">
<request>
<parameter name="sequence" type="xsd:string" required="true"/>
</request>
<response>
<representation mediaType="text/xml" element="yn:ResultSet">
<parameter name="totalResults"
type="xsd:nonNegativeInteger"
</response>
</operation>
Namespaces
• “XML namespaces provide a simple method for qualifying element and attribute names used in XML by associating them with namespaces identified by URI references.”
<?xml version="1.0" standalone="yes"?>
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd=http://www.w3.org/2001/XMLSchema …
What this means is that
<xsi:fred> and <xsd:fred>
Are different because they belong in different namespaces,
“xsi:fred” is shorthand for http://www.w3.org/2001/XMLSchema-instance:fredAllows us to have lots of different things called “fred”
Describing Web Services
• This (xml, uri, namespaces + xml) gives us enough to describe Web Services
• There are two styles of services on the Web: “RESTful” and “RESTless”
– “RESTful” (usually no SOAP): described with (Web Application Description Language) WADL https://wadl.dev.java.net/
– “RESTless” (uses SOAP and WSDL): described with Web Services Description Language (WSDL) http://www.w3.org/TR/wsdl
– Most services you’ll come across in bioinformatics are the latter… but that might change
WSDL, WuzzDuLL, MiserabuL…
• WSDL http://www.ebi.ac.uk/Tools/webservices/wsdl/WSInterProScan.wsdl
• Describes Inputs and Outputs, tells us how to interact with service
• Registries of Web Services, like myGrid and BioMOBY use these WSDLs to build their index
• But, its really difficult to find services based on information in their WSDL
– Poor metadata e.g. input “string” name “in0”, “in1”, “out1” etc
– Often auto-generated by tools, not humans
– No constraints on what can be said
– Machine readable but not very human readable
RDF and OWL: Adding metadata and semantics
• Web Ontology Language (OWL) http://www.w3.org/TR/owl-features/
• Resource Description Framework (RDF) http://www.w3.org/TR/rdf-primer/ (M&S stands for model and syntax)
• RDF Schema (RDFS) http://www.w3.org/TR/rdf-schema/
RDF and RDF schema
• RDF is just triples of (subject, verb, object) we can say things about services like
– InterproScan isA service
– InterProScan isA protein_domains_identifier
– InterProScan hasInput protein_sequence
– InterProScan hasOutput InterProScan_report
• The idea is simple …
• …but unfortunately the specifications are syntax are horrible to read and write
– But see http://notabug.com/2002/rdfprimer/ by Aaron Swartz
– RDF Schema gives us “templates” for RDF
BioMOBY.org
• A registry of annotated Web Services:
• BioMOBY has three ontologies
– Namespace e.g. genbank
– Object e.g. protein_sequence (inputs and outputs)
– Services (tasks e.g. alignment)
– And an API too, which lets users add terms to ontology when they register services
– Everything in BioMOBY is annotated (unlike myGrid and myExperiment)
– Ontologies and Services are available from:
• http://biomoby.org/cgi-bin/serviceList
• http://biomoby.org/RESOURCES/MOBY-S/Namespaces
• http://biomoby.org/RESOURCES/MOBY-S/Objects
• http://biomoby.org/RESOURCES/MOBY-S/Services
myGrid Taverna
• myGrid has a registry of services
– Many aren’t annotated
– …but arbitrary services can be added, not just BioMOBY
– Lovingly curated by Franck Tanoh and Katy Wolstencroft
– Using a single ontology http://www.mygrid.org.uk/ontology
• Accessible from Taverna workflow engine
• myGrid makes a bit more use of OWL but not much
Web Ontology Language (OWL)
• RDF and RDFS provide limited capabilities for reasoning
– All men are mortal
– Socrates is a man
--------------------------------------
– Therefore Socrates is mortal
• Do this using deductive reasoners like FaCT++, Pellet, KAON2 etc
• Ulrike Sattlers list of reasoners
– http://www.cs.man.ac.uk/~sattler/reasoners.html
• http://flickr.com/photos/dullhunk/337473755/ socrates picture
What can a reasoner do?
• Subsumption check knowledge is correct, e.g. all protein_sequences are biological_sequences
• Equivalence check knowledge is minimally redundant e.g. SLPI and WAP4 are synonyms for “Secretory leukocyte protease inhibitor”
• Consistency check that knowledge is meaningful, no contradictions are made SLP1 can’t be both a DNA_sequence and a protein_sequence because these are disjoint classes
• Instantiation check if an individual is an instance of a class is myProtein and instance of SLPI?
• Used Protégé, you have used a reasoner
Semantic Web Services in a nutshell
BioMOBY myGrid Taverna 2 / myExperiment
API? yes no Maybe?
Reasoning / semantics
No no no! A bit possibly?
Community participation
Yes Yes Yes! Kind of yes
Metadata Lots of user generated metadata
Bit of an afterthought
User driven
www.myExperiment.org
• Getting large quantities or high-quality metadata about services is time-consuming and expensive…
• Many new web applications rely on users to provide metadata for them
– E.g. flickr, myspace, facebook, delicious etc
• People annotate services by uploading collections of services, workflows
• Can “tag” them
Conclusions
• We really need standard metadata to describe and find services
• Standards are boring but important
• You’re unlikely to win a Nobel prize for creating or using one…
– But science can’t work without them
– Especially “data-driven” rather than “hypothesis-driven” Science
• We’ve looked at semantic web standards for describing Web Services, using InterProScan as an example
– And myGrid, BioMOBY and myExperiment too
– But didn’t talk about DAS / BioDAS
– Thanks for listening
Acknowledgements and References
• Thanks to everyone I robbed stuff off :
– Carole Goble, Homer Simpson, David De Roure, Tim Bray, Mark Butler, Stian Soiland, Katy Wolstencroft, Franck Tanoh, Rod Page, Mark Wilkinson, myGrid team, myExperiment team, Ian Horrocks, Ulrike Sattler, Tim Berners-Lee, Ora Lassila, Jim Hendler, Steve Pettifer, Douglas Kell, IETF, W3C etc
• These slides are also available at http://www.slideshare.net/dullhunk/slideshows
• See Also:
– This talk mostly about semantics rather than web services: see also “Web of Science - REST or SOAP?” at http://www.slideshare.net/dullhunk/web-of-science-rest-or-soap/
– BioMOBY http://view.ncbi.nlm.nih.gov/pubmed/12511062
– myGrid Ontology http://view.ncbi.nlm.nih.gov/pubmed/18048194
– Taverna workflow http://view.ncbi.nlm.nih.gov/pubmed/16845108
– myExperiment: social networking for workflow-using e-scientists (Goble and DeRoure) http://portal.acm.org/citation.cfm?doid=1273360.1273361
• Questions?