mat úš kalaš university of bergen, norway biohackathon , kyōto august 21, 2011

39
Matúš Kalaš University of Bergen, Norway EDAM ontology BioXSD of bioinformatics data and methods and

Upload: thuyet

Post on 24-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

EDAM ontology. of bioinformatics data and methods a nd. Bio XSD. Mat úš Kalaš University of Bergen, Norway BioHackathon , Kyōto August 21, 2011 (Extended version for discussions). EDAM ontology. E MBRACE D ata A nd M ethods ontology. An ontology for annotation of - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

Matúš KalašUniversity of Bergen, Norway

BioHackathon, KyōtoAugust 21, 2011

(Extended version for discussions)

EDAM ontology

BioXSD

of bioinformatics data and methods

and

Page 2: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

EMBRACE Data And Methods ontology

Jon IsonPeter Rice (PI)

Hamish McWilliamJames Malone

EBI, EMBL, Hinxton

Matúš KalašInge Jonassen (PI)

CBU, Uni Bergen

Steve Pettifer

University of Manchester

An ontology for annotation ofbioinformatics tools, resources, and data

EDAM ontology

Page 3: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

Design principles of EDAM:

• Bioinformatics specificwith as few exceptions as necessary

• Well-defined scopeoperations, types of data (including identifiers), topics, formats

• Relevant and usablefor users and annotators

• Ontologically sanewell-defined concepts and relations, reflecting the reality

• Maintainable

Page 4: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

Scope of EDAM, and example concepts:

Format

“FASTQ”“SBML”

Data

“Sequence trace”“Position frequency

matrix”

Topic

“Phylogenetics”“Protein classification”

Operation

“Multiple sequence alignment”

“Molecular dynamics simulation”

Page 5: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

EDAM sub-ontologies and types of relations:

Page 6: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

Example annotation with EDAM:

Examples: SAWSDL, EMBOSS (similarly also within data-resource annotation in DRCAT)

Page 7: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

2nd example annotation with EDAM:

Example: DRCAT

Page 8: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

Matúš KalašJan Christian Bryne*

Armin Töpfer **

Pål Puntervoll (PI)

Inge Jonassen (PI)

CBU, BCCS, Bergen

* now Oslo University Hospital

** now Uni Bielefeld, moving to Basel

Edita BartaševičiūtėKristoffer Rapacki (PI)

CBS, DTU, Greater Copenhagen

Jon Ison

EBI, EMBL, Hinxton

Alexandre JosephChristophe Blanchet (PI)

IBCP, CNRS, Lyon

Steve Pettifer

University of Manchester

An XML exchange formatfor basic bioinformatics data

BioXSD

Page 9: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

Goals of BioXSD:

• Being an XSD-based XML format to complement RDF and plain-text formats

• Filling the gap between specialised XSD-based exchange formats (such as SBML, MAGE-ML, PDBML, phyloXML, PSI-MI MIF, GCDML, GLYDE-II, … )

• Compatible with XML libraries for all main programming languages

• As lightweight as possible, but fitting everyone

• Developed and maintained in an open but organised collaborationwelcoming requests from the community

• Detailed structurein-depth validation, semantic annotation (EDAM), efficient compression (EXI)

Page 10: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

BioXSD is an exchange formatfor basic bioinformatics data

references to data, accessions, …

sequence and genome features (annotation)

sequence alignments

biomolecular sequences

Page 11: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

http://EDAMontology.sourceforge.net

EDAM at NCBO BioPortal: http://bioportal.bioontology.org/ontologies/1498

http://BioXSD.org

http://drcat.sourceforge.net (a catalogue of databases annotated with EDAM)

Page 12: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

Additional stuff

Page 13: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

EDAM- & BioXSD-related topics for BH11:

• EDAM for new applications & for semantic dataplus what is the best RDF representation for annotation of tools & resources?

• BioXSD to RDF, RDF to BioXSD, SPARQLing of BioXSD datafirstly, what is the best RDF representation for sequence/alignment/feature data?

• BioXSD support in Open Bio*import & export of BioXSD into/from BioPython, BioRuby, BioPerl, BioJava

• Compatibility of BioXSD with other bioinformatics XSDson the conceptual & design level; and on the level of data integration

Page 14: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

Acknowledgement

Big thanks to the BioHackathon organisers!!!

and the sources of funding BH!

Projects contributing to EDAM & BioXSD, and their sources of funding:

Page 15: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

EXAMPLE OF AN EDAM CONCEPT:

Page 16: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

id: EDAM:0001099name: UniProt accessionsubset: identifiersubset: datanamespace: identifierdef: "Accession number of a UniProt database entry."regex: "[A-NR-Z][0-9][A-Z][A-Z0-9][A-Z0-9][0-9]" "[OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9]"example: P43353 Q9C199 A5A6J6synonym: "UniProtKB accession number" EXACT []synonym: "UniProtKB entry accession" EXACT []synonym: "UniProt accession number" EXACT []synonym: "Swiss-Prot accession" EXACT []is_a: EDAM:0002091 ! Accession

EXAMPLE OF AN EDAM CONCEPT:

Page 17: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

Use cases driving EDAM are:

• Searching for tools and resourcesand categorising them

• Tool & data integration automation of data handling or even workflow composition vocabulary for semantically rich data (incl. RDF)

• Data provenancehow data was created and processed

Page 18: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011
Page 19: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011
Page 20: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011
Page 21: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011
Page 22: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

How to annotate the tools and resources?

• Annotate a formal description of the tool

- WSDL (SOAP Web services) (SAWSDL standard)

- WADL (Web applications, URL & REST services)

• Annotate in a dedicated catalogue

- for example DrCAT (online bio databases)

http://drcat.sourceforge.net

- RDF (needs additional vocabulary/ies)

Page 23: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

Who should annotate the tools and resourceswith EDAM?

Providers and users

preferably not catalogue curators

Page 24: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

How to annotate SOAP Web services?wsdl:definitions wsdl:service wsdl:port wsdl:binding wsdl:portType * wsdl:operation wsdl:input wsdl:fault wsdl:output wsdl:message wsdl:part xs:element xs:complexType xs:sequence * xs:element more types, elements, attributes, enumerations ..

SAWSDL

Using URIs of concepts

Page 25: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011
Page 26: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011
Page 27: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011
Page 28: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

Advantages of XML Schema

Advantages for users of tools: Advantages for providers of tools:

usability Automatic input validation * security

usability Easy conversion of formats Parsing “for free” maintainability

usability Auto-generation of objects and GUIs (*) maintainability

scalability Efficient compression (with EXI) (*) scalability

semantics Annotation of details possible semantics

resources Workflow programmingeasier & faster

Ready-made I/Obuilding blocks:

development easier & faster (*)resources

Page 29: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

BioXSD 1.1 beta1 types:

SimpleTypes:

NucleotideSequence AminoacidSequence GeneralNucleotideSequence GeneralAminoacidSequence Biosequence

Accession(s)

helper types: Name, Text Uri Integer(s), Decimal(s) … and a few more

ComplexTypes:

NucleotideSequenceRecord AminoacidSequenceRecord GeneralNucleotideSequenceRecord GeneralAminoacidSequenceRecord BiosequenceRecord

..SequenceAlignment(s)

AnnotatedSequence

DatabaseReference, EntryReference OntologyReference, OntologyConcept Species, SequenceReference, Method

helper types: Score, SequencePosition(s) … a few more

Page 30: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

BioXSD : BiosequenceRecord

BioXSD : BiosequenceAlignment

Page 31: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

BioXSD : AnnotatedSequence

BioXSD : AnnotatedSequence

Page 32: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

BioXSD can be used:

• Directly as an input/output format of tools

• BioXSD can be extended, restricted,or included within other formats

• BioXSD can serve as the intermediate canonical format

Page 33: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

>sp|P43353|AL3B1_HUMAN Aldehyde dehydrogenase family 3 member B1 OS=Homo sapiens GN=ALDH3B1 PE=1 SV=1MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLALRNLRAWMKDERVPKNLATQLDSAFIRKEPFGLVLIIAPWNYPLNLTLVPLVGALAAGNCVVLKPSEISKNVEKILAEVLPQYVDQSCFAVVLGGPQETGQLLEHRFDYIFFTGSPRVGKIVMTAAAKHLTPVTLELGGKNPCYVDDNCDPQTVANRVAWFRYFNAGQTCVAPDYVLCSPEMQERLLPALQSTITRFYGDDPQSSPNLGRIINQKQFQRLRALLGCGRVAIGGQSDESDRYIAPTVLVDVQEMEPVMQEEIFGPILPIVNVQSLDEAIEFINRREKPLALYAFSNSSQVVKRVLTQTSSGGFCGNDGFMHMTLASLPFGGVGASGMGRYHGKFSFDTFSHHRACLLRSPGMEKLNALRYPPQSPRRLRMLLVAMEAQGCSCTLL

>AL3B1_HUMAN P43353 ALDEHYDE DEHYDROGENASE 3B1 (EC 1.2.1.5). - Homo sapiens (Human).MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLALRNLRAWMKDERVPKNLATQLDSAFIRKEPFGLVLIIAPWNYPLNLTLVPLVGALAAGNCVVLKPSEISKNVEKILAEVLPQYVDQSCFAVVLGGPQETGQLLEHRFDYIFFTGSPRVGKIVMTAAAKHLTPVTLELGGKNPCYVDDNCDPQTVANRVAWFRYFNAGQTCVAPDYVLCSPEMQERLLPALQSTITRFYGDDPQSSPNLGRIINQKQFQRLRALLGCGRVAIGGQSDESDRYIAPTVLVDVQEMEPVMQEEIFGPILPIVNVQSLDEAIEFINRREKPLALYAFSNSSQVVKRVLTQTSSGGFCGNDGFMHMTLASLPFGGVGASGMGRYHGKFSFDTFSHHRACLLRSPGMEKLNALRYPPQSPRRLRMLLVAMEAQGCSCTLL

>gi|4502043|ref|NP_000685.1| aldehyde dehydrogenase family 3 member B1 isoform a [Homo sapiens]MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLALRNLRAWMKDERVPKNLATQLDSAFIRKEPFGLVLIIAPWNYPLNLTLVPLVGALAAGNCVVLKPSEISKNVEKILAEVLPQYVDQSCFAVVLGGPQETGQLLEHRFDYIFFTGSPRVGKIVMTAAAKHLTPVTLELGGKNPCYVDDNCDPQTVANRVAWFRYFNAGQTCVAPDYVLCSPEMQERLLPALQSTITRFYGDDPQSSPNLGRIINQKQFQRLRALLGCGRVAIGGQSDESDRYIAPTVLVDVQEMEPVMQEEIFGPILPIVNVQSLDEAIEFINRREKPLALYAFSNSSQVVKRVLTQTSSGGFCGNDGFMHMTLASLPFGGVGASGMGRYHGKFSFDTFSHHRACLLRSPGMEKLNALRYPPQSPRRLRMLLVAMEAQGCSCTLL

>sp_ac|P43353 \ID= AL3B1_HUMAN \DE="Aldehyde dehydrogenase family 3 member B1 (Aldehyde dehydrogenase 7)" \NCBITAXID=9606 MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLALRNLRAWMKDERVPKNLATQLDSAFIRKEPFGLVLIIAPWNYPLNLTLVPLVGALAAGNCVVLKPSEI

Page 34: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

Sequence record in BioXSD 1.0:

<mySequence xsi:type="AminoacidSequenceRecord"> <sequence>MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLALRNLRAWMKDERVPKNLATQLDSAFIRKEPFGLVLIIAPWNYPLNLTLVPLVGALAAGNCVVLKPSEISKNVEKILAEVLPQ YVDQSCFAVVLGGPQETGQLLEHRFDYIFFTGSPRVGKIVMTAAAKHLTPVTLELGGKNPCYVDDNCDPQTVANRVAWFRYFNAGQTCVAPDYVLCSPEMQERLLPALQSTITRFYGDDPQSSPNLGRIINQKQFQRLRALLGCGRVAIGGQSDESDRYIAPTVLVDVQE MEPVMQEEIFGPILPIVNVQSLDEAIEFINRREKPLALYAFSNSSQVVKRVLTQTSSGGFCGNDGFMHMTLASLPFGGVGASGMGRYHGKFSFDTFSHHRACLLRSPG MEKLNALRYPPQSPRRLRMLLVAMEAQGCSCTLL</sequence> <species> <databaseName>NCBI Taxonomy</databaseName> <accession>9606</accession> <entryUri>http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606</entryUri> <name>Human</name> </species> <customName>Aldehyde dehydrogenase family 3 member B1 (ALDH3B1)</customName> <formalReference> <databaseName>UniProt</databaseName> <accession xsi:type=“UniprotAccession">P43353</accession> <entryUri>http://www.uniprot.org/uniprot/P43353</entryUri> <sequenceVersion>1</sequenceVersion> <isoformAccession xsi:type=“ExtendedUniprotAccession">P43353-1</isoformAccession> </formalReference></mySequence>

Page 35: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

Example data in BioXSD 1.1 beta1 format:a sequence record

<exampleSequenceRecord xsi:type="bx:NucleotideSequenceRecord"><bx:sequence>gtgcgagaggcccgtgccgccgtgcgcgctgcctacgaggctttctgccgctggagggaggtc</

bx:sequence><bx:species

dbName="NCBI Taxonomy"dbUri="http://www.ncbi.nlm.nih.gov/taxonomy"accession="9598"entryUri="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9598"speciesName="Chimp"

/><bx:reference

dbName="GenBank/Nucleotide"dbUri="http://www.ncbi.nlm.nih.gov/nuccore"accession="NM_001008991"entryUri="http://www.ncbi.nlm.nih.gov/nuccore/NM_001008991"sequenceVersion="1"

><bx:subsequencePosition>

<bx:segment min="282" max="345"/></bx:subsequencePosition>

</bx:reference ><bx:name>snippet of aldehyde dehydrogenase 5 family, member A1 (ALDH5A1)</bx:name><bx:note>nuclear gene encoding mitochondrial protein, mRNA (GI:57113868)</bx:note>

</exampleSequenceRecord>

Page 36: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

Sequence record in BioXSD 1.1 beta1:

Page 37: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

Sequence-string restriction in BioXSD:

<xs:simpleType name="NucleotideSequence" sawsdl:modelReference="http://purl.org/edam/data/0001211"> <xs:annotation> <xs:documentation>

Nucleotide sequence without ambiguous ("degenerate") bases </xs:documentation>

</xs:annotation>

<xs:restriction base="GenericNucleotideSequence"> <xs:pattern value="[acgt]+"/>

<xs:pattern value="[acgu]+"/>

</xs:restriction></xs:simpleType>

Page 38: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

Strategy

recommended by the EMBRACE network of excellence (2005 - 2010)

Page 39: Mat úš Kalaš University of Bergen, Norway BioHackathon ,  Kyōto August 21, 2011

The EMBRACE project partners:

EMBL-EBI, Hinxton, UK; EMBL, Heidelberg, Germany; ITB, CNR, Bari, Italy; University of Manchester, UK; SIB, Geneva, Switzerland; SLU, Sweden; CNRS, Clermont-Ferrand and Lyon, France; CBS, DTU, Lyngby, Denmark; CSIC, Madrid, Spain; University of Stockholm, Sweden; INRIA-UCBL, Lyon, France; MPIMG, Berlin, Germany; CSC, Espoo, Finland; UCL, London, UK; The Weizmann Institute, Rehovot, Israel; University of Nijmegen, Netherlands; INTA, Madrid, Spain; CBU, BCCS, Bergen, Norway