provenance in a collaborative bio-database raaswiki donald dunbar & jon manning queen’s...

Provenance in a Collaborative Bio-database

RAASWiki

Donald Dunbar & Jon ManningQueen’s Medical Research Institute

University of Edinburgh

Use Cases for ProvenanceApril 20th 2009

Provenance in Bio-databases

including RAASWiki

Donald Dunbar & Jon ManningQueen’s Medical Research Institute

University of Edinburgh

Use Cases for ProvenanceApril 20th 2009

Plan

bio-databases

provenanceRAASWiki

collaborativeknowledgebases

Biological databases• Sequences

– Ensembl, Entrez• Structure

– PDB• Expression

– GEO, ArrayExpress– Function

– Gene Ontology– Interaction

– MINT, BIND, KEGG– ‘Warehouses’

– GeneCards, IUPHAR– Literature

– Pubmed

How do they handle provenance?Ensembl produces genome databases for vertebrates and other eukaryotic species, and makes this information freely available online.

‘Gene’ ID histories (with stable ID)

Evidence for gene predictions

Links to other databases (eg Uniprot)

How do they handle provenance?The PDB archive contains information about experimentally-determined structures of proteins, nucleic acids, and complex assemblies.

Primary citation

History: deposition and last update

Raw data and protocols

How do they handle provenance?Gene Expression Omnibus: a gene expression/molecular abundance repository supporting MIAME compliant data submissions, and a curated, online resource for gene expression data browsing, query and retrieval.

Standards compliance (protocols, data…)

Links within database (microarrays, protocols)

Raw data and protocols

http://www.ncbi.nlm.nih.gov/geo/info/MIAME.html

How do they handle provenance?The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism.

Evidence for gene annotation (experimental, computational)

Links to original publications

No versioning, just updates

How do they handle provenance?PubMed is a free search engine for accessing the MEDLINE database of citations, abstracts and some full text articles on life sciences and biomedical topics.

Original source material, authors, abstracts

Unique Pubmed ID (used by other databases)

Continual updates (new papers), occasional retractions

http://en.wikipedia.org/wiki/MEDLINE

http://en.wikipedia.org/wiki/Bibliographic_database

How do they handle provenance?GeneCards® is a searchable, integrated database of human genes that provides concise genomic, proteomic, transcriptomic, genetic and functional information on all known and predicted human genes.

Lots of data from other databases

IDs/keys from sources

Lots of data integration based on IDs

How do they handle provenance?The IUPHAR database (IUPHAR-DB) integrates peer-reviewed pharmacological, chemical, genetic, functional and anatomical information on GPCRs, ligand-gated ion channels and voltage-gated-like ion channel subunits encoded by the human, rat and mouse genomes.

Curated by experts

Original sources plus curation provenance

Suggested citations

Newer developments

WikiGenes is the first wiki system to combine the collaborative and largely altruistic possibilities of wikis with explicit authorship. In view of the extraordinary success of Wikipedia there remains no doubt about the potential of collaborative publishing, yet its adoption in science has been limited. Here I discuss a dynamic collaborative knowledge base for the life sciences that provides authors with due credit and that can evolve via continual revision and traditional peer review into a rigorous scientific tool.

but….

RAASWikiRAASWiki is a knowledgebase of information on the renin-angiotensin-aldosterone system. While much of the seed data were derived from pre-existing databases such as KEGG and OMIM, supplementary data are included not easily available through such resources. This includes short textual reports on the genes involved, and more experimentally-oriented information such as animal models.

Important biology - hypertension

Automatic seeding of database (BioKB)

Collaborative editing (Wiki based, useful functionality)

Genes, publications, animal models, datasets…

RAASWiki – provenance

Seeded data tagged with source database and date

Edits are tagged with editor and date

Comments are tagged: name and date

Wiki functionality allows versioning and roll back

Identifiers for source databases preserves provenance

‘Crowd wisdom’ will hopefully unsure good quality

RAASWiki – provenance issues

How much detail (each edit, granularity, versions)?

Who will use provenance data?

How much should we rely on sources for provenance?

Annotation & comments v changing data

Public v private data

Different focus depending on data (who, when, confidence)

Likely to become a big issue

What provenance to we need?Example:Gene expression in a transgenic animal

gene annotation gene expression measurements

public databases output from machine

processingintegration

where, when

which identifiers how

when, what, how

data miningwhat and how did we select genes

……

What provenance to we need?Example:Curated gene database

curation database links

curator input

archive

contributor, date

verify, add, delete, modify

source, identifiers, dates

Curated databaseversions, dates

developmentschema & interface changes

Collaborative knowledgebasesdatabases

experiments knowledge knowledgebase

papers

Collaborative knowledgebaseprovenance issues

Confidence in data

Tracking data to its (real) source

When is something (knowledge) finished

Citing of knowlegebase records

Linking between knowledgebase records

Published papers do not contain all information

Some sort of dynamic publication

Conclusions

• In biology provenance is a mixed bag• We use mainly static databases• Usually source is clear but not much else• RAASWiki contains static and curated data • We have implemented a very rudimentary

provenance scheme• Collaborative knowledgebases will need to

address provenance in new ways

provenance in a collaborative bio-database raaswiki donald dunbar & jon manning queen’s...

Documents

curation provenance

uniprot slide

gene expression data

data links

gene predictions links

gene expression omnibus

iuphar literature pubmed

gene id histories