overview of genome databases peter d. karp, ph.d. sri international [email protected]

43
Overview of Genome Databases Peter D. Karp, Ph.D. SRI International [email protected] www-db.stanford.edu/dbseminar/ seminar.html

Post on 18-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Overview of Genome Databases

Peter D. Karp, Ph.D.

SRI International

[email protected]

www-db.stanford.edu/dbseminar/seminar.html

Page 2: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Talk Overview

Definition of bioinformatics

Motivations for genome databases

Issues in building genome databases

Page 3: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Definition of Bioinformatics

Computational techniques for management and analysis of biological data and knowledge

Methods for disseminating, archiving, interpreting, and mining scientific information

Computational theories of biology

Genome Databases is a subfield of bioinformatics

Page 4: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Motivations for Bioinformatics

Growth in molecular-biology knowledge (literature)

Genomics

1. Study of genomes through DNA sequencing2. Industrial Biology

Page 5: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Example Genomics Datatypes

Genome sequences DOE Joint Genome Institute

511M bases in Dec 2001 11.97G bases since Mar 1999

Gene and protein expression data

Protein-protein interaction data

Protein 3-D structures

Page 6: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Genome Databases

Experimental data Archive experimental datasets Retrieving past experimental results should be faster than repeating the

experiment Capture alternative analyses Lots of data, simpler semantics

Computational symbolic theories Complex theories become too large to be grasped by a single mind The database is the theory Biology is very much concerned with qualitative relationships Less data, more complex semantics

Page 7: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Bioinformatics

Distinct intellectual field at the intersection of CS and molecular biology

Distinct field because researchers in the field must know CS, biology, and bioinformatics

Spectrum from CS research to biology service

Rich source of challenging CS problems

Large, noisy, complex data-sets and knowledge-sets

Biologists and funding agencies demand working solutions

Page 8: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Bioinformatics Research

algorithms + data structures = programs

algorithms + databases = discoveries

Combine sophisticated algorithms with the right content:

Properly structured Carefully curated Relevant data fields Proper amount of data

Page 9: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Reference on Major Genome Databases

Nucleic Acids Research Database Issue

http://nar.oupjournals.org/content/vol30/issue1/ 112 databases

Page 10: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Questions to Ask of a New Genome Database

Page 11: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

What are Database Goals andRequirements?

What problems will database be used to solve?

Who are the users and what is their expertise?

Page 12: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

What is its Organizing Principle?

Different DBs partition the space of genome information in different dimensions

Experimental methods (Genbank, PDB)

Organism (EcoCyc, Flybase)

Page 13: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

What is its Level of Interpretation?

Laboratory data

Primary literature (Genbank)

Review (SwissProt, MetaCyc)

Does DB model disagreement?

Page 14: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

What are its Semantics and Content?

What entities and relationships does it model?

How does its content overlap with similar DBs?How many entities of each type are present?Sparseness of attributes and statistics on

attribute values

Page 15: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

What are Sources of its Data?

Potential information sources Laboratory instruments Scientific literature

Manual entry Natural-language text mining

Direct submission from the scientific community Genbank

Modification policy DB staff only Submission of new entries by scientific community Update access by scientific community

Page 16: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

What DBMS is Employed?

None

Relational

Object oriented

Frame knowledge representation system

Page 17: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Distribution / User Access

Multiple distribution forms enhance accessBrowsing access with visualization toolsAPIPortability

Page 18: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

What Validation Approaches areEmployed?

None

Declarative consistency constraints

Programmatic consistency checking

Internal vs external consistency checking

What types of systematic errors might DB contain?

Page 19: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Database Documentation

Schema and its semanticsFormatAPIData acquisition techniquesValidation techniquesSize of different classesCoverage of subject matterSparseness of attributesError ratesUpdate frequency

Page 20: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Relationship of Database Field toBioinformaticsScientists generally unaware of basic DB

principles Complex queries vs click-at-a-time access Data model Defined semantics for DB fields Controlled vocabularies Regular syntax for flatfiles Automated consistency checking

Most biologists take one programming classEvolution of typical genome databaseFiner points of DB research off their radar screenHandfull of DB researchers work in bioinformatics

Page 21: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Database Field

For many years, the majority of bioinformatics DBs did not employ a DBMS

Flatfiles were the rule Scientists want to see the data directly Commercial DBMSs too expensive, too complex DBAs too expensive

Most scientists do not understand Differences between BA, MS, PhD in CS CS research vs applications Implications for project planning, funding, bioinformatics

research

Page 22: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Recommendation

Teaching scientists programming is not enoughTeaching scientists how to build a DBMS is

irrelevantTeach scientists basic aspects of databases and

symbolic computing Database requirements analysis Data models, schema design Knowledge representation, ontologies Formal grammars Complex queries Database interoperability

Page 23: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

BioSPICE Bioinformatics

Database WarehousePeter Karp, Dave Stringer-Calvert, Tom Lee, Kemal

Sonmez

SRI Internationalhttp://www.BioSPICE.org/

Page 24: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Project Goal

Create a toolkit for constructing bioinformatics database warehouses that collect together a set of bioinformatics databases into one physical DBMS

Page 25: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Motivations

Important bioinformatics problems require access to multiple bioinformatics databases

Hundreds of bioinformatics databases exist Nucleic Acids Research 30(1) 2002 – DB issue Nucleic Acids Research DB list: 350 DBs at

http://www3.oup.co.uk/nar/database/a/ Different problems require different sets of

databases

Page 26: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Motivations

Combining multiple databases allows for data verification and complementation

Simulation problems require access to data on pathways, enzymes, reactions, genetic regulation

Page 27: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Why is the Multidatabase Approach Not Sufficient?

Multidatabase query approaches assume databases are in a DBMS

Internet bandwidth limits query throughput Most sites that do operate DBMSs do not allow

remote SQL access because of security and loading concerns

Control data stability Need to capture, integrate and publish locally

produced data of different types Multidatabase and Warehouse approaches

complementary

Page 28: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Scenario 1

BioSPICE scientist wants to model multiple metabolic pathways in a given organism

Enumerate pathways and reactions What enzymes catalyze each reaction? What genes code for each enzyme? What control regions regulate each gene?

Page 29: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Approach

Oracle and MySQL implementations Warehouse schema defines many bioinformatics

datatypes Create loaders for public bioinformatics DBs

Parse file format for the DB Semantic transformations Insert database into warehouse tables

Warehouse query access mechanisms SQL queries via Perl, ODBC, OAA

Page 30: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Example: Swiss-Prot DB

Version 40.0 describes 101K proteins in a 320MB file

Each protein described as one block of records (an entry) in a large text file

Loader tool parses file one entry at a time Creates new entries in a set of warehouse tables

Page 31: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Warehouse Schema

Manages many bioinformatics datatypes simultaneously Pathways, Reactions, Chemicals Proteins, Genes, Replicons Citations, Organisms Links to external databases

Each type of warehouse object implemented through one or more relational tables (currently 43)

Page 32: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Warehouse Schema

Databases on our wish list: Genbank (nucleotide sequences) Protein expression database Protein-protein interactions database Gene expression database NCBI Taxonomy database Gene Ontology CMR

Page 33: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Warehouse Schema

Manages multiple datasets simultaneously Dataset = Single version of a database

Support alternative measurements and viewpoints

Version comparison Multiple software tools or experiments that

require access to different versions Each dataset is a warehouse entity Every warehouse object is registered in a dataset

Page 34: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Warehouse Schema

Different databases storing the same biological types are coerced into same warehouse tables

Design of most datatypes inspired by multiple databases

Representational tricks to decrease schema bloat Single space of primary keys Single set of satellite tables such as for synonyms, citations,

comments, etc.

Page 35: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Warehouse Schema

Examples Protein data from Swiss-Prot, TrEMBL, KEGG, and EcoCyc

all loaded into same relational tables Pathway data from MetaCyc and KEGG are loaded into the

same relational tables

Page 36: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Example: Swiss-Prot DB

ID 1A11_CUCMA STANDARD; PRT; 493 AA.AC P23599;DT 01-NOV-1991 (Rel. 20, Created)DT 01-NOV-1991 (Rel. 20, Last sequence update)DT 15-DEC-1998 (Rel. 37, Last annotation update)DE 1-AMINOCYCLOPROPANE-1-CARBOXYLATE SYNTHASE CMW33 (EC 4.4.1.14) (ACCDE SYNTHASE) (S-ADENOSYL-L-METHIONINE METHYLTHIOADENOSINE-LYASE).GN ACS1 OR ACCW.

Page 37: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

How Swiss-Prot is Loaded intoThe Warehouse

Register Swiss-Prot in Datasets tableCreate entry in Entry and Protein tables for each

Swiss-Prot proteinSatellite tables store

Protein synonyms, citations, comments, accession numbers, organism, sequence features, subunits/complexes, DB links

Page 38: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Protein Table

CREATE TABLE Protein ( WID NUMBER --The warehouse ID of this protein Name VARCHAR2(500) --Common name of the protein AASequence VARCHAR2(4000),--Amino-acid sequence for this protein Charge NUMBER, --Charge of the chemical Fragment CHAR(1), --Is this protein a fragment or not, T or F MolecularWeightCalc NUMBER, --Molecular weight calculated from sequence. Units: Daltons. MolecularWeightExp NUMBER, --Molecular Weight determined through experimentation. Units: Daltons. PICalc VARCHAR2(50), --pI calculated from its sqeuence. PIExp VARCHAR2(50), --pI value determined through experimentation. DataSetWID NUMBER --Reference to the data set from which the entity came from);

Page 39: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Database Loaders

Loader tool defined for each DB to be loaded into Warehouse

Example loaders available in several languages Loaders

KEGG (C) BioCyc collection of 15 pathway DBs (C) Swiss-Prot (Java) ENZYME (Java)

Page 40: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Terminology

Model Organism Database (MOD) – DB describing genome and other information about an organismPathway/Genome Database (PGDB) – MOD that combines information about

Pathways, reactions, substrates Enzymes, transporters Genes, replicons Transcription factors, promoters,

operons, DNA binding sites

BioCyc – Collection of 15 PGDBs at BioCyc.org

EcoCyc, AgroCyc, YeastCyc

Page 41: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Loader Architecture

Grammar forSwiss-Prot

Parser forSwissProt

ANTLRParserGenerator

Swiss-ProtDatafile

SQL InsertCommands

OracleLoadableFile

Page 42: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Current Warehouse Contents

KEGG ENZYME SwissProt BsubCyc Warehouse Total

Chemicals 7,284 2,952 0 576 10,812

Genes 5,714 0 88,605 4,221 98,540

Organisms 60 0 103,807 1 103,868

Proteins 3,829 3,870 101,602 4,150 113,451

Enzymatic

Reactions 3,509 0 0 717 4,226

Pathways 4,517 0 0 138 4,655

Pathway

Reactions 36,271 0 0 530 36,801

Page 43: Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com

Example Warehouse Uses

Check completeness of data sources

Count reactions in ENZYME database with (and without) associated protein sequences in SWISS-PROT database:3870 reactions in ENZYME1662 reactions (43%) with a sequence in SWISS-PROT2208 reactions (57%) without a sequence in SWISS-PROT

Count #of distinct non-partial EC numbers in SWISS-PROT:1554 distinct EC numbers in SWISS-PROT (non-partial)