pathway/genome databases and software tools peter d. karp, ph.d. bioinformatics research group sri...

42
Pathway/Genome Databases and Software Tools Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International [email protected] http://ecocyc.DoubleTwist.com/ecocyc/

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Pathway/Genome Databases and Software

Tools

Peter D. Karp, Ph.D.

Bioinformatics Research Group

SRI International

[email protected]

http://ecocyc.DoubleTwist.com/ecocyc/

SRI InternationalBioinformaticsOverview

Overview of bioinformatics

Motivations for the EcoCyc project

EcoCyc demoDescription of EcoCyc database and Pathway Tools

software

Underlying technologies Ocelot object database GKB Editor X-windows to WWW translator

SRI InternationalBioinformaticsDefinition of Bioinformatics

Computational techniques for management and analysis of biological data and knowledge

Methods for disseminating, archiving, interpreting, and mining scientific information

SRI InternationalBioinformatics

Motivations for Bioinformatics

Growth in molecular-biology knowledge

Industrialization of biological experimentation

High-throughput biology Genome sequences Gene and protein expression data Protein-protein interaction data Protein 3-D structures ….

SRI InternationalBioinformatics

A

E

SRI InternationalBioinformaticsMotivations for EcoCyc --

E. coli Encyclopedia

Integrate E. coli information dispersed in the literature

New paradigm of scientific publishing

Model the full metabolic network of an organism

Integrate genomic data with functional data

Develop algorithms for computing with function

Provide a challenging domain for computer-science research

SRI InternationalBioinformaticsDefinitions

A chemical reaction interconverts chemical compounds

An enzyme is a protein that accelerates chemical reactions

A pathway is a linked set of reactions

A conceptual unit of cell’s biochemical machine

A + B = C + D

A C E

SRI InternationalBioinformaticsOrganism-Specific

Pathway/Genome Databases

Layer functional information above the genome

Rich ontology to encode biological information with high fidelity

Chromosomes, genes, operons, gene products, reactions, pathways

Curated by experts for that organism Integrate literature and computational predictions

SRI InternationalBioinformaticsPathway Tools Software

Pathway/Genome Navigator WWW publishing of PGDBs Graphic depictions of pathways, chromosomes, operons Pathway visualization of gene-expression data

Pathway/Genome Editors Distributed curation of genome annotations Distributed object database system Interactive editing tools

PathoLogic Prediction of metabolic network from genome

SRI InternationalBioinformatics

EcoCyc = E.coli Dataset + Pathway/Genome

Navigator

Genes: 4,393

Gene Products: 4,393

Reactions: 1,117

Pathways: 158

Metabolic Network

Compounds: 1,887

http://ecocyc.DoubleTwist.com/ecocyc/

Operons: 375

SRI InternationalBioinformaticsEcoCyc

Collaborative development via internet Karp -- Bioinformatics architect Riley -- Metabolic pathways, signal transduction Saier and Paulsen -- Transport Collado -- Regulation of gene expression

Ontology of 1000 biological classes14,000 instances

Over 2,600 registered users

SRI InternationalBioinformaticsPathway Tools Software

Pathway/ Genome Databases

Pathway/GenomeNavigator

PathoLogic Pathway

Predictor

Pathway/GenomeEditors

SRI InternationalBioinformaticsCreation of the Overview Graph

Run layout algorithms on individual pathway graphs

Automatically determine topology of pathway graph Apply associated layout algorithm (linear, circular, tidy tree)

Use superpathways to create hierarchical layouts Treat each individual pathway as a single node Pathway connections are edges Run appropriate layout algorithm

Manually position the resulting pathway clusters

SRI InternationalBioinformaticsInference of Metabolic Pathways

Genomic Map

Genes

Gene Products

Reactions

Pathway

Metabolic Network

Compounds

Pathway/Genome Database

PathoLogicList of Genes/ORFs

List of Gene Products

ANNOTATED GENOMEStructured ASCII Text File

DNA Sequence

Reports

MetaCyc

SRI InternationalBioinformaticsSummary of H. pylori Analysis

For 121 E. coli pathways, what is the evidence that each pathway occurs in H. pylori?

Strong evidence: 41 Medium evidence: 29 Little or no evidence: 51 31 reactions catalyzed by H. pylori but not by E. coli

H. pylori has partial abilities to synthesize cofactors and amino-acids, extremely

limited carbohydrate catabolism, some amino acid utilization, and a reductive citric-acid pathway

SRI InternationalBioinformaticsMicrobial Pathway/

Genome DBs

Literature-based Datasets:

MetaCyc

Escherichia coli

PathoLogic-based Datasets:

Bacillus subtilisMycobacterium tuberculosisHelicobacter pyloriHaemophilus influenzaeMycoplasma pneumoniaTreponema pallidumChlamydia trachomatis

Saccharomyces cerevisiae

SRI InternationalBioinformaticsPathway Tools Software

Architecture

Implemented in Common Lisp

WWW server runs as a single Unix process with a separate thread to service each query

Grasper-CL graph manager

Ocelot object databaseGKB Editor schema-driven editor

SRI InternationalBioinformaticsEcoCyc WWW Server

SRI InternationalBioinformaticsPathway Tools Architecture --

Development Configuration

Ocelot DBMS

GFP API

PathwayGenome Navigator

WWWServer

X-Windows Graphics

Object EditorPathway EditorReaction Editor

Oracle

SRI InternationalBioinformaticsOcelot Database System

Object Database ManagerPersistence via filesystem or relational DBMS

Demand and background faulting of objects from RDBMS

Two-level object cachingExtensive bioinformatics schema

Stored transaction history Inspect object history

SRI InternationalBioinformaticsOcelot Knowledge Server

Architecture

Frame data model

Persistent storage via Disk files Oracle DBMS

Optimistic concurrency-control protocol

Schema evolution

Logging facility

SRI InternationalBioinformaticsThe Frame Data Model

Frames are of two types: classes, instances

Frames have slots that define their properties, attributes, relationships

A slot has one or more values

Each value can be any Lisp datatype

Slotunits define metadata about slots: Domain, range, inverse Collection type, number of values, value constraints

SRI InternationalBioinformaticsInference Capabilities

Inheritance of defaults

Slot values computed via attached procedures

Maintenance of inverse relationships

Constraint system Deferred evaluation Tolerant of nonconformant data

SRI InternationalBioinformaticsStorage System Architecture

Oracle KBs

DBMS is submerged within FRSRelational schema is domain independent,

supports multiple KBs simultaneously

Frames transferred from DBMS to Ocelot On demand By background prefetcher Memory cache Persistent disk cache to speed performance via Internet

SRI InternationalBioinformaticsFrame Faulting

(get-slot-value gene ‘map-position)

Gene present in in-memory object cache?Gene present in cache on local disk?Query Oracle DBMS

SRI InternationalBioinformaticsLogging

Oracle DBMS stores: The latest version of each frame A history of all OKBC operations applied to KB

Reconstruct earlier versions of KBView history of changes to an objectUpdate replicatesConcurrency control

SRI InternationalBioinformaticsSchema Management

FRSs store and process class and instance information similarly

Applications can query schema information as easily as they can query instances

SRI InternationalBioinformaticsGKB Editor

Browser and editor for KBs and ontologies

Four editing tools

GKB Editor reusable with multiple FRSs All database queries via OKBC/GFP API Interoperability achieved with Ocelot, LOOM, Ontolingua

All operations are schema driven

http://www.ai.sri.com/~gkb/overview.html

SRI InternationalBioinformaticsEditors

Taxonomy editor

Frame editor

Relationships editor

Spreadsheet editor

SRI InternationalBioinformaticsResults

Ocelot in use in the EcoCyc project for 5 years

Supports collaborative development of EcoCyc by four groups in North America

Distributed architecture GKB Editor in active use

Supports development of 8 Pathway/Genome Databases

SRI InternationalBioinformaticsSummary

Pathway/Genome Databases

Pathway Tools software Extract pathways from genomes Distributed curation tools Query, visualization, WWW publishing Analysis algorithms

SRI InternationalBioinformaticsComputer Science Results

Extend scalability and multiuser access for knowledge representation systems

Reusable, schema-driven KB editor

Hierarchical graph layout algorithms

Dynamic translation from X-windows to HTML+GIF

Importance of ontologies and of content:Discovery = Algorithm + Database

SRI InternationalBioinformaticsProblem Solving Depends on

Algorithms and Content

Database Size and Quality

SolutionQuality

Algorithm Quality

ComputeTime

SRI InternationalBioinformaticsBioinformatics Results:

Content

The EcoCyc database describes the full metabolic map of an organism

The MetaCyc database describes over 300 metabolic pathways

Ontology spans genome to pathway information

SRI InternationalBioinformaticsBioinformatics Results:

Algorithms

Software environment for genome and pathway information

Query and visualization Distributed database development

PathoLogic algorithm predicts the metabolic network of an organism from its genome

Algorithms under development for qualitative modeling of the cell

SRI InternationalBioinformaticsAcknowledgements

Funding sources: NIH National Center for Research Resources

Collaborators: Monica Riley, Marine Biological Laboratory Milton Saier, UC San Diego Julio Collado, UNAM Christos Ouzounis, European Bioinformatics Institute

Peter D. Karp, Ph.D.

http://www.ai.sri.com/pkarp/

[email protected]