biological databases

48
Biological Databases By : Lim Yun Ping E mail : [email protected] National University of Singapore

Upload: seoras

Post on 14-Jan-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Biological Databases. By : Lim Yun Ping E mail : [email protected] National University of Singapore. Overview. Introduction What is a database What type of databases can we access What roles do they play What type of information can we get from them How do we access these information. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Biological Databases

Biological Databases

By : Lim Yun PingE mail : [email protected]

National University of Singapore

Page 2: Biological Databases

Overview

• Introduction

• What is a database

• What type of databases can we access

• What roles do they play

• What type of information can we get from them

• How do we access these information

Page 3: Biological Databases

What is a database ?

• Convenient method of vast amount of information

• Allows for proper storing, searching & retrieving of data.

• Before analyzing them we need to assemble them into central, shareable resources

Page 4: Biological Databases

Why databases ?

• Means to handle and share large volumes of biological data

• Support large-scale analysis efforts

• Make data access easy and updated

• Link knowledge obtained from various fields of biology and medicine

Page 5: Biological Databases

Different Database Types

• depends on the nature of information stored(sequences, 2D gel or 3D structure images)

• manner of storage (flat files, tables in a relational database, etc)

• In this course we are concerned more about the different types of databases rather than the particular storage

Page 6: Biological Databases

Features

• Most of the databases have a web-interface to search for data

• Common mode to search is by Keywords

• User can choose to view the data or save to your computer

• Cross-references help to navigate from one database to another easily

Page 7: Biological Databases

Biological Databases

Type of databases Information they containBibliographic databases LiteratureTaxonomic databases ClassificationNucleic acid databases DNA informationGenomic databases Gene level informationProtein databases Protein informationProtein families, domains and functional sites Classification of proteins and identifying domainsEnzymes/ metabolic pathways Metabolic pathways

Page 8: Biological Databases

Types Of Biological Databases Accessible

There are many different types of database but for routine sequence analysis, the following are initially the most important

Primary databasesSecondary databasesComposite databases

Page 9: Biological Databases

Primary databases

• Contain sequence data such as nucleic acid or protein

• Example of primary databases include :

Protein Databases• SWISS-PROT• TREMBL• PIR

Nucleic Acid Databases• EMBL• Genbank• DDBJ

Page 10: Biological Databases

Secondary databases

• Or sometimes known as pattern databases

• Contain results from the analysis of the sequences in the primary databases

• Example of secondary databases include : PROSITE Pfam BLOCKS PRINTS

Page 11: Biological Databases

Composite databases

• Combine different sources of primary databases.

• Make querying and searching efficient and without the need to go to each of the primary databases.

• Example of composite databases include : NRDB – Non-Redundant DataBase OWL

Page 12: Biological Databases

DDBJ : http://www.ddbj.nig.ac.jpDNA Databank of Japan

NCBI : http://www.ncbi.nlm.nih.gov/NCBI, at the NIH campus, USA

EMBL : http://www.embl-heidelberg.de/ European Molecular Biology Laboratory, UK

Nucleic acid Databases

Page 13: Biological Databases

GenBankGenBank

DDBJDDBJ

EMBLEMBL

The International Sequence Database Collaboration

Page 14: Biological Databases

The International Sequence Database Collaboration

• These three databases have collaborated since 1982. Each database collects and processes new sequence data and relevant biological information from scientists in their region e.g. EMBL collects from Europe, GenBank from the USA.

• These databases automatically update each other with the new sequences collected from each region, every 24 hours. The result is that they contain exactly the same information, except for any sequences that have been added in the last 24 hours.

• This is an important consideration in your choice of database. If you need accurate and up to date information, you must search an up to date database.

Page 15: Biological Databases

Amount Of Data Grows Rapidly

As of June 2003, there were 32528249295 bases in 25592865 sequence

Page 16: Biological Databases

How to access themMain Sites

NCBI : http://www.ncbi.nlm.nih.gov/

EMBL : http://www.embl-heidelberg.de/

DDBJ : http://www.ddbj.nig.ac.jp

•full release every two months•incremental and cumulative updates daily•available only through internet ftp://ftp.ncbi.nih.gov/genbank/• 66.3 Gigabytes of data

Page 17: Biological Databases

The Internet and WWW

Page 18: Biological Databases

Kyoto Encyclopedia of Genes and Genomeshttp://www.genome.ad.jp/kegg/kegg2.html

NCBI : http://www.ncbi.nlm.nih.gov/NCBI, a division of NLM at the NIH campus, USA

EXPASY : http://www.expasy.org

Swiss Institute of Bioinformatics

Page 19: Biological Databases

National Centre for Biotechnology Information

Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information all for the better understanding of molecular processes affecting human health and disease.

http://www.ncbi.nlm.nih.gov/

Page 20: Biological Databases
Page 21: Biological Databases

EntrezEntrez is a search and retrieval system that integrates information from databases at NCBI.

Page 22: Biological Databases
Page 23: Biological Databases

BNIP

Page 24: Biological Databases
Page 25: Biological Databases

Accession Number : Unique identifier

Brief description of the sequence.

Source : Organism’s common name Formal scientific name Contains information on the

publications such as the authors, and topic titles of

the journals that discuss the data reported in the record.

Contains the contact information

of the submitter

Contains the information about the genes, gene products and regions of biological significance reported in the sequence &•length of sequence•scientific name of the source organism•Taxon ID number, Map location

Page 26: Biological Databases

Coding sequence (region of the nucleotides that correspond to the sequence of amino acid). This is also the location that contains the start and stop codon.

Region of biological interest

The amino acid translation corresponding to the nucleotide coding sequence

Page 27: Biological Databases

How to understand the output

Unique Identifiers :Each entry in a database must have a unique identifierEMBL Identifier (ID)GENBANK Accession Number (AC)

Other information is stored along with the sequence.Each piece of information is written on it's own line, with a code defining the line. For example, DE, description; OS, organism species; AC, accession number. Relevant biological information is usually described in the feature table (FT).

Page 28: Biological Databases

Genbank Flat File Format

Refer to Summary Description of the Genbank Flat File Format

Or

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

Page 29: Biological Databases

ExPASy

• Expert Protein Analysis System proteomics server of the Swiss Institute of Bioinformatics (SIB)

• dedicated to the analysis of protein sequences and structures

http://www.expasy.org/

Page 30: Biological Databases

Databases on the Expasy server

• SWISS-PROT and TrEMBL - Protein knowledgebase

• PROSITE - Protein families and domains

• SWISS-2DPAGE - Two-dimensional polyacrylamide gel electrophoresis

• ENZYME - Enzyme nomenclature

• SWISS-3DIMAGE - 3D images of proteins and other biological macromolecules

• SWISS-MODEL Repository - Automatically generated protein models

Page 31: Biological Databases

SWISS-PROT

A curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases

http://tw.expasy.org/sprot/

Page 32: Biological Databases

TrEMBL

• Computer-annotated supplement to SWISS-PROT

Page 33: Biological Databases

ENZYME

Enzyme nomenclature database

http://tw.expasy.org/enzyme/

Page 34: Biological Databases

ENZYME Database

• A repository of information relative to the nomenclature of enzymes

• Describes each type of characterized enzyme for which an EC (Enzyme Commission) number has been provided

Page 35: Biological Databases

Access to ENZYME

• by EC number

• by enzyme class

• by description (official name) or alternative name(s)

• by chemical compound

• by cofactor

Page 36: Biological Databases
Page 37: Biological Databases
Page 38: Biological Databases

K E G G

Kyoto Encyclopedia of Genes and Genomes

http://www.genome.ad.jp/kegg/kegg2.html

Page 39: Biological Databases

A structured database containing information about metabolic pathways in many organisms.

Page 40: Biological Databases

KEGG

• Part of the GenomeNet database system

• Linked to all accessible databases by search engines; LIGAND & BRITE

Page 41: Biological Databases
Page 42: Biological Databases
Page 43: Biological Databases

Link to other

pathways

Enzyme

Compound

Page 44: Biological Databases
Page 45: Biological Databases

Summary

• Biological databases represent an invaluable resource in support of biological research.

• We can learn much about a particular molecule by searching databases and using available analysis tools.

• A large number of databases are available for that task. Some databases are very general while some are very specialised. For best results we often need to access multiple databases.

Page 46: Biological Databases

• Common database search methods include keyword matching, sequence similarity, motif searching, and class searching

• The problems with using biological databases include incomplete information, data spread over multiple databases, redundant information, various errors, sometimes incorrect links, and constant change.

Page 47: Biological Databases

• Database standards, nomenclature, and naming conventions are not clearly defined for many aspects of biological information. This makes information extraction more difficult

• Retrieval systems help extract rich information from multiple databases. Examples include Entrez and SRS.

• Formulating queries is a serious issue in biological databases. Often the quality of results depends on the quality of the queries.

• Access to biological databases is so important that today virtually every molecular biological project starts and ends with querying biological databases.

Page 48: Biological Databases

The End