biological databases

38
INTRODUCTION TO BIOLOGICAL DATABASES SARFARAZ HUSSAIN Department of Bioinformatics & Biotechnology GCU- Faisalabad NOTE: Most slides derived from NCBI’s field guide / [email protected] m

Upload: sarfaraz-nasri

Post on 08-Feb-2017

548 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Biological databases

INTRODUCTION TO BIOLOGICAL DATABASES

SARFARAZ HUSSAIN

Department of Bioinformatics &

Biotechnology GCU-Faisalabad

NOTE: Most slides derived from NCBI’s field guide

/ [email protected]

Page 2: Biological databases

WHAT YOU NEED TO LEARN:

What is a database and what are the features of an ideal db?

What are the relationships/differences between primary and derived sequence databases?

What are the benefits of RefSeq?

Why is data integration useful?

Page 3: Biological databases

THE NATIONAL CENTER FOR BIOTECHNOLOGY INFORMATION

Created in 1988 as a part of theNational Library of Medicine at NIH

– Establish public databases– Research in computational biology– Develop software tools for sequence analysis– Disseminate biomedical information

Bethesda,MD

Page 4: Biological databases

WEB ACCESS: WWW.NCBI.NLM.NIH.GOV

New HomepageCommon footer

New pages!

Page 5: Biological databases

WHAT ARE DATABASES? Structured collection of information.

Consists of basic units called records or entries.

Each record consists of fields, which hold pre-defined data related to the record.

For example, a protein database would have protein entries as records and protein properties as fields (e.g., name of protein, length, amino-acid sequence)

Page 6: Biological databases

THE ‘PERFECT’ DATABASE Comprehensive, but easy to search.

Annotated, but not “too annotated”.

A simple, easy to understand structure.

Cross-referenced.

Minimum redundancy.

Easy retrieval of data.

Page 7: Biological databases

THE CENTRAL DOGMA & BIOLOGICAL DATA

Protein structures-Experiments-Models (homologues)

Literature information

Original DNA Sequences(Genomes)

Protein Sequences-Inferred -Direct sequencing

Expressed DNA sequences( = mRNA Sequences= cDNA sequences)Expressed Sequence Tags (ESTs)

Page 8: Biological databases

NCBI DATABASES AND SERVICES GenBank primary sequence database

Free public access to biomedical literature PubMed free Medline (3 million searches per day) PubMed Central full text online access

Entrez integrated molecular and literature databases

Page 9: Biological databases

TYPES OF MOLECULAR DATABASES

Primary Databases Original submissions by experimentalists Content controlled by the submitter

Examples: GenBank, Trace, SRA, SNP, GEO

Derivative Databases Derived from primary data Content controlled by third party (NCBI)

Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets, UniGene, Homologene, Structure, Conserved Domain

Page 10: Biological databases

PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institutes of Health maintains the database as part of the Entrez system of information retrieval.

Page 11: Biological databases

Pubmed: click on the drop down menu select the pubmed option. Type any topic which you want to find in the search box.

Page 12: Biological databases

After typing the topic of our interest lots of research papers will appear on window from where we select the specific papers for our study.

Page 13: Biological databases

PRIMARY VS. DERIVATIVE SEQUENCE DATABASES

GenBankGenBank

SequencingSequencingCentersCenters

GA

GAGA

ATTAT

TCCGAGA

ATTAT

TCC

AT

GAGA

ATTCC GAGA

ATTCC

TTGACAATT

GACTA

ACGTGC

TTGACA

CGTGAATTGAC

TATATAGCCG

ACGTGC

ACGTGCACGTGCTTGACA

TTGACA

CGTGA CGTGA

CGTGA

ATTGACTAATTGACTA AT

TGACTA

ATTGACTA

TATAGC

CG

TATAGCCGTATAGCCGTATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG CAT

T

GAGA

ATTCC GAGA

ATTCC LabsLabs

AlgorithmsAlgorithms

UniGene

CuratorsCurators

RefSeq

GenomeAssembly

TATAGCCGAGCTCCGATACCGATGACAA

Updated continually by NCBI

Updated ONLY by submitters

Page 14: Biological databases

SEQUENCE DATABASES AT NCBI Primary

GenBank: NCBI’s primary sequence database Trace Archive: reads from capillary sequencers Sequence Read Archive: next generation data

Derivative GenPept (GenBank translations) Outside Protein (UniProt—Swiss-Prot, PDB) NCBI Reference Sequences (RefSeq)

Page 15: Biological databases

GENBANK - PRIMARY SEQUENCE DB Nucleotide only sequence database

Archival in nature Historical Reflective of submitter point of view (subjective) Redundant

Data Direct submissions (traditional records) Batch submissions FTP accounts (genome data)

Page 16: Biological databases

GENBANK - PRIMARY SEQUENCE DB (2) Three collaborating databases

1. GenBank

2. DNA Database of Japan (DDBJ)

3. European Molecular Biology Laboratory (EMBL) Database

Page 17: Biological databases

TRADITIONAL GENBANK RECORD

ACCESSION U07418

VERSION U07418.1 GI:466461

Accession•Stable•Reportable•Universal

VersionTracks changes in sequence

GI numberNCBI internal use

well annotated

the sequence is the data

Page 18: Biological databases

DERIVATIVE SEQUENCE DATABASES

Page 19: Biological databases

FEATURES Location/Qualifiers source 1..2484 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene 1..2484 /gene="MLH1" CDS 22..2292 /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS

GENPEPT: GENBANK CDS TRANSLATIONS

>gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

Page 20: Biological databases

REFSEQ: DERIVATIVE SEQUENCE DATABASE Curated transcripts and proteins

Model transcripts and proteins

Assembled Genomic Regions

Chromosome records Human genome microbial organelle

ftp://ftp.ncbi.nih.gov/refseq/release/

Page 21: Biological databases

SELECTED REFSEQ ACCESSION NUMBERS

mRNAs and ProteinsNM_123456 Curated mRNANP_123456 Curated ProteinNR_123456 Curated non-coding RNAXM_123456 Predicted mRNAXP_123456 Predicted Protein XR_123456 Predicted non-coding RNAGene RecordsNG_123456 Reference Genomic SequenceChromosomeNC_123455 Microbial replicons, organelle genomes, human chromosomesAC_123455 Alternate assembliesAssembliesNT_123456 Contig NW_123456 WGS Supercontig

Page 22: Biological databases

GENBANK TO REFSEQ

Page 23: Biological databases

REFSEQS: ANNOTATION REAGENTS

Genomic DNAGenomic DNA((NCNC, , NT, NWNT, NW))

Model mRNAModel mRNA (XM)(XM)(XR)(XR)

Curated mRNACurated mRNA (NM)(NM)(NR)(NR)

Model protein Model protein (XP)(XP)

Curated ProteinCurated Protein (NP)(NP)

Scanning....

= ?

GenBankSequences

RefSeq

Page 24: Biological databases

REFSEQ BENEFITS Non-redundancy  

Updates to reflect current sequence data and biology

Data validation

Format consistency

Distinct accession series

Stewardship by NCBI staff and collaborators

Page 25: Biological databases

OTHER DERIVATIVE DATABASES Expressed Sequences

dbSNP

Structure

Gene

and more…

Page 26: Biological databases

ENTREZ

FINDING RELEVANT INFORMATION IN NCBI

DATABASES

Page 27: Biological databases

ENTREZ: A DISCOVERY SYSTEM

Gene

Taxonomy

PubMed abstracts

Nucleotide sequences

Protein sequences

3-D Structure

3 -D Structure

Word weight

VAST

BLASTBLAST

Phylogeny

Hard LinkNeighborsRelated Sequences

NeighborsRelated SequencesBLinkDomains

NeighborsRelated Structures

Pre-computed and pre-compiled data. •A potential “gold mine” of undiscovered relationships.•Used less than expected.

Page 28: Biological databases

GLOBAL QUERY: ALL NCBI DATABASES

The Entrez system: 38 (and counting) integrated databases

Page 29: Biological databases

TRADITIONAL METHOD: THE LINKS MENU

DNA Sequence

Nucleotide – Protein Link

Related Proteins

Protein – Structure Link

3-D Structure

Page 30: Biological databases

THE PROBLEM

Rapidly growing databases with complex and changing relationships

Rapidly changing interfaces to match the above

Result Many people don’t know:

Where to begin Where to click on a Web page Why it might be useful to click there

Page 31: Biological databases

GLOBAL NCBI (ENTREZ) SEARCH

colon cancer

Page 32: Biological databases

GLOBAL ENTREZ SEARCH RESULTS

Page 33: Biological databases

ENTREZ TIP: START SEARCHES IN GENE

Other Entrez DBs

HomoloGene

Entrez Protein

Gene

UniGene

BLink

Homologene:Gene Neighbors

Page 34: Biological databases

PRECISE RESULTS

MLH1[Gene Name] AND Human[Organism]

Page 35: Biological databases

MLH1 GENE RECORD

Page 36: Biological databases

MLH1:LINKS TO SEQUENCE

Page 37: Biological databases

GENEVIEW: HUMAN MLH1 VARIATIONS

ATPase domain

Page 38: Biological databases

‘TAKE HOME MESSAGE’ ADVANTAGES OF DATA INTEGRATION More relevant inter-related information in one

place

Makes it easier to find additional relevant information related to your initial query

Potentially find information indirectly linked, but relevant to your subject of interest uncover non-obvious genetic features that

explain phenotype or disease

Easier to build a ‘story’ based on multiple pieces of biological evidence