btn323: introduction to biological databases

35
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day1: NCBI Databases and Entrez Lecturer: Junaid Gamieldien, PhD [email protected] NOTE: Most slides derived from NCBI’s field guide http://www.sanbi.ac.za/training-2/ undergraduate-training/

Upload: jase

Post on 12-Jan-2016

78 views

Category:

Documents


0 download

DESCRIPTION

BTN323: INTRODUCTION TO BIOLOGICAL DATABASES. Lecturer: Junaid Gamieldien, PhD [email protected]. Day1: NCBI Databases and Entrez. http://www.sanbi.ac.za/training-2/undergraduate-training/. NOTE: Most slides derived from NCBI’s field guide. WHAT YOU NEED TO LEARN:. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

BTN323:INTRODUCTION TO

BIOLOGICAL DATABASES

Day1: NCBI Databases and Entrez

Lecturer: Junaid Gamieldien, PhD

[email protected]

NOTE: Most slides derived from NCBI’s field guide

http://www.sanbi.ac.za/training-2/undergraduate-training/

Page 2: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

WHAT YOU NEED TO LEARN:

What is a database and what are the features of an ideal db?

What are the relationships/differences between primary and derived sequence databases?

What are the benefits of RefSeq?

Why is data integration useful?

Page 3: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

THE NATIONAL CENTER FOR BIOTECHNOLOGY INFORMATION

Created in 1988 as a part of theNational Library of Medicine at NIH

– Establish public databases– Research in computational biology– Develop software tools for sequence analysis– Disseminate biomedical information

Bethesda,MD

Page 4: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

WEB ACCESS: WWW.NCBI.NLM.NIH.GOV

New HomepageCommon footerCommon footer

New pages!New pages!

Page 5: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

WHAT ARE DATABASES?

Structured collection of information.

Consists of basic units called records or entries.

Each record consists of fields, which hold pre-defined data related to the record.

For example, a protein database would have protein entries as records and protein properties as fields (e.g., name of protein, length, amino-acid sequence)

Page 6: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

THE ‘PERFECT’ DATABASE

Comprehensive, but easy to search.

Annotated, but not “too annotated”.

A simple, easy to understand structure.

Cross-referenced.

Minimum redundancy.

Easy retrieval of data.

Page 7: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

THE CENTRAL DOGMA & BIOLOGICAL DATA

Protein structures-Experiments-Models (homologues)

Literature information

Original DNA Sequences(Genomes)

Protein Sequences-Inferred -Direct sequencing

Expressed DNA sequences( = mRNA Sequences= cDNA sequences)Expressed Sequence Tags (ESTs)

Page 8: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

NCBI DATABASES AND SERVICES

GenBank primary sequence database

Free public access to biomedical literature PubMed free Medline (3 million searches per day) PubMed Central full text online access

Entrez integrated molecular and literature databases

Page 9: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

TYPES OF MOLECULAR DATABASES

Primary Databases Original submissions by experimentalists Content controlled by the submitter

Examples: GenBank, Trace, SRA, SNP, GEO

Derivative Databases Derived from primary data Content controlled by third party (NCBI)

Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets, UniGene, Homologene, Structure, Conserved Domain

Page 10: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

PRIMARY VS. DERIVATIVE SEQUENCE DATABASES

GenBankGenBank

SequencingSequencingCentersCenters

GA

GAGA

ATT

ATTCCGAGA

ATT

ATTCC

AT

GAGA

ATTCC GAGA

ATTCC

TTGACAAT

TGACTA

ACGTGC

TTGACA

CGTGAATTGACTA

TATAGCCG

ACGTGC

ACGTGCACGTGC

TTGACA

TTGACA

CGTGA

CGTGA

CGTGA

ATTGACTA

ATTGACTAATTGACTA

ATTGACTA

TATAGCCG

TATAGCCGTATAGCCGTATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG

CATT

GAGA

ATTCC GAGA

ATTCC LabsLabs

AlgorithmsAlgorithms

UniGene

CuratorsCurators

RefSeq

GenomeAssembly

TATAGCCGAGCTCCGATACCGATGACAA

Updated continually by NCBI

Updated ONLY by submitters

Page 11: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

SEQUENCE DATABASES AT NCBI

Primary GenBank: NCBI’s primary sequence database Trace Archive: reads from capillary sequencers Sequence Read Archive: next generation data

Derivative GenPept (GenBank translations) Outside Protein (UniProt—Swiss-Prot, PDB) NCBI Reference Sequences (RefSeq)

Page 12: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

GENBANK - PRIMARY SEQUENCE DB

Nucleotide only sequence database

Archival in nature Historical Reflective of submitter point of view (subjective) Redundant

Data Direct submissions (traditional records) Batch submissions FTP accounts (genome data)

Page 13: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

GENBANK - PRIMARY SEQUENCE DB (2) Three collaborating databases

1. GenBank

2. DNA Database of Japan (DDBJ)

3. European Molecular Biology Laboratory (EMBL) Database

Page 14: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

TRADITIONAL GENBANK RECORD

ACCESSION U07418

VERSION U07418.1 GI:466461

ACCESSION U07418

VERSION U07418.1 GI:466461

Accession•Stable•Reportable•Universal

Accession•Stable•Reportable•Universal

VersionTracks changes in sequenceVersionTracks changes in sequence

GI numberNCBI internal useGI numberNCBI internal use

well annotatedwell annotated

the sequence is the datathe sequence is the data

Page 15: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

DERIVATIVE SEQUENCE DATABASES

Page 16: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

FEATURES Location/Qualifiers source 1..2484 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene 1..2484 /gene="MLH1" CDS 22..2292 /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS

GENPEPT: GENBANK CDS TRANSLATIONS

>gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

>gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

Page 17: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

REFSEQ: DERIVATIVE SEQUENCE DATABASE Curated transcripts and proteins

Model transcripts and proteins

Assembled Genomic Regions

Chromosome records Human genome microbial organelle

ftp://ftp.ncbi.nih.gov/refseq/release/

Page 18: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

SELECTED REFSEQ ACCESSION NUMBERS

mRNAs and Proteins

NM_123456 Curated mRNANP_123456 Curated ProteinNR_123456 Curated non-coding RNAXM_123456 Predicted mRNAXP_123456 Predicted Protein XR_123456 Predicted non-coding RNAGene RecordsNG_123456 Reference Genomic SequenceChromosomeNC_123455 Microbial replicons, organelle genomes, human chromosomesAC_123455 Alternate assembliesAssembliesNT_123456 Contig NW_123456 WGS Supercontig

Page 19: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

GENBANK TO REFSEQ

Page 20: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

REFSEQS: ANNOTATION REAGENTS

Genomic DNAGenomic DNA((NCNC, , NT, NWNT, NW))

Model mRNAModel mRNA (XM)(XM)(XR)(XR)

Curated mRNACurated mRNA (NM)(NM)(NR)(NR)

Model protein Model protein (XP)(XP)

Curated ProteinCurated Protein (NP)(NP)

Scanning....

= ?

GenBankSequences

RefSeq

Page 21: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

REFSEQ BENEFITS

Non-redundancy  

Updates to reflect current sequence data and biology

Data validation

Format consistency

Distinct accession series

Stewardship by NCBI staff and collaborators

Page 22: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

OTHER DERIVATIVE DATABASES

Expressed Sequences

dbSNP

Structure

Gene

and more…

Page 23: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

ENTREZ

FINDING RELEVANT INFORMATION IN NCBI

DATABASES

Page 24: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

ENTREZ: A DISCOVERY SYSTEM

Gene

Taxonomy

PubMed abstracts

Nucleotide sequences

Protein sequences

3-D Structure

3 -D Structure

Word weight

VAST

BLASTBLAST

Phylogeny

Hard LinkNeighborsRelated Sequences

NeighborsRelated SequencesBLinkDomains

NeighborsRelated Structures

Pre-computed and pre-compiled data.

•A potential “gold mine” of undiscovered relationships.

•Used less than expected.

Pre-computed and pre-compiled data.

•A potential “gold mine” of undiscovered relationships.

•Used less than expected.

Page 25: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

GLOBAL QUERY: ALL NCBI DATABASES

The Entrez system: 38 (and counting) integrated databasesThe Entrez system: 38 (and counting) integrated databases

Page 26: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

TRADITIONAL METHOD: THE LINKS MENU

DNA Sequence

Nucleotide – Protein Link

Related Proteins

Protein – Structure Link

3-D Structure

Page 27: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

THE PROBLEM

Rapidly growing databases with complex and changing relationships

Rapidly changing interfaces to match the above

Result Many people don’t know:

Where to begin Where to click on a Web page Why it might be useful to click there

Page 28: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

GLOBAL NCBI (ENTREZ) SEARCH

colon cancercolon cancer

Page 29: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

GLOBAL ENTREZ SEARCH RESULTS

Page 30: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

ENTREZ TIP: START SEARCHES IN GENE

Other Entrez DBs

HomoloGene

Entrez Protein

Gene

UniGene

BLink

Homologene:Gene Neighbors

Page 31: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

PRECISE RESULTS

MLH1[Gene Name] AND Human[Organism]MLH1[Gene Name] AND Human[Organism]

Page 32: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

MLH1 GENE RECORD

Page 33: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

MLH1:LINKS TO SEQUENCE

Page 34: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

GENEVIEW: HUMAN MLH1 VARIATIONS

ATPase domain

Page 35: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

‘TAKE HOME MESSAGE’ ADVANTAGES OF DATA INTEGRATION More relevant inter-related information in one

place

Makes it easier to find additional relevant information related to your initial query

Potentially find information indirectly linked, but relevant to your subject of interest uncover non-obvious genetic features that

explain phenotype or disease

Easier to build a ‘story’ based on multiple pieces of biological evidence