ncbi fieldguide ncbi molecular biology resources january 12, 2007 a field guide part 1

NC

BI

Fie

ldG

uid

e

NCBI Molecular Biology Resources

January 12, 2007

A Field GuidePart 1

NC

BI

Fie

ldG

uid

e• The NCBI Entrez System

• NCBI Sequence Databases– Primary data: GenBank– Derivative data: RefSeq, Gene

• Protein Structure and Function

• Sequence polymorphisms and phenotypes

** Intermission **

• NCBI Genomic Resources

• BLAST

NCBI Resources

NC

BI

Fie

ldG

uid

eThe National Center for

Biotechnology Information

Created in 1988 as a part of theNational Library of Medicine at NIH

– national resource for molecular biology information (biological information direct from organisms)

– gather data both nationally and internationally – develop new information technologies to aid in the understanding of

fundamental molecular and genetic processes that control health and disease

Bethesda,MD

NC

BI

Fie

ldG

uid

eData sources: traditional literature and data obtained from the direct study of organisms

The information landscape in biological and medical research

has grown far beyond literature to include a wide variety of databases generated by research fields such as molecular biology and genomics.

Figure 1 from Geer RC., Broad issues to consider for library involvement in bioinformatics. J Med Libr Assoc. 2006 Jul; 94(3):286–98. E-152.–5. PMID: 16888662

NCBI:– accepts submissions of

bibliographic records (example) and primary research data (example nucleotide sequence for colon cancer gene, MLH1)

– organizes the information into databases, maintains them, makes them available to the world

– develops software to retrieve and analyze the data

– conducts basic research to make new biological discoveries using the databases and software tools

NC

BI

Fie

ldG

uid

e

What does NCBI do?

• NCBI accepts submissions of primary data

• NCBI develops tools to analyze these data

• NCBI uses these tools to create derivative databases based on the primary data

• NCBI provides free search, link, and retrieval of these data, primarily through the Entrez system

NC

BI

Fie

ldG

uid

e

BLAST

VAST

Entrez

Text

Sequence

Protein Structure

Small Mol. Structure

PubChem

www.ncbi.nlm.nih.govwww.ncbi.nlm.nih.gov

Web Access

query

NC

BI

Fie

ldG

uid

e

The NCBI ftp site

30,000 files per day620 Gigabytes per day

NC

BI

Fie

ldG

uid

e

NCBI Toolbox: In-house source code useful for incorporating NCBI-like functionality into their programs. Three main parts: Data Model, Data Encoding and Programming Libraries.

• Examples: BLAST, Cn3D, Sequin, Data format conversion scripts

http://www.ncbi.nlm.nih.gov/IEB/ToolBox/index.cgi

Help for Programmers

http://www.ncbi.nih.gov/entrez/query/static/eutils_help.html

E-Utilities: Guidelines for Entrez “URL calls” used to access data. Designed for use in scripts.

• Examples: ESearch, EPost, ESummary, EFetch and ELink

Caution: Overuse may result in blocked IPs!

NC

BI

Fie

ldG

uid

eGlobal Entrez Search Page

All[Filter]All[Filter]

NC

BI

Fie

ldG

uid

e

What is Entrez?

• A system of 31 linked databases

• A text search engine

• A tool for finding biologically linked data

• A retrieval engine

• A virtual workspace for manipulating large datasets

NC

BI

Fie

ldG

uid

e

Entrez Databases

• Each record is assigned a UID– unique integer identifier for internal tracking– GI number for Nucleotide

• Each record is given a Document Summary– a summary of the record’s content (DocSum)

• Each record is assigned links to biologically related UIDs

• Each record is indexed by data fields– [author], [title], [organism], and many others

NC

BI

Fie

ldG

uid

eLinking in Entrez

Follow links to related datain the same database

or in others!

Links

Hard Links: Curated links based on biology• nucleotide taxonomy (based on organism identifier)• protein domain relatives (based on domain assignment)• domains pubmed (based on supporting literature)• pcsubstance structures/mmdb (based on source information)

Soft Links: Pre-computed analyses• nucleotide related sequences (BLAST neighbors)• protein conserved domains (CDD/RPS-BLAST search) • pccompound pccompound (structure-based neighboring)

NC

BI

Fie

ldG

uid

e

Genomes

Taxonomy

Entrez: Database IntegrationEntrez: Database Integration

PubMed abstracts

Nucleotide sequences

Protein sequences

3-D Structure

3 -D Structure

Word weight

VAST

BLASTBLAST

Phylogeny

Hard LinkNeighborsRelated Sequences

NeighborsRelated Seqs.BLink, Domains

NeighborsRelated Structures

NC

BI

Fie

ldG

uid

e

Links: Database Integration at NCBI

Gene

Gene

Nucleo

tide

Prote

in

Struct

ure

CDDSNP

Taxonom

y

PubMed

Homolo-

gene

mRNAs;

genome

All CDS products

Protein

Function

SNPs; indels

Source

organism

Literature

Gene locus BLASTn

CDS product

3D DNA

3D RNA

SNPs; indels

Source

organism

Literature

Gene locus

cDNA transcript BLASTp

3D proteins

Function SNPs; indels

Source

organism

Literature

DNA sequence

Protein sequence VAST

Protein

Function

SNP BLASTp

Source organism

Literature

Gene loci Proteins with CD

3D templates CDART

Broadest taxon

Literature

Gene locus

DNA sequence

Protein sequence

3D template

Source organism

Literature

Genes for taxon

Seqs for taxon

Seqs for taxon

Structs for taxon

CD spans

Taxon

SNPs for taxon

Common

Tree

Gene loci in article

Sequence in article

Sequence in article

Structure in article

CDs in article

SNPs in article

Related articles

Nucleotide

Protein

Structure

CDD

SNP

Taxonomy

PubMed

NC

BI

Fie

ldG

uid

e

Types of Databases

• Primary Databases– Original submissions by experimentalists– Content controlled by the submitter

• Examples: GenBank, dbSNP, GEO, PubChem Substance and PubChem Bioassays

• Derivative Databases– Built from primary data– Content controlled by third party (NCBI)

• Examples: Refseq, RefSNP, GEO Datasets, PubChem Compound

NC

BI

Fie

ldG

uid

e

An Entrez Database - Nucleotide

• GenBank: Primary Data (98.2%)– original submissions by experimentalists– submitters retain editorial control of records– archival in nature

• RefSeq: Derivative Data (1.8%)– curated by NCBI staff– NCBI retains editorial control of records– record content is updated continually

NC

BI

Fie

ldG

uid

e

Literature Databases

NC

BI

Fie

ldG

uid

e

NM_000249: PubMed

Books

NC

BI

Fie

ldG

uid

eBooks Link

NC

BI

Fie

ldG

uid

e

NC

BI

Fie

ldG

uid

e

A part of the NCBI Bookshelf

Part 1. The Databases

Part 3. Querying and Linking the Data

Part 2. Data Flow and Processing

Part 4. User Support

NC

BI

Fie

ldG

uid

e

NC

BI

Fie

ldG

uid

e

PubMed Central

PubMed Central is a digital archive of life sciences journal literature. Integrated into the Entrez retrieval system, PMC provides free and unrestricted access to the full text of over 160 life sciences journals, with more to come.

NC

BI

Fie

ldG

uid

e

NCBI Journal Database

Detailed journal information

NC

BI

Fie

ldG

uid

e

OMIM - A catalogue of genes involved with human disease processes - Detailed clinical and reference information - Curated and maintained by Johns Hopkins - Links to PubMed and sequence databases

NC

BI

Fie

ldG

uid

e

Primary vs. Derivative Databases

ACGTGC

CG

TG

AATTGACTAACGTGCA

CG

TG

C

TTGACA

TATAGCCG

GenBank

SequencingCenters

GAGA

ATTC

C GAGA

ATTC

C UniGene

RefSeq:Gene andGenomes Pipelines

RefSeq:Annotation Pipeline

Labs

Curators

Algorithms

TATAGCCGAGCTCCGATACCGATGACAA

Updated ONLY by submitters

EST UniSTS

STS

GSS

HTG

Updated continuall

y by NCBI

PRI ROD PLN MAM BCT

INV VRT PHG VRL

NC

BI

Fie

ldG

uid

eWhat is GenBank? NCBI’s Primary Sequence

Database• Nucleotide only sequence database • Archival in nature• Each record is assigned a stable accession number• GenBank Data

– Direct submissions (traditional records )– Batch submissions (EST, GSS, STS)– ftp accounts (genome data)

• Three collaborating databases– GenBank– DNA Database of Japan (DDBJ) – European Molecular Biology Laboratory (EMBL)

Database

NC

BI

Fie

ldG

uid

e

GenBankGenBank

DDBJDDBJ

EMBLEMBL

EMBLEMBL

Entrez

SRS

getentry

NIGNIGCIB

NCBI

NIHNIH

•Submissions•Updates •Submissions

•Updates

•Submissions•Updates

The International Sequence Database Collaboration

SequinBankItftp

EBI

NC

BI

Fie

ldG

uid

e

• full release every two months• incremental and cumulative updates daily• available only through internet

ftp://ftp.ncbi.nih.gov/genbank/

(non-WGS)Release 156 October 2006 62765195 Records 66925938907 Nucleotides >150,000 Species 245 Gigabytes 1032 files

GenBank Releases

NC

BI

Fie

ldG

uid

e

The Growth of GenBank

Non-WGS: 59.8 billion basesNon-WGS: 59.8 billion bases

WGS: 63.2 billion bases WGS: 63.2 billion bases

Release 152Release 152

NC

BI

Fie

ldG

uid

e

GenBank DivisionsPRI Primate ROD Rodent PLN Plant and FungalBCT Bacterial/ArchealVRT Other Vertebrate INV Invertebrate VRL ViralMAM MammalianPHG PhageSYN SyntheticUNA Unannotated

•Direct Submissions (Sequin/Bankit)•Accurate (~1 error per 10,000 bp)•Well characterized•Organized by taxonomy

EST Expressed Sequence Tag GSS Genome Survey SequenceHTG High Throughput GenomicPAT Patent sequencesSTS Sequence Tagged Site HTC High Throughput cDNACON Constructed entries

•From sequencing projects•Batch submissions (ftp/email) •Inaccurate•Poorly Characterized•Organized by sequence type

Traditional

Bulk

NC

BI

Fie

ldG

uid

e

Entrez Nucleotide Subsets

CoreNucleotide 29225247 EST 39288168

GSS 15655087

TOTAL 84168502

NC

BI

Fie

ldG

uid

e

A Traditional GenBank RecordLOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN"ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt

1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a//

Header

Feature Table

Sequence

The Flatfile Format

NC

BI

Fie

ldG

uid

e

An Example Record – M17755

Field Indexed Terms

[primary accession] M17755[title] Homo sapiens thyroid peroxidase (TPO) mRNA…[organism] Homo sapiens[sequence length] 3060[modification date] 1999/04/26[properties] biomol mrna

gbdiv prisrcdb genbank

Indexing for Nucleotide UID 4680720

NC

BI

Fie

ldG

uid

e

M17755: Feature Table

CDS position in bp

TPO [gene name]

thyroiditis[text word]

thyroid peroxidase[protein name]

protein accession

NC

BI

Fie

ldG

uid

e

Sequence: 99.99% Accurate

The sequence itselfis not indexed…

Use BLAST for that!

NC

BI

Fie

ldG

uid

e

Entrez Protein

• GenPept (DDBJ, EMBL, GenBank) 6259705 • RefSeq

2997502• Swiss Prot 236666• PDB 86934 • PIR

30413• PRF 12079• Third Party Annotation

4969

Total 9628271

NC

BI

Fie

ldG

uid

e

Protein Sources and Links

PIR

RefSeq

SWISS-PROT

GenPept

NM_000547

M17755

no mRNA!

no mRNA!

NC

BI

Fie

ldG

uid

e

Sequence Revisions

Version and GI change only if the sequence changes

The accession number always retrieves the most recent version

First seen at NCBI, not first seen at GenBank!

NC

BI

Fie

ldG

uid

eUpdate without a Sequence

Change

June 15, 1989!

GenBank cameto NCBI in 1992!

NC

BI

Fie

ldG

uid

e

Update with a Sequence Change

NC

BI

Fie

ldG

uid

e

GenBank File Formats

ASN.1 – The Raw Data

XML

FASTA

flat file

NC

BI

Fie

ldG

uid

e

/************************************************************************** asn2ff.c* convert an ASN.1 entry to flat file format, using the FFPrintArray. ***************************************************************************/#include <accentr.h>#include "asn2ff.h"#include "asn2ffp.h"#include "ffprint.h"#include <subutil.h>#include <objall.h>#include <objcode.h>#include <lsqfetch.h>#include <explore.h>

#ifdef ENABLE_ID1#include <accid1.h>#endif

FILE *fpl;

Args myargs[] = {{"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL},{"Input is a Seq-entry","F", NULL ,NULL ,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL},{"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL},{"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL},{"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL},

Toolbox Sources

ftp> open ftp.ncbi.nih.gov..ftp> cd toolboxftp> cd ncbi_tools

ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools

NCBI Toolbox

NC

BI

Fie

ldG

uid

eText Queries in Entrez

term1[limit] OP term2[limit] OP …

limit = Entrez indexing field (organism, author, …)

OP = Boolean operator = AND, OR, NOT

where

term1 term2

Complex queries:((A[limit1] OR B[limit2]) AND C[limit3]) NOT D[limit4]

1:200[MW]

Ranges: Wildcards:

cancer[title] vs. cancer*[title]

NC

BI

Fie

ldG

uid

e

Entrez Tabs

Limits Provides a simple form for applying commonly used Entrez limits

Preview/Index Allows access to the full indexing of each Entrez database and aids in constructing complex queries

History Provides access to previous searches in the current Entrez database

Clipboard A temporary storage area for selected records

Details Displays the detailed parsing of the current Entrez query, and lists errors and terms without matches

NC

BI

Fie

ldG

uid

e

Programming Entrez: E-Utilities

ESearch

EPost

ESummary

Entrez query UID list or History

Document summaries

http://www.ncbi.nih.gov/entrez/query/static/eutils_help.html

History

UID list or History

UID list

EFetch Formatted dataUID list or History

ELinkUID list or History UID list or History

NC

BI

Fie

ldG

uid

e

Finding Primary Sequences

• Search Entrez CoreNucleotide– 94.8% GenBank (primary data)– 5.2% RefSeq (curated data)

M17755 [primary accession] TPO [gene name]thyroid peroxidase [title] thyroiditis [text word]Homo sapiens [organism] thyroid peroxidase [protein name]3060 [sequence length] 1999/04/26 [modification date]biomol mrna [properties] gbdiv pri [properties]srcdb genbank [properties]

Possible queries we’ve seen so far…

NC

BI

Fie

ldG

uid

e

A Starting Query

Find nucleotide records for human thyroid peroxidase

(("Homo sapiens“[Organism] OR human[All Fields]) AND thyroid peroxidase[All Fields])

human thyroid peroxidase

human[organism] AND thyroid peroxidase

("Homo sapiens“[Organism] AND thyroid peroxidase[All Fields])

276 records

262 records

Field Limit!

14 records aren’t human sequences!!

NC

BI

Fie

ldG

uid

e

Limit by Title and Database

#1: thyroid peroxidase AND human[orgn] 262#2: thyroid peroxidase[title] AND human[orgn] 55

#3: #2 AND srcdb refseq[properties] 5#4: #2 AND srcdb ddbj/embl/genbank[properties] 50

Entrez Nucleotide

GenBank srcdb ddbj/embl/genbank[properties]

RefSeq srcdb refseq[properties]

primary data

NC

BI

Fie

ldG

uid

e

Limit by Biomolecule Type

Genomic DNA biomol genomic[prop]

cDNA biomol mrna[prop]

#1: thyroid peroxidase AND human[orgn] 262

#2: thyroid peroxidase[title] AND human[orgn] 55

#3: #2 AND srcdb refseq[properties] 5#4: #2 AND srcdb ddbj/embl/genbank[properties] 50

#5: #4 AND biomol genomic[prop] 26#6: #4 AND biomol mrna[prop] 24

mRNA / cDNA

genomic DNA

NC

BI

Fie

ldG

uid

e

Limit by Protein Namethyroid peroxidase[protein name] AND human[orgn] AND gbdiv pri[prop] AND biomol mrna[prop]

24 records [title] 5 records [protein name]

NC

BI

Fie

ldG

uid

e

Entrez Document Summaries

Click the accession to view the record

Links menu

Links to other Entrez databasescomputed for M17755

NC

BI

Fie

ldG

uid

e

Viewing M17755

NC

BI

Fie

ldG

uid

eGenBank Sequences for Human

TPO

Which one is the best sequence???

NC

BI

Fie

ldG

uid

e

• Non-redundant • Explicitly linked nucleotide and protein sequences• Updated to reflect current sequence data and biology• Validated by hand • Format consistency• Distinct accession series • Stewardship by NCBI staff and collaborators

ftp://ftp.ncbi.nih.gov/refseq/release

RefSeq: NCBI’s Derivative Sequence Database

RefSeq Benefits

NC

BI

Fie

ldG

uid

e

RefSeq: NCBI’s Derivative Sequence Database

• Curated transcripts and proteins– NM_123456 NP_123456– NR_123456 (non-coding RNA)

• Model transcripts and proteins– XM_123456 XP_123456

– XR_123456 (non-coding RNA)

• Assembled Genomic Regions (contigs)– NT_123456 (BAC clones)– NW_123456 (WGS)

• Other Genomic Sequence– NG_123456 (complex regions, pseudogenes)

– NZ_ABCD12345678 (WGS) ZP_123456

• Chromosome records in Entrez Genome– NC_123456 (chromosome; microbial or organelle genome)

Nucleotide

Protein

NC

BI

Fie

ldG

uid

e

NM/NP Records in Entrez

COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from M17755.2 and AW874082.1. On Feb 25, 2003 this sequence version replaced gi:21361188.

NM_000547: variant 1

COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from J02970.1, AW874082.1 and M17755.2.

NM_175719: variant 2 EST that completes 3’ end

Nucleotide

Protein

NC

BI

Fie

ldG

uid

e

Genomic DNAGenomic DNA((NCNC, , NT, NWNT, NW))

Model mRNAModel mRNA (XM)(XM)(XR)(XR)

Curated mRNACurated mRNA (NM)(NM)(NR)(NR)

Model protein Model protein (XP)(XP)

Annotating the Gene

Curated ProteinCurated Protein (NP)(NP)

Scanning....

= ?= !

GenbankSequences

RefSeq

NC

BI

Fie

ldG

uid

e

The Perils of the XM

XM records are models based only on genomic sequence, and are subjectto revision or removal with each new build of that genome.

Query= gi|20850420|ref|XM_124429.1| Mus musculus expressed sequence AA553001 (AA553001), mRNA

gi|19527087|ref|NM_133873.1| Mus musculus DNA segment, Chr 4, Wayne State University 114, expressed (D4Wsu114e), mRNA Length=1898 Score = 3701.55 bits (1867), Expect = 0 Identities = 1870/1871 (99%), Gaps = 0/1871 (0%) Strand=Plus/Plus

BLAST the XM against the RefSeq database to look for a replacement:

NC

BI

Fie

ldG

uid

e

Entrez Gene and RefSeq

• Entrez Gene is the central depository for information about a gene available at NCBI, and often provides links to sites beyond NCBI

• Entrez Gene includes records for organisms that have NCBI Reference Sequences (RefSeqs)

• Entrez Gene records contain RefSeq mRNAs, proteins, and genomic DNA (if known) for a gene locus, plus links to other Entrez databases • NCBI RefSeqs are based on primary sequence data in GenBank

GenBank RefSeq Gene

Nucleotide

NC

BI

Fie

ldG

uid

e

Entrez Gene: RefSeq Annotations

NC

BI

Fie

ldG

uid

e

NM/NP Records in Entrez Gene

NC

BI

Fie

ldG

uid

e

Entrez Gene RefSeq Graphics

NM NP

NC

BI

Fie

ldG

uid

e

Getting the Annotation Details

Genomic sequence

ACCESSION NC_000002 REGION: 1396242..1525502

NC

BI

Fie

ldG

uid

e

Genome Annotation in Entrez Nucleotide

GenBank Components (clones, WGS) NT/NW Contigs NC

Assembly

Components

Genome

Components

NM/XMMaster

mRNA

NC

BI

Fie

ldG

uid

e

Genome Annotation Links

curated mRNA

genomic contig on chromosome 2 transcribing NM_000547

human chromosome 2

the 18 contigs of the chromosome 2 assembly

NC

BI

Fie

ldG

uid

e

Searching Entrez Gene

RefSeq status and variants: Reviewed RefSeqs with transcript variants

srcdb refseq reviewed[prop] AND has transcript variants[prop]

Gene symbol: human thyroid peroxidase (TPO)

tpo [sym] AND human [organism]

Disease and Gene Ontology: Membrane proteins linked to cancer

integral to plasma membrane[gene ontology] AND cancer [dis]

Chromosome and Links: genes on human chromosome 2 with OMIM links

2 [chromosome] AND gene omim [filter] AND human [organism]

Protein name: topoisomerase genes from Archaea

topoisomerase[gene/protein name] AND archaea [organism]

NC

BI

Fie

ldG

uid

e

Examples of sequences appropriate for TPA are:

Annotation of features on gene and/or mRNA sequences

Assembled “full length” genes and/or mRNAs

NCBI now accepts the submission of new annotationsof existing GenBank sequences.

• Submissions must be published in a peer-reviewed journal.

• Facilitates the annotation of sequences by experts.

What should not be submitted to TPA?

Synthetic constructs (such as cloning vectors) that use well-characterized, publicly available genes, promoters, or terminators

Updates or changes to existing sequence data

Sequence annotations without experimental evidence

Third Party Annotation(TPA) Database

NC

BI

Fie

ldG

uid

eLinking Protein Sequence,

Structure, and Function

sequence function (pfam, smart)ConservedDomains (CDD) sequence structure + function (cd)

VAST

Structure (MMDB) sequence structure

structure structure

Protein sequence sequence

NC

BI

Fie

ldG

uid

e

Entrez Structure

• Derived from experimentally determined PDB records• Add value to PDB records by:

– Adding explicit chemical bonding information– Validating and indexing the sequences– Annotating 3D domains and secondary structure– Adding links to CDD, Taxonomy, Pubmed – Converting PDB data to ASN.1

• Structure neighbors determined by

Vector Alignment Search Tool (VAST)

MMDB: MMolecular MModeling Data Base

Structure

NC

BI

Fie

ldG

uid

e

Structure Summary Page

Conserved Domains

VAST Neighbors for chain C (domain 0)

Cn3D

VAST Neighbors for domain 2

NC

BI

Fie

ldG

uid

e

Related Structures

NC

BI

Fie

ldG

uid

eVAST: Structure Neighbors

Vector Alignment Search Tool

For each 3D domain,

locate SSEs (secondarystructure elements),

and represent them asindividual vectors.

1

2

3

4

5 6

Human IL-4

VAST uses 3D Domains only!Whole polypeptides are assigned 3D domain 0 (zero).

NC

BI

Fie

ldG

uid

e

VAST Neighbors

1D2V

1D2V

1Q4G

3D domains!

Cn3D

NC

BI

Fie

ldG

uid

e

Submitting a PDB File to VAST

• Redesigned interface!• This is the best way to convert PDB into MMDB format!

New!

NC

BI

Fie

ldG

uid

e

Structure + Function

VAST finds proteins that have similar 3D folds

CD-Search finds proteins that have similar sequences and similar functions

Curated CDs = VAST + CD-Search

Proteins that have similar 3D folds,

similar sequences and similar functions

NC

BI

Fie

ldG

uid

e

Protein Links: Domains

Click on a colored bar to align your sequence to the CD

NC

BI

Fie

ldG

uid

e

CDD Record – heme peroxidases

aligned query

red = high conservation

blue = low conservation

NC

BI

Fie

ldG

uid

e

Curated CD Record - EGF

Annotated features

Launch Cn3D

phylogenetic tree of aligned sequencesLaunch

CDTree

New

NC

BI

Fie

ldG

uid

e

Curated CD Record - EGF

Annotated features

Launch Cn3D


CDTree

New

Cn3D

NC

BI

Fie

ldG

uid

eEntrez PubChem

PC Substance

PC Compound

PC BioAssay

Primary database of chemical samples

Derived database of known chemicals fromPC Substance records

Primary database of bioactivity screens ofsamples in PC Substance

NC

BI

Fie

ldG

uid

e

Links from Structure

N-acetylglucosamine

heme

mannose

fucose

NC

BI

Fie

ldG

uid

e

Sequence Polymorphisms

SNP OMIM

• Primary database of submitted SNPs• Curated database of reference SNPs• Contains more than just SNPs:

• True SNPs• MNP (multiple nucleotide)• Insertions• Deletions• Microsatellites• Mixed• No variation (constant)

• Clinical literature database• Curated at Johns Hopkins Univ• Links human genes and genetic disorders to human disease• Lists allelic variants that have clinical consequences

Variations in SNP are not necessarily in OMIM, and vice versa!

General Polymorphisms Human Phenotypes

NC

BI

Fie

ldG

uid

e

Linking to SNP

Links to SNP are also available fromNucleotide and Protein

Entrez Gene - TPO

NC

BI

Fie

ldG

uid

e

Entrez SNP

primary data: ss#

SNP UID: rs#

NC

BI

Fie

ldG

uid

e

Find Non-synonymous SNPs

#7 AND coding nonsynon[Function Class]

Function Class

NC

BI

Fie

ldG

uid

e

Non-synonymous TPO SNPs

Link to Map Viewer

View all SNPs in locus

Link to related 3D structures

NC

BI

Fie

ldG

uid

e

GeneView in dbSNP

NC

BI

Fie

ldG

uid

e

Links to OMIM

Entrez Gene - TPO

NC

BI

Fie

ldG

uid

e

OMIM Record

NC

BI

Fie

ldG

uid

e

Explore a Disease SNP

799

NC

BI

Fie

ldG

uid

e

Curated CD Record

Launch Cn3D


CDTree

Cn3D

E799

ncbi fieldguide ncbi molecular biology resources january 12, 2007 a field guide part 1

Documents