ncbi fieldguide ncbi molecular biology resources january 12, 2007 a field guide part 1
TRANSCRIPT
NC
BI
Fie
ldG
uid
e
NCBI Molecular Biology Resources
January 12, 2007
A Field GuidePart 1
NC
BI
Fie
ldG
uid
e• The NCBI Entrez System
• NCBI Sequence Databases– Primary data: GenBank– Derivative data: RefSeq, Gene
• Protein Structure and Function
• Sequence polymorphisms and phenotypes
** Intermission **
• NCBI Genomic Resources
• BLAST
NCBI Resources
NC
BI
Fie
ldG
uid
eThe National Center for
Biotechnology Information
Created in 1988 as a part of theNational Library of Medicine at NIH
– national resource for molecular biology information (biological information direct from organisms)
– gather data both nationally and internationally – develop new information technologies to aid in the understanding of
fundamental molecular and genetic processes that control health and disease
Bethesda,MD
NC
BI
Fie
ldG
uid
eData sources: traditional literature and data obtained from the direct study of organisms
The information landscape in biological and medical research
has grown far beyond literature to include a wide variety of databases generated by research fields such as molecular biology and genomics.
Figure 1 from Geer RC., Broad issues to consider for library involvement in bioinformatics. J Med Libr Assoc. 2006 Jul; 94(3):286–98. E-152.–5. PMID: 16888662
NCBI:– accepts submissions of
bibliographic records (example) and primary research data (example nucleotide sequence for colon cancer gene, MLH1)
– organizes the information into databases, maintains them, makes them available to the world
– develops software to retrieve and analyze the data
– conducts basic research to make new biological discoveries using the databases and software tools
NC
BI
Fie
ldG
uid
e
What does NCBI do?
• NCBI accepts submissions of primary data
• NCBI develops tools to analyze these data
• NCBI uses these tools to create derivative databases based on the primary data
• NCBI provides free search, link, and retrieval of these data, primarily through the Entrez system
NC
BI
Fie
ldG
uid
e
BLAST
VAST
Entrez
Text
Sequence
Protein Structure
Small Mol. Structure
PubChem
www.ncbi.nlm.nih.govwww.ncbi.nlm.nih.gov
Web Access
query
NC
BI
Fie
ldG
uid
e
The NCBI ftp site
30,000 files per day620 Gigabytes per day
NC
BI
Fie
ldG
uid
e
NCBI Toolbox: In-house source code useful for incorporating NCBI-like functionality into their programs. Three main parts: Data Model, Data Encoding and Programming Libraries.
• Examples: BLAST, Cn3D, Sequin, Data format conversion scripts
http://www.ncbi.nlm.nih.gov/IEB/ToolBox/index.cgi
Help for Programmers
http://www.ncbi.nih.gov/entrez/query/static/eutils_help.html
E-Utilities: Guidelines for Entrez “URL calls” used to access data. Designed for use in scripts.
• Examples: ESearch, EPost, ESummary, EFetch and ELink
Caution: Overuse may result in blocked IPs!
NC
BI
Fie
ldG
uid
eGlobal Entrez Search Page
All[Filter]All[Filter]
NC
BI
Fie
ldG
uid
e
What is Entrez?
• A system of 31 linked databases
• A text search engine
• A tool for finding biologically linked data
• A retrieval engine
• A virtual workspace for manipulating large datasets
NC
BI
Fie
ldG
uid
e
Entrez Databases
• Each record is assigned a UID– unique integer identifier for internal tracking– GI number for Nucleotide
• Each record is given a Document Summary– a summary of the record’s content (DocSum)
• Each record is assigned links to biologically related UIDs
• Each record is indexed by data fields– [author], [title], [organism], and many others
NC
BI
Fie
ldG
uid
eLinking in Entrez
Follow links to related datain the same database
or in others!
Links
Hard Links: Curated links based on biology• nucleotide taxonomy (based on organism identifier)• protein domain relatives (based on domain assignment)• domains pubmed (based on supporting literature)• pcsubstance structures/mmdb (based on source information)
Soft Links: Pre-computed analyses• nucleotide related sequences (BLAST neighbors)• protein conserved domains (CDD/RPS-BLAST search) • pccompound pccompound (structure-based neighboring)
NC
BI
Fie
ldG
uid
e
Genomes
Taxonomy
Entrez: Database IntegrationEntrez: Database Integration
PubMed abstracts
Nucleotide sequences
Protein sequences
3-D Structure
3 -D Structure
Word weight
VAST
BLASTBLAST
Phylogeny
Hard LinkNeighborsRelated Sequences
NeighborsRelated Seqs.BLink, Domains
NeighborsRelated Structures
NC
BI
Fie
ldG
uid
e
Links: Database Integration at NCBI
Gene
Gene
Nucleo
tide
Prote
in
Struct
ure
CDDSNP
Taxonom
y
PubMed
Homolo-
gene
mRNAs;
genome
All CDS products
Protein
Function
SNPs; indels
Source
organism
Literature
Gene locus BLASTn
CDS product
3D DNA
3D RNA
SNPs; indels
Source
organism
Literature
Gene locus
cDNA transcript BLASTp
3D proteins
Function SNPs; indels
Source
organism
Literature
DNA sequence
Protein sequence VAST
Protein
Function
SNP BLASTp
Source organism
Literature
Gene loci Proteins with CD
3D templates CDART
Broadest taxon
Literature
Gene locus
DNA sequence
Protein sequence
3D template
Source organism
Literature
Genes for taxon
Seqs for taxon
Seqs for taxon
Structs for taxon
CD spans
Taxon
SNPs for taxon
Common
Tree
Gene loci in article
Sequence in article
Sequence in article
Structure in article
CDs in article
SNPs in article
Related articles
Nucleotide
Protein
Structure
CDD
SNP
Taxonomy
PubMed
NC
BI
Fie
ldG
uid
e
Types of Databases
• Primary Databases– Original submissions by experimentalists– Content controlled by the submitter
• Examples: GenBank, dbSNP, GEO, PubChem Substance and PubChem Bioassays
• Derivative Databases– Built from primary data– Content controlled by third party (NCBI)
• Examples: Refseq, RefSNP, GEO Datasets, PubChem Compound
NC
BI
Fie
ldG
uid
e
An Entrez Database - Nucleotide
• GenBank: Primary Data (98.2%)– original submissions by experimentalists– submitters retain editorial control of records– archival in nature
• RefSeq: Derivative Data (1.8%)– curated by NCBI staff– NCBI retains editorial control of records– record content is updated continually
NC
BI
Fie
ldG
uid
e
Literature Databases
NC
BI
Fie
ldG
uid
e
NM_000249: PubMed
Books
NC
BI
Fie
ldG
uid
eBooks Link
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
A part of the NCBI Bookshelf
Part 1. The Databases
Part 3. Querying and Linking the Data
Part 2. Data Flow and Processing
Part 4. User Support
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
PubMed Central
PubMed Central is a digital archive of life sciences journal literature. Integrated into the Entrez retrieval system, PMC provides free and unrestricted access to the full text of over 160 life sciences journals, with more to come.
NC
BI
Fie
ldG
uid
e
NCBI Journal Database
Detailed journal information
NC
BI
Fie
ldG
uid
e
OMIM - A catalogue of genes involved with human disease processes - Detailed clinical and reference information - Curated and maintained by Johns Hopkins - Links to PubMed and sequence databases
NC
BI
Fie
ldG
uid
e
Primary vs. Derivative Databases
ACGTGC
CG
TG
AATTGACTAACGTGCA
CG
TG
C
TTGACA
TATAGCCG
GenBank
SequencingCenters
GAGA
ATTC
C GAGA
ATTC
C UniGene
RefSeq:Gene andGenomes Pipelines
RefSeq:Annotation Pipeline
Labs
Curators
Algorithms
TATAGCCGAGCTCCGATACCGATGACAA
Updated ONLY by submitters
EST UniSTS
STS
GSS
HTG
Updated continuall
y by NCBI
PRI ROD PLN MAM BCT
INV VRT PHG VRL
NC
BI
Fie
ldG
uid
eWhat is GenBank? NCBI’s Primary Sequence
Database• Nucleotide only sequence database • Archival in nature• Each record is assigned a stable accession number• GenBank Data
– Direct submissions (traditional records )– Batch submissions (EST, GSS, STS)– ftp accounts (genome data)
• Three collaborating databases– GenBank– DNA Database of Japan (DDBJ) – European Molecular Biology Laboratory (EMBL)
Database
NC
BI
Fie
ldG
uid
e
GenBankGenBank
DDBJDDBJ
EMBLEMBL
EMBLEMBL
Entrez
SRS
getentry
NIGNIGCIB
NCBI
NIHNIH
•Submissions•Updates •Submissions
•Updates
•Submissions•Updates
The International Sequence Database Collaboration
SequinBankItftp
EBI
NC
BI
Fie
ldG
uid
e
• full release every two months• incremental and cumulative updates daily• available only through internet
ftp://ftp.ncbi.nih.gov/genbank/
(non-WGS)Release 156 October 2006 62765195 Records 66925938907 Nucleotides >150,000 Species 245 Gigabytes 1032 files
GenBank Releases
NC
BI
Fie
ldG
uid
e
The Growth of GenBank
Non-WGS: 59.8 billion basesNon-WGS: 59.8 billion bases
WGS: 63.2 billion bases WGS: 63.2 billion bases
Release 152Release 152
NC
BI
Fie
ldG
uid
e
GenBank DivisionsPRI Primate ROD Rodent PLN Plant and FungalBCT Bacterial/ArchealVRT Other Vertebrate INV Invertebrate VRL ViralMAM MammalianPHG PhageSYN SyntheticUNA Unannotated
•Direct Submissions (Sequin/Bankit)•Accurate (~1 error per 10,000 bp)•Well characterized•Organized by taxonomy
EST Expressed Sequence Tag GSS Genome Survey SequenceHTG High Throughput GenomicPAT Patent sequencesSTS Sequence Tagged Site HTC High Throughput cDNACON Constructed entries
•From sequencing projects•Batch submissions (ftp/email) •Inaccurate•Poorly Characterized•Organized by sequence type
Traditional
Bulk
NC
BI
Fie
ldG
uid
e
Entrez Nucleotide Subsets
CoreNucleotide 29225247 EST 39288168
GSS 15655087
TOTAL 84168502
NC
BI
Fie
ldG
uid
e
A Traditional GenBank RecordLOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN"ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt
1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a//
Header
Feature Table
Sequence
The Flatfile Format
NC
BI
Fie
ldG
uid
e
An Example Record – M17755
Field Indexed Terms
[primary accession] M17755[title] Homo sapiens thyroid peroxidase (TPO) mRNA…[organism] Homo sapiens[sequence length] 3060[modification date] 1999/04/26[properties] biomol mrna
gbdiv prisrcdb genbank
Indexing for Nucleotide UID 4680720
NC
BI
Fie
ldG
uid
e
M17755: Feature Table
CDS position in bp
TPO [gene name]
thyroiditis[text word]
thyroid peroxidase[protein name]
protein accession
NC
BI
Fie
ldG
uid
e
Sequence: 99.99% Accurate
The sequence itselfis not indexed…
Use BLAST for that!
NC
BI
Fie
ldG
uid
e
Entrez Protein
• GenPept (DDBJ, EMBL, GenBank) 6259705 • RefSeq
2997502• Swiss Prot 236666• PDB 86934 • PIR
30413• PRF 12079• Third Party Annotation
4969
Total 9628271
NC
BI
Fie
ldG
uid
e
Protein Sources and Links
PIR
RefSeq
SWISS-PROT
GenPept
NM_000547
M17755
no mRNA!
no mRNA!
NC
BI
Fie
ldG
uid
e
Sequence Revisions
Version and GI change only if the sequence changes
The accession number always retrieves the most recent version
First seen at NCBI, not first seen at GenBank!
NC
BI
Fie
ldG
uid
eUpdate without a Sequence
Change
June 15, 1989!
GenBank cameto NCBI in 1992!
NC
BI
Fie
ldG
uid
e
Update with a Sequence Change
NC
BI
Fie
ldG
uid
e
GenBank File Formats
ASN.1 – The Raw Data
XML
FASTA
flat file
NC
BI
Fie
ldG
uid
e
/************************************************************************** asn2ff.c* convert an ASN.1 entry to flat file format, using the FFPrintArray. ***************************************************************************/#include <accentr.h>#include "asn2ff.h"#include "asn2ffp.h"#include "ffprint.h"#include <subutil.h>#include <objall.h>#include <objcode.h>#include <lsqfetch.h>#include <explore.h>
#ifdef ENABLE_ID1#include <accid1.h>#endif
FILE *fpl;
Args myargs[] = {{"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL},{"Input is a Seq-entry","F", NULL ,NULL ,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL},{"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL},{"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL},{"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL},
Toolbox Sources
ftp> open ftp.ncbi.nih.gov..ftp> cd toolboxftp> cd ncbi_tools
ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools
NCBI Toolbox
NC
BI
Fie
ldG
uid
eText Queries in Entrez
term1[limit] OP term2[limit] OP …
limit = Entrez indexing field (organism, author, …)
OP = Boolean operator = AND, OR, NOT
where
term1 term2
Complex queries:((A[limit1] OR B[limit2]) AND C[limit3]) NOT D[limit4]
1:200[MW]
Ranges: Wildcards:
cancer[title] vs. cancer*[title]
NC
BI
Fie
ldG
uid
e
Entrez Tabs
Limits Provides a simple form for applying commonly used Entrez limits
Preview/Index Allows access to the full indexing of each Entrez database and aids in constructing complex queries
History Provides access to previous searches in the current Entrez database
Clipboard A temporary storage area for selected records
Details Displays the detailed parsing of the current Entrez query, and lists errors and terms without matches
NC
BI
Fie
ldG
uid
e
Programming Entrez: E-Utilities
ESearch
EPost
ESummary
Entrez query UID list or History
Document summaries
http://www.ncbi.nih.gov/entrez/query/static/eutils_help.html
History
UID list or History
UID list
EFetch Formatted dataUID list or History
ELinkUID list or History UID list or History
NC
BI
Fie
ldG
uid
e
Finding Primary Sequences
• Search Entrez CoreNucleotide– 94.8% GenBank (primary data)– 5.2% RefSeq (curated data)
M17755 [primary accession] TPO [gene name]thyroid peroxidase [title] thyroiditis [text word]Homo sapiens [organism] thyroid peroxidase [protein name]3060 [sequence length] 1999/04/26 [modification date]biomol mrna [properties] gbdiv pri [properties]srcdb genbank [properties]
Possible queries we’ve seen so far…
NC
BI
Fie
ldG
uid
e
A Starting Query
Find nucleotide records for human thyroid peroxidase
(("Homo sapiens“[Organism] OR human[All Fields]) AND thyroid peroxidase[All Fields])
human thyroid peroxidase
human[organism] AND thyroid peroxidase
("Homo sapiens“[Organism] AND thyroid peroxidase[All Fields])
276 records
262 records
Field Limit!
14 records aren’t human sequences!!
NC
BI
Fie
ldG
uid
e
Limit by Title and Database
#1: thyroid peroxidase AND human[orgn] 262#2: thyroid peroxidase[title] AND human[orgn] 55
#3: #2 AND srcdb refseq[properties] 5#4: #2 AND srcdb ddbj/embl/genbank[properties] 50
Entrez Nucleotide
GenBank srcdb ddbj/embl/genbank[properties]
RefSeq srcdb refseq[properties]
primary data
NC
BI
Fie
ldG
uid
e
Limit by Biomolecule Type
Genomic DNA biomol genomic[prop]
cDNA biomol mrna[prop]
#1: thyroid peroxidase AND human[orgn] 262
#2: thyroid peroxidase[title] AND human[orgn] 55
#3: #2 AND srcdb refseq[properties] 5#4: #2 AND srcdb ddbj/embl/genbank[properties] 50
#5: #4 AND biomol genomic[prop] 26#6: #4 AND biomol mrna[prop] 24
mRNA / cDNA
genomic DNA
NC
BI
Fie
ldG
uid
e
Limit by Protein Namethyroid peroxidase[protein name] AND human[orgn] AND gbdiv pri[prop] AND biomol mrna[prop]
24 records [title] 5 records [protein name]
NC
BI
Fie
ldG
uid
e
Entrez Document Summaries
Click the accession to view the record
Links menu
Links to other Entrez databasescomputed for M17755
NC
BI
Fie
ldG
uid
e
Viewing M17755
NC
BI
Fie
ldG
uid
eGenBank Sequences for Human
TPO
Which one is the best sequence???
NC
BI
Fie
ldG
uid
e
• Non-redundant • Explicitly linked nucleotide and protein sequences• Updated to reflect current sequence data and biology• Validated by hand • Format consistency• Distinct accession series • Stewardship by NCBI staff and collaborators
ftp://ftp.ncbi.nih.gov/refseq/release
RefSeq: NCBI’s Derivative Sequence Database
RefSeq Benefits
NC
BI
Fie
ldG
uid
e
RefSeq: NCBI’s Derivative Sequence Database
• Curated transcripts and proteins– NM_123456 NP_123456– NR_123456 (non-coding RNA)
• Model transcripts and proteins– XM_123456 XP_123456
– XR_123456 (non-coding RNA)
• Assembled Genomic Regions (contigs)– NT_123456 (BAC clones)– NW_123456 (WGS)
• Other Genomic Sequence– NG_123456 (complex regions, pseudogenes)
– NZ_ABCD12345678 (WGS) ZP_123456
• Chromosome records in Entrez Genome– NC_123456 (chromosome; microbial or organelle genome)
Nucleotide
Protein
NC
BI
Fie
ldG
uid
e
NM/NP Records in Entrez
COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from M17755.2 and AW874082.1. On Feb 25, 2003 this sequence version replaced gi:21361188.
NM_000547: variant 1
COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from J02970.1, AW874082.1 and M17755.2.
NM_175719: variant 2 EST that completes 3’ end
Nucleotide
Protein
NC
BI
Fie
ldG
uid
e
Genomic DNAGenomic DNA((NCNC, , NT, NWNT, NW))
Model mRNAModel mRNA (XM)(XM)(XR)(XR)
Curated mRNACurated mRNA (NM)(NM)(NR)(NR)
Model protein Model protein (XP)(XP)
Annotating the Gene
Curated ProteinCurated Protein (NP)(NP)
Scanning....
= ?= !
GenbankSequences
RefSeq
NC
BI
Fie
ldG
uid
e
The Perils of the XM
XM records are models based only on genomic sequence, and are subjectto revision or removal with each new build of that genome.
Query= gi|20850420|ref|XM_124429.1| Mus musculus expressed sequence AA553001 (AA553001), mRNA
gi|19527087|ref|NM_133873.1| Mus musculus DNA segment, Chr 4, Wayne State University 114, expressed (D4Wsu114e), mRNA Length=1898 Score = 3701.55 bits (1867), Expect = 0 Identities = 1870/1871 (99%), Gaps = 0/1871 (0%) Strand=Plus/Plus
BLAST the XM against the RefSeq database to look for a replacement:
NC
BI
Fie
ldG
uid
e
Entrez Gene and RefSeq
• Entrez Gene is the central depository for information about a gene available at NCBI, and often provides links to sites beyond NCBI
• Entrez Gene includes records for organisms that have NCBI Reference Sequences (RefSeqs)
• Entrez Gene records contain RefSeq mRNAs, proteins, and genomic DNA (if known) for a gene locus, plus links to other Entrez databases • NCBI RefSeqs are based on primary sequence data in GenBank
GenBank RefSeq Gene
Nucleotide
NC
BI
Fie
ldG
uid
e
Entrez Gene: RefSeq Annotations
NC
BI
Fie
ldG
uid
e
NM/NP Records in Entrez Gene
NC
BI
Fie
ldG
uid
e
Entrez Gene RefSeq Graphics
NM NP
NC
BI
Fie
ldG
uid
e
Getting the Annotation Details
Genomic sequence
ACCESSION NC_000002 REGION: 1396242..1525502
NC
BI
Fie
ldG
uid
e
Genome Annotation in Entrez Nucleotide
GenBank Components (clones, WGS) NT/NW Contigs NC
Assembly
Components
Genome
Components
NM/XMMaster
mRNA
NC
BI
Fie
ldG
uid
e
Genome Annotation Links
curated mRNA
genomic contig on chromosome 2 transcribing NM_000547
human chromosome 2
the 18 contigs of the chromosome 2 assembly
NC
BI
Fie
ldG
uid
e
Searching Entrez Gene
RefSeq status and variants: Reviewed RefSeqs with transcript variants
srcdb refseq reviewed[prop] AND has transcript variants[prop]
Gene symbol: human thyroid peroxidase (TPO)
tpo [sym] AND human [organism]
Disease and Gene Ontology: Membrane proteins linked to cancer
integral to plasma membrane[gene ontology] AND cancer [dis]
Chromosome and Links: genes on human chromosome 2 with OMIM links
2 [chromosome] AND gene omim [filter] AND human [organism]
Protein name: topoisomerase genes from Archaea
topoisomerase[gene/protein name] AND archaea [organism]
NC
BI
Fie
ldG
uid
e
Examples of sequences appropriate for TPA are:
Annotation of features on gene and/or mRNA sequences
Assembled “full length” genes and/or mRNAs
NCBI now accepts the submission of new annotationsof existing GenBank sequences.
• Submissions must be published in a peer-reviewed journal.
• Facilitates the annotation of sequences by experts.
What should not be submitted to TPA?
Synthetic constructs (such as cloning vectors) that use well-characterized, publicly available genes, promoters, or terminators
Updates or changes to existing sequence data
Sequence annotations without experimental evidence
Third Party Annotation(TPA) Database
NC
BI
Fie
ldG
uid
eLinking Protein Sequence,
Structure, and Function
sequence function (pfam, smart)ConservedDomains (CDD) sequence structure + function (cd)
VAST
Structure (MMDB) sequence structure
structure structure
Protein sequence sequence
NC
BI
Fie
ldG
uid
e
Entrez Structure
• Derived from experimentally determined PDB records• Add value to PDB records by:
– Adding explicit chemical bonding information– Validating and indexing the sequences– Annotating 3D domains and secondary structure– Adding links to CDD, Taxonomy, Pubmed – Converting PDB data to ASN.1
• Structure neighbors determined by
Vector Alignment Search Tool (VAST)
MMDB: MMolecular MModeling Data Base
Structure
NC
BI
Fie
ldG
uid
e
Structure Summary Page
Conserved Domains
VAST Neighbors for chain C (domain 0)
Cn3D
VAST Neighbors for domain 2
NC
BI
Fie
ldG
uid
e
Related Structures
NC
BI
Fie
ldG
uid
eVAST: Structure Neighbors
Vector Alignment Search Tool
For each 3D domain,
locate SSEs (secondarystructure elements),
and represent them asindividual vectors.
1
2
3
4
5 6
Human IL-4
VAST uses 3D Domains only!Whole polypeptides are assigned 3D domain 0 (zero).
NC
BI
Fie
ldG
uid
e
VAST Neighbors
1D2V
1D2V
1Q4G
3D domains!
Cn3D
NC
BI
Fie
ldG
uid
e
Submitting a PDB File to VAST
• Redesigned interface!• This is the best way to convert PDB into MMDB format!
New!
NC
BI
Fie
ldG
uid
e
Structure + Function
VAST finds proteins that have similar 3D folds
CD-Search finds proteins that have similar sequences and similar functions
Curated CDs = VAST + CD-Search
Proteins that have similar 3D folds,
similar sequences and similar functions
NC
BI
Fie
ldG
uid
e
Protein Links: Domains
Click on a colored bar to align your sequence to the CD
NC
BI
Fie
ldG
uid
e
CDD Record – heme peroxidases
aligned query
red = high conservation
blue = low conservation
NC
BI
Fie
ldG
uid
e
Curated CD Record - EGF
Annotated features
Launch Cn3D
phylogenetic tree of aligned sequencesLaunch
CDTree
New
NC
BI
Fie
ldG
uid
e
Curated CD Record - EGF
Annotated features
Launch Cn3D
phylogenetic tree of aligned sequencesLaunch
CDTree
New
Cn3D
NC
BI
Fie
ldG
uid
eEntrez PubChem
PC Substance
PC Compound
PC BioAssay
Primary database of chemical samples
Derived database of known chemicals fromPC Substance records
Primary database of bioactivity screens ofsamples in PC Substance
NC
BI
Fie
ldG
uid
e
Links from Structure
N-acetylglucosamine
heme
mannose
fucose
NC
BI
Fie
ldG
uid
e
Sequence Polymorphisms
SNP OMIM
• Primary database of submitted SNPs• Curated database of reference SNPs• Contains more than just SNPs:
• True SNPs• MNP (multiple nucleotide)• Insertions• Deletions• Microsatellites• Mixed• No variation (constant)
• Clinical literature database• Curated at Johns Hopkins Univ• Links human genes and genetic disorders to human disease• Lists allelic variants that have clinical consequences
Variations in SNP are not necessarily in OMIM, and vice versa!
General Polymorphisms Human Phenotypes
NC
BI
Fie
ldG
uid
e
Linking to SNP
Links to SNP are also available fromNucleotide and Protein
Entrez Gene - TPO
NC
BI
Fie
ldG
uid
e
Entrez SNP
primary data: ss#
SNP UID: rs#
NC
BI
Fie
ldG
uid
e
Find Non-synonymous SNPs
#7 AND coding nonsynon[Function Class]
Function Class
NC
BI
Fie
ldG
uid
e
Non-synonymous TPO SNPs
Link to Map Viewer
View all SNPs in locus
Link to related 3D structures
NC
BI
Fie
ldG
uid
e
GeneView in dbSNP
NC
BI
Fie
ldG
uid
e
Links to OMIM
Entrez Gene - TPO
NC
BI
Fie
ldG
uid
e
OMIM Record
NC
BI
Fie
ldG
uid
e
Explore a Disease SNP
799
NC
BI
Fie
ldG
uid
e
Curated CD Record
Launch Cn3D
phylogenetic tree of aligned sequencesLaunch
CDTree
Cn3D
E799