national center for biotechnology information

110101

NCBI National Center for Biotechnology National Center for Biotechnology InformationInformation

National Center for Biotechnology National Center for Biotechnology InformationInformation

• Created by Public Law 100-607 in 1988 as part of National Library of Medicine at NIH to:• Create automated systems for knowledge about molecular biology,

biochemistry, and genetics.

• Perform research into advanced methods of analyzing and interpreting molecular biology data.

• Enable biotechnology researchers and medical care personnel to use the systems and methods developed.

• Builders and providers of GenBank, Entrez, Blast, PubMed. Online systems host about 1.8 million users per day at peak rates of 3,200 web hits a second.

• Center for basic research and training in computational biology.

110101

NCBI NCBI is the most heavily site in NCBI is the most heavily site in biomedicine. Why?biomedicine. Why?

NCBI is the most heavily site in NCBI is the most heavily site in biomedicine. Why?biomedicine. Why?

300,000

200,000

100,000

NCBI Web Traffic – 1997-2006

400,000

January 1998

500,000

600,000

700,000

January 1999

January 2000

January 2001

January 2002

January 2003

January 2004

January 2005

January 2006

722,000 Unique IPs a Day

91 Million Web Hits a Day

3200 Peak Web Hits a Second

1.5 Terabytes FTP a Day

1.8 Million Unique Users a Day

110101

NCBIData, the Next Intel InsideData, the Next Intel InsideData, the Next Intel InsideData, the Next Intel Inside

Growth of Searches and GenBank

0

5000

10000

15000

20000

25000

30000

35000

1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003

Search

es per D

ay

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000

2200000

2400000

Meg

abas

es

GenBank (Megabases)

Searches/Day (BLAST & Text)

110101

NCBI Comparative Analysis of Genes Comparative Analysis of Genes Enables “Innovation in Assembly”Enables “Innovation in Assembly”

Comparative Analysis of Genes Comparative Analysis of Genes Enables “Innovation in Assembly”Enables “Innovation in Assembly”

Human 638 RHACVEVQDEIAFIPNDVYFEKDKQMFHIITGPNMGGKSTYIRQTGVIVLMAQIGCFVPC 697Yeast 657 RHPVLEMQDDISFISNDVTLESGKGDFLIITGPNMGGKSTYIRQVGVISLMAQIGCFVPC 716E.coli 584 RHPVVEQVLNEPFIANPLNLSPQRR-MLIITGPNMGGKSTYMRQTALIALMAYIGSYVPA 642

Colon cancer gene sequence

3000 Myr

1000 Myr

500 Myr

HumanFlyWormYeastBacteria Mouse

110101

NCBI Ignoring the Central Dogma in Ignoring the Central Dogma in Bioinformatics is Evidence of “Stupid Bioinformatics is Evidence of “Stupid

Design”Design”

Ignoring the Central Dogma in Ignoring the Central Dogma in Bioinformatics is Evidence of “Stupid Bioinformatics is Evidence of “Stupid

Design”Design”

Gene Gene Gene Gene

Structure

Mature Peptide

ProPeptide

mRNA

Transcript

Chromosome

Genetics

Genomes

Organisms

Function

D isease

110101

NCBI It Guides “Innovative Assembly” of It Guides “Innovative Assembly” of Separate ResourcesSeparate Resources

It Guides “Innovative Assembly” of It Guides “Innovative Assembly” of Separate ResourcesSeparate Resources

GenBank

RefSeq

Human Genome

Bacterial Genome

Virus Genome

MMDB

PubMed

UniGene(s)

LocusLink

OMIM

Taxonomy

GEO

PopSet

BLAST

Entrez

ePCR

Sequin

Gene Gene Gene Gene

Structure

Mature Peptide

ProPeptide

mRNA

Transcript

C hromosome

Genetics

Genomes

Organisms

Function

D isease

110101

NCBIEntrezEntrez: Pathway to Discovery: Pathway to DiscoveryEntrezEntrez: Pathway to Discovery: Pathway to Discovery

Amino acid sequence similarityCoding region

features

Nucleotide sequence similarity

Term frequency statistics

Literature citations in sequence databases

Literature citations in sequence databases

MEDLINE abstracts

Nucleotide sequences

Protein sequences

110101

NCBIEntrez Increases Discovery SpaceEntrez Increases Discovery SpaceEntrez Increases Discovery SpaceEntrez Increases Discovery Space

Nucleotide sequences

Protein sequences

Taxon

Phylogeny 3-D Structure

MMDB

3 -D Structure

PubMed abstracts

Complete Genomes

PubMed Entrez Genomes

Publishers Genome Centers

110101

NCBIEntrez is Intrinsically ComponentsEntrez is Intrinsically ComponentsEntrez is Intrinsically ComponentsEntrez is Intrinsically Components

NCBI C++ Toolkit enforces common modules in internal pipelines, external applications, and web components.

Entrez has common model for Booleans and Summaries. Unique models for deep data.

New projects can be easily added or extended. Long standing use of the “productotype” keeps

NCBI agile, but (fairly) robust.

110101

NCBIWeb Services Provide Access to EntrezWeb Services Provide Access to EntrezWeb Services Provide Access to EntrezWeb Services Provide Access to Entrez

Eutils supports about 5 million service requests a day

SOAP versions support about 38,000 service requests a day (0.8%) similar to Amazon experience with REST and SOAP

Eutils allows outside sites to recreate Entrez and NCBI does not know who or why

Current NCBI Sequence Viewer uses Eutils itself

110101

NCBI Harnessing Collective Intelligence in Harnessing Collective Intelligence in BioMedicine BioMedicine

Harnessing Collective Intelligence in Harnessing Collective Intelligence in BioMedicine BioMedicine

110101

NCBIBibliographic ResourcesBibliographic ResourcesBibliographic ResourcesBibliographic Resources

PubMed – Citations and Abstracts from publishers; MEDLINE indexing

PMC – PubMed Central, full text journal articles from publishers (and NIHMS).

pPMC – portable mirror of PMC content NIHMS – NIH Manuscript Submission System for Public

Access policy NLM DTD – Modular DTD for bibliographic material pNIHMS – portable NIHMS XML Authoring System – MS Word/XML authoring Bookshelf – Books and monographs in XML from

publishers and authors.

110101

NCBIPubMed Central XMLPubMed Central XMLPubMed Central XMLPubMed Central XML

Why XML?

• Preserves structure of an article• Lends itself to intelligent processing • Human readable – not dependent on technology• Is based on SGML, a publishing industry standard• Portable and migratable

110101

NCBIPMC2PMC2PMC2PMC2

Content is converted to a standard XML format on ingest and then stored and rendered from the one format.

But, What format?

110101

NCBIHarvard E-journal Archiving ProjectHarvard E-journal Archiving ProjectHarvard E-journal Archiving ProjectHarvard E-journal Archiving Project

The Mellon Foundation funded the Harvard Library to study the feasibility of using one DTD for archiving journal articles.

Harvard commissioned Inera, Inc. for the E-Journal Archive DTD Feasibility Study. • Conclusion – yes, it is feasible, but the right DTD does not exist.

Recommendations from the study were used in modified PMC DTD. NCBI collaborated with Harvard to broaden the scope of the new PMC DTD to accommodate journals from all disciplines (not just life sciences).

110101

NCBI NLM Journal Article DTDsNLM Journal Article DTDsEstablishing Standards from PracticeEstablishing Standards from Practice

NLM Journal Article DTDsNLM Journal Article DTDsEstablishing Standards from PracticeEstablishing Standards from Practice

Archiving and Interchange DTD Purpose is to preserve journal’s intellectual content Written for

• ease of conversion (from other DTDs)

• completeness (union of current journal DTDs)

Journal Publishing DTD A subset of the Archiving DTD Written for

• authoring article content

• initial tagging of non-XML content

• creating consistent structures

110101

NCBIAdoptionAdoptionAdoptionAdoption

Highwire Press JStor’s Electronic Archiving Initiative Australia’s Commonwealth Scientific and

Industrial Research Organization PLoS and other PMC contributors Atypon Systems (over 150 titles) and other

conversion vendors and journal service providers

Wiley, Nature, Blackwell common format (PXI)

110101

NCBISupportSupportSupportSupport

Complete documentation for both DTDs available online.

Established public discussion lists for user questions

Generic transformations to HTML and PDF forms of articles

Public XML validation tool Working group of leaders in printing and markup

industries provides advice on changes to Tagset

110101

NCBIPortable PubMed Central (pPMC)Portable PubMed Central (pPMC)

Provides a local mirror of PMC content Updated daily from NCBI Multiple site archiving Provides rendering of PMC XML into HTML Provides searching through NCBI EUtils Provides for controlled local content in presentation Provides first step toward collaborative archiving Collaboration with Microsoft on support

110101

NCBI

Previously published books

What’s on the Bookshelf?What’s on the Bookshelf?What’s on the Bookshelf?What’s on the Bookshelf?

Previously published books New collectionsPreviously published books New collectionsNew content

110101

NCBI Diabetes

• Health information with links to molecular data• NIDDK advisors on content• ~ 10,000 users per month

• “…a truly valuable resource…” Gene Barrett, President, American Diabetes

Association

Obesity

110101

NCBIBooksBooksBooksBooks

• Authoring in MS Word• Simple mark-up based on Word styles• WordML to XML conversion

110101

NCBI

110101

NCBIBioMedicine Moves to the WebBioMedicine Moves to the WebBioMedicine Moves to the WebBioMedicine Moves to the Web

Electronic Authoring and Distribution of Articles• Linking and annotating factual data as a side effect• Ability to mine data and text together• Richer data “between” supported databases

High Throughput Biology generates large datasets stored in public repositories• Common factual data roadmap• Greater transparency• Greater incidental collaboration for discovery

New “private” sites for discussion on this armature New products arise from a public infrastructure

110101

NCBIInfluenza Anti-viral CompoundsInfluenza Anti-viral CompoundsInfluenza Anti-viral CompoundsInfluenza Anti-viral Compounds

110101

NCBIInfluzena Anti-viral/Protein BindingInfluzena Anti-viral/Protein BindingInfluzena Anti-viral/Protein BindingInfluzena Anti-viral/Protein Binding

110101

NCBIInfluenza Neuraminidase GeneInfluenza Neuraminidase GeneInfluenza Neuraminidase GeneInfluenza Neuraminidase Gene

110101

NCBIInfluenenza Genome ProjectInfluenenza Genome ProjectInfluenenza Genome ProjectInfluenenza Genome Project

110101

NCBIInfluenza Assembly ArchiveInfluenza Assembly ArchiveInfluenza Assembly ArchiveInfluenza Assembly Archive

national center for biotechnology information

Technology