national center for biotechnology information
TRANSCRIPT
110101
NCBI National Center for Biotechnology National Center for Biotechnology InformationInformation
National Center for Biotechnology National Center for Biotechnology InformationInformation
• Created by Public Law 100-607 in 1988 as part of National Library of Medicine at NIH to:• Create automated systems for knowledge about molecular biology,
biochemistry, and genetics.
• Perform research into advanced methods of analyzing and interpreting molecular biology data.
• Enable biotechnology researchers and medical care personnel to use the systems and methods developed.
• Builders and providers of GenBank, Entrez, Blast, PubMed. Online systems host about 1.8 million users per day at peak rates of 3,200 web hits a second.
• Center for basic research and training in computational biology.
110101
NCBI NCBI is the most heavily site in NCBI is the most heavily site in biomedicine. Why?biomedicine. Why?
NCBI is the most heavily site in NCBI is the most heavily site in biomedicine. Why?biomedicine. Why?
300,000
200,000
100,000
NCBI Web Traffic – 1997-2006
400,000
January 1998
500,000
600,000
700,000
January 1999
January 2000
January 2001
January 2002
January 2003
January 2004
January 2005
January 2006
722,000 Unique IPs a Day
91 Million Web Hits a Day
3200 Peak Web Hits a Second
1.5 Terabytes FTP a Day
1.8 Million Unique Users a Day
110101
NCBIData, the Next Intel InsideData, the Next Intel InsideData, the Next Intel InsideData, the Next Intel Inside
Growth of Searches and GenBank
0
5000
10000
15000
20000
25000
30000
35000
1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
Search
es per D
ay
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
2000000
2200000
2400000
Meg
abas
es
GenBank (Megabases)
Searches/Day (BLAST & Text)
110101
NCBI Comparative Analysis of Genes Comparative Analysis of Genes Enables “Innovation in Assembly”Enables “Innovation in Assembly”
Comparative Analysis of Genes Comparative Analysis of Genes Enables “Innovation in Assembly”Enables “Innovation in Assembly”
Human 638 RHACVEVQDEIAFIPNDVYFEKDKQMFHIITGPNMGGKSTYIRQTGVIVLMAQIGCFVPC 697Yeast 657 RHPVLEMQDDISFISNDVTLESGKGDFLIITGPNMGGKSTYIRQVGVISLMAQIGCFVPC 716E.coli 584 RHPVVEQVLNEPFIANPLNLSPQRR-MLIITGPNMGGKSTYMRQTALIALMAYIGSYVPA 642
Colon cancer gene sequence
3000 Myr
1000 Myr
500 Myr
HumanFlyWormYeastBacteria Mouse
110101
NCBI Ignoring the Central Dogma in Ignoring the Central Dogma in Bioinformatics is Evidence of “Stupid Bioinformatics is Evidence of “Stupid
Design”Design”
Ignoring the Central Dogma in Ignoring the Central Dogma in Bioinformatics is Evidence of “Stupid Bioinformatics is Evidence of “Stupid
Design”Design”
Gene Gene Gene Gene
Structure
Mature Peptide
ProPeptide
mRNA
Transcript
Chromosome
Genetics
Genomes
Organisms
Function
D isease
110101
NCBI It Guides “Innovative Assembly” of It Guides “Innovative Assembly” of Separate ResourcesSeparate Resources
It Guides “Innovative Assembly” of It Guides “Innovative Assembly” of Separate ResourcesSeparate Resources
GenBank
RefSeq
Human Genome
Bacterial Genome
Virus Genome
MMDB
PubMed
UniGene(s)
LocusLink
OMIM
Taxonomy
GEO
PopSet
BLAST
Entrez
ePCR
Sequin
Gene Gene Gene Gene
Structure
Mature Peptide
ProPeptide
mRNA
Transcript
C hromosome
Genetics
Genomes
Organisms
Function
D isease
110101
NCBIEntrezEntrez: Pathway to Discovery: Pathway to DiscoveryEntrezEntrez: Pathway to Discovery: Pathway to Discovery
Amino acid sequence similarityCoding region
features
Nucleotide sequence similarity
Term frequency statistics
Literature citations in sequence databases
Literature citations in sequence databases
MEDLINE abstracts
Nucleotide sequences
Protein sequences
110101
NCBIEntrez Increases Discovery SpaceEntrez Increases Discovery SpaceEntrez Increases Discovery SpaceEntrez Increases Discovery Space
Nucleotide sequences
Protein sequences
Taxon
Phylogeny 3-D Structure
MMDB
3 -D Structure
PubMed abstracts
Complete Genomes
PubMed Entrez Genomes
Publishers Genome Centers
110101
NCBIEntrez is Intrinsically ComponentsEntrez is Intrinsically ComponentsEntrez is Intrinsically ComponentsEntrez is Intrinsically Components
NCBI C++ Toolkit enforces common modules in internal pipelines, external applications, and web components.
Entrez has common model for Booleans and Summaries. Unique models for deep data.
New projects can be easily added or extended. Long standing use of the “productotype” keeps
NCBI agile, but (fairly) robust.
110101
NCBIWeb Services Provide Access to EntrezWeb Services Provide Access to EntrezWeb Services Provide Access to EntrezWeb Services Provide Access to Entrez
Eutils supports about 5 million service requests a day
SOAP versions support about 38,000 service requests a day (0.8%) similar to Amazon experience with REST and SOAP
Eutils allows outside sites to recreate Entrez and NCBI does not know who or why
Current NCBI Sequence Viewer uses Eutils itself
110101
NCBI Harnessing Collective Intelligence in Harnessing Collective Intelligence in BioMedicine BioMedicine
Harnessing Collective Intelligence in Harnessing Collective Intelligence in BioMedicine BioMedicine
110101
NCBIBibliographic ResourcesBibliographic ResourcesBibliographic ResourcesBibliographic Resources
PubMed – Citations and Abstracts from publishers; MEDLINE indexing
PMC – PubMed Central, full text journal articles from publishers (and NIHMS).
pPMC – portable mirror of PMC content NIHMS – NIH Manuscript Submission System for Public
Access policy NLM DTD – Modular DTD for bibliographic material pNIHMS – portable NIHMS XML Authoring System – MS Word/XML authoring Bookshelf – Books and monographs in XML from
publishers and authors.
110101
NCBIPubMed Central XMLPubMed Central XMLPubMed Central XMLPubMed Central XML
Why XML?
• Preserves structure of an article• Lends itself to intelligent processing • Human readable – not dependent on technology• Is based on SGML, a publishing industry standard• Portable and migratable
110101
NCBIPMC2PMC2PMC2PMC2
Content is converted to a standard XML format on ingest and then stored and rendered from the one format.
But, What format?
110101
NCBIHarvard E-journal Archiving ProjectHarvard E-journal Archiving ProjectHarvard E-journal Archiving ProjectHarvard E-journal Archiving Project
The Mellon Foundation funded the Harvard Library to study the feasibility of using one DTD for archiving journal articles.
Harvard commissioned Inera, Inc. for the E-Journal Archive DTD Feasibility Study. • Conclusion – yes, it is feasible, but the right DTD does not exist.
Recommendations from the study were used in modified PMC DTD. NCBI collaborated with Harvard to broaden the scope of the new PMC DTD to accommodate journals from all disciplines (not just life sciences).
110101
NCBI NLM Journal Article DTDsNLM Journal Article DTDsEstablishing Standards from PracticeEstablishing Standards from Practice
NLM Journal Article DTDsNLM Journal Article DTDsEstablishing Standards from PracticeEstablishing Standards from Practice
Archiving and Interchange DTD Purpose is to preserve journal’s intellectual content Written for
• ease of conversion (from other DTDs)
• completeness (union of current journal DTDs)
Journal Publishing DTD A subset of the Archiving DTD Written for
• authoring article content
• initial tagging of non-XML content
• creating consistent structures
110101
NCBIAdoptionAdoptionAdoptionAdoption
Highwire Press JStor’s Electronic Archiving Initiative Australia’s Commonwealth Scientific and
Industrial Research Organization PLoS and other PMC contributors Atypon Systems (over 150 titles) and other
conversion vendors and journal service providers
Wiley, Nature, Blackwell common format (PXI)
110101
NCBISupportSupportSupportSupport
Complete documentation for both DTDs available online.
Established public discussion lists for user questions
Generic transformations to HTML and PDF forms of articles
Public XML validation tool Working group of leaders in printing and markup
industries provides advice on changes to Tagset
110101
NCBIPortable PubMed Central (pPMC)Portable PubMed Central (pPMC)
Provides a local mirror of PMC content Updated daily from NCBI Multiple site archiving Provides rendering of PMC XML into HTML Provides searching through NCBI EUtils Provides for controlled local content in presentation Provides first step toward collaborative archiving Collaboration with Microsoft on support
110101
NCBI
Previously published books
What’s on the Bookshelf?What’s on the Bookshelf?What’s on the Bookshelf?What’s on the Bookshelf?
Previously published books New collectionsPreviously published books New collectionsNew content
110101
NCBI Diabetes
• Health information with links to molecular data• NIDDK advisors on content• ~ 10,000 users per month
• “…a truly valuable resource…” Gene Barrett, President, American Diabetes
Association
Obesity
110101
NCBIBooksBooksBooksBooks
• Authoring in MS Word• Simple mark-up based on Word styles• WordML to XML conversion
110101
NCBI
110101
NCBIBioMedicine Moves to the WebBioMedicine Moves to the WebBioMedicine Moves to the WebBioMedicine Moves to the Web
Electronic Authoring and Distribution of Articles• Linking and annotating factual data as a side effect• Ability to mine data and text together• Richer data “between” supported databases
High Throughput Biology generates large datasets stored in public repositories• Common factual data roadmap• Greater transparency• Greater incidental collaboration for discovery
New “private” sites for discussion on this armature New products arise from a public infrastructure
110101
NCBIInfluenza Anti-viral CompoundsInfluenza Anti-viral CompoundsInfluenza Anti-viral CompoundsInfluenza Anti-viral Compounds
110101
NCBIInfluenza Anti-viral CompoundsInfluenza Anti-viral CompoundsInfluenza Anti-viral CompoundsInfluenza Anti-viral Compounds
110101
NCBIInfluzena Anti-viral/Protein BindingInfluzena Anti-viral/Protein BindingInfluzena Anti-viral/Protein BindingInfluzena Anti-viral/Protein Binding
110101
NCBIInfluenza Neuraminidase GeneInfluenza Neuraminidase GeneInfluenza Neuraminidase GeneInfluenza Neuraminidase Gene
110101
NCBIInfluenenza Genome ProjectInfluenenza Genome ProjectInfluenenza Genome ProjectInfluenenza Genome Project
110101
NCBIInfluenza Assembly ArchiveInfluenza Assembly ArchiveInfluenza Assembly ArchiveInfluenza Assembly Archive