1 protein bioinformatics – advances and challenges sona vasudevan peter mcgarvey by

33
1 Protein Bioinformatics – Advances and Challenges Sona Vasudevan Peter McGarvey BY

Upload: belinda-harrell

Post on 28-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

11

Protein Bioinformatics – Advances and Challenges

Sona VasudevanPeter McGarvey

BY

22

OutlineOutline• What is Bioinformatics? What is Bioinformatics? Past & Past &

PresentPresent• About PIRAbout PIR• PIR resourcesPIR resources• UniProt resourcesUniProt resources• PIR’s leading role in CaBig; PIR’s leading role in CaBig;

Biodefense and OntologyBiodefense and Ontology

33

What is Bioinformatics?What is Bioinformatics?NIH Biomedical Information Science and Technology Initiative (BISTI) NIH Biomedical Information Science and Technology Initiative (BISTI)

Working Definition (2000)Working Definition (2000)

Bioinformatics: Bioinformatics: Research, development, or application of Research, development, or application of computational tools and approaches for expanding the use of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such to acquire, store, organize, archive, analyze, or visualize such data.data.

Computer + Mouse = Bioinformatics (Information) (Biology)

44

“A science which hesitates to forget its founders is lost.”

---- A. N. Whitehead

55

Dr. Margaret Oakley Dayhoff (1925 – 1983)

The origin of the single-letter code for the amino acids

Evolution of Protein databases

(Georgetown University)

66

Challenges we are facing today!Total number of Total number of sequences in NRsequences in NR

~~4,919,3024,919,302

Total number of Total number of environmental environmental sequencessequences

~6,028,191(NCBI)~6,028,191(NCBI)

Number of domainNumber of domain

Families (Pfam)Families (Pfam)~~89578957

Number of domainNumber of domain

Families (SMART)Families (SMART)~~665665

Number of Structures Number of Structures (PDB)(PDB)

~~4333943339

Number of COGSNumber of COGS ~4873 (Unicellular)~4873 (Unicellular)

~4852 (Eukaryote)~4852 (Eukaryote)

77

Molecular Biology Molecular Biology DatabasesDatabases

719 Databases in 14 categories

The DNA sequence database has exceeded 100 gigabases.

88

the birth of “omes” & "omic" era in biology

99

Genomics

Proteomics

Unknomics

Functionomics

Metagenomics

1010

1111

Protein Information ResourceProtein Information Resource UniProt Universal Protein Resource:UniProt Universal Protein Resource: Central Central

Resource of Protein Sequence and FunctionResource of Protein Sequence and Function PIRSF Protein Family Classification System:PIRSF Protein Family Classification System:

Protein Classification and Functional Annotation Protein Classification and Functional Annotation iProClass Integrated Protein Knowledgebase:iProClass Integrated Protein Knowledgebase:

Data Integration and Functional Associative Data Integration and Functional Associative AnalysisAnalysis

http://pir.georgetown.edu

Integrated Protein Informatics Resource for Proteomics Research

1212

UniProt DatabasesUniProt Databases UniParc: Comprehensive Sequence Archive with Sequence History UniParc: Comprehensive Sequence Archive with Sequence History UniProt: Knowledgebase with Full Classification and Functional AnnotationUniProt: Knowledgebase with Full Classification and Functional Annotation UniRef: Non-redundant Reference Databases for Sequence SearchUniRef: Non-redundant Reference Databases for Sequence Search

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

1313

UniProt KnowledgebaseUniProt Knowledgebase Objective: Stable, Comprehensive, Fully Classified, Objective: Stable, Comprehensive, Fully Classified,

Richly and Accurately Annotated Richly and Accurately Annotated Information ContentInformation Content

Isoform PresentationIsoform Presentation NomenclatureNomenclature Family Classification and Domain IdentificationFamily Classification and Domain Identification Functional AnnotationFunctional Annotation

ApproachesApproaches Full Classification Full Classification Automated AnnotationAutomated Annotation Literature-Based CurationLiterature-Based Curation Database Cross-ReferencesDatabase Cross-References Controlled Vocabularies & OntologiesControlled Vocabularies & Ontologies Evidence AttributionEvidence Attribution

1414

PIRSF Classification SystemPIRSF Classification System PIRSF:PIRSF:

Reflects Reflects evolutionary relationshipsevolutionary relationships of of full-lengthfull-length proteinsproteins A A networknetwork structure from structure from superfamiliessuperfamilies to to subfamiliessubfamilies

Definitions:Definitions: Homeomorphic Family (HF):Homeomorphic Family (HF): Basic UnitBasic Unit HomologousHomologous:: Common ancestry, inferred by sequence Common ancestry, inferred by sequence

similaritysimilarity HomeomorphicHomeomorphic:: Full-length similarity & common domain Full-length similarity & common domain

architecturearchitecture Hierarchy:Hierarchy: Flexible number of levels with varying degrees of Flexible number of levels with varying degrees of

sequence conservationsequence conservation Network StructureNetwork Structure: : Allows multiple parentsAllows multiple parents

AdvantagesAdvantages:: Annotate both general biochemical andAnnotate both general biochemical and specific biological specific biological

functionsfunctions AccurateAccurate propagation of annotation and development ofpropagation of annotation and development of

standardizedstandardized protein nomenclature and ontologyprotein nomenclature and ontology

Credit AN Nikolskaya

1515

PIRSF Classification SystemProtein Classification and Functional Annotation

(http://pir.georgetown.edu/pirsf/)

Comprehensive Classification of All UniProt Proteins Curated Families with Protein Name and Site Rules Classification and Visualization Tools

Taxonomy Distribution and Phylogenetic Pattern

Iterative BlastClust Tree with Annotation Table, MSA & Phylogenetic tree

1616

Classification Tool: Classification Tool: BlastClust BlastClust

Curator-guided Curator-guided clusteringclustering

Single-linkage Single-linkage clustering using clustering using BLASTBLAST

Retrieve all Retrieve all proteins proteins sharing a sharing a common common domaindomain

Iterative Iterative BlastClust BlastClust (fixed (fixed length coverage)length coverage)

1717

PIRSF-Based Protein Annotation

Classification-Driven Rule-Based AnnotationProvides Consistent Annotation and Database Integrity Check Includes:Site Rule (PIRSR): Position-Specific Site Feature (FT)Name Rule (PIRNR): transfer name from PIRSF to individual proteins

Protein Name (DE) with Synonym, EC, MisnomerGO Term

Rule IDRule ID Rule ConditionRule Condition Rule Description (Name Rule Interface)Rule Description (Name Rule Interface)

PIRNR000881PIRNR000881-1-1

PIRSF000881 PIRSF000881 member and member and vertebratesvertebrates

Name: Name: S-acyl fatty acid synthase thioesteraseS-acyl fatty acid synthase thioesteraseEC: oleoyl-[acyl-carrier-protein] hydrolase (EC EC: oleoyl-[acyl-carrier-protein] hydrolase (EC 3.1.2.14) 3.1.2.14)

PIRNR000881PIRNR000881-2-2

PIRSF000881 PIRSF000881 member and not member and not vertebratesvertebrates

Name: Name: Type II thioesteraseType II thioesteraseEC: thiolester hydrolases (EC 3.1.2.-)EC: thiolester hydrolases (EC 3.1.2.-)

PIRNR025624PIRNR025624-1-1

PIRSF025624 PIRSF025624 membermember

Name: ACT domain proteinName: ACT domain proteinMisnomer: chorismate mutaseMisnomer: chorismate mutase

1818

Rule-based Annotation of Protein Entries Using PIRSF

Structure Binding/active sites Identification of residues

1919

MethodologyMethodology

Defining a RuleDefining a Rule Select template structureSelect template structure Align curated PIRSF seed members and structural templateAlign curated PIRSF seed members and structural template Structure-based sequence alignment of seedsStructure-based sequence alignment of seeds Edit MSA retaining conserved regions covering all site Edit MSA retaining conserved regions covering all site

residuesresidues Build Site HMM from concatenated conserved regionsBuild Site HMM from concatenated conserved regions

Rule ConditionRule Condition Membership Check (PIRSF HMM threshold)Membership Check (PIRSF HMM threshold) Conserved Region Check (site HMM threshold)Conserved Region Check (site HMM threshold) Site Residue Check (position-specific residue in HMMAlign) Site Residue Check (position-specific residue in HMMAlign)

Rule PropagationRule Propagation Propagate conserved feature annotation to all members that fit Propagate conserved feature annotation to all members that fit

the rulethe rule

2020

An example of PIR rule Integrated into SP record

PIR Rule

2121

PIRSF Protein Classification provides PIRSF Protein Classification provides a platform for protein annotationa platform for protein annotation Improves AnnotationImproves Annotation Quality Quality

Annotation ofAnnotation of biological function biological function of whole proteinsof whole proteins Annotation of uncharacterized hypothetical proteins Annotation of uncharacterized hypothetical proteins

((functional predictions functional predictions helped by newly detected family helped by newly detected family relationships)relationships)

Correction Correction of annotation errorsof annotation errors Improvement Improvement of under- or over-annotated proteinsof under- or over-annotated proteins

Standardization Standardization of Protein Namesof Protein Names

2222

Data IntegrationData Integration

Data WarehouseData Warehouse Local Copy of Databases in a Unified Database SchemaLocal Copy of Databases in a Unified Database Schema Allows Local Control of Data; Update ProblemAllows Local Control of Data; Update Problem

Hypertext NavigationHypertext Navigation Browsing Model with Hypertext LinksBrowsing Model with Hypertext Links Allows Direct Interaction; Easily Lost in CyberspaceAllows Direct Interaction; Easily Lost in Cyberspace

iProClass ApproachiProClass Approach Data Warehouse + Hypertext NavigationData Warehouse + Hypertext Navigation Rich Links (Links + Executive Summaries) Rich Links (Links + Executive Summaries) Modular and Open Framework for Adding New Modular and Open Framework for Adding New

Components in Distributed Networking EnvironmentComponents in Distributed Networking Environment

2323

iiProClass DatabaseProClass Database

~5,000,000 Protein ~5,000,000 Protein SequencesSequences

Rich Links to >80 Rich Links to >80 DatabasesDatabases

Value-Added Views Value-Added Views for UniProtfor UniProt

Integrated Protein Family, Function, StructureIntegrated Protein Family, Function, Structure Information

Gene

Structure

PDBSCOPCATH

PDBSumMMDBFFSP

Family

PIR SuperfamilyPIR-ASDB

InterProPfam

PROSITECOG

BLOCKSProClassMetaFam

Taxonomy

NCBI TaxonLiterature

PubMed

Protein Sequence

PIR-NREFPIR-PSD

Swiss-ProtTrEMBLRefSeq

GenePept

Gene/Genome

GenBank/EMBL/DDBJLocusLinkUniGene

GDBOMIMSGDMGI

FlyBaseMIPSTIGR

Function/Pathway

EC-IUBMBKEGG

BRENDAWIT

MetaCycEcoCyc

Gene Ontology

Interaction

DIPBIND

Modification

RESIDPhosphoBase

PhosphorylationSite

Protein Structure

Protein Expression

Protein Modification

Protein Interaction

Protein Function/Pathway

Superfamily/Domain/Motif

iProClassProtein Sequence

Expression

PMG

Gene

Structure

PDBSCOPCATH

PDBSumMMDBFFSP

Family

PIR SuperfamilyPIR-ASDB

InterProPfam

PROSITECOG

BLOCKSProClassMetaFam

Taxonomy

NCBI TaxonLiterature

PubMed

Protein Sequence

PIR-NREFPIR-PSD

Swiss-ProtTrEMBLRefSeq

GenePept

Gene/Genome

GenBank/EMBL/DDBJLocusLinkUniGene

GDBOMIMSGDMGI

FlyBaseMIPSTIGR

Function/Pathway

EC-IUBMBKEGG

BRENDAWIT

MetaCycEcoCyc

Gene Ontology

Interaction

DIPBIND

Modification

RESIDPhosphoBase

PhosphorylationSite

Protein Structure

Protein Expression

Protein Modification

Protein Interaction

Protein Function/Pathway

Superfamily/Domain/Motif

iProClassProtein Sequence

Expression

PMG

2525

PIR PIR iiProClass SearchesProClass SearchesText Search

Peptide Search

BLAST Search

ID Mapping

26

1.1. Albert Einstein College of MedicineAlbert Einstein College of MedicineT. gondii, C. parvumT. gondii, C. parvum

2.2. Caprion Pharmaceuticals Caprion Pharmaceuticals B. abortusB. abortus

3.3. Harvard Institute of Proteomics Harvard Institute of Proteomics V. choleraeV. cholerae, , B. anthracisB. anthracis

4.4. Myriad Genetics Myriad Genetics B. anthracis, Y. pestis, F. tularensis, Vaccinia, B. anthracis, Y. pestis, F. tularensis, Vaccinia, VariolaVariola

5.5. Pacific Northwest National Laboratory Pacific Northwest National Laboratory S. typhimurium, S. typhi, Vaccinia, MonkeypoxS. typhimurium, S. typhi, Vaccinia, Monkeypox

6.6. ScrippsScrippsSARS CoV, SARS CoV, InfluenzaInfluenza

7.7. University of Michigan University of Michigan B. anthracisB. anthracis

Scripps Caprion

MyriadHarvard

U of Michigan

Albert Einstein

PNNL

Resource Center

SSS

PIR VBI

DATA

27

Organism

Research Center

Data Type

28

Currently contains 3,733 ORF Clones out of 3,784 Proteins

Master Protein DirectoryMaster Protein Directory

29 Colonization Pathway Proteins

29

Protein Summary ReportClone SequencesOrder Clones from RepositoriesProtein and Reagent InformationProtein and Reagent InformationSearch for Related Proteins in Catalog by

Family Classification or Similarity Searches

Mouse proteins detected in B. anthracis and S. typhimurium infected macrophages

NCI caBIG Initiative

cancer Biomedical Informatics Grid: • Informatics platform to enable sharing of research, data and tools

• Designed and built by an open federation of organizations

• Facilitate connectivity via common standards and unifying architecture

• Open source and open access principles

• Domain Workspaces

• Clinical Trial Management Systems

• Integrative Cancer Research

• Imaging

• Tissue Banks and Pathology Tools

• Cross Cutting Workspaces

• Architecture

• Vocabularies and Common Data Elements

PIR Activities in caBIG™

•Integrative Cancer Research Workspace• Developer

• Grid-enablement of PIR

• Adopter• SEED Genome Annotation Tool

(completed)

• GeneConnect Genomic Identifier Mapping Service

•Vocabularies and Common Data Elements• Participant

3333