web services for pir/uniprot databases

1
Web Services for PIR/UniProt Databases Baris E. Suzek, Hongzhan Huang, Sehee Chung, Hsing-Kuo Hua, Peter McGarvey, Zhangzhi Hu, Cathy H. Wu, Protein Information Resource, Georgetown University Medical Center, Washington, DC, USA 20057-1455 Abstract Protein Information Resource (PIR) is an integrated bioinformatics resource that provides protein databases and analysis tools to support genomic and proteomic research. PIR recently joined with the European Bioinformatics Institute (EBI) and Swiss Institute of Bioinformatics (SIB) to establish UniProt––the Universal Protein Resource–– to produce a single worldwide resource of protein sequence and function, by unifying the PIR, Swiss-Prot, and TrEMBL database activities (http://www.uniprot.org). The UniProt Knowledgebase (UniProtKB) provides the central database of protein sequences with accurate, consistent, rich sequence and functional annotation. UniProtKB consists of two sections: Swiss-Prot, containing manually-annotated records with information extracted from literature and curator-evaluated computational analysis, and TrEMBL, containing computationally analyzed records that await full manual annotation. One of the biggest challenges in life sciences research is the discovery, integration and exchange of data coming from multiple research groups. To make the PIR resource widely accessible to the research community and application programs, we are adopting an open-source, common-standard distribution practice and employing industry-standard J2EE technology to develop protein object models and web services. To make the PIR resource interoperable with other bioinformatics databases, we are developing controlled vocabularies and common data elements. The web services is in the framework of the cancer Biomedical Informatics Grid (caBIG TM ), an infrastructure connecting individuals and institutions to enable the sharing of data and tools for cancer research and developed under the leadership of National Cancer Institute’s Center for Bioinformatics (NCICB). PIR, as a participant of caBIG TM , is developing “Grid-enablement of PIR/UniProt Data Source” project. The goal of this project is to demonstrate how the PIR/UniProt data source can be discovered and consumed in a grid environment by creating an object layer and a web service layer for accessing the data source. The project has an n-tier architecture. The data layer, supported by Oracle 9i, stores the UniProtKB data. The data access layer utilizing Hibernate provides the mapping between relational database and object model. The object layer is developed using a Model Driven Architecture (MDA) approach. The use cases are developed with input from user community. The objects and their relations are designed using Unified Modeling Language (UML) in combination with existing UniProtKB XML schemas. An object-XML mapping tool (Castor) has been used to serialize/deserialize XML data from/to objects. The web service layer, supported by Apache Axis, provides language-independent programmatic access to the objects using SOAP protocol. The web services will facilitate many query mechanisms to access PIR/UniProt data: • Identifier searches such UniProtKB ID, RefSeq number • String-based searches for fields such as protein, gene name or keywords • Boolean searches The results are returned in XML and FASTA format for ease data exchange. To address the issues of data interoperability, PIR is participating in development of common data elements (CDE) as a part of caBIG TM Vocabulary and Common Data Elements (VCDE) activities. As members of the NIAID Administrative Resource for Proteomic Research Centers, the PIR team and the Virginia Bioinformatics Institute are developing a cyber infrastructure with a central proteomic database for the NIAID Proteomic Research Program. We have established an Interoperability Working Group (IWG) to discuss and address database interoperability issues. Interconnecting with the IWG and caBIG VCDE activities, we also participate in the HUPO PSI, focusing on mass spectrometry (PSI-MS) and general proteomics standards for formats (PSI-ML, XML format for data exchange), minimum reporting requirements (MIAPE), and ontologies (PSI-Ont). Response Formats UniProtKB FASTA for caBIG >UniProKB ID Accession|GO ID(s)|Organism Name|Protein Name >1433B_HUMAN P31946|GO:0005515|Homo Sapiens|14-3-3 protein beta/alpha MAQPAELSREENVYMAKLAEQAERYEEMVEFMEKVAKTVDSEELTVEERNLLSVAYKNVI GARRASWRIISSIEQKEESRGNEDRVTLIKDYRGKIEVELTKICDGILKLLDSHLVPSST APESKVFYLKMKGDYYRYLAEFKSGTERKDAAENTMVAYKAAQEIALAELPPTHPIRLGL ALNFSVFYYEILNSPDRACDLAKQAFDEAISELDSLSEESYKDSTLIMQLLRDNLTLWTS DISEDAAEEMKDAPKGESGDGQ UniProtKB Report http://www.pir.uniprot.org/ entry/P00439 Setting Response Criteria Default response: UniProtKB XML with UniProtKB ID/AC, protein/gene name(s), keywords, taxonomy, primary citation, cross- references and sequence information Extended response: Default response plus gene location, feature, comments and all citations FASTA response: Sequence file with identifier line containing UniProtKB ID, UniProtKB Primary_Accession, GO ID(s) and species name and protein name Use Cases Setting search criteria Simple Search is based on individual field; UniProtKB, PIR, ID or accession number, NCBI Taxonomy ID, PIR ID or accession number, NCBI GI, GenPept accession number, Locus ID/Entrez Gene ID, Refseq accession number, PDB ID with/without chain ID, OMIM ID, TIGR ID, EMBL ID, UniRef100/90/50 ID, UniParc ID, PubMed ID(PMID), PIRSF ID, PFAM ID, EC number, PROSITE ID, PRINTS ID, GO ID, InterPro ID, TIGRFAMS ID, Protein name, Gene name or symbol, Keywords, Scientific or common organism name, Sequence length, Molecular weight Advanced Search is based on two fields combined with boolean operators “AND” , “OR” and “AND_NOT” All-ID Search is a google-like search for the identifier fields if source of identifier is not known Batch Retrieval using multiple UniProtKB IDs or accessions Class Diagram Business Layer JSP/ Servle ts Struts SOAP Engine Query Proces sor H T T P D Messag e Proces sor <WSDL / > Web Services Layer SOAP Client Clie nt Databa se Data Layer ORM JDBC Domain Objects DAO SOAP Messag es Architectural Design Data layer is supported by Oracle 9i UniProtKB is loaded to the database using: Castor for UniProtKB XML to object mapping ( http://castor.exolab.org ) Hibernate for object to database mapping ( http://www.hibernate.org ) Domain objects are designed using Enterprise Architect (EA) (http://www.sparxsystems.com/ea.htm ) Code for domain objects is generated using EA Data access objects (DAO) are used to abstract and encapsulate the access to the database Apache Axis is used as SOAP Engine (http://ws.apache.org/axis/) Object serialization to UniProtKB XML is done at runtime using Castor mapping files instead of complied mapping descriptors Domain Workspaces Clinical Trial Management Systems Integrative Cancer Research Workspace PIR Developer Project: Grid Enablement of PIR/UniProt Data PIR Adopter Project: SEED Genome Annotation Tool Tissue Banks and Pathology Tools Workspace Cross Cutting Workspaces Architecture Vocabularies and Common Data Elements National Cancer Institute caBIG TM Initiative From caBIG TM site ( http://cabig.nci.nih.gov/ ): “Voluntary network or grid connecting individuals and institutions to enable the sharing of data and tools, creating a World Wide Web of cancer research. The goal is to speed the delivery of innovative approaches for the prevention and treatment of cancer” Acknowledgements Research Projects NIH: NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR (UniProt) NIH: NIAID (Proteomic Administrative Resource) NIH: NCI caBIG (Grid, SEED) NSF: BDI (iProClass) NSF: SEIII (Entity Tagging) NSF: ITR (Ontology) US Air Force: EOS (Epidemic Outbreak Surveillance) Computing Resources Sun Microsystems AEG grant (V880) IBM SUR grant (P690) Model Driven Architecture Object Management Group’s Model Driven Architecture (MDA) provides an open, vendor-independent approach MDA separates business and application logics from underlying technologies PIR’s approach: Analyze and develop the use cases Developed in collaboration with the adopter from University of Pennsylvania, BioMedical Informatics Facility (BMIF) Design the system using class diagram in UML Generate the code C lients M iddle Tier D ata Source (JavaW ebStart) A pplications W eb B row ser (JavaW ebStart) A pplications W eb B row ser JDBC FlatFile A dapter XML A dapter JDBC FlatFile A dapter XML A dapter M yS ql DB2 Oracle Legacy D atabases XML R epositories M yS ql DB2 Oracle M yS ql DB2 Oracle Legacy D atabases Legacy D atabases XML R epositories XML R epositories S ervlet [C ontroller] JS P , HTM L, XM L (X SLT) [P resentation] SQL DAO DAO M anager D om ain O bjects [Model] FLAT DAO XML DAO S ervlet [C ontroller] JS P , HTM L, XM L (X SLT) [P resentation] SQL DAO DAO M anager D om ain O bjects [Model] FLAT DAO XML DAO PIR J2EE Bioinformatics Framework Annotation Standards Annotation Guides Controlled Vocabularies and Ontologies Evidence Attribution Mechanism Data Submission and Exchange Standards Sequence, Annotation, Bibliography Submission Reciprocal Links, Database Cross-References Dissemination Databases: XML/DTD, Flat File, FASTA, Relational Software: Object Models; Web Services Towards Protein Name Standards and Ontology UniProt Guidelines for Protein Naming Protein Name Dictionary and Thesaurus PIRSF Classification-Based Protein Ontology UniProt Standards and Interoperability PIR and caBIG TM Common Data Elements (CDE) CDEs required for semantic interoperability in caBIG CDEs stored in caDSR which maintains metadata to permit a user to locate the correct defining characteristics of a piece of datum, an instance of a specific concept UMLs for object model registered to PIR’s CDE related activities: Participate in creation of Gene CDE: Genomic Identifiers Taxonomy Creation of CDEs for UniProtKB based on the object model Seven National Proteomic Research Centers Administrative Resource Centers: SSS, GU-PIR, VT-VBI Administrative Resource Activities Administrative Support Scientific Coordination: • Scientific Working Group • Interoperability Working Group Cyber Infrastructure • Central Web Site: Single Point of Access • Proteomic Database: Data Storage and Retrieval • Integrated Protein Knowledge System: Functional Interpretation Interoperability Working Group (IWG) • Discuss and address database interoperability issues • Participate in the HUPO PSI, focusing on mass spectrometry (PSI-MS) and general proteomics standards for formats (PSI-ML, XML format for data exchange), minimum reporting requirements (MIAPE), and ontologies (PSI-Ont). NIAID Biodefense Proteomic Centers Multiple Data Types from Proteomics Research Centers Integrated Data at VBI Data Exchange Format Controlled Vocabulary Ontology Master Catalog & Complete Proteomes at GU-PIR Protein ID Peptide/Protein Sequence Mapping iProClas s UniPro t PIRSF U niProtR eference C lusters (U niR ef) U niProtA rchive (U niParc) U niProtK now ledgebase (U niProtK B ) U niProt:the w orld's m ostcom prehensive catalog ofinform ation on proteins http://w w w .uniprot.org U niProt (U niversalProtein R esource) http://w w w .uniprot.org U niProt (U niversalProtein R esource) Sw iss-Protsection M anually-annotated protein sequences = + + = + + UniRef100 UniRef90 UniRef50 A stable, comprehensive archive ofall publicly available protein sequences for sequence tracking from : S w iss-Prot, TrEM B L,P IR -PSD , EM BL,Ensem bl,IPI, PD B ,R efS eq, FlyBase,W orm B ase, PatentOffices,etc. N on-redundantreference sequences clustered from U niProtKB and U niParc for com prehensive orfast sequence searches at100% , 90% ,or50% identity Integration ofS w iss-Prot,TrE M BL and P IR -PSD Fully classified,richly and accurately annotated protein sequences w ith m inim al redundancy and extensive cross-references TrEM BL section C om puter-annotated protein sequences UniProtKB XML

Upload: ady

Post on 19-Jan-2016

44 views

Category:

Documents


0 download

DESCRIPTION

iProClass. PIRSF. UniProt. Web Services Layer. Client. Business Layer. Data Layer. HTTPD. JSP/ Servlets Struts. Protein ID Peptide/Protein Sequence Mapping. Integrated Data at VBI. Master Catalog & Complete Proteomes at GU-PIR. Domain Objects. SOAP Messages. SOAP Client. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Web Services for PIR/UniProt Databases

Web Services for PIR/UniProt DatabasesBaris E. Suzek, Hongzhan Huang, Sehee Chung, Hsing-Kuo Hua, Peter McGarvey, Zhangzhi Hu, Cathy H. Wu, Protein Information Resource, Georgetown University Medical Center, Washington, DC, USA 20057-1455

AbstractProtein Information Resource (PIR) is an integrated bioinformatics resource that provides protein databases and analysis tools to support genomic and proteomic research. PIR recently joined with the European Bioinformatics Institute (EBI) and Swiss Institute of Bioinformatics (SIB) to establish UniProt––the Universal Protein Resource––to produce a single worldwide resource of protein sequence and function, by unifying the PIR, Swiss-Prot, and TrEMBL database activities (http://www.uniprot.org). The UniProt Knowledgebase (UniProtKB) provides the central database of protein sequences with accurate, consistent, rich sequence and functional annotation. UniProtKB consists of two sections: Swiss-Prot, containingmanually-annotated records with information extracted from literature and curator-evaluated computational analysis, and TrEMBL, containing computationally analyzed records that await full manual annotation. One of the biggest challenges in life sciences research is the discovery, integration and exchange of data coming from multiple research groups. To make the PIR resource widely accessible to the research community and application programs, we are adopting an open-source, common-standard distribution practice and employing industry-standard J2EE technology to develop protein object models and web services. To make the PIR resource interoperable with other bioinformatics databases, we are developing controlled vocabularies and common data elements.The web services is in the framework of the cancer Biomedical Informatics Grid (caBIGTM), an infrastructure connecting individuals and institutions to enable the sharing of data and tools for cancer research and developed under the leadership of National Cancer Institute’s Center for Bioinformatics (NCICB). PIR, as a participant of caBIGTM, is developing “Grid-enablement of PIR/UniProt Data Source” project. The goal of this project is to demonstrate how the PIR/UniProt data source can be discovered and consumed in a grid environment by creating an object layer and a web service layer for accessing the data source. The project has an n-tier architecture. The data layer, supported by Oracle 9i, stores the UniProtKB data. The data access layer utilizing Hibernate provides the mapping between relational database and object model. The object layer is developed using a Model Driven Architecture (MDA) approach. The use cases are developed with input from user community. The objects and their relations are designed using Unified Modeling Language (UML) in combination with existing UniProtKB XML schemas. An object-XML mapping tool (Castor) has been used to serialize/deserialize XML data from/to objects. The web service layer, supported by Apache Axis, provides language-independent programmatic access to the objects using SOAP protocol. The web services will facilitate many query mechanisms to access PIR/UniProt data:

• Identifier searches such UniProtKB ID, RefSeq number• String-based searches for fields such as protein, gene name or keywords• Boolean searches

The results are returned in XML and FASTA format for ease data exchange. To address the issues of data interoperability, PIR is participating in development of common data elements (CDE) as a part of caBIGTM Vocabulary and Common Data Elements (VCDE) activities. As members of the NIAID Administrative Resource for Proteomic Research Centers, the PIR team and the Virginia Bioinformatics Institute are developing a cyber infrastructure with a central proteomic database for the NIAID Proteomic Research Program. We have established an Interoperability Working Group (IWG) to discuss and address database interoperability issues. Interconnecting with the IWG and caBIG VCDE activities, we also participate in the HUPO PSI, focusing on mass spectrometry (PSI-MS) and general proteomics standards for formats (PSI-ML, XML format for data exchange), minimum reporting requirements (MIAPE), and ontologies (PSI-Ont).

Response Formats

UniProtKB FASTA for caBIG

>UniProKB ID Accession|GO ID(s)|Organism Name|Protein Name

>1433B_HUMAN P31946|GO:0005515|Homo Sapiens|14-3-3 protein beta/alphaMAQPAELSREENVYMAKLAEQAERYEEMVEFMEKVAKTVDSEELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEESRGNEDRVTLIKDYRGKIEVELTKICDGILKLLDSHLVPSSTAPESKVFYLKMKGDYYRYLAEFKSGTERKDAAENTMVAYKAAQEIALAELPPTHPIRLGLALNFSVFYYEILNSPDRACDLAKQAFDEAISELDSLSEESYKDSTLIMQLLRDNLTLWTSDISEDAAEEMKDAPKGESGDGQ

UniProtKB Report

http://www.pir.uniprot.org/entry/P00439

Setting Response Criteria• Default response: UniProtKB XML with UniProtKB ID/AC, protein/gene name(s), keywords, taxonomy, primary citation, cross-references and sequence information• Extended response: Default response plus gene location, feature, comments and all citations• FASTA response: Sequence file with identifier line containing UniProtKB ID, UniProtKB Primary_Accession, GO ID(s) and species name and protein name

Use Cases

Setting search criteria• Simple Search is based on individual field; UniProtKB, PIR, ID or accession number, NCBI Taxonomy ID, PIR ID or accession number, NCBI GI, GenPept accession number, Locus ID/Entrez Gene ID, Refseq accession number, PDB ID with/without chain ID, OMIM ID, TIGR ID, EMBL ID, UniRef100/90/50 ID, UniParc ID, PubMed ID(PMID), PIRSF ID, PFAM ID, EC number, PROSITE ID, PRINTS ID, GO ID, InterPro ID, TIGRFAMS ID, Protein name, Gene name or symbol, Keywords, Scientific or common organism name, Sequence length, Molecular weight • Advanced Search is based on two fields combined with boolean operators “AND” , “OR” and “AND_NOT” • All-ID Search is a google-like search for the identifier fields if source of identifier is not known• Batch Retrieval using multiple UniProtKB IDs or accessions

Class Diagram

BusinessLayer

JSP/ServletsStruts

SOAP Engine

QueryProcess

or

HTTPD

Message

Processor<WSDL /

>

Web Services Layer

SOAP Client

Client

Database

Data Layer

ORMJDB

C

Domain Objects

DAO

SOAPMessag

es

Architectural Design • Data layer is supported by Oracle

9i• UniProtKB is loaded to the

database using:– Castor for UniProtKB XML to

object mapping (http://castor.exolab.org)

– Hibernate for object to database mapping (http://www.hibernate.org)

• Domain objects are designed using Enterprise Architect (EA) (http://www.sparxsystems.com/ea.htm)• Code for domain objects is generated using EA• Data access objects (DAO) are used to abstract and encapsulate the access to the database

• Apache Axis is used as SOAP Engine (http://ws.apache.org/axis/)• Object serialization to UniProtKB XML is done at runtime using Castor mapping files instead of complied mapping descriptors

• Domain Workspaces • Clinical Trial Management Systems• Integrative Cancer Research

Workspace– PIR Developer Project: Grid

Enablement of PIR/UniProt Data– PIR Adopter Project: SEED

Genome Annotation Tool• Tissue Banks and Pathology Tools

Workspace• Cross Cutting Workspaces• Architecture • Vocabularies and Common Data

Elements

National Cancer Institute caBIGTM Initiative

From caBIGTM site (http://cabig.nci.nih.gov/):“Voluntary network or grid connecting individuals and institutions to enable the sharing of data and tools, creating a World Wide Web of cancer research. The goal is to speed the delivery of innovative approaches for the prevention and treatment of cancer”

Acknowledgements

• Research Projects– NIH: NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR

(UniProt) – NIH: NIAID (Proteomic Administrative Resource)– NIH: NCI caBIG (Grid, SEED)– NSF: BDI (iProClass)– NSF: SEIII (Entity Tagging)– NSF: ITR (Ontology)– US Air Force: EOS (Epidemic Outbreak Surveillance)

• Computing Resources– Sun Microsystems AEG grant (V880)– IBM SUR grant (P690)

Model Driven Architecture• Object Management Group’s Model Driven Architecture (MDA) provides an open, vendor-independent approach• MDA separates business and application logics from underlying technologies• PIR’s approach:

• Analyze and develop the use cases• Developed in collaboration with the adopter from University of Pennsylvania, BioMedical Informatics Facility (BMIF)• Design the system using class diagram in UML• Generate the code

Clients Middle Tier Data Source

(JavaWebStart)

Applications

Web Browser

(JavaWebStart)

Applications

Web Browser

JDBC

FlatFileAdapter

XMLAdapter

JDBC

FlatFileAdapter

XMLAdapter

MySqlDB2

Oracle

LegacyDatabases

XMLRepositories

MySqlDB2

Oracle

MySqlDB2

Oracle

LegacyDatabases

LegacyDatabases

XMLRepositoriesXMLRepositories

Servlet[Controller]

JSP,HTML,

XML (XSLT)[Presentation]

SQLDAO

DAOManager

Domain Objects[Model]

FLATDAO

XMLDAO

Servlet[Controller]

JSP,HTML,

XML (XSLT)[Presentation]

SQLDAO

DAOManager

Domain Objects[Model]

FLATDAO

XMLDAO

PIR J2EE Bioinformatics Framework

• Annotation Standards– Annotation Guides– Controlled Vocabularies and Ontologies – Evidence Attribution Mechanism

• Data Submission and Exchange Standards– Sequence, Annotation, Bibliography Submission– Reciprocal Links, Database Cross-References

• Dissemination– Databases: XML/DTD, Flat File, FASTA, Relational– Software: Object Models; Web Services

• Towards Protein Name Standards and Ontology– UniProt Guidelines for Protein Naming– Protein Name Dictionary and Thesaurus– PIRSF Classification-Based Protein Ontology

UniProt Standards and Interoperability

PIR and caBIGTM Common Data Elements (CDE)

• CDEs required for semantic interoperability in caBIG

• CDEs stored in caDSR which maintains metadata to permit a user to locate the correct defining characteristics of a piece of datum, an instance of a specific concept

• UMLs for object model registered to

• PIR’s CDE related activities:

• Participate in creation of Gene CDE:

• Genomic Identifiers

• Taxonomy

• Creation of CDEs for UniProtKB based on the object model

• Seven National Proteomic Research Centers• Administrative Resource Centers: SSS, GU-PIR, VT-VBI • Administrative Resource Activities

– Administrative Support– Scientific Coordination:

• Scientific Working Group• Interoperability Working Group

– Cyber Infrastructure• Central Web Site: Single Point of Access• Proteomic Database: Data Storage and Retrieval• Integrated Protein Knowledge System: Functional

Interpretation– Interoperability Working Group (IWG)

• Discuss and address database interoperability issues • Participate in the HUPO PSI, focusing on mass

spectrometry (PSI-MS) and general proteomics standards for formats (PSI-ML, XML format for data exchange), minimum reporting requirements (MIAPE), and ontologies (PSI-Ont).

NIAID Biodefense Proteomic Centers

MultipleData Types

from ProteomicsResearchCenters

Integrated Dataat VBI

Data Exchange FormatControlled Vocabulary

Ontology

Master Catalog & Complete Proteomes

at GU-PIR

Protein IDPeptide/Protein

Sequence Mapping

iProClassUniProt PIRSF

UniProt Reference Clusters (UniRef)

UniProt Archive(UniParc)

UniProt Knowledgebase(UniProtKB)

UniProt: the world's most comprehensive catalog of information on proteins

http://www.uniprot.orgUniProt (Universal Protein Resource) http://www.uniprot.orgUniProt (Universal Protein Resource)

Swiss-Prot sectionManually-annotated protein sequences

= + += + +

UniRef100

UniRef90

UniRef50

UniRef100

UniRef90

UniRef50

A stable, comprehensive

archive of all publicly available protein sequences for

sequence tracking from:

Swiss-Prot, TrEMBL, PIR-PSD,

EMBL, Ensembl, IPI, PDB, RefSeq,

FlyBase, WormBase, Patent Offices, etc.

Non-redundant reference sequences clustered from UniProtKB and UniParc for

comprehensive or fast sequence searches at 100%,

90%, or 50% identity

Integration of Swiss-Prot, TrEMBLand PIR-PSD

Fully classified, richly and accurately annotated protein sequences with minimal redundancy and extensive

cross-references

TrEMBL sectionComputer-annotated protein sequences

UniProtKB XML