caper 2.0: an interactive, configurable, and extensible workflow-based platform to analyze data sets...

8
CAPER 2.0: An Interactive, Congurable, and Extensible Workow- Based Platform to Analyze Data Sets from the Chromosome-centric Human Proteome Project Dan Wang, ,,§,# Zhongyang Liu, ,,§,# Feifei Guo, ,,§,,# Lihong Diao, ,,§ Yang Li, ,,§ Xinlei Zhang, Zechi Huang, Dong Li,* ,,,§ and Fuchu He* ,,,§,State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, 33 Life Science Park Road, Beijing 100850, China National Center for Protein Sciences Beijing, 33 Life Science Park Road, Beijing 102206, China § National Engineering Research Center for Protein Drugs, 33 Life Science Park Road, Beijing 100850, China Institute of Basic Medical Sciences Chinese Academy of Medical Sciences, School of Basic Medicine Peking Union Medical College, 5 Dong Dan San Tiao, Beijing 100005, China Beijing Genestone Technology, Ltd., F21-103, FengLinLvZhou, Kexueyuan Nanli, Datun Road, Beijing 100085, China ABSTRACT: The Chromosome-centric Human Proteome Project (C-HPP) aims to map and annotate the entire human proteome by the chromosome-by-chromosomestrategy. As the C-HPP proceeds, the increasing volume of proteomic data sets presents a challenge for customized and reproducible bioinformatics data analyses for mining biological knowledge. To address this challenge, we updated the previous static proteome browser CAPER into a higher version, CAPER 2.0 an interactive, congurable and extensible workow- based platform for C-HPP data analyses. In addition to the previous visualization functions of track-view and heatmap- view, CAPER 2.0 presents a powerful toolbox for C-HPP data analyses and also integrates a congurable workow system that supports the view, construction, edit, run, and share of workows. These features allow users to easily conduct their own C-HPP proteomic data analyses and visualization by CAPER 2.0. We illustrate the usage of CAPER 2.0 with four specic workows for nding missing proteins, mapping peptides to chromosomes for genome annotation, integrating peptides with transcription factor binding sites from ENCODE data sets, and functionally annotating proteins. The updated CAPER is available at http://www.bprc.ac.cn/CAPE. KEYWORDS: proteomic data analysis platform, user-customized workow, proteomic data visualization, bioinformatics, Chromosome-centric Human Proteome Project INTRODUCTION As an important component of the Human Proteome Project (HPP) established by the Human Proteome Organization (HUPO), the Chromosome-centric Human Proteome Project (C-HPP) was ocially launched in Geneva in 2011. 1 The C- HPP aims to identify the entire human protein set encoded in each chromosome and to characterize them with abundance, tissue/subcellular localization, post-translational modication (PTM), single amino acid variant (SAAV) generated by nonsynonymous single nucleotide polymorphism (nsSNP), interactome, and so on. 2,3 To achieve these scientic objects, the C-HPP consortium takes a chromosome-by-chromosomeinternational cooperation strategy. Now all 24 chromosomes and mitochondria have been adoptedby 25 teams from the world, 1 and the research achievements of the rst phase have been published in the 2013 C-HPP special issue of the Journal of Proteome Research. 4 In particular, the C-HPP consortium is strengthening the cooperation with the Encyclopedia of DNA Elements (ENCODE) Consortium, whose goal is to build a comprehensive parts list of functional elements in the human genome. 5 This cooperation between the two projects is promising to promote the elucidation of how the interacting genomic elements such as polygenes, SNPs, and transcription factors control the families of isoforms generated at the protein level. 6 As the C-HPP proceeds, large amounts of proteomic data sets have been produced. 4 It is challenging to extract biologically important information from these large-scale, Special Issue: Chromosome-centric Human Proteome Project Received: August 1, 2013 Published: November 22, 2013 Article pubs.acs.org/jpr © 2013 American Chemical Society 99 dx.doi.org/10.1021/pr400795c | J. Proteome Res. 2014, 13, 99106

Upload: fuchu

Post on 22-Feb-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CAPER 2.0: An Interactive, Configurable, and Extensible Workflow-Based Platform to Analyze Data Sets from the Chromosome-centric Human Proteome Project

CAPER 2.0: An Interactive, Configurable, and Extensible Workflow-Based Platform to Analyze Data Sets from the Chromosome-centricHuman Proteome ProjectDan Wang,†,‡,§,# Zhongyang Liu,†,‡,§,# Feifei Guo,†,‡,§,∥,# Lihong Diao,†,‡,§ Yang Li,†,‡,§ Xinlei Zhang,⊥

Zechi Huang,⊥ Dong Li,*,†,‡,§ and Fuchu He*,†,‡,§,∥

†State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, 33 Life Science ParkRoad, Beijing 100850, China‡National Center for Protein Sciences Beijing, 33 Life Science Park Road, Beijing 102206, China§National Engineering Research Center for Protein Drugs, 33 Life Science Park Road, Beijing 100850, China∥Institute of Basic Medical Sciences Chinese Academy of Medical Sciences, School of Basic Medicine Peking Union Medical College,5 Dong Dan San Tiao, Beijing 100005, China⊥Beijing Genestone Technology, Ltd., F21-103, FengLinLvZhou, Kexueyuan Nanli, Datun Road, Beijing 100085, China

ABSTRACT: The Chromosome-centric Human ProteomeProject (C-HPP) aims to map and annotate the entirehuman proteome by the “chromosome-by-chromosome”strategy. As the C-HPP proceeds, the increasing volume ofproteomic data sets presents a challenge for customized andreproducible bioinformatics data analyses for mining biologicalknowledge. To address this challenge, we updated the previousstatic proteome browser CAPER into a higher version, CAPER2.0 − an interactive, configurable and extensible workflow-based platform for C-HPP data analyses. In addition to theprevious visualization functions of track-view and heatmap-view, CAPER 2.0 presents a powerful toolbox for C-HPP dataanalyses and also integrates a configurable workflow systemthat supports the view, construction, edit, run, and share ofworkflows. These features allow users to easily conduct their own C-HPP proteomic data analyses and visualization by CAPER2.0. We illustrate the usage of CAPER 2.0 with four specific workflows for finding missing proteins, mapping peptides tochromosomes for genome annotation, integrating peptides with transcription factor binding sites from ENCODE data sets, andfunctionally annotating proteins. The updated CAPER is available at http://www.bprc.ac.cn/CAPE.

KEYWORDS: proteomic data analysis platform, user-customized workflow, proteomic data visualization, bioinformatics,Chromosome-centric Human Proteome Project

■ INTRODUCTION

As an important component of the Human Proteome Project(HPP) established by the Human Proteome Organization(HUPO), the Chromosome-centric Human Proteome Project(C-HPP) was officially launched in Geneva in 2011.1 The C-HPP aims to identify the entire human protein set encoded ineach chromosome and to characterize them with abundance,tissue/subcellular localization, post-translational modification(PTM), single amino acid variant (SAAV) generated bynonsynonymous single nucleotide polymorphism (nsSNP),interactome, and so on.2,3 To achieve these scientific objects,the C-HPP consortium takes a “chromosome-by-chromosome”international cooperation strategy. Now all 24 chromosomesand mitochondria have been “adopted” by 25 teams from theworld,1 and the research achievements of the first phase havebeen published in the 2013 C-HPP special issue of the Journal

of Proteome Research.4 In particular, the C-HPP consortium isstrengthening the cooperation with the Encyclopedia of DNAElements (ENCODE) Consortium, whose goal is to build acomprehensive parts list of functional elements in the humangenome.5 This cooperation between the two projects ispromising to promote the elucidation of how the interactinggenomic elements such as polygenes, SNPs, and transcriptionfactors control the families of isoforms generated at the proteinlevel.6

As the C-HPP proceeds, large amounts of proteomic datasets have been produced.4 It is challenging to extractbiologically important information from these large-scale,

Special Issue: Chromosome-centric Human Proteome Project

Received: August 1, 2013Published: November 22, 2013

Article

pubs.acs.org/jpr

© 2013 American Chemical Society 99 dx.doi.org/10.1021/pr400795c | J. Proteome Res. 2014, 13, 99−106

Page 2: CAPER 2.0: An Interactive, Configurable, and Extensible Workflow-Based Platform to Analyze Data Sets from the Chromosome-centric Human Proteome Project

Figure 1. Main pages of CAPER 2.0. (A) Analysis workspace. This page is divided into four areas, including the navigation bar (top), tool panel(left), detail panel (middle), and history panel (right). The navigation bar lists the main functions of CAPER 2.0 including analysis workspace,workflow, shared data, and so on. In the tool panel, the developed tools for data import/export, data preprocessing, C-HPP data analyses, andstatistical analyses are listed. The detail panel presents the interface of the tool selected by the user, by which the user can easily set parameters andinput files. The history panel records the user’s every action (a history item) with data, analysis results, as well as automatically tracked metadata andannotations written by the user. The history panel helps to facilitate the reproducibility of analysis procedures. (B) Page of workflow editor. In thispage, users can select tools from the tool panel (left) to construct and edit the workflow in the editor panel (middle). Each tool in the workflow canbe configurable in the detail panel (right). The constructed workflow can be run in the analysis workspace. (C) Page of published workflows. CAPER2.0 supports the publication and share of a workflow with metadata (the inset figure), facilitating the reuse and edit of the workflow on other datasets or by other users.

Journal of Proteome Research Article

dx.doi.org/10.1021/pr400795c | J. Proteome Res. 2014, 13, 99−106100

Page 3: CAPER 2.0: An Interactive, Configurable, and Extensible Workflow-Based Platform to Analyze Data Sets from the Chromosome-centric Human Proteome Project

high-throughput data sets with different sources, types, andconfidence levels. To address this challenge, a bioinformaticsplatform integrating various analysis tools is indispensable andcrucial. First, to meet the continuously increasing requirementof the C-HPP data analyses, this platform should haveextensibility, that is, it allows easy addition of newbioinformatics tools and workflows integrating multiple tools.Second, this platform should be configurable to meet differentrequirements of different users. Because the experimentalmotivations and proteomics data sets users want to analyze arevery diverse, there is no universal workflow that fits all users.Therefore, an ideal platform should allow users to configuretheir own analyses pipelines by setting different parameters andselecting different tools and reference data sets based on theirrequirement. Third, this platform should have excellentinteractivity. Graphical interfaces of tools and workflow editorand the visualization of the analysis results should be provided,enabling users to easily implement their analyses, even for thosewithout programming or informatics expertise.The C-HPP community has developed several bioinformatic

tools such as The Proteome Browser (TPB) developed byAustralia,7 GenomewidePDB by Korea,8 and Gene-centricKnowledgebase for Chr 18 by Russia.9 All three of these toolsuse gene-centric heatmap to integrate and visualize theproteomic data sets and related annotations. In particular, wedeveloped a chromosome-assembled human proteome browser(CAPER), which not only uses heatmap-view to visualize thequalitative and quantitative data but also uses track-view toexhibit sequence/site information of proteomic data sets,facilitating the complete annotation and functional interpreta-tion of the human genome by proteomic approaches.10

Although these tools perform well on data visualization, theystill have some limitations. First, these tools only provide thefunction of static visualization, and in fact the user-definedvisualization is much more important. Second, they haverelatively limited functions. Besides data visualization, C-HPPhas a more pressing requirement for data analyses such asmapping peptides onto chromosomes, finding missing proteins,classifying proteins with different levels of existence evidence,and combining ENCODE and C-HPP data sets etc. Third,these tools do not support online, interactive data analyses, letalone the extension of analysis tools, the configuration of toolparameters, the user-customized workflow, and the reproduci-bility of data analyses.To address these limitations, we have updated the previous

CAPER to a completely revised version: CAPER 2.0, aninteractive, configurable and extensible workflow-based bio-informatics platform for C-HPP data analyses. This platformintegrated multiple analysis tools specific for C-HPP, includingtools for finding missing proteins, mapping identified peptidesto human chromosomes, bridging the C-HPP and ENCODEdata sets and protein functional annotation, and so on. Thisplatform also integrated a configurable workflow system,supporting workflow viewing, editing, running, and sharing.These features facilitate the usage of users to conduct their ownC-HPP data analyses by CAPER 2.0.

■ MATERIALS AND METHODS

Design and Implementation of CAPER 2.0

CAPER 2.0 was implemented based on Galaxy, which is anopen-source web-based workflow system.11 In addition to thestandard Galaxy tools, multiple tools were developed

specifically for C-HPP data analyses, which were coded inPython or Perl. Statistical tools were implemented by use ofRGalaxy package. Network visualization was implemented bythe Cytoscape Web,12 which is a Flash-based and Javascript-programmed software package for online network visualization.The underlying databases of CAPER 2.0 were managed usingMySQL. CAPER 2.0 is currently running on an Ubuntu 12.04Server operating system, released with Nginx web serversoftware.Data Sets Collection

In CAPER 2.0, we used the C-HPP standard baseline metricsrecently presented in HUPO 2013 (Yokohama, Japan), that is,the protein coding genes from Ensembl (Release 72)13 andprotein-level existence evidence of protein coding genes fromPeptideAtlas (Aug 2013, FDR at protein level <1%),14 GPMDB(Aug 2013,with evidence code of “green”),15 neXtProt (Seq2013, PE1),16 and the Human Protein Atlas (HPA) (Dec 2012,with the expression value of “medium” or “high”).17 In addition,transcript-level existence evidence was also updated, whichcame from neXtProt (Sep 2013, PE2) and transcriptomicexpression profile from the Human Liver Proteome Project(HLPP).18 Of course, the standard baseline metrics for the2013 C-HPP special issue of the Journal of Proteome Research10

are still optional in CAPER 2.0.The integrated ENCODE data sets in CAPER 2.0 were from

the ENCODE Analysis Working Group (http://genome.ucsc.edu/ENCODE/downloads.html),5 Gene Ontology (GO)annotations of human proteins from GOfact,19 biologicalpathway data from KEGG (version: 20100301),20 and proteininteraction network data from HPRD (release 9).21

■ RESULTS AND DISCUSSION

Overview of CAPER 2.0

CAPER 2.0 is designed to be a workflow-based platform toanalyze and visualize proteomic data sets produced from C-HPP to satisfy the flexible requirement of C-HPP data analyses.Bioinformatics tools are indispensable for achieving the aims

of C-HPP. Up to now, there is no such integrated andextensible tool warehouse special for C-HPP data analyses,allowing users to select tools from it to do different jobs. InCAPER 2.0, besides standard computational tools in Galaxy, wedeveloped several tools designed for C-HPP proteomic dataanalyses, including “Map peptides to chromosomes”, “Findmissing proteins”, “Protein classification based on existenceevidence at different levels”, “KEGG pathway analyzer”,“GOfact”, “Protein network construction”, “Network analyzer”,“Send gff to CAPER track-view”, and so on. Statistical analysesoften play important roles in pattern recognition for proteomicdata sets, and thus in CAPER 2.0 we also integrated well-defined statistical methods as the “STATISTICS TOOLBOX”.All of these tools are listed in the tool panel in the left area ofthe workspace page of CAPER 2.0 (Figure 1A). Users canselect a tool from the tool panel (in the left area of Figure 1A),and the corresponding interface for this tool will be presentedin the detailed panel (in the middle area of Figure 1A), bywhich users can easily set parameters and input/output.Although individual tools can be easily used by their

interfaces in CAPER 2.0, biological researchers usually need acombination of different bioinformatics tools to achieve theirgoals. By the Galaxy workflow system, CAPER 2.0 fully allowsusers to select multiple tools to construct, configure, save, run,and share their own workflows. By clicking the “Workflow” on

Journal of Proteome Research Article

dx.doi.org/10.1021/pr400795c | J. Proteome Res. 2014, 13, 99−106101

Page 4: CAPER 2.0: An Interactive, Configurable, and Extensible Workflow-Based Platform to Analyze Data Sets from the Chromosome-centric Human Proteome Project

the navigation bar, users will be led to the graphical workfloweditor (Figure 1B), by which users can select tools from thetool panel (in the left area of Figure 1B) and configure them

using detailed panel (including parameters and I/O setting) (inthe right area of Figure 1B) to construct their own workflows.In the following parts, four typical workflows we constructed

Figure 2. Workflow of finding missing proteins to provide clues for C-HPP experimental design. (A) Workflow of finding missing proteins, which ismainly composed of “Find missing proteins” and “Send gff to CAPER track-view” tools. (B) Output of this workflow. (1) Webpage gives one of theoutput results of “Find missing proteins” tool, which is produced by “Create webpages” tool, presenting the identified missing proteins, together withtheir locations on the chromosome and classification. “Non-characterized proteins” are those missing proteins with transcription-level existenceevidence, while “dubious proteins” are those without transcription-level evidence. (See the Materials and Methods for details.) (2) Example ofexhibiting missing proteins in track-view, which is achieved by the “Send gff to CAPER track-view” tool. (See ref 10 for the detailed introduction ofthe usage of the track-view.) In this example, two missing proteins are exhibited. One is a non-characterized protein (TAS1R2), and the other is adubious protein (NBL1). In this example, protein- and transcription-level evidence (“User-defined mRNA-level evidence” track) are both thestandards for 2014 special issue.

Figure 3. Workflow used to map the identified peptides to chromosomes. (A) Workflow used to map the identified peptides to chromosomes. Thisworkflow is mainly composed of “Map peptides to chromosomes”, “Protein classification based on existence evidence at different levels”, and “Sendgff to CAPER track-view” tools. Similar to the “Find missing protein” tool, the “Protein classification” tool also supports users to select existenceevidence versions. (See the Materials and Methods.) (B) Output of this workflow. Presented in the figure are the analysis results of this workflow onidentified peptides from Chinese Adult Liver Proteome (CALP).18 (1) The webpage is one of the output results of “Protein classification” tool,which is produced by “Create webpages” tool, giving the classification information of the identified peptides’ corresponding proteins (three classes:characterized/non-characterized/dubious protein). (2) An example of visualizing peptides in track-view. The peptide in the figure is mapped to geneHA02, which corresponds to a “non-characterized protein” without translation-level but with transcription-level existence evidence. Here protein-and transcription-level evidence (“User-defined mRNA-level evidence” track) are both the standards for 2014 special issue. The peptide in the figureis located within known exon regions of the gene HA02, consistent with its existing gene model.

Journal of Proteome Research Article

dx.doi.org/10.1021/pr400795c | J. Proteome Res. 2014, 13, 99−106102

Page 5: CAPER 2.0: An Interactive, Configurable, and Extensible Workflow-Based Platform to Analyze Data Sets from the Chromosome-centric Human Proteome Project

for C-HPP data analyses are introduced in detail. These fourworkflows have been published on the “Published Workflows”page, which can be accessed by clicking “Shared data →Published workflows” in the navigation bar (Figure 1C). Thesepublished workflows can be reused and edited by users.To facilitate the usage of CAPER 2.0 by both bioinformatics

and nonbioinformatics researchers among the C-HPPcommunity, a detailed tutorial to describe how to edit, use,save, run, and publish these workflows has been presented inthe website (“Help” → “Tutorials”).

Finding Missing Proteins: Providing Clues for C-HPPExperimental Design

Missing proteins are referred to as those that should exist basedon the genomic evidence but remain undiscovered at theprotein level. Identifying and characterizing the missingproteins lacking MS evidence or antibody detection is theprimary goal of the C-HPP.22 Because of the high cost ofproteomic experiments for the validation of missing proteins,an appropriate experimental design is crucial.To provide clues for the experimental design of proteomic

detection of missing proteins, we construct the workflow“Finding missing proteins to provide clues for C-HPPexperimental design” by assembling tools “Find missingproteins” and “Send gff to CAPER track-view” (Figure 2A).In the tool of “Find missing proteins”, the C-HPP standardbaseline metrics1 of two versions to identify missing proteinsare optional: the standard for 2013 C-HPP special issue of theJournal of Proteome Research10 and that for 2014 special issue.(See the Materials and Methods for details.) In this tool, users

can select a chromosome of interest and evidence version byparameter setting. The output of this tool includes a table thatgives the identified missing proteins based on user-definedparameters together with their locations on chromosomes(Figure 2B(1)) and gff3-formated files, which describe genemodels and mRNA-level evidence of the missing proteins.Furthermore, by the tool of “Send gff to CAPER track-view”,users can use the CAPER track-view browser to visualize thegff3-formated results from “Find missing proteins” tool. Figure2B(2) shows an example of such visualization, in which missingproteins together with their protein coding genes and mRNA-level evidence are presented. These identified missing proteinstogether with their mRNA-level evidence will provide clues forthe following targeted experiment to ultimately achieve the C-HPP goal of proteome full coverage.

Mapping the Identified Peptides onto Human Genome:Annotating the Human Genome by Proteomic Data Sets

In the postgenomic era, proteomic approaches have begun toplay important roles in genome annotation.23 Mappingpeptides from MS to genome not only provides translation-level existence evidence for protein-coding region in genomebut also helps confirm or correct the current gene models.The massive proteomic data sets produced by C-HPP

provide an unprecedented opportunity for complete annotationof human genome. To contribute to genome annotation usingthese proteomics data sets, we used the tools of “Map peptidesto chromosomes”, “Protein classification based on existenceevidence at different levels”, and “Send gff to CAPER track-view” in CAPER 2.0 to establish a workflow (Figure 3A). First,

Figure 4. Workflow used to integrate C-HPP peptides with ENCODE transcription factor binding sites. (A) Workflow used to integrate C-HPPpeptides with ENCODE transcription factor binding sites. In the “Send gff to CAPER track-view” tool, users can select ENCODE data sets theywant to integrate with the identified peptides from C-HPP. Now several transcription factors’ binding signal profiles by ChIP-seq from ENCODE/HAIB are provided as examples,5 and in the future more ENCODE data sets will be added. (B) Example of the output of this workflow.“Tfbs_CEBPB_wgEncode” track in the figure exhibits the binding signal profile of transcription factor CEBPB. The transcription factor (CEBPB) issupposed to target the coding gene (CRYZ) of the identified peptides.

Journal of Proteome Research Article

dx.doi.org/10.1021/pr400795c | J. Proteome Res. 2014, 13, 99−106103

Page 6: CAPER 2.0: An Interactive, Configurable, and Extensible Workflow-Based Platform to Analyze Data Sets from the Chromosome-centric Human Proteome Project

for the peptides submitted by users, the “Map peptides tochromosomes” tool will map these peptides to their codinggenes in chromosomes and output a gff3-formated filedescribing these peptides’ detailed location information onchromosomes. Furthermore, on the basis of this information,peptides’ corresponding proteins will be divided into threeclasses by “Protein classification based on existence evidence atdifferent levels” tool, including “characterized proteins” withboth transcription- and translation-level existence evidence,“non-characterized proteins” with only transcription-levelevidence, and “dubious proteins” having only the predictedgene models.10 Both “non-characterized proteins” and “dubiousproteins” are “missing proteins”. Peptides corresponding to“characterized proteins” can further validate the existence ofthese proteins, while those corresponding to missing proteinscan provide novel protein-level evidence for protein-codingregions in genome (Figure 3B(1)). Finally, the results of“Protein classification” are visualized in track-view by “Send gffto CAPER track-view” tool, presenting the correspondingrelationship of sequences between the peptides, translation- andtranscription-level evidence, and gene models, which contributeto identifying protein-coding regions and also confirming/correcting current gene models in genome using peptides(Figure 3B(2)).

Integrating Genomic Annotation and Identified Peptides:Bridging ENCODE and C-HPP

The ENCODE project aims to decipher the functionalelements in the human genome,5 and the combined effortbetween ENCODE and C-HPP initiatives is promising toillustrate how the genomic elements control the expression ofproteins under certain circumstances.6 The chromosomeprovides an ideal medium to integrate the ENCODE genomicand C-HPP proteomic data sets. First, chromosomes are thecarrier of human genetic information, and almost all genomicand proteomic data can be mapped to chromosomes based on“genetic central dogma”. Second, the genome sequence onchromosomes is of one dimension, and thus it is easy to analyzeand show all related information on chromosomes by theparallel tracks. In CAPER 2.0, the published workflow“Integrating C-HPP peptides with ENCODE transcriptionfactor binding sites” is just used to integrate ENCODEtranscription factor binding site data sets and MS peptide datasets submitted by users using chromosomes as the referencecoordinate and further visualize them in track-view (Figure 4A).By such visualization in the track-view, users can easily explorethe relations of the transcription factor binding sites to proteinexpressions (Figure 4B). Next, with more and more ENCODEdata sets generated and integrated into CAPER, users can select

Figure 5.Workflow for the functional annotation of the protein list. (A) Workflow for the functional annotation of the protein list. The input of thisworkflow is a list of proteins, which can be obtained from “Find genes in a chromosome region”, uploaded by users by “Upload file from yourcomputer” or obtained from “Get data from CAPER track-view” tool. The following tools can only analyze genes represented by Entrez Gene ID,and thus the “ID mapping” tool can be used to transfer other IDs into Entrez Gene IDs. (B) Examples of the analyses results of this workflow. (1−2)Output of the “GOfact” and “KEGG pathway analyzer” tools. The table on the webpage produced by “Create webpages” tool gives the enriched/depleted GO terms/KEGG pathways and the corresponding P value, which is computed based on hypergeometric cumulative distribution test. Colorcode: red, significantly enriched (upper-tail P value <0.05); light red, enriched; light green, depleted; green, significantly depleted (lower-tail P value<0.05). (3) Protein interaction network graph presents the interaction relationship between the submitted proteins. This page can be seen by “viewin Cytoscape Viewer” function in history panel of the workspace page. On this page, users can further edit the network graph.

Journal of Proteome Research Article

dx.doi.org/10.1021/pr400795c | J. Proteome Res. 2014, 13, 99−106104

Page 7: CAPER 2.0: An Interactive, Configurable, and Extensible Workflow-Based Platform to Analyze Data Sets from the Chromosome-centric Human Proteome Project

other types of annotations in ENCODE as a reference to studytheir proteomic data sets.

Associating Identified Proteins with FunctionalAnnotations: Extracting Biological Knowledge from theC-HPP Data Sets

For C-HPP, identifying the full set of proteins encoded in eachchromosome is only the first step. The next step should beextracting useful biological knowledge from these identifiedproteins. In CAPER 2.0, the workflow “Associating identifiedproteins with functional annotations to extract biologicalknowledge from the C-HPP data sets” is constructed just forthis purpose (Figure 5A).The input of this workflow is a list of proteins, which may be

the uploaded data identified from some proteomic experiment(e.g., differently expressed proteins of a case-control study),obtained from track-view/heatmap-view by “Get data fromCAPER track-view/heatmap-view” tool, or obtained by “Findgenes in a chromosome region” tool. Then, we integrated twotypes of the most common function analysis approaches toanalyze this group of proteins. One is the functionalclassification tools including “GOfact”19 and “KEGG pathwayanalyzer”. These tools can group these proteins into GOcategories (biological process/cellular component/molecularfunction) or biological pathways and can also identify theoverrepresented GO categories or biological pathways amongthese proteins using the hypergeometric cumulative distributiontest with the whole human proteome as the background(Figure 5B(1, 2)). These analysis results can help researchers tounderstand the overall function of this group of proteins, andthe overrepresented GO categories/biological pathways amongthese proteins can provide important clues for the furtherexperimental research. The other type of tools is networkanalysis tools including “Protein network construction” and“Network analyzer”, which can be used to construct theseproteins-centric protein interaction network and further analyzethese proteins in the context of protein interaction network(Figure 5B(3)). The network-based analyses can alsocontribute to functional interpretation of these proteins. Forexample, interacting proteins are often used to provide clues forthe prediction of the function of a protein of interest,24 andgenerally a protein with higher degree/betweenness centralityin the protein interaction network is more important forcellular functions, which may be preferentially considered forthe following experimental design.

■ CONCLUSIONS

We updated the previous static proteome browser (CAPER)into a higher version CAPER 2.0, which is an interactive,configurable, and extensive workflow-based platform to analyzethe data sets from C-HPP. Four workflows and related tools forC-HPP data sets analyses and visualization are presented inCAPER 2.0. Users can also construct and edit their ownworkflows by assembling the tools in the toolbox of CAPER2.0. In the future, we will provide more tools and workflowswith the development of C-HPP. This workflow-based platformwill greatly facilitate the mapping and functional interpretationof entire human protein set, contributing to the achievement ofthe goals of C-HPP and the human physiology/pathologyresearch.

■ AUTHOR INFORMATIONCorresponding Authors

*Dong Li: Tel: 86-10-80705999. Fax: 86-10-80705225. E-mail:[email protected].*Fuchu He: Tel: 86-10-68177417. Fax: 86-10-68177417. E-mail: [email protected] Contributions#D.W., Z.L., and F.G. contributed equally to this work.Notes

The authors declare no competing financial interest.

■ ACKNOWLEDGMENTSWe thank Jun Qin, Weimin Zhu, Bei Zhen, Xiaohong Qian,Yunping Zhu, Ping Xu, and Hongxing Zhang for their fruitfuldiscussion. This work is funded by the Program of InternationalS&T Cooperation (0S2014ZR0003), National Natural ScienceFoundation of China (31271407), the Chinese National KeyProgram of Basic Research (2012CB910300 and2011CB910202), Chinese High-technology Research andDevelopment (2012AA020201), and the National KeyTechnology R&D Program (2012BAI29B07).

■ REFERENCES(1) Marko-Varga, G.; Omenn, G. S.; Paik, Y.-K.; Hancock, W. S. Afirst step toward completion of a genome-wide characterization of thehuman proteome. J. Proteome Res. 2013, 12, 1−5.(2) Paik, Y.-K.; Jeong, S.-K.; Omenn, G. S.; Uhlen, M.; Hanash, S.;et al. The Chromosome-Centric Human Proteome Project forcataloging proteins encoded in the genome. Nat. Biotechnol. 2012,30, 221−223.(3) Huhmer, A. F. R.; Paulus, A.; Martin, L. B.; Millis, K.; Agreste, T.;Saba, J.; Lill, J. R.; Fischer, S. M.; Dracup, W.; Lavery, P. Thechromosome-centric human proteome project: a call to action. J.Proteome Res. 2013, 12, 28−32.(4) Chromosome-centric Human Proteome Project Special Issue. J.Proteome Res. 2013, Vol. 12, issue 1.(5) ENCODE Project Consortium. An integrated encyclopedia ofDNA elements in the human genome. Nature 2012, 489, 57−74.(6) Paik, Y.-K.; Hancock, W. S. Uniting ENCODE with genome-wideproteomics. Nat. Biotechnol. 2012, 30, 1065−1067.(7) Goode, R. J. A.; Yu, S.; Kannan, A.; Christiansen, J. H.; Beitz, A.;Hancock, W. S.; Nice, E.; Smith, A. I. The proteome browser webportal. J. Proteome Res. 2013, 12, 172−178.(8) Jeong, S.-K.; Lee, H.-J.; Na, K.; Cho, J.-Y.; Lee, M. J.; Kwon, J.-Y.;Kim, H.; Park, Y.-M.; Yoo, J. S.; Hancock, W. S.; Paik, Y.-K.GenomewidePDB, a proteomic database exploring the comprehensiveprotein parts list and transcriptome landscape in human chromosomes.J. Proteome Res. 2013, 12, 106−111.(9) Zgoda, V. G.; Kopylov, A. T.; Tikhonova, O. V.; Moisa, A. A.;Pyndyk, N. V.; et al. Chromosome 18 transcriptome profiling andtargeted proteome mapping in depleted plasma, liver tissue andHepG2 cells. J. Proteome Res. 2013, 12, 123−134.(10) Guo, F.; Wang, D.; Liu, Z.; Lu, L.; Zhang, W.; Sun, H.; Zhang,H.; Ma, J.; Wu, S.; Li, N.; Jiang, Y.; Zhu, W.; Qin, J.; Xu, P.; Li, D.; He,F. CAPER: a chromosome-assembled human proteome browsER. J.Proteome Res. 2013, 12, 179−186.(11) Goecks, J.; Nekrutenko, A.; Taylor, J. Galaxy Team Galaxy: acomprehensive approach for supporting accessible, reproducible, andtransparent computational research in the life sciences. Genome Biol.2010, 11, R86.(12) Lopes, C. T.; Franz, M.; Kazi, F.; Donaldson, S. L.; Morris, Q.;Bader, G. D. Cytoscape Web: an interactive web-based networkbrowser. Bioinformatics 2010, 26, 2347−2348.(13) Flicek, P.; Amode, M. R.; Barrell, D.; Beal, K.; Brent, S.; et al.Ensembl 2012. Nucleic Acids Res. 2012, 40, D84−90.

Journal of Proteome Research Article

dx.doi.org/10.1021/pr400795c | J. Proteome Res. 2014, 13, 99−106105

Page 8: CAPER 2.0: An Interactive, Configurable, and Extensible Workflow-Based Platform to Analyze Data Sets from the Chromosome-centric Human Proteome Project

(14) Deutsch, E. W.; Lam, H.; Aebersold, R. PeptideAtlas: a resourcefor target selection for emerging targeted proteomics workflows.EMBO Rep. 2008, 9, 429−434.(15) Craig, R.; Cortens, J. P.; Beavis, R. C. Open source system foranalyzing, validating, and storing protein identification data. J. ProteomeRes. 2004, 3, 1234−1242.(16) Lane, L.; Argoud-Puy, G.; Britan, A.; Cusin, I.; Duek, P. D.;Evalet, O.; Gateau, A.; Gaudet, P.; Gleizes, A.; Masselot, A.; Zwahlen,C.; Bairoch, A. neXtProt: a knowledge platform for human proteins.Nucleic Acids Res. 2012, 40, D76−83.(17) Uhlen, M.; Oksvold, P.; Fagerberg, L.; Lundberg, E.; Jonasson,K.; Forsberg, M.; Zwahlen, M.; Kampf, C.; Wester, K.; Hober, S.;Wernerus, H.; Bjorling, L.; Ponten, F. Towards a knowledge-basedHuman Protein Atlas. Nat. Biotechnol. 2010, 28, 1248−1250.(18) Chinese Human Liver Proteome Profiling Consortium Firstinsight into the human liver proteome from PROTEOME(SKY)-LIVER(Hu) 1.0, a publicly available database. J. Proteome Res. 2010, 9,79−94.(19) Li, D.; Li, J. Q.; Ouyang, S. G.; Wu, S. F.; Wang, J.; Xu, X. J.;Zhu, Y. P.; He, F. C. An integrated strategy for functional analysis inlarge-scale proteomic research by gene ontology. Prog. Biochem.Biophys. 2005, 32, 1026−1029.(20) Kanehisa, M.; Goto, S.; Kawashima, S.; Okuno, Y.; Hattori, M.The KEGG resource for deciphering the genome. Nucleic Acids Res.2004, 32, D277−280.(21) Keshava Prasad, T. S.; Goel, R.; Kandasamy, K.; Keerthikumar,S.; Kumar, S.; et al. Human Protein Reference Database–2009 update.Nucleic Acids Res. 2009, 37, D767−772.(22) Paik, Y.-K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Marko-Varga,G.; et al. Standard Guidelines for the Chromosome-Centric HumanProteome Project. J. Proteome Res. 2012, 11, 2005−2013.(23) Desiere, F.; Deutsch, E. W.; Nesvizhskii, A. I.; Mallick, P.; King,N. L.; et al. Integration with the human genome of peptide sequencesobtained by high-throughput mass spectrometry. Genome Biol. 2005, 6,R9.(24) Schwikowski, B.; Uetz, P.; Fields, S. A network of protein-protein interactions in yeast. Nat. Biotechnol. 2000, 18, 1257−1261.

Journal of Proteome Research Article

dx.doi.org/10.1021/pr400795c | J. Proteome Res. 2014, 13, 99−106106