isoform lc-ms mudpit

Isoform Analysis of LC-MS/MS Data from MultidimensionalFractionation of the Serum ProteomeAlexei L. Krasnoselsky,* Vitor M. Faca, Sharon J. Pitteri, Qing Zhang, and Samir M. HanashFred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, Washington 98109Received November 8, 2007Abstract: We developed a visualization approach for theidentication of protein isoforms, precursor/mature pro-tein combinations, and fragments from LC-MS/MS analy-sis of multidimensional fractionation of serum and plasmaproteins. We also describe a pattern recognition algorithmto automatically detect and ag potentially heterogeneousspecies of proteins in proteomic experiments that involveextensivefractionationandresultinalargenumberofidentiedserumorplasmaproteinsinanexperiment.Examples are given of proteins with known isoforms thatvalidate our approach and present a subset of precursor/matureproteinpairsthat weredetectedwiththisap-proach. Potential applications includeidenticationofdifferentially expressed isoforms in disease states.Keywords: Protein fractionation visualization LC-MS/MS isoformsIntroductionWith rapid proliferation of proteomic data, there is a needfor tools that allow computational data mining and visualiza-tion of complex data sets. There are many software packagesavailable for processing proteomics data and displaying results(for recent review, see Palagi et al.).1However, there is a paucityofvisualizationtoolsthataresimpleandeasilyadaptabletoevolving proteomic data formats. Visualization tools combineseveral sources of information for intelligent data mining. Thehuman eye is particularly suited to identify complex patternsand features, provided that the information is presented in astructured visual way and limited to a few patterns at a time.The gene expression red-green heat maps serve as an exampleof simple and yet effective method of representation of complexdata.2Proteins exist in plasma and tissue sources in multiple formsthat result fromalternative splicing (isoforms), precursor/matureproteincombinations, ordifferentpatternsofglyco-sylation. Most proteins are secreted as precursor proteins fromwhich biologically active forms are generated upon proteolyticcleavage (e.g., see Khatib and Geraldine).3For biomarkerdiscovery, itisimportanttoassessthepresenceofisoformsthat may differ in their levels in a disease related manner as inthe case of phosphorylation and glycosylation, among numer-ous post-translational modications. We present here a visu-alizationapproachfor multidimensional proteomicdatatoassist inthesearchfor proteinisoforms, precursor/matureprotein combinations, and fragments. Along with the visualiza-tion tool, we also describe a simple pattern recognitionalgorithm that we developed to automatically detect and agpotentiallyheterogeneous species of proteins inproteomicexperiments that involve extensive fractionation and result ina large number of identied proteins in one experiment.MethodsProtein Separation and Mass Spectrometry Analysis. Serumandplasmaproteinsamplesweresubjectedtofractionationfollowed by LC-MS/MS analysis of tryptic digests from indi-vidual fractions. The full procedure, designated Intact ProteinAnalysis System (IPAS) has been previously described by Facaet al.4Briey, after immunodepletion, acrylamide-labeledsamples5were fractionated by anion-exchange into 12 fractionsand subsequently by reversed-phase into 12 fractions, repre-senting a total of 144 fractions that were analyzed individuallyby shotgun LC-MS/MS. In-solution tryptic digestion wasperformed overnight with lyophilized aliquots from the reversed-phase(seconddimension) fractionationstep. TheresultingpeptidemixtureswereanalyzedbyaLTQ-FTICRmassspec-trometer (Thermo-Finnigan) coupledwitha NanoAcquity-nanoow chromatography system (Waters). Spectra were ac-quired in a data-dependent mode in m/z range of 400-1800,including selection of the 5 most abundant +2 or +3 ions ofeachMSspectrumfor MS/MSanalysis. Acquireddatawasautomatically processedby the Computational ProteomicsAnalysisSystem(CPAS)6pipeline. ThispipelineincludestheX!Tandemsearchalgorithm7withcometscoremoduleplug-in,8PeptideProphet9peptide validation, and ProteinProphet10protein inference tool. The tandem mass spectra were searchedagainst version 3.12 of the human IPI database.11All identica-tions with a PeptideProphet probability greater than 0.75 wereselected and the subsequent protein identications wereltered at a 5% error rate.HeterogeneityDetectionAlgorithm. Theconceptbehindcluster detection is as follows. For each protein (single IPI or aprotein group of multiple IPI numbers considered to representthe same protein), the data were assembled into a n m gridof fractions, where n corresponds to the number of fractionsderived in ion-exchange chromatography (represented on theX-axis) and m corresponds to the number of fractions derivedinRP-HPLC(representedontheY-axis).Thedimensionsforthe two data sets used in this article are 12 12 for one data* To whom correspondence should be addressed. Tel: (206) 667-1250, fax:(206) 667-2537, E-mail: [email protected] reveals data. EdwardR. TufteinThe Visual Display ofQuantitative Information.2546 Journal of Proteome Research 2008, 7, 25462552 10.1021/pr7007219 CCC: $40.75 2008 American Chemical SocietyPublished on Web 04/18/2008Figure1.Visualizationofproteomicdatain2-Dfractionationexperimentswithdifferentialsamplelabeling,Thedatashownisforprotein HFAC (hepatocyte growth factor activator). (A) The peptide and ratio map of the 2-D chromatography fractionation. The gridrepresents the 2-D chromatography fractionation (12 12 fractions). The X-axis represents 12 fractions of ion-exchange chromatographyand Y-axis, 11 fractions of RP HPLC. Each node of the grid shows the fraction location. The peptides are shown as concentric circlesof different colors (the full list of identied peptides is shown in the inset), whereby the size of the circle indicates a relative distanceofthepeptidefromtheN-terminusofthefull proteinsequence. Thesizeofthecirclecorrespondstothesequential orderofthepeptides starting from the N-terminus. The range for each peptide represents the starting and ending position in the protein sequence,scaled to 0-1. Values are provided for ratios between two samples being compared based on differential acrylamide labeling,5wherecase samples are labeled with C13 acrylamide and control samples with C12 acrylamide. (B) Histogram of the ratios obtained for thisprotein in an experiment in which a comparison is made between two samples (in all 132 fractions). (C) Total MS events map for 2-Dseparation. Each node of the grid shows the number of MS events summed up across all peptides, while the size of the circle reectsvisually that number.IsoformAnalysis of LC-MS/MSData technical notesJournal of Proteome Research Vol. 7, No. 6, 2008 2547set, and 1211 for the other. The patterndetectionisperformedat thepeptidelevel. For eachfraction, abinarypeptide separation map is derived by assigning 1 to a fractionwhere the peptide was identied and 0 where it was not. Themapservesasinput totheproteinheterogeneitydetectionalgorithm, which consists of two steps. First, the fractionationpattern is smoothed by a 2 2 kernel, whereby each fractionxijisassignedasumofthevaluesinthekernel: Si,j)xi,j+xi+1,j+xi,j+1+xi+1,j+1. The rationale for smoothing is to reducetheMSsamplingeffect that might result inoverestimationof the number of clusters. The clusters are dened by selectingthe nodes with the values equal or exceeding k (kmax) 4) andseparatedby a gapof at least g fractions (g)2 for thisfractionation experiment based on the chromatographic reso-lution of the system). The number of identied peptide clustersis then averaged across all peptides for a given protein to resultin a cluster score assigned to this protein. The output consistsof all proteinsrankedbythecluster scorewiththeclusterstatistics described on peptide level.DataVisualization. Thevisualizationapplicationrequiresseveralinputdatamatrices.Foreachprotein(singleIPIoraprotein group of multiple IPI numbers considered to representthesameprotein), then mdatamatrixof fractions (asdescribedabove),alongwiththeratiovectorandavectorofnumber of spectral events for each peptide in each fraction ispassed to the software. Avector of scaled 0-1 sequencepositioninformationispassedtotheapplicationaswell.Allpreprocessing of the data is accomplished prior to passing thedatatothevisualizationtool.Theoutputsoftheapplicationinclude three gures, saved as picture les (jpeg format): thegure that combines fractionation, ratio, and peptide sequenceinformation(suchasFigure1A), histogramof theratios(ifavailable, see Figure 1B), and a gure of total spectral eventsfor each fraction (such as Figure 1C). The Matlab code for theapplication is available upon request from the author.ResultsVisualization of the IPAS Proteomics Data. The data gener-atedincomparativeproteomicsexperimentsthatutilizeex-tensiveproteinfractionationcontaininformationrelatedtoisoforms that could be mined, but is generally not systemati-callyanalyzed. Suchinformationisintrinsictothelocations(fractions) in which proteins were identied. Thus, chromato-graphic properties contain information that could be used tomakeinferencesaboutsubspecies/isoformsof proteinsthatelute differently but may be the products of the same gene. Inthisstudy, weanalyzeddatafrom132serumfractionsthatresulted from 2-D fractionation of intact (undigested) proteins.Figure 1A shows a representation of the 2-D fractionation as agridwiththenodes denotingthefractions. Theparticularidentied peptides in a protein could be used to infer cleavagesas in the case of surface proteins that shed their extracellulardomains. We have devised a way of capturing this informationonthefractionationgrid,wherebyasetofconcentriccirclesrepresent the sequentially organized peptides. The circles arescaledinsuchawaythat thesizeof thecircleindicatesarelative distance from the N-terminus of the protein, with thepeptide represented by the smallest circle being closest to theN-terminus and the largest circle denoting the peptide closestto the C-terminus. Such visualization aids in immediatediscerning a fragment: if a set of peptides appears as doughnut-shaped in one or more fractions (such as fraction withcoordinates [x ) 2, y ) 5] on Figure 1A), such a set of peptideswould be derived from the C-terminal portion of the protein.If the peptides in a given fraction are represented by a set ofsmall circles (relative toall the peptides identiedinthefractions, as shown in the gure inset), such as in the fractionwith coordinates [x ) 7, y ) 3], then the fragment is derivedfrom the N-terminal portion of the protein. Thus, visualizationallows animmediategraspof four characteristics for eachprotein: the two chromatographic properties, the distributionof peptides along the sequence, and in comparative quantita-tive studies the differential ratio. Furthermore, the samevisualization approach can be used for representing thenumberofMSeventsforagivenproteininagivenfraction(Figure 1B). Additional informationis provided inanac-companying histogram of all ratios for a given protein in theexperiment (Figure 1C).AutomatedDetectionof ChromatographicClusters. Wedeveloped a simple pattern recognition algorithm (see Meth-ods) to identify and ag proteins that show distinct chromato-graphic clusters, suchas showninFigure 2A. The clusteridentication occurs on the peptide level, and the number ofclustersisthenaveragedacrossall thepeptidesforasingleFigure2. Relationshipbetweennumber of peptidesandtheaverage number of clusters per protein. (A) The average clusterscore (number of identied chromatographic clusters averagedacrossallthepeptidesperprotein)isplottedagainstthetotalnumberofuniquepeptidesforthecorrespondingprotein. (B)The histogram of average cluster scores across all proteins withtwo or more unique peptides.technical notes Krasnoselsky et al.2548 Journal of Proteome Research Vol. 7, No. 6, 2008protein to derive a protein score. Figure 2A shows that there isnocorrelationbetweentheaveragenumberofpeptidesandthe number of identied clusters. The increase in number ofclusters for proteins identied with a single peptide in multiplefractions is most likely due to incorrect IDs. The single-peptidehitswerenotincludedinsubsequentanalysis. Theanalysisshowsthatoutof1224proteinswithmorethanoneuniquepeptidecoverage295proteinsshowedchromatographichet-erogeneity on the peptide levels. Such heterogeneity could bedue to multiple factors that include MS sampling, precursor/mature protein, multichain proteins connected by S-S bridges,splice isoforms, PTM modications, and proteolytic fragments.Thealgorithmags all theseinstances as longas theyaremanifested in discontinuous elution prole for a given protein.The histogramin Figure 2B shows that the majority ofheterogeneous proteins show less than two clusters per protein(averaged number of identied peptide clusters). This isreasonable given the limited resolution of the system (11 12fraction grid).Identicationof ProteinsandTheirCleavageProducts.Most proteins are synthesized in vivo in the form of inactiveprecursor that is cleaved upon a physiological event locally orwith their extracellular release. We have analyzed humanplasmaforpresenceofsuchprecursor/matureproteinpairsusing our pattern detection algorithm to ag potential isoforms.Outof295proteinsthatwereaggedasheterogeneous, 176(or 60%) were consistent with precursors. Figure 1A shows anexample of the detection of the full-length precursor and thematureformof hepatocytegrowthfactor activator (HGFA),identiedintheIPASexperimentwith14peptides.Ascouldbe observedfromFigure 1A, the proteinspecies elute asseparate clusters that correspond to the mature protein as wellasthecorrespondingprecursorpartremoveduponcleavage(seeFigure3forexplanation). Thedetectionalgorithmagsthis protein as heterogeneous and fragments may be discernedupon inspection of the plot. The precursor for HGFA does notconvert single chainHGFtoits biologically active form.12However, cleavage of pre-HGFAat R407-I408andR372-V373converts it to its active two-chain form. Figure 1A shows thatwe detect several forms. The R407-I408corresponds totheposition 0.62 on 0-1 scale from N- to C-terminus of 655 aminoacid-long HGFA, and R372-V373 to 0.57, respectively. Indeed, weidentied two sets of fractions that correspond to the precursorpart that is removed in the mature form (sequence 36:372 or0.06:0.57) as well as the two chains of the mature protein itself(0.57-0.6and0.72-0.98, peptides 9-14). Interestingly, theshortchainofthematureprotein(peptide8)yieldedonlyasingle identied peptide, which elutes separately from the longchain of the HGFA.Analysis of Protein Isoforms. Proteins that result in alterna-tive splicing can produce isoforms that are distinguishable inIPAS experiments. Here, we show one such example, bulin-1(FBLN1). Fibulin-1isanextracellular matrixproteinthat isknown to have four different isoforms (for recent review, seeGallagher et al.13). In the IPAS experiment described here, wehave identied peptides that map to FBLN1 and identify at leasttwo groups of isoforms: isoform C and isoforms B and D. Thelatter are indistinguishable by the identiedpeptides andreferred here as isoform B/D. Figure 4 exhibits the fractionationpattern of FBLN1. The differences between the isoforms lie inthe C-terminal portion of FBLN1. Figure 4A exhibits thefractions in which the isoforms B/D were identied by uniquepeptides (peptides 14and15), whereas theisoformCwasidentied by its corresponding C-terminal peptides (peptides11 and 12 on Figure 4B). Isoform B/D elutes in the earlier ion-exchange andreverse-phase HPLCfractions. There is alsosome, albeit incomplete, separationof isoformsbyreverse-phase HPLC for the late eluting ion-exchange fractions. Analysisofthepeptidecompositionshowsnoevidenceoftheearliereluting fractions resulting from fragmentation of the later full-length protein. Such differences might be due to variation intheglycosylationpatternFBLN1. Thecontributionof eachisoformtotheoverall FBLN1ratiocouldnotbeassessedinthis study due to the origin of the Cys-containing peptides fromthe region of FBLN1 sequence common to all known isoforms.However, thepresenceof several isoformsthatarepartiallyresolved chromatographically is demonstrated.Autilityofthevisualizationalgorithmcouldbeillustratedon Figure 5A where two subspecies of coagulation factor F11are shown. The detection algorithm ags F11 (IPI00008556) asa chromatographically heterogeneous protein with two distinctspecies (Figure 5A). The Swiss-Prot annotation (P03951) indi-Figure 3. Hepatocyte growth factor activator protein, The sequence of the precursor is shown with a signal peptide in black letters,prepropeptide removed in mature protein in red letters, short chain in blue letters, and long chain in green letters. The underlinedpeptides denote those identied by mass-spectrometry in 132 fractions.IsoformAnalysis of LC-MS/MSData technical notesJournal of Proteome Research Vol. 7, No. 6, 2008 2549catesthat twospliceisoformshavebeenidentiedfor thisprotein. However, the visualization plot suggests that the twochromatographicspeciesareunlikelytobespliceisoforms,because the missing sequence in isoform 2 that distinguishesit from isoform 1 is present in both clusters (the sequence mapstotherangeof0.17-0.30). Analternativeexplanationisthedifference in glycosylation pattern (F11 is heavily glycosylated).DiscussionFractionationbasedonchromatographicpropertiesyieldsangerprint of aproteinthat is determinedbystructuralvariations in the protein. High resolution HPLC systems, suchasmodernreverse-phaseandion-exchangeHPLC,yield2-Dfractionation patterns that allow inferences to be made regard-ingsingleproteinheterogeneity. Wehaveutilizedthischro-matographic pattern information, along with sequence map-ping of identied peptides, to gain insight into potentialfragmentationpatterns, spliceisoforms, orothersourcesofproteinheterogeneitythat might befoundinasample. Toreduce data complexity and allow an easier grasp of multidi-mensional proteomic data, we developed a visualization methodthat combines three sources of information (four dimensionsof data) in one two-dimensional plot. Along with the visualiza-tiontool, we alsodevelopeda simple patternrecognitionalgorithm to automatically detect and ag potentially hetero-geneous species of proteins in experiments such as IPAS, whichinvolve extensive fractionation and identify more than athousand serum or plasma proteins in one experiment.4Given that proteins are identied based on matching of theircorresponding peptide mass spectra to sequence databases, theFigure 4. Fibulin 1 isoforms, (A) Total MS events map for 2-D separation. The grid represents the 2-D chromatography fractionation (12 12 fractions). The X-axis represents 12 fractions of ion-exchange chromatography and Y-axis, 12 fractions of RP HPLC. Each node ofthe grid shows the number of MS events corresponding to FBLN1, while the size of the circle reects visually that number. (B) Thepeptide and ratio map of the 2-D chromatography fractionation. Each node of the grid shows the fraction location as in (A). Informationis provided regarding fractions in which FBLN1 was found and the related peptides that were identied (full list is displayed in thegure inbox). Peptides are shown as concentric circles of different colors, whereby the size of the circle indicates a relative distanceof the peptide from the N-terminus of the full protein sequence.technical notes Krasnoselsky et al.2550 Journal of Proteome Research Vol. 7, No. 6, 2008isoform identication process is dependent on accurate peptideidentications. The goal of the automated detection algorithmwe have developed is to reduce data complexity by eliminatingproteins that do not show heterogeneity and leaving it to theresearcher, aided by the visualization tool, to make naldecisions about the agged proteins. It is desirable to estimateafalse-discoveryrateforthelistofproteinsdeemedhetero-geneous by the algorithm. To address this problem, theavailability of a benchmark set of known heterogeneousproteins that are resolved by chromatography would be usefulto develop an algorithm for FDR estimation. In this publication,we provide two examples, whereby an observed heterogeneousnatureofproteins(HGFAandFBLN1)couldbeindicativeofthe true precursor/mature protein (in the case of HGFA) anddifferent splice isoforms (in the case of FBLN1) to be presentinthesamples. However, thedenitiveassessmentrequiresbiochemical evidence to validate the nding of distinct speciesfor the same protein. Nevertheless, as shown in this paper, inthe example of coagulation factor F11, using our visualizationsoftware tool enables the researcher to rule out a hypothesis,suchasthepresenceofalternativelysplicedisoformsinthecase of F11.Our approach allows us to start compiling a list of proteinsthat could serve as benchmark set for performance evaluationof futureisoformdetectionalgorithms. Figure5shows anexampleof twosuchproteins, F11andLCAT. Compilingacomprehensive data set for benchmarking of isoform detectionalgorithmis beyond the scope of this paper and will beaddressedinfuturepublications. Suchaproteinset shouldsatisfyatleastthefollowingcriteria:thespeciesofaproteinshould (a) be well-dened and characterized biochemically; (b)be detectable in normal plasma in quantities that allow goodFigure 5. Protein heterogeneity for LCAT and F11, The peptide map of the 2-D chromatography fractionation. Each node of the gridshows the fraction location as in 4A). (A) F11; peptides 6-9 are present in both chromatographically distinct clusters. Region 0.1-0.30of the sequence of the F11 protein is missing in alternatively spliced isoform 2 (see text for details). (B) LCAT protein; N-glycosylationof LCAT has been shown by mass-spectrometry.14IsoformAnalysis of LC-MS/MSData technical notesJournal of Proteome Research Vol. 7, No. 6, 2008 2551peptide coverage in MS; and (c) have large enough differencesto be separable by common methods of protein fractionation.In conclusion, we have developed a visualization tool to aidin making inferences about heterogeneity of proteins identiedin proteomics experiments that utilize extensive fractionation.We also provide a simple algorithm to detect and ag potentialsplice isoforms, mature/precursor protein combinations, andother types of protein structural variation.References(1) Palagi, P. M.; Hernandez, P.; Walther, D.; Appel, R. D. ProteomeinformaticsI: bioinformaticstoolsfor processingexperimentaldata. Proteomics 2006, 6 (20), 54355444.(2) Spellman, P. T.; Sherlock, G.; Zhang, M. Q.; Iyer, V. R.; Anders, K.;Eisen, M. B.; Brown, P. O.; Botstein, D.; Futcher, B. Comprehensiveidentication of cell cycle-regulated genes of the yeast Saccharo-myces cerevisiae by microarray hybridization. Mol. Biol. Cell 1998,9, 32733297.(3) Khatib, A.-M.; Geraldine, S. Growth Factors: To Cleave or not ToCleave. In Regulation of Carcinogenesis, Angiogenesis and Metasta-sis by the Proprotein Convertases (PCs), A New Potential in CancerTherapy; Khatib, A.-M., Ed.; Springer: The Netherlands, 2006; pp121-135.(4) Faca, V.; Pitteri, S.; Newcomb, L.; Glukhova, V.; Phanstiel, D.;Krasnoselsky, A.; Zhang, Q.; Struthers, J.; Wang, H.; Eng, J.;Fitzgibbon, M.; M, M.; Hanash, S. Contributionofproteinfrac-tionation to depth of analysis of the serum and plasma proteomes.J. Proteome Res. 2007, 6 (9), 35583565.(5) Faca, V.; Coram, M.; Phanstiel, D.; Glukhova, V.; Zhang, Q.;Fitzgibbon, M.; McIntosh, M.; Hanash, S. Quantitative analysis ofacrylamide labeled serum proteins by LC-MS/MS. J. Proteome Res.2006, 5 (8), 20092018.(6) Rauch, A.; Bellew, M.; Eng, J.; Fitzgibbon, M.; Holzman, T.; Hussey,P.; Igra, M.; Maclean, B.; Lin, C. W.; Detter, A.; Fang, R.; Faca, V.;Gafken, P.; Zhang, H.; Whitaker, J.; States, D.; Hanash, S.; Paulov-ich, A.; McIntosh, M. W. Computational Proteomics AnalysisSystem(CPAS): anextensible, open-sourceanalyticsystemforevaluatingandpublishingproteomicdataandhighthroughputbiological experiments. J. Proteome Res. 2006, 5 (1), 112121.(7) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandemmass spectra. Bioinformatics 2004, 20 (9), 14661467.(8) Maclean, B.; Eng, J. K.; Beavis, R. C.; McIntosh, M. Generalframework for developing and evaluating database scoring algo-rithms using the TANDEM search engine. Bioinformatics 2006, 22(July 28), 28302832.(9) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empiricalstatistical model to estimate the accuracy of peptide identicationsmade by MS/MS and database search. Anal. Chem. 2002, 74, 53835392.(10) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. A statisticalmodel for identifying proteins by tandem mass spectrometry. Anal.Chem. 2003, 75 (17), 46464658.(11) Kersey, P.; Duarte, J.; Williams, A.; Karavidopoulou, Y.; Birney, E.;Apweiler, R. The International Protein Index: an integrated data-base for proteomics experiments. Proteomics 2004, 7, 19851988.(12) Miyazawa, K.; Shimomura, T.; Naka, D.; Kitamura, N. Proteolyticactivation of hepatocyte growth factor in response to tissue injury.J. Biol. Chem. 1994, 269 (12), 89668970.(13) Gallagher, W. M.; Currid, C. A.; Whelan, L. C. Fibulins and cancer:friend or foe. Trends Mol. Med. 2005, 11 (7), 336340.(14) Liu, T.; Qian, W. J.; Gritsenko, M. A.; Camp, D. G., 2nd.; Monroe,M. E.; Moore, R. J.; Smith, R. D. Human plasma N-glycoproteomeanalysis by immunoafnity subtraction, hydrazide chemistry, andmass spectrometry. J. Proteome Res. 2005, 4 (6), 20702080.PR7007219technical notes Krasnoselsky et al.2552 Journal of Proteome Research Vol. 7, No. 6, 2008

isoform lc-ms mudpit

Documents