rapid for mass spectrometry pre-processing

Upload: inambioinfo

Post on 03-Jun-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Rapid for Mass spectrometry Pre-processing

    1/10

    Isotopic Peak Intensity Ratio Based Algorithm forDetermination of Isotopic Clusters andMonoisotopic Masses of Polypeptides fromHigh-Resolution Mass Spectrometric Data

    Kunsoo Park,*, Joo Young Yoon, Sunho Lee, Eunok Paek,*, Heejin Park, Hee-Jung Jung,| and

    Sang-Won Lee|

    School of Computer Science and Engineering, Seoul National University, Seoul, Korea, Department of Mechanicaland Information Engineering, University of Seoul, Seoul, Korea, College of Information and Communications, HanyangUniversity, Seoul, Korea, and Department of Chemistry and Center for Electro- and Photo-Responsive Molecules,Korea University, Seoul, Korea

    Determining isotopic clusters and their monoisotopic

    masses is a first step in interpreting complex mass spectra

    generated by high-resolution mass spectrometers. We

    propose a mathematical model for isotopic distributionsof polypeptides and an effective interpretation algorithm.

    Our model uses two types of ratios: intensity ratio of two

    adjacent peaks and intensity ratio product of three adja-

    cent peaks in an isotopic distribution. These ratios can

    be approximated as simple functions of a polypeptide

    mass, the values of which fall within certain ranges,

    depending on the polypeptide mass. Given a spectrum as

    a peak list, our algorithm first finds all isotopic clusters

    consisting of two or more peaks. Then, it scores clusters

    using the ranges of ratio functions and computes the

    monoisotopic masses of the identified clusters. Our

    method was applied to high-resolution mass spectraobtained from a Fourier transform ion cyclotron reso-

    nance (FTICR) mass spectrometer coupled to reverse-

    phase liquid chromatography (RPLC). For polypeptides

    whose amino acid sequences were identified by tandem

    mass spectrometry (MS/MS), we applied both THRASH-

    based software implementations and our method. Our

    method was observed to find more masses of known

    peptides when the numbers of the total clusters identified

    by both methods were fixed. Experimental results show

    that our method performed better for isotopic mass

    clusters of weak intensity where the isotopic distributions

    deviate significantly from their theoretical distributions.

    Also, it correctly identified some isotopic clusters that

    were not found by THRASH-based implementations,

    especially those for which THRASH gave 1 Da mis-

    matches. Another advantage of our method is that it is

    very fast, much faster than THRASH that calculates the

    least-squares fit.

    With the introduction of soft ionization methods such aselectrospray ionization (ESI)1 and matrix-assisted laser desorp-tion/ionization (MALDI),2 mass spectrometry (MS) has been oneof the most robust and powerful analytical tools to characterizelarge biological molecules. MS-based proteomic experiments haveprovided valuable biological information, including qualitative andquantitative identification of proteome and the types and degreesof post-translational modifications. Especially, high-resolution massspectrometers, such as Fourier transform ion cyclotron resonance(FTICR) or Orbitrap mass spectrometers, greatly improvedaccuracy of proteomic information.

    In a common experimental practice of shotgun proteomics,precursor peptides are dynamically selected for fragmentation withexclusion to prevent repetitive acquisition of MS/MS spectra forthe same peptide. While this experimental scheme greatlyincreased the throughput of proteomic experiments, it often incursfragmentation of peptide ions having weak intensities. MS dataof such weak ions exhibit nonstatistical isotopic distributions withmissing peaks, which lead to inaccurate determination of monoiso-topic masses. A recent study showed that the portion of wronginterpretation of precursor ion mass is up to 40%.3 Overlappingisotopic clusters are often observed with complex proteomesamples and resulted in wrong interpretation of their masses as

    well. It is also well-known that MS/MS spectra from ECD on intactproteins often suffer from inaccurate extraction of fragments massinformation due to nonideal and overlapping isotopic clusters.4

    Determining isotopic clusters and their monoisotopic massesis the first step in interpreting complex mass spectra generatedby high-resolution mass spectrometers such as FTICR or Orbitrap.* To whom correspondence should be addressed. Eunok Paek, Department

    of Mechanical and Information Engineering, University of Seoul, Seoul, 130-

    743, Korea. Phone: +82-2-2210-2680. Fax: +82-2-2210-5575. E-mail: [email protected].

    Kunsoo Park, School of Computer Science and Engineering, Seoul National

    University, Seoul, 151-742, Korea. Phone: +82-2-880-8381. Fax: +82-2-885-3141.

    E-mail: [email protected]. Seoul National University. University of Seoul. Hanyang University.| Korea University.

    (1) Fenn, J. B.; Mann, M.; Meng, C. K.; Wong, S. F.; Whitehouse, C. M.Mass

    Spectrom. Rev. 1990, 9, 3770.(2) Karas, M.; Hillenkamp, F.Anal. Chem. 1988, 60, 22992301.(3) Shin, B.; Jung, H.-J.; Hyung, S.-W.; Kim, H.; Lee, D.; Lee, C.; Yu, M.-H.;

    Lee, S.-W. Mol. Cell. Proteomics 2008, 7, 11241134.(4) Zubarev, R. A.; Kelleher, N. L.; McLafferty, F. W.J. Am. Chem. Soc. 1998,

    120, 32653266.

    Anal. Chem. 2008, 80, 72947303

    10.1021/ac800913b CCC: $40.75 2008 American Chemical Society7294 Analytical Chemistry, Vol. 80, No. 19, October 1, 2008Published on Web 08/28/2008

  • 8/12/2019 Rapid for Mass spectrometry Pre-processing

    2/10

    An LC/MS/MS experiment using these mass spectrometersroutinely generates high-resolution MS data (usually on the orderof 104 spectra) along with MS/MS spectra in large quantity. Fast,automated and accurate interpretation of the vastly large amountof MS data is a fundamental and critical step in MS-basedproteomic experiments and remains the subject of much researchactivity. Mann et al.5 suggested a deconvolution algorithm to findcharge states. Senko et al.6 introduced a notion of an averageamino acid called averagine and suggested a computational

    method for determination of monoisotopic masses using it. Zscore7is a fast and automated isotopic cluster identification algorithmbased on a charge scoring scheme. Many other algorithms suchas ESI-ISOCONV,8 MATCHING,9 PepList,10 LASSO,11AID-MS,12

    and THRASH13 were reported.Among these algorithms, THRASH has been one of the most

    widely used algorithms. It employs the Fourier transform/Patterson method for charge determination and least-squaresfitting to compare a peak cluster with an averagine isotopicdistribution. However, the use of least-squares fitting and/oraveragine isotopic distribution often leads to an inaccuratemonoisotopic mass that is 1-2 Da different from the correct

    value.12 In addition, since the least-squares fitting is not acomputationally efficient operation, THRASH is known to becomputationally demanding.

    In this paper, we present a new probabilistic model of anisotopic distribution, which regards peak intensities in an isotopicdistribution as the existential probabilities of isotope compositions.Our distribution model has two feature functions: intensity ratiosof two adjacent peaks and intensity ratio products of three adjacentpeaks in an isotopic cluster. We show that the intensity ratioscan be approximated as linear functions of polypeptide mass valuesand that the intensity ratio products to constants. These ap-proximations can be computed from theoretical distributions of

    tryptic peptides generated from a protein database. On the basisof our model, we propose an innovative algorithm that determinesisotopic clusters and their monoisotopic masses with accuracy. Itis shown that our algorithm outperforms two THRASH imple-mentations, ICR2LS and Decon2LS (http://ncrr.pnl.gov/software/), both in its accuracy and speed, which was demonstrated withan LC/MS/MS data set of known standard peptide samples. Ourprogram is available via an e-mail to: [email protected].

    EXPERIMENTAL DATASETS

    LC/MS/MS Experiments.We tested our algorithm on a dataset from tryptic digests of an 18 protein mixture, ISB standard

    protein mix.14 (This mixture was generously provided by Aeber-sold group.) The tryptic peptides of the 18 protein mixture wereseparated using a modified version of the nanoACQUITY UPLC(NanoA, Waters, Milford) system, having a maximum operatingpressure of 10 000 psi. Briefly, the NanoA system was modifiedto equip a RPLC capillary column (75 m i.d. 360 m o.d. 80cm length, C18-bonded particles, 3 m, 300 pore size, Jupiter,Phenomenex) and an SPE column. The SPE column was preparedby packing a 1-cm-long liner (250 m i.d.) inside an internal

    reducer (1/16 in. to 1/32 in.; VICI) with the same C18-bondedparticles. The peptides were eluted by a mixture of solvents A(0.1% formic acid in water) and B (99.9% acetonitrile, 0.1% formicacid in water), where the percentage of solvent A was increasedlinearly from 0 to 15% over 5 min, then increased to 50% over 120min, and finally increased to 100% over 10 min where it wasmaintained for 10 min prior to re-equilibration with solvent A.

    A 7-T FTICR mass spectrometer (LTQ-FT, Thermo Electron,San Jose, CA) was used to collect the mass spectra. MS precursorion scans (m/z400-2000) were acquired in full-profile mode (i.e.,with no baseline truncation) with an AGC target value of 1 106,a mass resolution of 1 105, and a maximum ion accumulationtime of 1000 ms. Acquisition of an MS scan in full-profile modesignificantly increases the data size: one full LC/MS experimentwould result in an MS result file (.raw file) exceeding 2 GB, whichcannot be handled in the current Xcalibur software and other MSdata analysis tools that utilize Xcaliburs API to handle the rawfile. We divided one full LC/MS experiment of ISB standardpeptide mix into five 30-min experiments (i.e., five segments) byplacing five MS acquisition sequences consecutively during anLC gradient. The mass spectrometer was operated in data-dependent tandem MS mode; the seven most abundant ionsdetected in a precursor MS scan were dynamically selected for

    MS/MS experiments simultaneously incorporating a dynamicexclusion option (exclusion mass width low, 1.10 Th; exclusionmass width high, 2.10 Th; exclusion list size, 120; exclusionduration, 30 s). Collision-induced dissociations of the precursorions were performed in an ion trap (LTQ) with the collisionalenergy and isolation width set to 35% and 3 Th, respectively. TheXcalibur software package (v. 2.0 SR1, Thermo Electron) was usedto construct the experimental methods.

    Database Search. All MS/MS data (i.e., DTA files) weresubjected to the postexperiment monoisotopic mass filtering andrefinement (PE-MMR) process3 before they were searched againsta protein database, containing sequences of 18 proteins and

    common contaminant sequences. The tolerance was set to 10 ppmfor precursor ions and 1 Da for fragment ions. Variable modifica-tion options were used for the carbamidomethylation of cysteineand arginine (57.021 460 Da) and the oxidation of methionine(15.994 920 Da). The search results were subsequently subjectedto statistical validation by PeptideProphet and the peptide IDs withprobability score of 0.5 or higher (839 nonredundant peptides)were further analyzed by manual inspection to produce the final494 nonredundant peptide sequences from the 18 protein analysis.

    (5) Mann, M.; Meng, C. K.; Fenn, J. B. Anal. Chem. 1989, 61, 17021708.(6) Senko, M. W.; Beu, S. C.; McLafferty, F. W. J. Am. Soc. Mass Spectrom.

    1995, 6, 229233.(7) Zhang, Z. Q.; Marshall, A. G. J. Am. Soc. Mass Spectrom. 1998, 9, 225

    233.(8) Wehofsky, M.; Hoffman, R.J. Mass Spectrom. 2002, 37, 223229.(9) Fernandez-de-Cossio, J.; Gonzalez, L. J.; Satomi, Y.; Betancout, L.; Ramos,

    Y.; Huerta, V.; Besada, V.; Padron, G.; Minamino, N.; Takao, T. Rapid

    Commun. Mass Spectrom. 2004, 19, 24652472.(10) Li, X.; Yi, E. C.; Kemp, C. J.; Zhang, H.; Aebersold, R.Mol. Cell. Proteomics

    2005, 4, 13281340.(11) Du, P.; Angeletti, R. H. Anal. Chem. 2006, 78, 33853392.(12) Chen, L.; Sze, S. K.; Yang, H. Anal. Chem. 2006, 78, 50065018.(13) Horn, D. M.; Zubarev, R. A.; McLafferty, F. W.J. Am. Soc. Mass Spectrom.

    2000, 11, 320332.

    (14) Klimek, J.; Eddes, J. S.; Hohmann, L.; Jackson, J.; Peterson, A.; Letarte, S.;

    Gafken, P. R.; Katz, J. E.; Mallick, P.; Lee, H.; Schmidt, A.; Ossola, R.; Eng,

    J. K.; Aebersold, R.; Martin, D. B. J. Proteome Res.2008

    , 7, 96103.7295Analytical Chemistry, Vol. 80, No. 19, October 1, 2008

  • 8/12/2019 Rapid for Mass spectrometry Pre-processing

    3/10

    METHODS

    We first present a probabilistic model of an isotopic distributionof a polypeptide. Then, we describe our approximations of intensityratio functions, which are the intensity ratios of two adjacent peaksin an isotopic distribution, and of intensity ratio product functions,the intensity ratio products of three adjacent peaks. Finally, ouralgorithm is shown to determine isotopic clusters and theirmonoisotopic masses in a fast and accurate manner.

    Isotopic Distribution Model.We first introduce some nota-

    tions. LetA ){C,H,N,O,S} be the set of atoms that compose apolypeptide. For each atom X A, letXadenote the +aisotopeof an atom X, and PXa denote its existential probability. Forexample, PC1 ) 0.011 07 because 1.107% of carbon atoms in natureare +1 isotopes.15 CnCHnHNnNOnOSnSdenotes the elemental com-position of a polypeptide where nXis the number of atom Xinthe polypeptide.

    Because of the isotopes, the mass of a polypeptide C nCHn-HNnNOnOSnSis not unique. If an instance of the polypeptide has four+1 isotopes, its mass is bigger by 4 Da than an instance of thepolypeptide with no isotopes. The set of peaks generated byvarious instances of a polypeptide is called the isotopic cluster ofthe polypeptide. We define an isotopic distribution of a polypeptideas the theoretical masses and intensities of the peaks generatedby all instances of the polypeptide. In an isotopic distribution, eachpeak is separated by 1 Da (average value 1.002 35 Da12,13). LetIkdenote the intensity of the kth, k g 0, peak in an isotopicdistribution. Specifically, intensity I0 is the intensity of themonoisotopic peak andIk,k g 1, is the intensity of the peak whosemass difference from the monoisotopic peak is k. We modelIkasin Lemma 1 using the existential probability of the polypeptideinstance whose mass is bigger by k Da than the polypeptideinstance with no isotopes. A detailed derivation of Lemma 1 isgiven in Supporting information section S-1.

    Lemma 1. The intensity Ik in an isotopic distribution ap-

    proximates to

    Ik)I0 k1+2k2+4k4)k

    T1k1T2

    k2T4k4

    k1 ! k2 ! k4!

    where

    T1)X

    nXPX1

    PX0, T2)

    X

    nXPX2

    PX0, and T4)

    X

    nXPX4

    PX0

    For example, whenk1 + 2k2 + 4k4 ) 4, there are four cases: four+1 isotopes (k1 ) 4,k2 ) 0,k4 ) 0); two +1 isotopes, and one +2isotope (k1 ) 2,k2 ) 1,k4 ) 0); two +2 isotopes (k1 ) 0,k2 ) 2,k4 )0); and one +4 isotope (k1 )0, k2 )0, k4 )1). Hence I4approximates to I0(T14 /4! + T12T2/2! + T22 /2! + T4).

    Now we want to simplify further the mathematical form of theintensityIkin Lemma 1. We assume the linearity between massmand the numbers of atoms, i.e.,nX aXmwhereaXis a constantfor each atomX, which may have a range of values according toelemental compositions of polypeptides. If each nXis linear inm,then T1, T2, and T4are also linear in mass mand Ikbecomes apolynomial of mass m. In the representation ofIkbyT1, T2, and

    T4in Lemma 1, the degree ofT1determines that ofIk, which isk,because the term with highest degree is T1k/k! from the case ofkisotopes of+1 Da.

    Lemma 2.In an isotopic distribution of a polypeptide CnCH-nHNnNOnOSnS, intensityIkapproximates to a polynomial of mass mwith degreek, i.e., Ik ) ckmk + ck-1mk-1 +...+ c1m + c0.

    Because of variations in elemental compositions, each ofT1,T2, andT4has a range of constants in its linear form. For example,consider the extreme case that a polypeptide consists of one kind

    of amino acid: polypeptides of phenylalanine (F, C9H9NO) givethe maximum T1 ) 6.97 10-4m and polypeptides of asparticacid (D, C4H5NO3) the minimumT1 ) 4.23 10-4m. The averageT1 ) 5.43 10-4m is computed from the averagine C4.9384-H7.7583N1.3577O1.4773S0.0417. Note that the averagine model fixes T1,T2, andT4as the average values for all values ofm. However, weobtain both minimum and maximum ofT1, T2, and T4as linearforms in addition to their averages. From the ranges of valuesT1, T2, and T4can take, we can estimate the range ofIk.

    Ratio Functions and Ratio Product Functions. On the basisof the approximation of Ik given above, we first show that anintensity ratio,Ik+1/Ik, can be approximated to a linear function of

    polypeptide mass and that an intensity ratio product, IkIk+2/Ik+12

    ,to a constant function. Recently, a similar model using the intensityratio was proposed independently, in whichIk+1/Ikis modeled bya polynomial of mass.16 We show here that a simple linearapproximation ofIk+1/Iksuffices.

    Second, we compute their average, minimum, and maximumfunctions using simulation spectra of tryptic polypeptides gener-ated from a protein database. The algebraic estimation of min/max functions from T1, T2, and T4 becomes harder for higherdegreek, so we compute them using stochastic simulation. Theseintensity ratio and ratio product functions are simpler than theintensity itself and reveal more features of isotopic distributions.

    From Lemma 2,Ik+1/Ikis a ratio of two polynomials of degreek+1 and k . For a sufficiently large mass m, the highest degreeterms (ck+1mk+1 in Ik+1and ckmk in Ik) dominate and thus Ik+1/Ikapproximates to some linear function, cm + b.

    Theorem 1. In an isotopic distribution of a polypeptideCnCHnHNnNOnOSnS, the ratio of two adjacent peaks, Ik+1/Ik, can beapproximated by a linear function of the polypeptide mass.

    To determine the constants of the ratio function, Ik+1/Ik ) cm+ b, we sampled about 100 000 tryptic peptides of 400 Da to 5 200Da generated from UniProt database 8.017 and computed the ratio

    Ik+1/Ikfor each peptide. Figure 1 shows our ratio functions Ik+1/Ikfor 0 e k e 3. For a sufficiently large mass m g 1800, it can beclearly seen that the intensity ratios can be approximated by linear

    functions of mass, represented as the solid lines in Figure 1, whichis in accordance with our theoretical analysis. The solid line,named Ravg(k,m), is computed by linear regression using least-squares fitting in gnuplot program (http://www.gnuplot.info). Thedotted line, Rmax(k,m), is the upper bound and the dashed line,

    Rmin (k,m), is the lower bound of the ratios in the graph, alsocomputed by linear regression using least-squares fitting. Note

    (15) Beavis, R. B.Anal. Chem.1993

    , 65, 496497.

    (16) Valkenborg, D.; Jansen, I.; Burzykowski, T. J. Am. Soc. Mass Spectrom.

    2008, 19, 703712.(17) Wu, C. H.; Apweiler, R.; Bairoch, A.; Natale, D. A.; Barker, W. C.;

    Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane,

    M.; Martin, M. J.; Mazumder, R.; ODonovan, C.; Redaschi, N.; Suzek, B.

    Nucleic Acids Res.2006

    , 34, D187191.7296 Analytical Chemistry, Vol. 80, No. 19, October 1, 2008

  • 8/12/2019 Rapid for Mass spectrometry Pre-processing

    4/10

    that the min/max functions,Rmin(k,m) and Rmax(k,m), representthe variation ofIk+1/Ikdue to elemental composition of polypep-tides of mass m. In Supporting Information Table S-1, we showthat the average function Ravg(k,m) is very close to the lineestimated by averagine.

    For a small mass m < 1800, we use the linearlike quotient of

    two polynomials with degreesk + 1 andkin Lemma 2. Especially,I1/I0has a strong linearity for all m, because the quotient ofI1/I0is cm. The reason for choosing the threshold 1800 is that apolypeptide within 1800 Da has the first and most abundant peakas its monoisotopic peak. In other words, I0is the most abundantandIk+1/Ik,k g 1, becomes insignificant in the range ofm < 1800.Note that the model by Valkenborg et al.16 proposes a refinedmodel of isotopic distributions for low-mass peptides by consider-ing the number of sulfurs in the peptides, which explains the tailsof ratios in the low mass range. However, our simple modelperformed well in the experimental data, and we expect that theexperimental error in peaks dominates the theoretical error in

    our model.In a similar way to Theorem 1, we obtain a constant ap-proximation of the ratio product of three adjacent peaks (i.e., (Ik/

    Ik+1)(Ik+2/Ik+1)). From Lemma 2, the degrees of (Ik)(Ik+2) andIk+12

    are the same as 2k + 2. Hence,IkIk+2/Ik+12 can be approximatedas a constant for polypeptides of sufficiently large masses.

    Theorem 2. In an isotopic distribution of a polypeptideCnCHnHNnNOnOSnS, the ratio product of three adjacent peaks,IkIk+2/

    Ik+12, can be approximated to a constant.Similarly to the ratio functions, we define ratio product

    functionsRPmax(k,m), RPmin(k,m), and RPavg(k,m), respectively,corresponding to the maximum, the minimum, and the average

    values ofIkIk+

    2/Ik+

    1

    2

    . These functions are also computed from the

    peptide database (Figure 2 and Supporting Information Table S-2).We also divide the mass range by 1800 Da and compute the ratioproducts for two intervals.

    Algorithm Overview. We present an algorithm for determin-ing isotopic clusters and their monoisotopic masses from a rawspectrum. Before describing our algorithm, we introduce several

    cluster names. A peak cluster indicates a list of peaks selectedfrom a raw spectrum and sorted in increasing order ofm/z. Apseudo (isotopic) cluster with charge state Cis a peak clustersuch that the m/zdifference of every adjacent peak pair in thepeak cluster is 1/C. An isotopic cluster with charge state Cis apseudocluster with charge state Csuch that the intensity patternof the pseudocluster corresponds to that of an isotopic distribution.Our determination algorithm consists of the following four steps:(1) peak picking, (2) pseudocluster identification, (3) isotopiccluster identification and monoisotopic mass determination, and(4) duplicate cluster removal. We describe the steps one by one.

    Peak Picking. We remove noise and select relatively high

    intensity peaks from the raw spectrum. It should be noted thatthis step is not closely related to the essence of our algorithm.On the contrary, it is more related to the noise pattern of a massspectrometer. Thus, any peak picking algorithm that removes wellthe noise from the raw spectrum can be used. In our experiment,we used the peak picking algorithm of Decon2LS.

    Pseudocluster Identification.We identify pseudoclusters byscanning the selected peaks from low m/zto high m/z. Everytime we examine a peak, we find all the pseudoclusters startingat the peak, in a way that we first find pseudoclusters with a chargestate 1+and find the other pseudoclusters with higher chargestates by incrementing the charge state. We describe how to

    enumerate all pseudoclusters starting at a peak Pwith a charge

    Figure 1. Ratio functions (Ik+1/Ik) obtained from stochastic simulation using 100 000 tryptic peptides sampled from Uniprot database. Thesefour figures show the kth intensity ratios for 0 e ke 3 of sampled peptides. For a sufficiently large mass of mg 1800, we represent the kthintensity ratio,Ik+1/Ik, by a linear function of polypeptide mass m(i.e.,cm+ b) and compute its average (solid line), its upper bound (dotted line),

    and its lower bound (dashed line) by least-squares fitting. For a small mass of m< 1800, we employ the quotient of two polynomials withdegrees k+ 1 and k. Supporting Information Table S-1 compares the average ratio functions by the averagine model and by our fitting result.

    7297Analytical Chemistry, Vol. 80, No. 19, October 1, 2008

  • 8/12/2019 Rapid for Mass spectrometry Pre-processing

    5/10

    state C. We first enumerate pseudoclusters with two peaks andthen pseudoclusters with more peaks. LetXdenote them/zofP;

    we first find the next peaks ofP, i.e., peaks in the mass range [X+(D - E)/C... X+(D + E)/C] where D is the estimated massdifference between two adjacent peaks in an isotopic cluster and

    Eis the error bound. In our experiment, D is 1.002 35, which isthe mass difference of two adjacent averagine peaks and E )10-5X, which corresponds to 10 ppm mass accuracy. By pairing

    Pand each next peak ofP, we generate all pseudoclusters withtwo peaks. Once pseudoclusters with two peaks are enumerated,we enumerate pseudoclusters with three peaks by extending thepseudoclusters with two peaks to the second next peaks ofP. Inthis way, we can enumerate all pseudoclusters starting at a peak

    Pwith a charge state C.

    Isotopic Cluster Identification and Monoisotopic Mass

    Determination. From the pseudoclusters, we identify isotopicclusters whose intensity patterns are similar to those of isotopicdistributions. For each pseudocluster, we determine whether itis an isotopic cluster or not by checking the intensity ratio of everyadjacent peak pair and the intensity ratio product of every three

    adjacent peaks in the pseudocluster. In determining isotopicclusters, we also consider the case that some peaks are missingin pseudoclusters because sometimes the monoisotopic and itsneighboring peaks are as small in their intensities as the noiselevel and they may be missing from a pseudocluster. Ouralgorithm allows up to three leftmost peaks to be missing in apseudocluster. More specifically, we calculate scores for four cases(in which we assume that we miss zero to three leftmost peaks)and select the case with the highest score. If the score of theselected pseudocluster is above zero, it means that most of theratios and ratio products range from Rmax(k,m) to Rmin(k,m) andfrom RPmax (k,m) to RPmin (k,m), respectively. Therefore, the

    pseudocluster is selected and becomes an isotopic cluster.Otherwise, the pseudocluster is discarded.

    Score calculation for a pseudocluster starts with monoisotopicmass calculation. The monoisotopic mass, denoted by m, iscomputed from the most abundant peak in the pseudocluster. Ifthe most abundant peak is theqth peak in the pseudocluster and

    ppeaks are assumed to be missing, m is computed as follows.

    m) mass of theqth peak- 1.002 35(q+p- 1)

    The score of a pseudocluster with p peaks assumed missing

    is as follows.

    Figure 2. Ratio product functions (IkIk+2/Ik+12

    ) obtained from stochastic simulation using 100 000 tryptic peptides sampled from Uniprot database.These four figures show the kth intensity ratio product for 0 e ke 3 of sampled peptides. We represent the kth intensity ratio product, IkIk+2/

    Ik+12, by an approximate constant function of polypeptide mass m(i.e., t+ (c/(m+ b)). and compute its average (solid line), its upper bound

    (dashed line), and its lower bound (dotted line) by least-squares fitting. We use the same approximate constant functions but obtain divided

    fitting results by the mass range. Supporting Information Table S-2 compares the average ratio product functions by the averagine model and

    by our fitting result.

    Figure 3.Numbers of identified clusters of 494 known peptides by

    each program. Isotopic clusters with the monoisotopic mass within a

    mass tolerance of 10 ppm are considered as the correct isotopicclusters of known peptides.

    7298 Analytical Chemistry, Vol. 80, No. 19, October 1, 2008

  • 8/12/2019 Rapid for Mass spectrometry Pre-processing

    6/10

    Score)k)0

    n-2

    scoreR(k,p, m)+k)0

    n-3

    scoreRP(k,p, m), 0epe 3

    where n is the number of peaks in the pseudocluster.The score is the sum of ratio score, scoreR(k, p, m), defined

    on every adjacent peak pair and ratio product score, scoreRP(k,p,m), defined on every three adjacent peaks in a pseudocluster.Let intensity Ik be the intensity of the (k + 1)st peak in apseudocluster. (Note that Ikcorresponds to Ik+pin the isotopic

    distribution). The ratio score scoreR(k, p, m) measures thesimilarity of the intensity ratioIk+1/Ikto the intensity ratio Ik+p+1/

    Ik+pin the isotopic distribution whose monoisotopic mass is m:

    scoreR(k,p, m))

    {1-

    I'k+1I'k-Ravg(k+p, m)

    Rmax(k+p, m)-Ravg(k+p, m)if I'k+1I'k>Ravg(k+p, m)

    1-Ravg(k+p, m)-I'k+1I'k

    Ravg(k+p, m)-Rmin(k+p, m) otherwise

    The ratio score function consists of two linear function fragmentsof the ratio Ik+1/Ikthat is designed to have the maximum value

    1 when the ratio is Ravg(k + p,m), and to have 0 when the ratiois Rmax(k + p, m) orRmin(k + p, m). In addition, the score hasnegative values when the ratio is higher thanRmax(k + p, m) orlower than Rmin(k + p, m).

    The ratio product score scoreRP(k, p, m) measures thesimilarity of the intensity ratio productIkIk+2/Ik+12 to the intensityratio product Ik+pIk+p+2/Ik+p+12 in an isotopic distribution whosemonoisotopic mass is m:

    scoreRP(k,p, m))

    {1-

    I'kI'k+2Ik+12-RPavg(k+p, m)

    RPmax(k+p, m)-RPavg(k+p, m)if I'kI'k+2Ik+1

    2>RPavg(k+p, m)

    1-RPavg(k+p, m)-I'kI'k+2Ik+1

    2

    RPavg(k+p, m)-RPmin(k+p, m)otherwise

    Refer to Supporting Information section S-2 for more tech-niques to improve the accuracy of our method.

    Duplicate Cluster Removal. Because we consider all possiblepseudoclusters, many pseudoclusters can be generated from asingle isotopic cluster. Suppose that there are five peaks andadjacent peaks are separated by 0.5 Th. In this case, a pseudoclus-ter consisting of five peaks (with charge state 2+), a pseudoclusterconsisting of four peaks (missing the first peak), and a pseudoclus-ter consisting of three peaks (with charge state 1+) can begenerated. We call these clusters duplicate clusters and select

    one of them. (They are not overlapping clusters.) Generally, iftwo clusters share one or more peaks and the charge state ofone is a multiple of the other, they are duplicate clusters. Thenwe remove one of them as follows. First, we remove an isotopiccluster whose most abundant peak is smaller than anothers. Ifthe most abundant peaks are the same, an isotopic cluster withthe lower charge state is removed. If their charge states arealso the same, the cluster with the lower score is removed.

    RESULTS AND DISCUSSION

    To evaluate the performance of our method, we compared itwith ICR2LS and Decon2LS, both developed by Smith group at

    Pacific Northwest National Laboratory (http://ncrr.pnl.gov/

    software/). ICR2LS is a powerful FTICR mass analysis softwarepackage. For deisotoping, it basically adapts THRASH. Decon2LSalso adapts THRASH, but its algorithm has been modified toincrease deisotoping speed while the details of the improvementswere not disclosed. All three programs were executed on the samePC (Pentium M processor 1.70 GHz, 1GB RAM, Windows XP OS).To be as fair as possible to each program, parameters were setso that each method works on a similar number of total clusters.Our method and Decon2LS use the same peak picking method.The result of each peak picking program contained about 25 000isotopic clusters.

    Identification of Known Peptides. In comparing threeprograms, we counted the number of identified isotope clustersof known peptides whose amino acid sequences were identifiedby MS/MS. It is difficult, however, to pick out the isotopic clustersof known peptides because the MS data from an LC/MS/MS cancontain many peptides whose monoisotopic masses are verysimilar. Therefore we use the following method to classifypeptides. For each known (confidently identified by MS/MSspectrum) peptide, we find isotopic clusters of this peptide at theMS scan where this peptide was identified by MS/MS. If anisotopic cluster has the monoisotopic mass within a mass toleranceof 10 ppm, we consider it a potentially correct isotopic cluster.We also look for this peptide in adjacent scans. If no isotopiccluster is found within any of 10 consecutive scans, the cluster isdiscarded. We regard these isotopic clusters as true positives.

    We counted the isotopic clusters of 494 known peptides. Figure3 shows the number of isotopic clusters identified by eachprogram. It shows 10.6% improvement over ICR2LS and 4.8%

    improvement over Decon2LS. To observe the performance ac-

    Table 1. Numbers of Clusters of 494 Known Peptidesa

    number of clusters

    mass number of peptides our method Decon2LS ICR2LS

    1000 47 790 767 7771500 158 2630 2559 25752000 109 2136 2024 19612500 72 1555 1447 13933000 52 1162 1151 10603500 26 963 880 8024000 19 969 856 6874500 2 42 41 375000 2 30 31 305000 7 311 348 255

    sum 494 10 588 10 104 9577

    a We divided the 494 peptides into 500 Da intervals and countedthe number of identified clusters of peptides that belong to eachinterval.

    Table 2. Result of Monoisotopic Mass Determination

    for the Peptide Whose Mass Is 2296.22 Da

    our method Decon2LS ICR2LS

    2296.22 Da (correct) 35 27 212295.22 Da (-1 Da) 2 1 02297.22 Da (+1 Da) 6 10 92298.22 Da (+2 Da) 0 1 2765.40 Da (wrong CS) 0 2 6not found 0 2 5

    7299Analytical Chemistry, Vol. 80, No. 19, October 1, 2008

  • 8/12/2019 Rapid for Mass spectrometry Pre-processing

    7/10

    Figure

    4.

    Exampleswhereourmethod

    determinesthecorrectmonoisotopicmass.

    ThechemicalformulaisC101H165N29O

    32

    andthemonoisotopicmassis2296.2

    2Da.

    Anarrow

    representsthe

    monoisotopicpeakofthispeptideandO

    ,]

    ,and/

    representthetheoreticalisotopic

    distributionsofthispeptidecalculatedby

    eachofourmethod

    ,Decon2LSandICR2

    LS

    ,respectively

    .(a)Decon2LS

    assigned2295

    .22Daasthemonoisotopic

    massandICR2LSfoundnocluster.(b)D

    econ2LSandICR2LSassigned2297

    .22

    Da.

    (c)ICR2LSassigned2298

    .22Da.

    (d)ICR2LSassignedanincorrect

    chargestateandassigned765

    .40Daas

    themonoisotopicmass.

    7300 Analytical Chemistry, Vol. 80, No. 19, October 1, 2008

  • 8/12/2019 Rapid for Mass spectrometry Pre-processing

    8/10

    cording to the mass, we divided the 494 peptides into 500 Daintervals and counted the number of identified clusters of peptidesthat belong to each interval (Table 1). Our method works wellregardless of peptide masses.

    There can be various reasons that each program gives differentsearch results. Some clusters are inherently ambiguous and eachprogram can make different judgments. Sometimes the chargestates of clusters are determined incorrectly. For all threeprograms, primary errors are 1-2 Da errors. In THRASH basedalgorithms, 1-2 Da errors often happen when the position of themost abundant peak of an identified cluster is different from thatof averagine. On the contrary, our method has low dependency

    on the most abundant peak. Sometimes THRASH based algo-

    rithms determine the monoisotopic mass of an identified isotopiccluster 1 Da larger than the correct mass, even though there existsthe correct monoisotopic peak in the spectrum. Such an error isuncommon in our method because the existence of the monoiso-topic peak in a pseudocluster usually increases our score.However, our method also cannot correctly identify severalambiguous cases because it is still based on the cluster shape.

    Detection of false positives in search results can only beperformed by manual inspection because many unidentifiedpeptides are crowded in the spectrum and it is possible that thereexists a peptide whose monoisotopic mass is 1 Da different froma known peptide. Here we present several examples in which

    monoisotopic masses determined by our method are different

    Figure 5.Examples of overlapping clusters. (a) Two clusters share no peak. Two isotopic clusters were identified by all three programs. (b)Two clusters share the peak of 716.62 Th. The isotopic cluster of 6433.46 Da (O) was identified by all three programs, but the isotopic cluster

    of 3576.03 Da (]) was identified only by our method.

    7301Analytical Chemistry, Vol. 80, No. 19, October 1, 2008

  • 8/12/2019 Rapid for Mass spectrometry Pre-processing

    9/10

    from masses of other programs. A peptide whose chemicalformula is C101H165N29O32and monoisotopic mass is 2296.22 Dais observed in relatively long duration in elution time (from scanno. 3464 to 3565) during the LC/MS/MS experiment of the ISBstandard peptide mix. The results of mass determination aresummarized in Table 2. We show four examples in Figure 4 whereour method determines the correct monoisotopic mass.O, ], and/ represent the theoretical isotopic distributions of this peptide

    calculated by our method, Decon2LS, and ICR2LS, respectively.In Figure 4a, Decon2LS determined the mass of the cluster as 1Da smaller than the correct theoretical mass because the firstpeak of the cluster is much larger than the averagine isotopicdistribution. ICR2LS found no cluster in this region. On the otherhand, Decon2LS and ICR2LS assigned 2297.22 Da, which is 1 Dalarger than the theoretical mass in Figure 4b. Figure 4c is a casewhere the intensities are close to the noise level. Because thefourth peak appears abnormally large, ICR2LS assigned 2298.22Da, which is 2 Da larger than the theoretical mass. Theseexamples (Figure 4a-c) show that THRASH algorithm oftenassigns incorrect mass when the most abundant peak of the

    identified cluster shows a discrepancy from the averagine isotopicdistribution. Figure 4d is a case where ICR2LS assigned anincorrect charge state and assigned 765.40 Da as the monoisotopicmass. Some clusters that were not found by a program may befound if the parameters are set differently (lowering minimumS/N ratios, for example). However, a different parameter set maywell cause false positive determination of other clusters and thereis always a compromise between the accuracy and computationalcosts. The highly accurate determination of monoisotopic massesby our method should increase the accuracy in peptide identifica-tion and decrease false positive peptide identification by MS-basedproteomics. More scans of this peptide are shown in Supporting

    Information Figure S-1.

    Identification of Overlapping Clusters.Although FTICR MShas a high resolving power, there are many overlapping clustersbecause hundreds of isotopic clusters crowded into a narrowrange. Even in these cases it is easy to identify all overlappingisotopic clusters if there is no shared peak. All programs correctlyfound two isotopic clusters in Figure 5a. However, it is very hardto identify all clusters if isotopic clusters share one or more peaks.THRASH fails to identify all clusters that share one or more peaks

    in many cases, because the subtraction of an identified clustermight eliminate the shared peaks. Our method can identifyoverlapping clusters that share one or more peaks in many casesbecause we consider all possible pseudoclusters and do notsubtract the peaks of identified clusters. In Figure 5b, the clusterwhose monoisotpic mass is 6433.46 Da (O) was identified by allthree programs, but the cluster whose monoisotopic mass is3576.03 Da (]) was identified only by our method. Both clustersbelong to the clusters of 494 known peptides. However, Decon2LSand ICR2LS have failed to identify both because the peak of 716.62Th is shared by both clusters. Elimination of the 716.62 Th peakresults in low match (i.e., low fit number) between the theoretical

    averagine distribution and the experimental distribution, leadingto loss of the mass information.Execution Time.Another noticeable advantage of our method

    is its speed. Since our method uses simple ratio functions andratio product functions that are precomputed, our method cancalculate the scores of isotopic clusters much faster than THRASHcalculating the least-squares fit on the fly. Execution time for ourdata set is shown in Supporting Information Table S-3 and Figure6. ICR2LS is much slower than other programs. Execution timeof our method was similar to that of Decon2LS in deisotopingthe first segment data due to the dominant effect of I/O time.We can see a remarkable difference in execution time in analyzing

    segment 4 data, (almost 5 times faster than Decon2LS,) for which

    Figure 6.Execution time of three programs. The 18 protein data set consists of five files. Our method is almost 5 times faster than Decon2LSin analyzing segment 4 data. ICR2LS is much slower than other programs.

    7302 Analytical Chemistry, Vol. 80, No. 19, October 1, 2008

  • 8/12/2019 Rapid for Mass spectrometry Pre-processing

    10/10

    it took the longest time. It must also be noted that the number ofpeaks obtained by the peak picking step is a major factor inexecution time.

    CONCLUSION

    We have presented a new probabilistic model for isotopicdistributions and a novel algorithm for determining isotopicdistributions and monoisotopic masses based on the model. Ourmethod was applied to protein mixture data from a high-resolution

    mass spectrometer, and we obtained better performance thanthose of THRASH-based implementations. Our method foundmore isotopic clusters of identified peptides in spite of the similarnumber of the total clusters. Our method does not use theaveragine fitting method, so we successfully resolve the 1-2 Damismatch problem in THRASH, which occurs especially to isotopicclusters that deviate from the averagine distribution due to theirweak intensity. Overlapping clusters are also identified success-fully in our method. Because our method uses simple ratiofunctions to evaluate the score of isotopic clusters, its execution

    time is very fast. This speed is expected to allow on-the-flydetermination of monoisotopic masses during an LC/MS/MSexperiment, which provides advantages such as accurate assign-ment of precursor monoisotopic masses to the corresponding MS/MS data.

    ACKNOWLEDGMENT

    This study was supported by Grants FPR08-A1-020, FPR08-A1-021, and FPR08-A1-010 of the 21C Frontier Functional Pro-

    teomics Project from the Korean Ministry of Education, Science& Technology.

    SUPPORTING INFORMATION AVAILABLE

    Additional information as noted in text. This material isavailable free of charge via the Internet at http://pubs.acs.org.

    Received for review May 2, 2008. Accepted July 9, 2008.

    AC800913B

    7303Analytical Chemistry, Vol. 80, No. 19, October 1, 2008