university of groningen mastering data pre-processing for ... · if you believe that this document...
TRANSCRIPT
University of Groningen
Mastering data pre-processing for accurate quantitative molecular profiling with liquidchromatography coupled to mass spectrometryMitra, Vikram
IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite fromit. Please check the document version below.
Document VersionPublisher's PDF, also known as Version of record
Publication date:2017
Link to publication in University of Groningen/UMCG research database
Citation for published version (APA):Mitra, V. (2017). Mastering data pre-processing for accurate quantitative molecular profiling with liquidchromatography coupled to mass spectrometry [Groningen]: University of Groningen
CopyrightOther than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of theauthor(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).
Take-down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediatelyand investigate your claim.
Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons thenumber of authors shown on this cover page is limited to 10 maximum.
Download date: 13-06-2018
135
Supporting information for chapter 3
Rat CSF sampling and LC-MS acquisition
1.1 Experimental design
In total 7 male Lewis rats (Harlan Laboratories B.V.) with a starting weight of an average of
200 g were used to model onset and development of experimental autoimmune
encephalomyelitis (EAE) induced with guinea pig myelin basic protein. At the start of the
study (day 0) EAE was induced in four male Lewis rats, by subcuntaneous injection in left
hind paw (under isoflurane anesthesia) of 100 μL of a saline-based emulsion containing 20
μg guinea pig myelin basic protein (MBP, Vrije Universiteit Amsterdam), 500 μg
Mycobacterium tuberculosis type 37HRa (Difco) and 50 μL Complete Freunds’ Adjuvant
(CFA) (EAE groups B, F and H). Three male Lewis rats were used as inflammatory control
by the injection of the same emulsion, but without MBP (EAE inflammatory control groups
A, C and G). Three animals were treated with minocycline at the onset of the study (day 0)
at 50 mg/kg bodyweight injected intraperitoneal in the belly (groups C, G and H). Four rats
were treated with vehicle (groups A, B and F). The animals were kept in type III cages three
by three in random order, food and water was available ad libitum. The animal groups and
samples included in the present study with the corresponding file name and LC-MS analysis
places are summarized in Table S1.
Group Treatment Sampl
e File names in laboratory 1 (orbitrap) File names in laboratory 2 (qTOF)
A CFA + vehicle 15 100419O2c1_MS_28_EAE_Mino_15_00
2 20100612_15_Exclusion_400
B EAE + vehicle 50 100505O2c1_MS_28_EAE_Mino_50_05
3 20100612_50_Exclusion_400
C CFA +
minocycline 53
100628O2c1_MS_28_EAE_Mino_53_00
4 20100609_EAE_53_MSMS_400_2
F EAE + vehicle 98 100614O2c1_MS_28_EAE_Mino_98_06
3 20100609_EAE_98_MSMS_400_2
F EAE + vehicle 6 100419O2c1_MS_28_EAE_Mino_6_017 20100612_6_Exclusion_400
G CFA +
minocycline 12
100419O2c1_MS_28_EAE_Mino_12_02
1 20100612_12_Exclusion_400
H EAE
+minocycline 111
00625O2c1_MS_28_EAE_Mino_111_00
6
20100609_EAE_111_MSMS_400_
2
136
Table S1. Treatment summary of sample analysed in rat EAE CSF dataset. Treatments are
the following: CFA: animals treated with complete Freund’s adjuvant (inflammatory control
animals); EAE: animals treated with CFA and Myelin Basic Protein (MBP) (diseased
animals); Minocycline: animals treated with minocycline; vehicle: animal not treated with
minocycline only with saline solution.
1.2. CSF Sampling
Three of the animals were euthanized at day 10 (groups A-C) and the rest of the animals
(group F-H) were euthanized at day 14 using CO2/O2. For the CSF sampling procedure the
head of each rat was held in a fixed position using a holder. The arachnoid membrane was
revealed by a skin incision followed by an incision in the musculus trapezius pars
descendens. The CSF was collected from the cisterna magna using an insulin syringe
needle (Myjector, 29 G × 1/2" - 0.33 × 12 mm, 0.3 mL = 30 units); a maximum of 60 μL was
collected from each animal. Each sample was centrifuged at 2000 g for 10 minutes at 4°C
within 20 minutes after the collection. The supernatant was divided in aliquots of five tubes
of ~10 μL each and stored at -80 °C until the analysis. Samples containing blood
contamination determined by visual inspection were discarded from the study.
1.3. Sample preparation
The protein digestion was performed in a random order according to following procedure;
10 μL CSF was added to a tubes containing 10 μL 0.1% RapiGest™ (Waters) dissolved in
50 mM ammonium bicarbonate. The proteins in the CSF were reduced by the addition of
0.5 μL 1,4-dithiotreitol (DTT) (0.5 M) incubated at 60 °C for 30 min. Samples were cooled
down to room temperature, and were subsequently alkylated using 1 μL iodoacetamide
(IAM) (0.3 M) incubated in the dark for 30 min at room temperature. One micro liter
sequencing grade modified trypsin (Promega, Madison, WI, USA, part # V5111) of 1 μg/μL
concentration was added to the samples and further incubated for ~16 h at 37°C under
agitation (450 rpm). At the end of the digestion 3 μL hydrochloric acid (0.5 M) was added to
the sample solution and were further incubated for 30 min at 37 °C. The samples were
centrifuged at 13 250 g for 10 min at 4 °C to remove RapiGest™ particles. The samples
were transferred to glass sample vials, and kept at -80°C. Each sample was exposed to two
freeze-thaw cycles prior to the LC-MS analysis.
137
1.4. LC-MS proteomic analysis in two laboratories
The digested CSF samples were analyzed in a random order. Before and after the analysis
of the 7 samples a blank (water with 0.1 % formic acid) and a quality control (horse hearth
cytochrome C spiked into pooled CSF sample after digestion at 200 fmol/μL) was injected
to check the technical quality of the analysis. Samples from group A, C and G were injected
at a volume of 1 μL and samples from group B, F and H were injected at a volume of 0.2 μL.
The reason for the difference in the injected sample amount was to normalize the total ion
chromatograms (TIC) of the samples (approximately five times higher in group B, F and H
compared to the rest of the groups, determined by a pre-analysis of samples from all groups
at the same volume) and therefore to avoid overloading the trap column.
1.4.1. Orbitrap nanoLC-MS/MS analysis (laboratory 1, Rotterdam)
Digested rat CSF samples were analyzed by LC-MS/MS using an Ultimate 3000 nano LC
system (Dionex, Germering, Germany) online coupled to a hybrid linear ion trap/Orbitrap
mass spectrometer (LTQ Orbitrap XL; Thermo Fisher Scientific, Bremen, Germany). Five
microliter digest were loaded onto a C18 trap column (C18 PepMap, 300µm ID 5mm of 5
µm particle size, 100 Å pore size; Dionex, The Netherlands) and desalted for 10 minutes
using a flow rate of 20 µL/min. The trap column was switched online with the analytical
column (PepMap C18, 75 μm ID 150 mm, 3 μm particle and 100 Å pore size; Dionex, The
Netherlands) and peptides were eluted with the following binary gradient: 0% - 25% eluent
B for 120 min and 25% - 50% eluent B for further 60 minutes, where eluent A consisted of
2% acetonitrile and 0.1% formic acid in ultra pure water and eluent B consisted of 80%
acetonitrile and 0.08% formic acid in water. The column flow rate was set to 300 nL/min. For
MS/MS analysis a data dependent acquisition method was used composed from a high
resolution survey scan between 400 – 1800 m/z performed with an Orbitrap (automatic gain
control (AGC) 106, resolution 30 000 at 400 m/z; lock mass set to 445.120025 m/z
(protonated (SiO(CH3)2)6). Based on this survey scan the 5 most intense ions were
consecutively isolated (AGC target set to 104 ions) and fragmented by collision-activated
dissociation (CAD) applying 35% normalized collision energy in the linear ion trap. Once a
precursor had been selected, it was excluded for 3 minutes.
1.4.2. qTOF chipLC-MS/MS proteomic analysis (laboratory 2, Groningen)
The peptide separation was performed on a reverse phase LC-chip (Protein ID chip #3;
G4240-63001 SPQ110: Agilent Technologies (Santa Clara, USA); analytical column: 150
138
mm × 75 μm Zorbax 300SB-C18, particle size of 5 μm and pore size of 300 Å; trap column:
160 nL Zorbax 300SB-C18, 5 μm) coupled to a nanoLC system (Agilent 1200) with a 40 μL
injection loop. The chip was interfaced with and electrospray ionization to quadrupole-time-
of-flight (QTOF) type of mass spectrometer (Agilent 6510). The MassHunter software
(version B.02.00; Agilent Technologies) for data acquisition. The LC separation was
performed by using following eluents: A: ultra-pure water (conductivity 18.2 MΩ obtained
with Sartorius Stedim purification system, Nieuwegein, The Netherlands) containing 0.1%
formic acid (98-100%, pro analysis, Merck, Darmstadt, Germany); B: acetonitrile (HPLC-S
gradient grade, Biosolve, Valkenswaard, The Netherlands) containing 0.1% formic acid.
All samples were desalted and enriched on the trap column for 10 minutes at a flow rate of
3 μL/min (3% B). The samples were then transferred to the separation column at a flow rate
of 250 nL/min. For the elution of the peptides, following gradient was used: 100 min linear
gradient from 3 to 50% B; 5 min linear gradient from 50 to 70% B and finally 4 min linear
gradient from 70 to 3%.
MS analysis was performed using 2 GHz extended dynamic range mode under the following
conditions: mass range: 275-2000 m/z, acquisition rate: 1 spectrum/sec, data storage:
profile and centroid mode, fragmentor: 175 V, skimmer: 65 V, OCT 1 RF Vpp: 750 V, spray
voltage: ~1900 V, drying gas temp: 325ºC, drying gas flow (N2): 6 L/min. Mass correction
was performed during analysis using internal standards of 371.31559 m/z (originating from
a ubiquitous background ion (Dioctyl adipate, DOA, plasticizer) and 1221.990637 m/z (HP-
1221 calibration standard, evaporating from a wetted wick inside the spray chamber).
To assess the repeatability of the LC-MS analysis the relative standard deviation (RSD) was
calculated for the mass accuracy, retention time and peak area based on selected
cytochrome C peaks in the QC samples. The peaks were first smoothed (Gaussian function
width; 15 points, (15 sec)) and subsequently integrated; the peak area RSD was within +/-
25%, the retention time deviation was less than +/- 0.3% (or 5 sec) and the mass accuracy
(calculated as the mean of five measurements from each selected cytochrome C peak), was
within +/- 9 ppm of the theoretical value.
Mouse experimental design dataset
2.1 Experimental design
The mouse experimental design dataset was obtained from the National Cancer Institute’s
Mouse Proteomic Technology Initiative (http://proteomics.cancer.gov/programs/mouse), who
launched in 2004 to assess proteomic strategies for discovering candidate biomarkers for
139
early detection of cancer from genetically modified mouse models of human cancer. The
dataset that we obtained was specially design to identify the sources of variation and factors
which has the largest influence on the compound variability of mouse serum modeling
human cancer. The following four factors were studied: laboratory (2 levels), depletion
method (4 levels), disease or healthy stage (2 levels) and type of cancer (3 levels). This
modified version of the Latin square experimental design(1) provides then 48 analysis in
total with 24 identic samples analysed in two laboratories. In a standard Latin square design,
the number of levels for each factor must be the same. In this particular experiment, the
factors mouse models has tree and laboratory and disease status have two levels each
whereas the depletion method has four levels, hence a modified Latin square. In a Latin
square design, the overall variation is partitioned into four sources, which allows the
experimenter to isolate the effects of mouse model, lab, disease status and depletion
method. The data set is available on-line at (http://www.proteomecommons.org/dev/dfs/examples/nci-mouse-
models/index.html).
2.2 Depletion methods and protein digestion
MARS Immunoaffinity Depletion. 40 μL of each mouse model plasma sample was used
for the immunoaffinity depletion. Three high-abundance proteins (albumin, IgG, and
transferrin) that compose 75-80% of the total protein mass in mouse plasma were removed
simultaneously using a 4.6 50 mm murine MARS column (Agilent, Palo Alto, USA, CA)
per the manufacturer’s instructions. The flow-through fractions were concentrated in iCON
concentrators with 9 kDa molecular weight cutoffs (Pierce, Rockford, IL) followed by buffer
exchange into 50 mM NH4HCO3 in the same unit per the manufacturer’s instructions.
Cysteinyl Peptide Enrichment. Cysteinyl peptides were captured from the tryptic digest as
previously described(2, 3). All solutions used in this method were degassed to prevent
oxidation of the thiol content. The peptides resulting from the above protein digestion step
were reduced with 5 mM DTT in 80 mL of 50 mM Tris buffer (pH 7.5), 1 mM EDTA (coupling
buffer) for 30 min at 37ºC, after which the samples were diluted 5-fold by adding coupling
buffer. Thiopropyl Sepharose 6B thiol-affinity resin (100 μL; Amersham Biosciences,
Uppsala, Sweden) was prepared from dried powder per the manufacturer's instructions.
Briefly, the dried powder was rehydrated in water for 15 min and washed by 50 bed volumes
of water, followed by 50 bed volumes of coupling buffer in a Handee Mini-Spin column
(Pierce, Rockford, USA, IL). The reduced peptide sample was then incubated with the resin
for 1 h at room temperature with gentle mixing, and the unbound portion (non-cysteinyl
140
peptides) was removed by spinning the column at low speed. The resin was washed in the
spin column sequentially with 0.5 mL of each of the following solutions: 1) 50 mM Tris buffer
(pH 8.0), 1 mM EDTA (washing buffer); 2) 2 M NaCl; 3) 80% ACN/0.1% TFA solution; and
4) washing buffer. To release the captured cysteinyl peptides, 100 μL of 20 mM DTT freshly
prepared in washing buffer was added to the resin and incubated for 30 min at room
temperature. The resin was further washed with 100 μL of 80% ACN which was pooled with
the previous DTT eluate. The sample was alkylated with 80 mM iodoacetamide for 30 min
at room temperature in dark. The eluted cysteinyl peptides were desalted by using a SPE
C18 column and lyophilized. Cysteinyl peptide samples were reconstituted in 25 mM
NH4HCO3 and stored at -80ºC until time for LC-MS analysis (MARS+Cys sample).
Plasma Protein Digestion. The MARS flow-through proteins were denatured and reduced
in 50 mM NH4HCO3 (pH 8.2), 8 M urea, 10 mM dithiothreitol (DTT) for 1 h at 37ºC. The
resulting protein mixture was diluted 10 fold with 50 mM NH4HCO3, and then sequencing
grade modified porcine trypsin (Promega, Madison, USA, WI) was added at a trypsin:protein
ratio of 1:50. The sample was incubated overnight at 37ºC. The following day, the trypsin
digested sample was loaded on a 1 mL SPE C18 column (Supelco, Bellefonte, USA, PA)
and washed with 4 mL of 0.1% trifluoroacetic acid (TFA)/5% acetonitrile (ACN). Peptides
were eluted from the SPE column with 1 mL of 0.1% TFA/80% ACN and lyophilized
afterwards. Peptide samples were reconstituted in 25 mM NH4HCO3 and stored at -80ºC
until LC-MS analysis.
Reversed-Phase Capillary LC-MS Analyses. A custom-built high-pressure capillary LC
system(4) coupled on-line to an Agilent LC/MSD TOF (G1969A, laboratory 2) via an in-
house-manufactured electrospray ionization interface was used to analyze the peptide
samples. In the other laboratory an LC-MS system with time-of-flight detector was used
(Waters LCT Premier for laboratory 1). The reversed-phase capillary column is prepared by
slurry packing 3-mm Jupiter C18 bonded particles (Phenomenex, Torrence, CA) into a
65-cm long and 75 mm i.d. fused silica capillary (Polymicro Technologies, Phoenix, AZ) that
incorporated a retaining stainless steel screen in an HPLC union (Valco Instruments Co.,
Houston, TX). The mobile phases consisted of 0.2% acetic acid and 0.05% TFA in water (A)
and 0.1% TFA in 90% ACN/10% water (B) and were degassed on-line by using a vacuum
degasser (Jones Chromatography Inc., Lakewood, CO). After loading 5 mL of sample
solution onto the column, an exponential gradient elution was achieved by increasing the
mobile-phase composition in a stainless steel mixing chamber from 0 to 70% B over 120
141
min. The TOF mass spectrometer was scanned in the m/z range of 400-2000 at 1
scan/second.
Monte-Carlo simulated dataset The Monte-Carlo simulation to imitate the outcome of peak-matching procedure was
performed with the following criteria:
a) There are two types of peak pairs: Accurately matched peak pairs where the retention
time coordinates of the matched peaks follow a non-linear, monotonic trend. In case of
peak order inversion, the retention times of the accurately matched peak pairs fluctuate
along the non-linear, monotonic trend with the maximal value of retention time difference
of peak changing elution order. The second type of peak pairs is obtained by randomly
matching peaks between the two chromatograms and simulates the error in the peak
matching procedure. These peak pairs are distributed randomly throughout the retention
time space while taking the initial peak density distribution in the two chromatograms
into account.
b) The non-linear monotonic trend is simulated using a cubic spline function and peak
elution order inversion is represented as random fluctuation (orthogonal residuals) along
this trend. Distribution of peak pairs along the non-linear monotonic retention time trend
is sampled directly from the peak distribution of a real LC-MS chromatogram.
c) The parameters of the simulation that can be set by the user are the following: (1)
number of accurately matched peak pairs, (2) number of randomly matched peak pairs,
(3) fluctuation in minutes of the accurately matched peak pairs simulating the amount of
maximal retention time differences related to changes in peak elution order, (4-6) three
LC-MS peak distributions expressed as a histogram along the retention time (one is
used to sample the peak distribution of the accurately matched peak pairs along the
main monotonic retention time correspondence trend and the other two are used to
sample randomly matched peak pairs from two LC-MS/MS chromatograms).
Parameters for the Monte Carlo simulations were the following:
Total number of MPPs: 100, 250, 500, 750 and 1000
Fluctuation of AMPPs around the monotonic retention time trend: 0.05, 01, 1, 5 and 15
minutes
Ratio of AMPPs relative to the MPPs: 0.00, 0.10, 0.20, 0.30, 0.40, 0.50, 0.75, 0.90 and 1.00
Number of repetitions: 3
142
Detailed description of the time alignment algorithm
Pre-processing of single stage LC-MS data Figure 1 shows the main parts of the quality assessment procedure and indicates in red
modules where the procedure can stop due to improper conditions for time alignment such
as low number of matched peak pairs with respect to random peak pairing, low number of
accurately matched peak pairs or high probability of peak order inversion. The quality control
procedure is a pairwise method and expect that the subsequent time alignment method
change only the retention time of one chromatogram (refereed here as sample
chromatogram, and shown as peak list 2 in Figure 1). To process the raw LC-MS/MS data,
data in vendor specific format were converted to mzXML format using msconvert tool of the
ProteoWizard library(5). Single stage part of LC-MS/MS datasets in mzXML format were
submitted to data pre-processing which included peak detection and quantification, de-
convolution of isotopic peak clusters, charge state determination of isotopologue peaks
clusters and summing of the most abundant isotopologues of each charge state per
peptides. The initial noise filtering and peak quantification was carried out using the
PeakPicker module of the OpenMS pipeline(6). The signal-to-noise ratio parameter of the
PeakPicker algorithm was set to 10. Detected isotopologues (chemical species of the same
compound with the same atomic, but different isotope constitution) of one particular charge
state are then clustered and clusters which are not in accordance with the isotope wavelet
model following the “averagine” peptide constitution(6) are filtered out during the feature
finding step. The charge state of each detected isotope cluster is then determined and the
decharged mass of the most abundant isotopologues is calculated and is attributed as mass
of single peptide. This is followed by summing of the most abundant decharged
isotopologues with the same mass (mass tolerance within ±0.01 Da) within ±30 seconds of
retention time. The final quantitative value for each compound is characterized with the mass
value of the decharged most abundant isotopologue and the average retention time of all
charge states. This information along with ioncounts are exported in tab-delimited text file;
which is referred as “peak list” in the article.
Intensity-rank-based peak matching of LC-MS data (left part of step 1. in Figure 1) Prior peak matching, all peak lists were sorted and ranked according to decreasing intensity.
Correspondences between a pair of peak lists are determined by finding peak pairs that are
close in mass and intensity rank. The peak correspondences between a pair of intensity
143
sorted peak lists are identified using a sliding window technique with the following
parameters: (1) peak pairs should be close in m/z therefore a threshold for the maximal m/z
difference between peak pairs is applied. This threshold should be set according to the
maximal mass calibration differences between the two LC-MS chromatograms. For
improvement of mass calibration it is advised to recalibrate the mass axis using either known
masses of background contaminants(7) or using accurate mass of identified peptides from
MS/MS data, if available(8); (2) number of the most abundant peaks used to identify peak
pairs. The end of the intensity sorted peak lists contains noise and other data processing
artifacts, therefore this parameter should be set to include only peaks from the intensity
sorted peak lists and exclude non informative items such as noise; (3) length of the sliding
window used in the peak matching procedure. This window defines the largest differences
between the intensity ranks of paired peaks that are considered by the matching algorithm.
In case of multiple hits for the same mass within the sliding window, the algorithm only
selects the peak with lowest difference of intensity ranks. Figure S2 in the supporting
information provides a visual summary of the peak matching procedure used to define peak
pairs between two LC-MS intensity sorted peak lists using the sliding window approach.
Optimizing peak matching parameters of single stage LC-MS peak lists (parameter optimisation in step 1. in Figure 1) All peak matching algorithms provide a certain ratio of accurately and inaccurately matched
peak pairs. The accurately matched peak pairs are common peaks between the two
chromatograms and contain the information for the correction of the retention time
differences between the two LC-MS chromatograms. When the ratio of accurately matched
peaks is high within the dataset, the retention time coordinates of the accurately matched
peaks accumulate along the retention time correspondence trend. Bivariate kernel density
estimation (2D-KDE) is applied over the retention time vectors of the matched peak pairs to
identify the regions where peaks accumulate in higher density compared to what is expected
from random pairing of peaks from two LC-MS/MS chromatograms. In 2D-KDE for n peak
pairs of ,x y retention time coordinates the estimated probability density function f̂ is
given by:
1
1
ˆ , ,n
H H i ii=
f x y = n K x x y y (1)
where KH is a bivariate ellipsoid symmetric Gaussian kernel that integrates to 1 from - to
+ for x and y values. H is the bandwidth described by the sigma of two-dimensional
144
Gaussian kernel in the x (x) and y (y) directions and is greater than zero. KH determines
the smoothing extent of the 2 dimensional density histogram, and is expressed by the
following equation:
2 2( ) ( )
2 222
2 ( )( )1 1( , ) exp2 12 1
i ix x y yi i
H i ix y x yx y
x x y yK x x y y
(2)
ρ is the correlation between the two 1-dimensional Gaussian kernel functions and defines
the rotation of the Gaussian kernel. The bandwidth parameter was optimally set using a
plug-in bandwidth matrix approach developed by Botev et al.(9). An important feature of the
full bandwidth matrix is that it does not use any normal reference rules and is data centric.
For n peak pairs, the algorithm estimates a square density matrix of size 2i, where i is
arg min 2i
in , and the data matrix cover the entire retention time domains of the matched
peak pairs. The value of 2i is maximized to i = 7, to avoid long calculation time for 2-
dimensional-Kolmogorov-Smirnov test (see two paragraph below).
Peaks paired from the two peak lists contain randomly matched peaks pairs and accurately
matched peak pairs. The ratio of correctly and incorrectly matched pairs using decharged
and isotope deconvoluted peak lists depends from the molecular composition of the two
samples and from the parameters of the peak matching procedure. In order to assess
statistically the ratio of correctly and incorrectly matched peak pairs a p-value is calculated
using 2-dimensional Kolmogorov-Smirnov (2D-KS) test between the 2D-KDE matrix
obtained from the matched peak pair distribution and a the density matrix calculated for
random peak pairing. The density matrix for random peak pairing is obtained with the cross
product of the 1-dimensional KDE of peak distribution for each LC-MS chromatogram using
x and y for the corresponding chromatograms. Therefore 2D-KS measures the statistical
probability that the peak pair distribution originates from the distribution of random pairing of
peaks from the two chromatograms.
The 1-dimensional Kolmogorov-Smirnov (1D-KS) test provides the non-parametric
probability that two distributions is equivalent and that the observed differences between the
two distributions is due to random sampling. The 1D-KS uses the maximum absolute
difference between the two cumulative distributions to calculate the probability of the equality
of two empirical distributions. Extending KS statistic to multi-dimensional space is
challenging, while there are 2d-1 number of independent cumulative distributions in d
145
dimensions. We have slightly modified the algorithm developed by Peacock et al.(10), which
estimates the largest difference between the two cumulative distributions for any possible
ordering for two dimensions. Given n points in a two-dimensional space defined by the
retention time domains of the two chromatograms, this amounts to calculating the
cumulative distribution functions in 4n2-1 quadrants. Our modification comprises that the
cumulative functions is not calculated for each peak pairs, but it is obtained directly from the
two 2D-KDE matrices, one obtained with the cross product of two 1D-KDE calculated from
the peak distribution in the two LC-MS chromatograms and the other obtained with the peak
pairing algorithm described in previous section 3.2. The DKS test statistic is then obtained by
calculating the largest difference between cumulative distributions considering all possible
4n2-1 quadrant divisions, where n in this case corresponds to dimension of the 2D-KDE
square matrices. The null hypothesis considering, that the distribution of peak pairs obtained
with random peak paring and the distribution obtained with matching of intensity sorted peak
lists is same is rejected at a significance level of α if
αKS Z>Dn2
(3)
where Zα is the cumulative standard normal deviate for the corresponding α probability. The
exact p-value for the 2D-KS estimate is obtained from the left part of the inequality (3). 2D-
KS test is then used to optimise the three parameters of the intensity-rank-based peak
pairing algorithm (the length of the intensity rank window, the threshold for the m/z
differences and the number of the most abundant peaks taken into consideration by the
peak matching procedure) with a predefined set of parameters (the exact values of the
parameters are presented in section 7 in supporting information). The 2D-KDE and the 2D-
KS calculations are performed only for peak lists matching a minimum of 100 peaks pairs.
If during the whole optimisation procedure the minimum number of matched peak pairs is
not reached the two-step time alignment procedure stops. The alignment procedure stops
as well if the probability of 2D-KS test measuring if intensity-rank-based peak pairing
distribution is the same that would be obtained with random peak pairing is higher than a p-
value of 0.001. Peak matching parameters have large effect on the peak matching accuracy
(ratio of accurately and inaccurately matched peak pairs and number of obtained peak
pairs), and for that reason this step is crucial to find the optimal peak matching parameters,
which provide the most different peak pairs distribution in the retention time space of the two
chromatograms from the distribution obtained with random peak pairing. Few examples on
the effect of parameters on peak matching results are shown in this supporting information
146
in Figure S3. Figure S4 in this supporting information shows plots presenting the mains
steps of selection of accurately matched peaks.
Selection of accurately matched peak pairs (step 2. in Figure 1) An optimal threshold selection method is required to select the dense region of 2D-KDE
obtained with any types of peak pairing method containing mixture of accurately and
inaccurately matched peak pairs. This threshold is calculated by constructing a 1
dimensional histogram from all the density values of the 2D-KDE matrix (histogram is made
with number of bins equal to the size of the square 2D-KDE matrix). The threshold (d) is set
to a density value where the positive part of the histogram’s frequency’s first derivative is
closest to the median of
dh
dh ~
minarg ,where h correspond to the abundance of the
histograms, d is the density estimate, the + sign refer’s to positive value of the hd
and ~
sign to median value of hd
). Peak pairs that are located at density areas higher than this
density threshold are selected as accurately matched peak pairs, while other peaks are
considered as randomly grouped, mismatched peak pairs. It should be noted that this
threshold selection is sensitive to peak distribution, and slight manual readjustment of the
threshold value may improve the accuracy of accurately matched peak pairs.
Monotonic non-linear alignment function (step 3. in Figure 1) The retention time coordinates of the selected accurately matched peak pairs are used to
calculate a monotonic non-linear global alignment function by using Locally Weighted
Scatterplot Smoothing (LOWESS)(11) regression in combination with bagging
resampling(12) technique. A robust version of the LOWESS regression assigning a lower
weight to outliers has been used for calculating the alignment function. The method assigns
zero weight to peak pairs outside of six median absolute deviation of the residuals from the
tested position. The four times the root mean square of the 2D-KDE bandwidth is used to
set the span and third order polynomial function is used for the LOWESS regression. The
final smoothed regression points are calculated as average of 100 bootstrap resampling.
The bootstrap resampling is performed uniformly with replacement by using all extracted
peak pairs. This procedure reduces the variance of the LOWESS predictor and helps to
avoid overfitting. When peak elution order of common peaks is the same in two
chromatograms, one-to-one peak correspondence is expressed by monotonic function
between the retention time of accurately matched peaks. For that reason the main retention
147
time correspondence trend – the alignment function should be monotonic. To make the main
time alignment function monotonic, least squared linear optimisation with monotonic
constraint is applied on the average LOWESS regression points of 100 bootstraps. A
piecewise cubic Hermite interpolating polynomial (PCHIP) function with cubic spline(13) is
used to perform monotonic interpolation for transformation of retention time of peaks
between retention time space of the two chromatogram. Partitioning of the data for PCHIP
was performed on the basis of the span used in LOWESS (root mean square of the 2D-KDE
bandwidth). Before performing PCHIP, linear interpolation was performed between
experimental data for partitioned part containing no data points, to avoid large jumps in the
main monotonic alignment function.
Probability of peak elution order similarity between two chromatograms (step 3. In Figure 1) When the peak elution order of common peaks is same in two chromatograms the accurately
matched peak pairs follow a non-linear monotonic trend between the chromatograms
without any fluctuation of the retention time coordinates of accurately matched peak pairs
along this trend. However, small scattering may be observed due to improper determination
of the peak maxima. In this case it is possible to determine the one-to-one correspondence
of peaks unambiguously in the two chromatograms with the monotonic alignment function.
This means that it is possible to unambiguously find the same peaks or to determine if a
peak has no correspondence in the other chromatogram. When the peak elution order of
common peaks is different in the two chromatograms, fluctuation of the correctly matched
peak pairs becomes larger around the non-linear monotonic retention time trend. In this case
it is not possible to match peaks between two chromatograms unambiguously. The
corresponding peak could be anywhere within the fluctuation domain of the accurately
matched peaks pairs and the non-linear monotonic function just represent the average
retention time correspondence function.
The probability for peak order inversion can be calculated by comparing the orthogonal
residual variance of the accurately matched peak pairs between two chromatograms that
have the same elution order of common peaks (e.g. two chromatograms of two samples
with similar molecular composition acquired in the same batch) with the orthogonal residual
variance obtained in two chromatograms that are of interest. It is advantageous to use at
least one same chromatogram in the two chromatogram pairs to minimise differences due
to the difference between different samples and/or different LC-MS acquisitions. By
conducting an F-test on the orthogonal residual variances obtained for the two conditions
148
the probability of peak elution order similarity can be estimated, which is the null hypothesis
of the F-test. When comparing two chromatograms obtained under different conditions (e.g.
acquired in two different laboratory), it is possible to perform two separate F-test, in which
the orthogonal residual variance with no peak order inversion are determined for both
chromatograms separately. For final decision for peak elution order similarity, F-test
providing the smaller p-value should be taken into consideration. If the probability for
similarity of peak elution order is lower than 0.01, then the algorithm stops, because the
chance for similar peak elution order is low and therefore it is not possible to establish an
unambiguous one-to-one correspondence between peaks or chromatographic locations of
the two chromatograms.
The orthogonal residuals are calculated in different way than residuals of a usual regression
analysis. In regression analysis the dependent and independent variable axis are fixed,
however in time alignment the two axis should be interchangeable (e.g. the same results
should be obtained by aligning chromatogram A to B and B to A). For this reason we have
calculated the orthogonal residual distance from the main monotonic function, by
transforming one of the retention time of peaks by using the main retention time
corresponding function. In this case the main monotonic retention time correspondence
function becomes a line with 45° regarding the two retention time axes of the scatter plots.
The procedure calculating orthogonal residuals and performing F-test to assess the
probability of peak elution order similarity is demonstrated in Figure S5.
maxD is calculated for the orthogonal variance. Components of maxD according to the
chromatograms provide the retention time error to determine retention time locations in the
other chromatogram after alignment.
Retention time correction (step 4. in Figure 1) In the case of a high probability of peak elution order similarity of common peaks in two
chromatograms (null hypothesis of the F-test is not rejected), the alignment function is used
to correct the retention times of the peaks in the sample chromatogram with respect to a
reference chromatogram. The method does not depend on which chromatogram is selected
as reference or sample, as the monotonic nature of the retention time trend between the two
chromatograms allows to determine the same one-to-one correspondence of common
peaks. Figure S12 in supporting information shows that the non-linear main retention time
correspondence trend obtained with two different order of LC-MS chromatograms is highly
similar. The retention time of peaks in the sample chromatogram is calculated by
interpolation using the monotonic alignment function. The algorithm results finally a sample
149
peak list aligned to the reference peak list. It should be noted that any other type of time
alignment method devising monotonic non-linear retention time correspondence function
can be used instead of the proposed monotonic constrained LOWESS/PCHIP approach.
Hardware and software environment The Monte-Carlo simulation, the intensity-rank-based peak matching, the 2D-KDE, the 2D-
KS algorithm were written in matlab scripting language using Matlab Mathworks R2010b
(version 7.11.0.584 64-bit linux version) and was run on desktop computer equipped with
Intel Quad Q9300 CPU at 2.5GHz, 8 GB RAM and 64-bit linux Ubuntu 10.04 operating
system. The source code is available at https://trac.nbic.nl/pre-alignment.
Peptide identification parameters The peptide and protein identification was performed using Phenyx database search
program (Geneva Bioinformatics, version 2.6, Geneva, Switzerland) using raw data in
mzData format. Datasets were searched against the Uniprot database (version: 57.4) and
against the reverse sequence of this database with following parameters: taxonomy: Rattus
Norvegicus; instrument types were selected according to the used mass spectrometer; FDR
rate: <1; scoring model: ESI-QTOF (QTOF) for QTOF data and CID_LTQ_scan_LTQ for
Orbitrap data; parent ion charge states: +2, +3, +4 (with trusted medium charge). The search
was performed in two subsequent cycles. The following search parameters were common
for both cycles: peptide AC score: ≥5; peptide length: ≥5; p-value: <0.0001; cleaving
enzyme: trypsin (KR); number of allowed missed cleavage: ≤ 1. The following search
parameters were different between the first and second search cycles: for cycle 1 amino
acid modifications: Cys_CAM (carboxy methylation, fixed), Oxidation_M (oxidation of
methionine, variable, ≤ 2); for cycle 2: Cys_CAM (fixed), Oxidation_M (variable, ≤ 2),
Oxidation_HW (oxidation of histidine and tryptophan, variable, ≤ 2), DEAMID (deamidation,
variable, ≤ 2), PHOS (phosphorylation, variable, ≤ 2). The cleavage mode was set to ‘normal’
for cycle 1 and to ‘half cleaved’ for cycle 2. Parent ion m/z tolerance: 600 ppm for the first
and 800 ppm for the second cycle for Orbitrap data and was 800 ppm for both cycles for the
QTOF data. Only MS/MS spectra of ions with intensities above the background noise
(50−100 counts) were considered for both search cycles. Source code, installation guide,
user manual and example dataset is available at https://trac.nbic.nl/pre-alignment.
150
Parameters used for optimisation of intensity-based peak matching Delta mass: 0.005, 0.01, 0.05, 0.1 and 0.3
Maximal number of most abundant peaks: 50, 100, 200, 500
Rank window fraction: 0.50, 0.60, 0.80, 0.90 and 1
Labels legend of G-score versus 2D-KS plot in Figure 2.
Parameters corresponding to labels of G-score versus 2D-KS plot in Figure 2. In addition to
these parameters, the size of the marker indicates the ratio of trend peak pairs relative to
the total number of peak pairs (parameters are ranging from 0.00, 0.10, 0.20, 0.30, 0.40,
0.50, 0.75, 0.90 and 1.00). Larger size of the marker indicates higher percentage of trend
points used in the simulation.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
10
20
30
40
50
60
70
G-SCORE
-log
10(p
valu
e)
N = 100 Points 0.05 mins.N = 100 Points 0.3 mins.N = 100 Points 1 mins.N = 100 Points 5 mins.N = 100 Points 15 mins.
N = 250 Points 0.05 mins.N = 250 Points 0.3 mins.N = 250 Points 1 mins.N = 250 Points 5 mins.N = 250 Points 15 mins.
N = 750 Points 0.05 mins.N = 750 Points 0.3 mins.N = 750 Points 1 mins.N = 750 Points 5 mins.N = 750 Points 15 mins.
N = 1000 Points 0.05 mins.N = 1000 Points 0.3 mins.N = 1000 Points 1 mins.N = 1000 Points 5 mins.N = 1000 Points 15 mins.
N = 500 Points 0.05 mins.N = 500 Points 0.3 mins.N = 500 Points 1 mins.N = 500 Points 5 mins.N = 500 Points 15 mins.
151
Supplementary figures for chapter 3
Figure S1. Uncertainty of determining peak maxima in single stage LC-MS peak detection (green double
arrow) and uncertainty of having an MS/MS event (red double arrow) during data dependent acquisition
demonstrated on extracted ion chromatogram of a chromatographic peak. The threshold of data-dependent
acquisition is presented with a red line.
164 166 168Retention Time (min)
Inte
nsity
0
2000
4000
6000
8000
10000
152
Figure S2. Schematic representation of intensity-rank-based peak matching using the sliding window
technique. The first and second sliding windows are represented in black dashed box in the two single stage
LC-MS peak lists. The length of the window in this case is 10. The window is sliding from the most abundant
to less abundant peaks until reaching a limit rank provided as a parameter. The grey arrows show
corresponding peaks in the two single stage LC-MS peak lists with a certain mass tolerance (here ±0.1 da),
which are included in the matched peak list. The resulting matched peak pairs are the outcome of the
procedure.
117.2238 679.326 1.23E+0892.02026 633.8222 1.05E+0876.86034 575.3129 8931580060.25212 509.2758 7268670476.83325 720.3972 72064400102.3785 733.397 6125790045.08288 699.4488 5802700078.10147 485.94 5299000044.11627 716.3162 52803800139.4195 554.9541 51339200139.6532 831.9259 4738210057.61272 550.843 4582900042.09376 416.8869 44581000119.4607 679.8202 44010300132.1907 740.432 3960630023.37762 491.2722 3912560082.66945 575.8082 37671400138.7721 907.9263 3621580041.55041 525.7471 3578620023.39004 747.4247 35451300117.7689 955.547 34737000131.9638 493.9559 3338370033.47533 538.2569 32477600
75.7529 700.974 32173600
203.9339 733.3986 517989107.2073 583.8935 51113977.32381 716.3185 493099329.0283 550.9585 465810178.2039 485.945 376887179.3107 485.9439 375440
122.622 772.8679 316733187.6178 633.8263 312388201.5122 628.322 285237
196.048 735.3841 283167144.1663 575.3158 28272895.91563 708.283 26142978.25997 463.2606 254432
329.007 831.9324 252834205.8059 733.397 24401880.16918 716.3165 227115150.6807 486.794 216580177.1648 485.9433 212658260.4004 753.8816 20996679.97867 487.2531 20620997.08019 708.284 20419990.24291 416.8859 20200477.71372 421.2163 19782279.76277 322.6951 194113
RT M/Z Intensity RT M/Z Intensity
Ref. Peaklist Samp. Peaklist
92.02026 633.8222 1.05E+08 187.6178 633.8263 31238876.86034 575.3129 89315800 144.1663 575.3158 28272857.61272 550.843 45829000 329.0283 550.9585 465810
Matched Peaklist
153
Fi
gure
S3.
Effe
ct o
f pea
k m
atch
ing
para
met
ers
(max
imal
m/z
diff
eren
ces,
leng
th o
f slid
ing
win
dow
and
rank
frac
tion
para
met
er; t
hese
par
amet
ers
are
show
n at
the
botto
m p
art o
f the
plo
ts) o
n pe
ak p
arin
g re
sults
pre
sent
ed w
ith s
catte
r plo
ts. Y
axi
s is
the
rete
ntio
n tim
e of
mat
ched
pea
k in
labo
rato
ry 2
and
X a
xis
is th
e re
tent
ion
time
of
mat
ched
pe
ak
pairs
in
la
bora
tory
1.
Th
e ab
ove
scat
ter
plot
s w
ere
obta
ined
us
ing
NC
I m
ouse
se
rum
da
tase
t an
d ch
rom
atog
ram
s of
Lab1
_GLY
_Lun
gEG
FR_t
umor
and
Lab
2_G
LY_L
ungE
GFR
_tum
or.
010
2030
4050
6070
8090
020406080100
120
Ret
entio
n tim
e ax
is in
Lab
1 in
min
utes
.
Retention time axis in Lab 2 in minutes.
010
2030
4050
6070
8090
020406080100
120
Ret
entio
n tim
e ax
is in
Lab
1 in
min
utes
.
Retention time axis in Lab 2 in minutes.
010
2030
4050
6070
8090
020406080100
120
Ret
entio
n tim
e ax
is in
Lab
1 in
min
utes
.
Retention time axis in Lab 2 in minutes.
010
2030
4050
6070
8090
020406080100
120
Ret
entio
n tim
e ax
is in
Lab
1 in
min
utes
.
Retention time axis in Lab 2 in minutes.
0.1
500
0.2
0.5
500
0.3
0.1
500
0.3
0.1
750
0.3
(a)
(b)
(c)
(d)
154
Fi
gure
S4.
Vis
ualiz
atio
n of
the
mos
t im
porta
nt s
teps
of t
he ti
me
alig
nmen
t qua
lity
cont
rol p
roce
dure
usi
ng s
imul
ated
dat
a w
ith 5
00 m
atch
ed p
eak
pairs
, 5 m
inut
es o
f
cons
tant
fluc
tuat
ion
of a
ccur
atel
y m
atch
ed p
eak
pairs
and
ratio
of 0
.75
of a
ccur
atel
y m
atch
ed p
eak
pairs
. The
plo
t a) s
how
s th
e in
itial
dat
a in
a s
catte
r plo
t with
rete
ntio
n
time
of th
e m
atch
ed p
eak
pairs
in tw
o ch
rom
atog
ram
s. B
lue
dots
rep
rese
nt th
e ra
ndom
ly m
atch
ed p
eak
pairs
and
red
dot
s re
pres
ent t
he a
ccur
atel
y m
atch
ed p
eak
pairs
follo
win
g th
e m
ain
mon
oton
ic re
tent
ion
time
trend
. Plo
t b) s
how
s th
e co
rres
pond
ing
2D-K
DE
dens
ity im
age
of p
eak
pairs
sho
wn
in p
lot a
). Pl
ot c
) sho
ws
the
2D-
KDE
den
sity
imag
e of
cro
ss p
rodu
ct o
f tw
o 1D
-KD
E pe
ak d
ensi
ty o
f tw
o LC
-MS
chro
mat
ogra
ms.
Thi
s pl
ot re
pres
ents
the
peak
den
sity
that
is o
btai
ned
with
rand
om
pairi
ng o
f pea
k in
two
chro
mat
ogra
ms.
The
2D
-KS
test
is p
erfo
rmed
by
com
parin
g th
e cu
mul
ativ
e de
nsity
dis
tribu
tion
of p
lot b
) and
c) w
hen
optim
izin
g th
e pa
ram
eter
s
of in
tens
ity-b
ased
pea
k m
atch
ing
proc
edur
e. T
he p
lot d
) sho
ws
the
hist
ogra
m o
f the
den
sity
val
ues
of p
lot b
) with
nat
ural
loga
rithm
of c
ount
s w
ithin
a h
isto
gram
bin
s.
The
red
line
show
s th
e lo
catio
n of
the
thre
shol
d (p
= 9
.144
·10-2
9 ) s
elec
ted
auto
mat
ical
ly a
nd c
orre
spon
ding
to th
e lo
catio
n w
here
the
posi
tive
first
der
ivat
e of
the
hist
ogra
m is
the
clos
est t
o its
med
ian.
Plo
t e) s
how
s th
e 2D
-KD
E de
nsity
imag
e of
plo
t b) e
nclo
sing
the
dens
ity re
gion
hig
her t
han
the
auto
mat
ical
ly s
elec
ted
thre
shol
d
(whi
te c
onto
urs)
. Plo
t f) s
how
s th
e sc
atte
r plo
t pre
sent
ed in
a) w
ith th
e co
ntou
r of h
igh
dens
ity re
gion
s (r
ed c
onto
urs)
indi
catin
g th
e pe
aks
pairs
sel
ecte
d an
d co
nsid
ered
bein
g ac
cura
tely
mat
ched
.
050
100
150
050100
150
rete
ntio
n tim
e ch
rom
atog
ram
1 (m
in)
retention time chromatogram 2 (min)
050
100
150
050100
150
rete
ntio
n tim
e ch
rom
atog
ram
1 (m
in)
retention time chromatogram 2 (min)
050
100
150
050100
150
rete
ntio
n tim
e ch
rom
atog
ram
1 (m
in)
retention time chromatogram 2 (min)
00.
20.
40.
60.
81
1.2
1.4
1.6
x 10-3
100
101
102
103
104
105
106
ln(counts)
dens
ity
rete
ntio
n tim
e ch
rom
atog
ram
1 (m
in)
retention time chromatogram 2 (min)
050
100
150
050100
150
rete
ntio
n tim
e ch
rom
atog
ram
1 (m
in)
retention time chromatogram 2 (min)
050
100
150
050100
150
b)c)
e)
a)
f)d)
155
Fi
gure
S5.
Cal
cula
tion
of o
rthog
onal
res
idua
ls a
nd c
alcu
latio
n of
pro
babi
lity
for
peak
elu
tion
orde
r si
mila
rity.
Orig
inal
ret
entio
n tim
e of
chr
omat
ogra
m (
left
plot
s;
chro
mat
ogra
ms
to b
e tra
nsfo
rmed
are
: sam
ple
12 in
labo
rato
ry 1
and
sam
ple
6 in
labo
rato
ry 2
) of
acc
urat
ely
mat
ched
pea
k pa
irs a
re tr
ansf
orm
ed u
sing
the
mai
n
rete
ntio
n tim
e co
rres
pond
ence
func
tion
to th
e re
tent
ion
time
spac
e of
the
othe
r ch
rom
atog
ram
(rig
ht p
lots
). In
this
tran
sfor
med
sca
tter
the
plot
the
mai
n m
onot
onic
func
tion
is a
dia
gona
l lin
e w
ith 4
5° b
etw
een
the
two
axes
, fro
m w
hich
the
orth
ogon
al d
ista
nce
can
be c
alcu
late
d us
ing
right
-ang
led
trian
gle
rule
s. T
he o
rthog
onal
resi
dual
var
ianc
e is
then
cal
cula
ted
for t
he tw
o ch
rom
atog
ram
s of
inte
rest
(upp
er ri
ght p
lot)
and
for a
pai
r of L
C-M
S ch
rom
atog
ram
s, w
hich
do
not h
ave
peak
elu
tion
orde
r inv
ersi
on, a
cqui
red
gene
rally
with
in th
e sa
me
batc
h an
d w
ell c
ontro
lled
chro
mat
ogra
phic
par
amet
ers
(low
er ri
ght p
lot).
It is
pre
fera
ble
that
one
chr
omat
ogra
m o
f
the
refe
renc
e ch
rom
atog
ram
pai
r with
no
peak
elu
tion
orde
r inv
ersi
on is
cho
sen
from
the
chro
mat
ogra
m th
at s
houl
d be
alig
ned
(in th
is p
lot S
ampl
e6_L
ab1)
. The
F-te
st
is c
alcu
late
d us
ing
thes
e tw
o or
thog
onal
res
idua
l var
ianc
es. I
t is
poss
ible
to p
erfo
rm th
is tr
ansf
orm
atio
n fo
r th
e tw
o LC
-MS
chro
mat
ogra
ms
(Sam
ple6
_Lab
1 as
it is
pres
ente
d he
re a
nd S
ampl
e6_L
ab2)
and
ther
efor
e pe
rform
two
F-te
st c
alcu
latio
n w
ith b
oth
chro
mat
ogra
ms,
whi
ch s
houl
d be
alig
ned.
The
F-te
st p
rovi
ding
the
low
est
p-va
lue
is c
onsi
dere
d fo
r the
fina
l dec
isio
n if
the
two
LC-M
S ch
rom
atog
ram
s of
inte
rest
hav
e or
not
the
sam
e el
utio
n or
der o
f com
mon
pea
ks.
5010
015
020
025
030
035
040
020406080100
120
140
160
180
050
100
150
200
250
300
350
400
050100
150
200
250
300
350
400 0
5010
015
020
025
030
035
040
0050100
150
200
250
300
350
400
050
100
150
200
250
300
350
400
050100
150
200
250
300
350
Orig
inal
axe
s
With
in la
bW
ithin
lab
Betw
een
labs
Sample12_Lab1(trans)*
Sam
ple6
_Lab
1
Sample6_Lab2(trans)*
Sample6_Lab2 Sample12_Lab1
Betw
een
labs
Sam
ple6
_Lab
1
Sam
ple6
_Lab
1
Sam
ple6
_Lab
1
Axi
s tr
ansf
orm
atio
n
Axi
s tr
ansf
orm
atio
n
Comparison(F-TEST)
Uni
form
axe
s
156
Figure S6. 3 dimensional bar plots of specificity plot (top left), sensitivity plot (top right) and minus log of the 2D-KS test
probability (bottom left) obtained with Monte Carlo simulation representing the complete studied parameter space. The
number of peak pairs were 100, 250, 500, 750 and 1000, the fluctuation of the accurately matched peaks were 0.05,
0.3, 1, 5, 10 and 15 minutes and the fraction of the accurately were 0.00, 0.10, 0.20, 0.30, 0.40, 0.50, 0.75, 0.90, 1.00.
00.10.20.30.40.50.750.910
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.050.05 0.05 0.05 0.05 0.3 0.3 0.3 0.3 0.3 1 1 1 1 1 5 5 5 5 5 15 15 15 15 15
100 200 300 500 1000100 200 300 500 1000 100 200 300 500 1000 100 200 300 500 1000 100 200 300 500 1000
Sens
itivi
ty
Fluctuation and N
Sensitivity
00.10.20.30.40.50.750.910
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.050.05 0.05 0.05 0.05 0.3 0.3 0.3 0.3 0.3 1 1 1 1 1 5 5 5 5 5 15 15 15 15 15
100 200 300 500 1000100 200 300 500 1000 100 200 300 500 1000 100 200 300 500 1000 100 200 300 500 1000
Spec
ificit
y
Fluctuation and N
Specificity
0.000.10
0.200.300.400.500.750.901.000
10
20
30
40
50
60
70
0.05 0.05 0.05 0.05 0.3 0.3 0.3 0.3 0.3 1 1 1 1 1 5 5 5 5 5 15 15 15 15 15
200 300 500 1000 100 200 300 500 1000 100 200 300 500 1000 100 200 300 500 1000 100 200 300 500 1000
-log 1
0(p-v
alue
)
Fluctuation and N
-log10(p-value)
157
Figure S7. Scatter plot of matched peak pairs obtained with intensity-rank-based peak matching method of deisotoped
LC-MS peak list (left), and after deisotoping and decharging the same two LC-MS peaks list (right). Decharging the peak
list results in lower number of matched peak pairs but the peak pairs are more rich in accurately matched peak pairs
indicating the retention time trend. Peak matching parameters are for Lab1_GLY_LungEGFR_normal vs
Lab2_GLY_LungEGFR_normal using 500 as the window length 0.1 Da of maximal m/z difference and 0.9 as the rank
fraction parameters. The analysed two LC-MS peak list had the following pre-analytical parameters: LC-MS 1: laboratory
1, GLY depletion, Lung EGFR cancer type, without tumor; LC-MS 1: laboratory 2, GLY depletion, Lung EGFR cancer
type, without tumor.
0 10 20 30 40 50 60 70 80 9010
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 9010
20
30
40
50
60
70
80
90
100
Ret
entio
n tim
e LC
-MS
1 (m
in)
Retention time LC-MS 2 (min)Retention time LC-MS 2 (min)
Ret
entio
n tim
e LC
-MS
1 (m
in)
158
Figure S8. Peak with long tailing in dataset of rat CSF analysed in laboratory 2 (a) and histogram of peak width at half
peak height (35 bins taking the 1000 most abundant peaks) of the 4 samples in two datasets acquired in different
laboratories (b) (chromatograms used are the same that are in middle column of Figures 3 and S9). The plot a) was
prepared with help of OpenDX (http://www.opendx.org/) visualization software tool.
0.5 1 1.5 2 2.5 30
20
40
60
80
100
120
140
160
peak width at half peak height (min)
coun
ts
rat CSF Lab1rat CSF Lab2rat serum Lab1rat serum Lab2
(b)
(a)
159
Figure S9. Extracted ion chromatograms (EIC) of three peptides from the same sample (sample 6 from the
rat CSF dataset) in two laboratories using the original retention time values. Peptide LTLPQLEIR (green
arrows) is located on the monotonic retention-time corresponding function, while the peptides DIAPTLTLYVGK
(red arrows) and VHQFFNVGLIQPGSVK (blue arrows) are located far from this function and Figure 5 shows
the location of these peak after alignment one of chromatogram to the other. Locations of the three peaks are
shown in the scatter plot of Figure 4 with corresponding red, green and blue circles. The extracted ion
chromatograms are normalized to the highest peaks, for that reason the Y axis represent ion counts relative
to the most abundant signal intensity of the most abundant signal.
0 50 100 150 200 250 300 350 4000
1
2
3
4
5
6
x 107
Time (min)
Ionc
ount
(cts
)
645.87 +-0.025 Da, Original, Lab1.541.83 +-0.025 Da, Original, Lab1.590.66 +-0.025 Da, Original, Lab1.645.87 +-0.025 Da, Original, Lab2.541.83 +-0.025 Da, Original, Lab2.590.66 +-0.025 Da, Original, Lab2.
160
Figure S10. Scatter plots of matched peaks between two LC-MS chromatograms with time alignment
functions. All chromatograms were obtained from LC-MS chromatograms of the National Cancer Institute’s
Mouse Proteomic Technology Initiative and originate from an experimental design study of mouse serum
analysis. The scatter plots in the middle column were obtained from two LC-MS chromatograms of the same
sample prepared in two laboratories, while the right and left columns were obtained with two LC-MS
chromatograms of the same laboratory, from which one was used in the middle scatter plot. Matched peak
pairs were obtained using peak list obtained from single stage LC-MS data with OpenMS workflow and using
intensity-rank-based peak matching procedure. Peak pairs not select as accurately matched peak pairs are
blue. The peak pairs selected as accurately matched are contoured with dashed red lines and are highlighted
in green circle. The main monotonic retention time correspondence function is showed in solid red line.
Samples have the following factors in the experimental design: GLY depletion method, Lung EGFR cancer
type, tumor (middle plot) and tumor and healthy (side plots).
0 10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100
Retention time axis in Lab 2 in minutes.
Ret
enti
on ti
me
axis
in L
ab 2
in m
inut
es.
0 10 20 30 40 50 60 70 80 900
10
20
30
40
50
60
70
80
90
Retention time axis in Lab 1 in minutes.
Ret
enti
on ti
me
axis
in L
ab 2
in m
inut
es.
0 10 20 30 40 50 60 70 80 900
10
20
30
40
50
60
70
80
90
Retention time axis in Lab 1 in minutes.
Ret
entio
n tim
e ax
is in
Lab
1 in
min
utes
.
(b) (c)(a)
Within laboratory (2 samples) Within laboratory (2) samplesInterlaboratory (same sample)
Single-stage MS peak list
Intensity-rank-based peak matching
161
Fi
gure
S11
. Ove
rlaid
plo
t of m
ultip
le m
ain
mon
oton
ic re
tent
ion
time
corr
espo
nden
ce fu
nctio
n in
7 c
hrom
atog
ram
pai
rs o
f the
sam
e ra
t CSF
sam
ples
(a, b
and
c) a
nd
24 c
hrom
atog
ram
pai
rs o
f mou
se s
erum
sam
ples
(d) m
easu
red
in tw
o la
bora
torie
s. In
a) p
airs
of L
C-M
S p
recu
rsor
ion
peak
list
s w
ere
mat
ched
bas
ed o
n ag
reem
ent
of id
entif
ied
pept
ide
sequ
ence
and
PTM
s, in
b) t
wo
LC-M
S pr
ecur
sor i
on p
eak
lists
wer
e m
atch
ed u
sing
inte
nsity
-ran
k-ba
sed
peak
mat
chin
g ap
proa
ch a
nd in
c) p
airs
of L
C-M
S si
ngle
sta
ge io
n pe
ak li
sts
wer
e m
atch
ed u
sing
inte
nsity
-ran
k-ba
sed
peak
mat
chin
g al
gorit
hm. O
verla
id p
lot i
n d)
was
obt
aine
d w
ith p
airs
of L
C-M
S si
ngle
stag
e io
n pe
ak lis
ts m
atch
ed u
sing
inte
nsity
-ran
k-ba
sed
peak
mat
chin
g al
gorit
hm a
nd th
e m
ain
mon
oton
ic fu
nctio
n is
col
ored
acc
ordi
ng to
the
appl
ied
depl
etio
n m
etho
d
(in re
d G
LY, i
n gr
een
MAR
S, in
blu
e M
+CYS
and
in b
lack
NF)
. The
hig
h si
mila
rity
of th
e m
ain
mon
oton
ic re
tent
ion
time
corr
espo
nden
ce fu
nctio
ns s
how
s th
at m
etho
d
usin
g se
quen
ce in
form
atio
n to
mat
ch p
recu
rsor
ion
peak
list
s an
d in
tens
ity-r
ank-
base
d m
atch
ed s
ingl
e st
age
LC-M
S pe
ak li
sts
are
robu
st w
ith r
espe
ct o
f bio
logi
cal
varia
bilit
y an
d th
at th
e tw
o m
etho
ds p
rovi
de h
ighl
y si
mila
r cor
rect
ion
of re
tent
ion
time.
Sin
gle
stag
e LC
-MS
peak
list
s w
ith c
ombi
natio
n of
inte
nsity
-ran
k-ba
sed
peak
mat
chin
g is
slig
htly
less
acc
urat
e, w
hich
is re
flect
ed b
y th
e la
rger
var
iabi
lity
of th
e m
ain
mon
oton
ic re
tent
ion
time
corre
spon
denc
e fu
nctio
ns o
btai
ned
with
this
met
hod
com
pare
d w
ith th
ose
obta
ined
with
pre
curs
or io
n LC
-MS
peak
list
s m
atch
ed u
sing
agr
eem
ent o
f ide
ntifi
ed p
eptid
e se
quen
ce a
nd P
TMs.
050
100
150
200
250
300
350
400
020406080100
120
140
160
180
Ret
entio
n tim
e la
bora
tory
1 (i
n m
inut
es)
Retention time laboratory 2 (in minutes)
010
2030
4050
6070
8090
020406080100
120
Ret
entio
n tim
e ax
is la
bora
tory
1 (i
n m
inut
es)
Retention time axis laboratory 2 (in minutes)
050
100
150
200
250
300
350
400
020406080100
120
140
160
180
Ret
entio
n tim
e la
bora
tory
1 (i
n m
inut
es)
Retention time laboratory 2 (in minutes)
050
100
150
200
250
300
350
400
020406080100
120
140
160
Ret
entio
n tim
e in
labo
rato
ry 1
(in
min
utes
)
Retention time in laboratory 2 (in minutes)
Rat
CSF
Rat
seru
m
c)d)
a)b)
162
Figure S12. Monotonic nonlinear time alignment function (solid red and green lines) determined with different order of
two LC-MS/MS chromatograms as sample and reference chromatogram. Peak matching was performed using identified
peptide sequence and post-translational modification data, and blue dots shows the matched peak pairs. The two
chromatograms were from sample 6 in laboratory 1 and laboratory 2. The two monotonic retention time correspondence
functions are highly similar, which shows that the time alignment procedure do not depend from the order of the
chromatograms.
0 50 100 150 200 250 300 350 4000
20
40
60
80
100
120
140
160
180
200
Retention time in minutes (laboratory 1)
Rete
ntio
n tim
e in
min
utes
(lab
orat
ory
2)
163
References (1) Kendall MG, Buckland WR, Institute IS. A dictionary of statistical terms: Hafner Pub. Co.; 1971.
(2) Liu T, Qian WJ, Chen WN, Jacobs JM, Moore RJ, Anderson DJ, et al. Improved proteome coverage by
using high efficiency cysteinyl peptide enrichment: the human mammary epithelial cell proteome. Proteomics.
2005;5:1263-73.
(3) Liu T, Qian WJ, Strittmatter EF, Camp DG, 2nd, Anderson GA, Thrall BD, et al. High-throughput
comparative proteome analysis using a quantitative cysteinyl-peptide enrichment technology. Anal Chem.
2004;76:5345-53.
(4) Livesay EA, Tang K, Taylor BK, Buschbach MA, Hopkins DF, LaMarche BL, et al. Fully automated four-
column capillary LC-MS system for maximizing throughput in proteomic analyses. Anal Chem. 2008;80:294-
302.
(5) Kessner D, Chambers M, Burke R, Agus D, Mallick P. ProteoWizard: open source software for rapid
proteomics tools development. Bioinformatics. 2008;24:2534-6.
(6) Sturm M, Bertsch A, Gropl C, Hildebrandt A, Hussong R, Lange E, et al. OpenMS - an open-source software
framework for mass spectrometry. BMC Bioinformatics. 2008;9:163.
(7) Scheltema RA, Kamleh A, Wildridge D, Ebikeme C, Watson DG, Barrett MP, et al. Increasing the mass
accuracy of high-resolution LC-MS data using background ions: a case study on the LTQ-Orbitrap. Proteomics.
2008;8:4647-56.
(8) Palmblad M, van der Burgt YE, Dalebout H, Derks RJ, Schoenmaker B, Deelder AM. Improving mass
measurement accuracy in mass spectrometry based proteomics by combining open source tools for
chromatographic alignment and internal calibration. J Proteomics. 2009;72:722-4.
(9) Botev Z, Grotowski J, Kroese D. Kernel density estimation via diffusion. The Annals of Statistics.
2010;38:2916-57.
(10) Peacock JA. Two-dimensional goodness-of-fit testing in astronomy. Monthly Notices of the Royal
Astronomical Society. 1983;202:23.
(11) Cleveland WS, Devlin SJ. Locally Weighted Regression: An Approach to Regression Analysis by Local
Fitting. Journal of the American Statistical Association. 1988;83:596-610.
(12) Breiman L. Bagging Predictors. Machine Learning. 1996;24:123-40.
(13) Fritsch FN, Carlson RE. Monotone Piecewise Cubic Interpolation. Siam Journal on Numerical Analysis.
1980;17:238-46.
164
Supporting information for chapter 4
Xrea score calculation Xrea score measures the quality of the MS/MS spectra using the ranked cumulative intensity
distribution of the fragment ions, which measure is independent from the identification status
of the fragment spectra. It is assumed that the intensities of the fragments ions in a MS/MS
spectrum which contains only noise are evenly distributed with respect of fragment ion
intensity. In contrast, MS/MS containing fragments from a compound exclusively or mixed
with noise, the fragment ion intensity distribution is uneven showing intensity distribution
difference between fragments originating from the compounds and noise. Xrea score can
take values between 0 (MS/MS contains only noise fragments) and 1 (MS/MS contains only
compound derived fragments). Na et al.1 contains details of Xrea calculation.
Supplementary figures for chapter 4
Figure S1. Example of MS1 isotopologue peaks with high and low quality MS/MS spectra. Panel a) shows an
extracted ion chromatogram of m/z 994.02 Da with a chromatographic peak at retention time of 52.04 minutes.
The maximum ion intensity of this peak is 1.30·104 ion counts. This peak at retention time indicated with an
arrow was submitted for MS/MS fragmentation with precursor intensity of 1.15·104 ion counts. The resulting
MS/MS spectrum (panel c) shows random distribution of fragment ions with respect of ion intensity indicating
low MS/MS spectral quality and has an Xrea value of 0.14. This spectrum did not obtain peptide sequence
annotation according to the applied search parameters and FDR settings. However, the chromatographic peak
in the extraction ion chromatogram of m/z 625.31 Da at retention time of 60.37 minutes (panel b) provided
MS/MS spectra sampled at the top of the peak at retention indicated with an arrow (panel d) with precursor ion
m/z: 625.31; RT: 60.37Charge: +3; Scan Number:7301
Xrea: 0.854Xrea: 0.140
m/z: 994.02;RT: 52.04Charge: +2; Scan: 6139
a)
c)d)
e)
b)
165
intensity reaching 8.0·105 ion counts at MS/MS precursor sampling time. Since the peptide feature was
submitted for MS/MS fragmentation at the highest intensity, the peptide feature ion intensity corresponded to
the precursor ion intensity of 8.0·105 ion counts. This MS/MS spectrum is of high quality showed by the uneven
fragment ion distribution differing from MS/MS spectrum in panel c). The high quality of the MS/MS spectrum
is confirmed by the Xrea value of 0.85, and obtained a successful PSM attributing the primary amino acid
sequence of TTPPVLDSDGSFFLYSK (panel e). This figure was produced for the LC-MS/MS file obtained
from 14th fraction of kidney using the bRP approach.
Figure S2. Overview of the main steps of the identification transfer workflow. Peptide sequences are first
matched (based on amino acid sequences) and common annotated peptide features are used to assess
orthogonality between the reference and sample datasets as described in Mitra et al7. The retention time
coordinates of the common annotated peptide features are used to correct retention time of peptide features
with monotonic retention time correction function followed by assessment of orthogonality between sample
and reference chromatograms. Identification transfer is then performed between peptide feature of the
Identification transfer workflow
Assessment of orthogonality
Match common peptide features
beween datasets and correct retention time for monotonic shift in
sample chromatogram
Transfer peptide identification with 0.005 Da and ≤ 1 minutes thresholds
Unidentified features
(sample list)
Annotated peptides in sample list
Sample dataset
Referencedataset
Identified features
(reference list)
166
reference dataset having PSMs and unannotated peptide features of the sample dataset by matching peptides
features based on retention time and m/z with thresholds of 0.005 Da for m/z and 1 minute respectively. The
identification transfer is performed using the LOOCV procedure shown in Figure S4 and described in the
section “4.2.4 Identification transfer” in material and methods section of the manuscript.
Figure S3. Scheme showing the error rate assessment of the identification transfer method using leave-one-
out cross validation (LOOCV). In LOOCV the peptide features with the same annotation between reference
and sample datasets (common peptide features) are divided into k (k = 5) subsets of equal size. Training
subsets were constructed using 80% of the common peptide features equivalent of 4 out of 5 data subsets
and a test subset included 20% of the common peptide features (1 out of 5 data subset). Monotonic retention
time shift is corrected with a nonlinear monotonic LOWESS regression function in the sample set as described
in Mitra et al.2 using common peptide features from the training set. Peak matching using all common peptide
features was performed using different tolerances for m/z (0.005, 0.05, 0.10, 0.15 or 0.20 Da) and retention
time (1.00, 3.25, 5.50, 7.75 or 10.00 minutes). Performance metrics such as FDR based on agreement and
disagreement of peptide identity, difference between the predicted and measured retention time of peptide
features receiving annotation in the sample list are calculated using the common peptide features in the test
set. This procedure is repeated to include all common peptide features as test set 5 times and the complete
training/test partitioning procedure was repeated 100 times.
Training set
Retention time
correction function
Test set
Calculate FDR and retention time
prediction error for the test set
Correct retention time of peptide features in
sample datasetSelect common
annotated peptides features between datasets
Reference and sample
datasets
Each kth iteration
Leave-one-out-cross validation
Match all peptide features based on m/z
and retention time coordinates
repeated 100 times
167
Figure S4. Distributions (histogram) of Xrea scores of all MS/MS spectra (red trace) and MS/MS spectra that
have successful PSM annotation with PEAKS (blue trace).
0.2 0.4 0.6 0.8 1.0
010
020
030
040
050
0kidney/bRP f14
Xrea
Cou
nts
MS2 XreaPSMs Xrea
0.2 0.4 0.6 0.8 1.0
010
020
030
040
0
kidney/bRP f15
Xrea
Cou
nts
MS2 XreaPSMs Xrea
0.2 0.4 0.6 0.8 1.0
010
020
030
040
050
0
kidney/bRP f16
Xrea
Cou
nts
MS2 XreaPSMs Xrea
0.2 0.4 0.6 0.8 1.0
010
030
050
0
kidney/gel f14
Xrea
Cou
nts
MS2 XreaPSMs Xrea
0.2 0.4 0.6 0.8 1.0
010
030
050
0
kidney/gel f15
Xrea
Cou
nts
MS2 XreaPSMs Xrea
0.2 0.4 0.6 0.8 1.0
010
020
030
040
050
0
kidney/gel f16
Xrea
Cou
nts
MS2 XreaPSMs Xrea
0.2 0.4 0.6 0.8 1.0
050
150
250
350
esophagus/bRP f14
Xrea
Cou
nts
MS2 XreaPSMs Xrea
0.2 0.4 0.6 0.8 1.0
050
150
250
350
esophagus/bRP f15
Xrea
Cou
nts
MS2 XreaPSMs Xrea
0.2 0.4 0.6 0.8 1.0
050
100
200
300
esophagus/bRP f16
Xrea
Cou
nts
MS2 XreaPSMs Xrea
0.2 0.4 0.6 0.8 1.0
010
020
030
040
050
060
0
esophagus/gel f14
Xrea
Cou
nts
MS2 XreaPSMs Xrea
0.2 0.4 0.6 0.8 1.0
010
020
030
040
050
0
esophagus/gel f15
Xrea
Cou
nts
MS2 XreaPSMs Xrea
0.2 0.4 0.6 0.8 1.0
010
030
050
0
esophagus/gel f16
Xrea
Cou
nts
MS2 XreaPSMs Xrea
169
Figure S5. Scatter plot of Xrea scores and log10 precursor ion intensity of unidentified MS/MS spectra (green dots),
MS/MS spectra identified using PEAKS (blue dots) and MS/MS spectra that were not identified with PEAKS, but
obtained identification with identification transfer (red dots). The marginal distributions are obtained for Xrea scores
(top-panel) and log10 precursor ion intensity (right panel) for each precursor ion class as described above.
170
Supporting information for chapter 5
Materials and methods
LC-MS/MS (QTOF) For the analysis of 40 depleted serum samples, the HPLC equipment and elution program was
identical to the LC-MS analyses on the iontrap (see section 1.4.1. in the main manuscript), while
MS/MS analysis was performed using a quadrupole time-of-flight mass spectrometer (qTOF, Agilent
6510). Data dependent LC-MS/MS analysis was performed using 2 GHz extended dynamic range
mode collection of 3 MS/MS in one duty cycle under the following additional parameters: mass range:
275-2000 m/z, acquisition duty cycle: 1 spectrum/sec, data storage: profile and centroid mode,
fragmentor: 175 V, skimmer: 65 V, OCT 1 RF Vpp: 750 V, spray voltage: ~1900 V, drying gas temp:
325ºC, drying gas flow (N2): 6 L/min. Mass correction was performed during analysis using internal
standards of 371.31559 m/z (originating from a ubiquitous background ion Dioctyl adipate, DOA,
plasticizer) and 1221.990637 m/z (HP-1221 calibration standard, evaporating from a wetted wick
inside the spray chamber).
Data pre-processing and quantification Figure 1 shows the workflow that illustrates the steps in the analysis of the experimental design
data. Raw iontrap single stage LC-MS datasets were obtained in Bruker Daltonics HPLC-MS.dat
format which was further converted into the mzXML proteomics standard format using the msConvert
tool from the ProteoWizard toolset1,2. The Threshold-Avoiding Proteomics Pipeline (TAPP)3 was
used to extract chromatographic peaks from the raw data and for data pre-processing. Centroid data
were smoothed and reduced using a normalized two dimensional Gaussian filter with a peak
resolution in m/z dimension of σm/z = 0.3 m/z. This parameter was obtained by optimizing the peak
detection quality upon visual inspection of one chromatogram (see Figure S5). Smoothing low-
resolution single stage iontrap data with σm/z = 0.3 m/z and σrt = 0.5 minutes using a 2-dimensional
Gaussian filter results in one peak without isotopic resolution for each peptide isotope cluster for a
given charge state. The non-linear retention time shifts between LC-MS peak lists were corrected
using Warp2D, which is a tool based on Correlation Optimised Warping (COW). To find the best
reference chromatogram, all possible pairwise combinations of alignments were performed resulting
in a total of 256 pairwise alignments using distributed grid computing with the Data Analysis
Framework (DAF)4. The parameters for time alignment were as follows: retention time width: 0.5
minutes; m/z width 0.3 Da; maximal retention time difference 0.6 minutes; maximum m/z difference
0.6 Da; windows size: 50 points; slack parameter: 10 points; maximal number of peaks/segment: 50;
number of total time points: 2 000; constant retention time shift: 0 min. The raw data was analysed
for peaks with 100-1 500 m/z and 65-135 minutes of retention time. Every alignment combination
with Warp2D produced a quality score between 0 (no alignment and/or no peak list similarity) and 1
171
(perfect alignment and peak list similarity). The peak list with the highest geometrical mean of the
sum of overlapping peak volumes normalized to the sum of peak volumes of the two individual
chromatograms after warping to all combinations was selected as the optimal reference (sample ID
16090525). All other peak lists were aligned and corrected to this reference and used for further
processing. Corresponding peaks in multiple chromatograms were matched with the MetaMatch
module of the TAPP pipeline using the following parameters: delta m/z: 0.3 Da; delta retention time:
0.5 minutes; minimal fraction of class occupancy of peaks: 0.50, meaning that a matched peak was
retained if the peak was identified in a minimum of 8 out of the 16 analysed samples. Isolated peaks
that did not belong to a peak cluster were removed as “orphaned” peaks. Finally a quantitative peak
matrix containing the intensity, average m/z ratio and average retention time information of 2 559
common peaks was obtained and used in the following statistical analysis. The resulting quantitative
peak matrix contained 13,492 zeros, corresponding to 32.95% of the total number of peaks. The
intensity of the orphaned peaks was considered as representative of the noise level. Gaussian
mixture curves were fitted to the natural logarithm of the intensity of the orphaned peaks. This
analysis resulted in a normal distribution for the noise, N(µ=6.042, σ=0.5391). This distribution was
used to noise-fill the zeros in the peak matrix.
Annotation of the aligned peak matrix All depleted human serum samples for the experimental design were analysed in single stage (MS1)
mode, therefore no identification of peaks with MS/MS spectra was possible. In order to annotate
the quantitative peak matrix obtained with the TAPP pipeline, we used the data of 40 depleted human
serum samples obtained with the same sample preparation and analysed by a QTOF instrument
using the same LC columns and elution conditions. The obtained LC-MS/MS data was identified with
the PEAKS 7.5 database search tool5 with the following parameters: database: Uniprot (July 22,
2015); parent mass error tolerance: 50.0 ppm; fragment mass error tolerance: 0.05 Da; precursor
mass search type: monoisotopic; enzyme: trypsin; max missed cleavages: 2; non-specific cleavage:
one variable modifications: Oxidation (M); max variable ptm per peptide: 3; searched entry: 29,5778;
MS/MS quality filter: >0.65; FDR (Peptide-Spectrum Matches): 0.1%; FDR (Peptide Sequences):
1.0%; FDR (Protein): 0.0% determined with reverse decoy approach were retained and used for
annotation transfer. The dataset contained 356,633 MS/MS scans, and the database search resulted
in 17,4331 peptide-spectrum matches, 1884 identified unique peptide sequences, 2,417 unique
peptides of different charge and oxidation states, 106 protein groups and 229 identified proteins. The
FDR rate for PSM, peptides and proteins were <1%. The iontrap data was identified using the same
parameters except for the following: parent mass error tolerance: 0.5 Da; fragment mass error
tolerance: 0.3 Da. The search for iontrap resulted in 334 peptide-spectrum matches, 163 unique
peptide sequences, 183 unique peptides of different charge and oxidation states, 33 protein groups
and 113 proteins. Since the iontrap LC-MS data used for quantification was obtained with a different
instrument than the QTOF LC-MS/MS data used for peptide identification it was necessary to check
if the elution order of common peaks in the two analysis batches is the same6. Two depleted human
172
serum samples (sample IDs 30B and 29B) were analysed in QTOF and in the iontrap using MS/MS
mode. A database search of these two analyses resulted in 181 unique peptides of different charge
and oxidation states identified in both Q-TOF and iontrap datasets. The analysis of two different
samples with QTOF and iontrap instruments separately resulted in 470 and 27 MS/MS with common
identification respectively (plots 3 and 1 in Figure S6 in supporting information). Due to the low
number of identified common peptides it was not possible to apply our quality control method
assessing peak elution order inversion6, however visual inspection of the scatter plot of the MS/MS
spectra and the calculated Dmax (Figure S6 in supporting information) showed slight orthogonality of
the separation despite the fact that the same liquid chromatography system, elution condition and
column were used7. This setup allows to transfer the peptide identifications from the 40 QTOF
MS/MS files with a minimal error of 2 minutes as determined by Dmax to annotate the peaks in the
single stage MS profile of the experimental design dataset, however, the results should considered
with care. The scheme of the main steps of identification transfer is presented in Figure S7
(supporting information). Using the derived non-linear retention time correspondence function the
retention times of 2,417 unique peptides with different charge and oxidation states in the QTOF
datasets were aligned to the iontrap LC-MS dataset. The corrected retention time coordinates of
QTOF’s unique peptides were used to annotate peaks corresponding to peptides in the pre-
processed iontrap experimental design LC-MS dataset. In the annotation we allowed 0.85 m/z and
3.5 min of retention time difference between m/z and the retention time of peptide identifications and
peak position in the matched single stage experimental design dataset. The matching procedure
resulted in 629 peaks annotated with peptide and protein identification. For the annotation of peaks
most affected by significant factors in Figure 3 and Table S2 we have used protein names
corresponding to SwissProt identifiers.
Parameters of the simulated dataset We have simulated data matrix X with the same dimensions as the data matrix X obtained from the
experimental design LC-MS dataset with goal to assess ASCA performance to identify significant
pre-analytical factor, to identify peaks affected by significant factors and to assess ASCA
performance with respect of peak selection using Volcano plot parameters (t-test p-value and fold
change).
The X matrix of dimension (2,559 × 16) was obtained as follow: from the seven pre-analytical factors,
three (factors 1, 3 and 5) were constructed to have a significant effect on 5%, 3.5% and 5% of
randomly chosen peaks with a mean difference in peak intensity between the two factor levels of 3,
4 and 6, respectively. Any peaks not affected by factors was sampled from noise distribution found
in the experiment design data and using normal distribution of N(µ=6.042, σ=0.539). The peak
intensities obtained with this approach for the seven factors were averaged out and the outcome
was used as simulated data matrix X for ASCA analysis.
173
To test eventual overfitting we have simulated completely random data matrix X 15 times with the
same dimensions than the experimental design dataset, where all factors were non-significant and
there was no peak affected by any of the factors. During the simulation we have analysed the
complete simulated data matrix and matrix obtained after Volcano based filtering using the same set
of threshold used during assessment of ASCA performance with simulation. Figures from these
analysis are available in file ParameterOptFactdesRandom.pdf submitted as supporting information.
Main steps requiring bioinformatics intervention Three steps require bioinformatics intervention: 1.) planning experimental design providing level
distribution of the different factors can be performed with MODDE software; 2.) data pre-processing
of LC-MS/MS resulting in a table that contains quantitative information on compounds for all samples
designed in point 1, which data pre-processing can be performed with any single-stage LC-MS/MS
processing workflow such as TAPP3, OpenMS6, mzMine7 or maxQuant8 and 3.) the ASCA analysis
(matlab script provided at https://github.com/vikrammitra/ASCA).
Supplementary figures for chapter 5 Figure S1. Volcano plots of the matched peak matrices using the low and high levels of seven factors leading
to 7 2,559 dots. a) simulated dataset, b) experimental design dataset. In the simulated dataset peaks
sampled with different means between the high and low level of factors 1, 3 and 5 are shown with +, while all
other peaks are represented by dots. The factors are represented by differently colored symbols. Peaks
selected for ASCA analysis using a threshold of 2 and 0.05 for fold ratio change and t-test significance,
respectively, are encircled.
174
-10 -8 -6 -4 -2 0 2 4 6 80
2
4
6
8
10
12
14Peak selection with -log10(p-value): 1.301 and log2(fold ratio): -1 and 1
log2(fold change)
-log 10
(p-v
alue
)
blood collection tubehaemolysisclotting timefreeze-thaw cycletrypsin digestionstopping trypsinsample stability
-5 -4 -3 -2 -1 0 1 2 3 40
1
2
3
4
5
6
7
8
9
10Peak selection with -log10(p-value): 1.301 and log2(fold ratio): -1 and 1
log2(fold change)
-log 10
(p-v
alue
)
Factor 1Factor 2Factor 3Factor 4Factor 5Factor 6Factor 7
b)
a)
Experimental design dataset
Simulated dataset
175
Figure S2. Surface plots showing the value of SSQ (z axis) as a function of log2(fold change) and -log10(p-value) t-test
significance thresholds for the main effects in the simulated data set. Factors 1, 3 and 5 contained peaks affected by
the factors, while the other factors have no effect on any of the peaks.
f)d) e)
g)
c)a) b)
176
Figure S3. Surface plots showing the factor’s ASCA variance significance (SSQ, z axis) as a function of log2(fold
change) and -log10(p-value) t-test significance thresholds for the main effects in the simulated data set. Factors 1, 3 and
5 contained peaks affected by the factors, while the other factors have no effect on any of the peaks. SSQ significance
were set to 0.001 as lowest value that occurred in permutation test.
f)d) e)
g)
c)a) b)
177
Figure S4. Surface plots showing the recall (a), precision (b), g-score (c) f-score (d) and log10 number of selected
variables (e) as a function of log2(fold change) and -log10(p-value) t-test significance thresholds for a simulated data
matrix.
b)
c)
a)
e)
d)
178
Figure S5. Optimisation of the smoothing parameter for optimal quantitative pre-processing using the TAPP pipeline.
The effect of parameters σrt and σm/z (half the peak width at the inflection point of the Gaussian distribution) of the 2-
dimensional Gaussian smoothing procedure on the noise content of single-stage MS data in Grid module of TAPP
pipeline. The value of σ in the retention time dimension (σrt) was 1 minute and the value of σ in the m/z dimension (σm/z)
was varied with 0.1, 0.25, 0.3 and 0.5 Da. The 0.3 Da σm/z provided the optimal settings for peak detection, without
missing peaks (too much smoothing) or peak splitting (too much noise). These settings smoothed out isotopic clusters
of one peptide with one charge state resulting in one Gaussian peak in the retention time, m/z and ion count space.
0.25 Da m/z; 1 min rt
0.5 Da m/z; 1 min rt
0.1 Da m/z; 1 min rt
0.3 Da m/z; 1 min rt
179
Figure S6. Scatterplot of identic MS/MS identifications in two chromatograms acquired with 1) ion trap (samples 30A
and 30B), 2) ion trap and QTOF (sample 30B) and 3) QTOF (sample 30 B). The black lines show the values of Dmax
(0.50, 1.67 and 2.00 respectively), while the red lines correspond to the main retention time correspondence trend6.
40 60 80 100 120 140 16040
60
80
100
120
140
160
Iontrap sample 30B (rt in Minutes)
Iont
rap
sam
ple
30A
(rt in
Min
utes
)
40 60 80 100 120 140 16040
60
80
100
120
140
160
Qtof sample 30B (rt in Minutes)
Iont
rap
sam
ple
30B
(rt in
Min
utes
)
1
40 60 80 100 120 140 16040
60
80
100
120
140
160
Qtof sample 29B (rt in Minutes)
Qto
f sam
ple
30B
(rt in
Min
utes
)
3
2
180
Figure S7. Main steps of the peptide identification transfer to annotate the quantitative peak matrix obtained from 16
LC-MS iontrap dataset acquired to study the effect of pre-analytical factor on depleted human serum peptide profile in
experimental design study. 2 iontrap and 40 QTOF LC-MS/MS files were subjected to spectrum peptide match
identification using PEAKS database search tool. These datasets were combined by using the retention time
correspondence function obtained with retention time of identic MS/MS spectra in the two datasets aligning the
identification of the 40 QTOF files to the aligned retention time domain of the 2 iontrap LC-MS/MS data. The 2 iontrap
data were aligned to the best reference chromatogram of the 16 iontrap LC-MS chromatograms which allowed the
transfer of combined identifications from 40 QTOF and 2 iontrap LC-MS/MS files by finding the highest peaks within
retention window of 3.5 min and 0.85 m/z.
Truth by Methods Selected peaks Not selected peaks
Affected peaks True Positive (tp) False Negative (fn)
Not affected peaks False Positive (fp) True Negative (tn)
Table S1. Confusion table. The columns correspond to features as predicted by a given method,
while the rows correspond to the actual class of the features. Adapted from Christin et al.9
40 QTOF LC-MS/MSanalysis
2 iontrapLC-MS/MS analysis
combined peptide
identification
16 LC-MSiontrapdataset
data pre-processing
(TAPP)
Annotated quantitative peak
matrix used in factorial design
Same sample, different instrument
Identification transfer
Same instruments different samples
1% FDR
181
Measure Equation
Sensitivity = Recall = True Positive Rate (TPR) tp
tp+ fn
Precision tp
tp+ fp
Specificity = True Negative Rate (TNR) tn
tn+ fp
Geometric Mean Accuracy (g-score) TNRTPR
f-score recallprecision
recallprecision
2
2 )1(
Table S2. Definition of the scores that were used to compare the performance of different feature
selection methods. In this manuscript the value of β in f-score calculation is 1. Adapted from
Christin et al.9
182
Factor Peak rank Peptide sequence Protein name
Heamolysis 1 VADALTNAVAHVDDMPNALSALSDLHAHK Hemoglobin subunit alpha (P69905)
Heamolysis 2 FFESFGDLSTPDAVMGNPK Hemoglobin subunit beta (P68871)
Heamolysis 3 VLGAFSDGLAHLDNLK Hemoglobin subunit beta (P68871)
Heamolysis 4 VADALTNAVAHVDDMPNALSALSDLHAHK Hemoglobin subunit alpha (P69905)
Heamolysis 5 VGFYESDVMGR Alpha-2-macroglobulin (P01023)
Heamolysis 6 AIGYLNTGYQR Alpha-2-macroglobulin (P01023)
Heamolysis 7 HVIILMTDGLHNM(Ox)GGDPITVIDEIR Complement factor B precursor
(P00751)
Heamolysis 9 FVTWIEGVM(Ox)R Plasminogen (P00747)
Heamolysis 10 FFESFGDLSTPDAVMGNPK Hemoglobin subunit beta (P68871)
Trypsin digestion 1 KFPSGTFEQVSQLVK Vitamin D-binding protein (P02774)
Trypsin digestion 4 EQLGPVTQEF Apolipoprotein A-I (P02647)
Trypsin digestion 6 AEAESLYQSK Keratin, type II cytoskeletal 1
(P04264)
Trypsin digestion 7 FVELTMPYSVIR Alpha-2-macroglobulin (P01023)
Trypsin digestion 8 PSLVPASAENVNK Inter-alpha-trypsin inhibitor heavy
chain H4 (Q14624)
Trypsin digestion 10 ILTVPGHLDEM(Ox)QLDIQAR Complement C4-A and B (P0C0L5)
Stopping Trypsin 3 VVNNSPQPQNVVFDVQIPK Inter-alpha-trypsin inhibitor heavy
chain H2 (P19823)
Stopping Trypsin 7 YFKPGMPFDLMV Complement C3 (P01024)
Stopping Trypsin 9 DFVQPPTK Kininogen-1 (P01042)
Table S3. Peptide sequences and protein names of the most discriminating, annotated peaks for the 3
factors that affect depleted human serum peptide profiles. The peak rank reflects the discriminating rank
of the peak according to the average absolute ASCA loadings obtained with 100 repetitions of the ASCA
analysis as displayed in the bar diagrams of Figure 3. The protein name reflect the occurrence of the
peptides in SwissProt entries.
183
References (1) Holman, J. D.; Tabb, D. L.; Mallick, P. Current protocols in bioinformatics / editoral board, Andreas D.
Baxevanis ... (et al.) 2014, 46, 13 24 11-19.
(2) Kessner, D.; Chambers, M.; Burke, R.; Agus, D.; Mallick, P. Bioinformatics 2008, 24, 2534-2536.
(3) Suits, F.; Hoekman, B.; Rosenling, T.; Bischoff, R.; Horvatovich, P. Analytical Chemistry 2011, 83, 7786-
7794.
(4) Ahmad, I.; Suits, F.; Hoekman, B.; Swertz, M. A.; Byelas, H.; Dijkstra, M.; Hooft, R.; Katsubo, D.; van
Breukelen, B.; Bischoff, R.; Horvatovich, P. Bioinformatics (Oxford, England) 2011, 27, 1176-1178.
(5) Zhang, J.; Xin, L.; Shan, B.; Chen, W.; Xie, M.; Yuen, D.; Zhang, W.; Zhang, Z.; Lajoie, G. A.; Ma, B.
Molecular & cellular proteomics : MCP 2012, 11, M111 010587.
(6) Zwanenburg, G.; Hoefsloot, H. C. J.; Westerhuis, J. A.; Jansen, J. J.; Smilde, A. K. Journal of Chemometrics
2011, 25, 561-567.
(7) Mitra, V.; Smilde, A.; Hoefsloot, H.; Suits, F.; Bischoff, R.; Horvatovich, P. Journal of chromatography. A
2014, 1373, 61-72.
(8) Cox, J.; Mann, M. Nature biotechnology 2008, 26, 1367-1372.
(9) Christin, C.; Hoefsloot, H. C.; Smilde, A. K.; Hoekman, B.; Suits, F.; Bischoff, R.; Horvatovich, P. Mol Cell
Proteomics 2013, 12, 263-276.