building a standard operating procedure for the analysis of mass

68
UPTEC X 12 021 Examensarbete 30 hp Oktober 2012 Building a standard operating procedure for the analysis of mass spectrometry data Niklas Malmqvist

Upload: others

Post on 03-Feb-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building a standard operating procedure for the analysis of mass

UPTEC X 12 021

Examensarbete 30 hpOktober 2012

Building a standard operating procedure for the analysis of mass spectrometry data

Niklas Malmqvist

Page 2: Building a standard operating procedure for the analysis of mass
Page 3: Building a standard operating procedure for the analysis of mass

Bioinformatics Engineering Program

Uppsala University School of Engineering

UPTEC X 12 021 Date of issue 2012-09Author

Niklas Malmqvist

Title (English)

Building a standard operating procedure for the analysis of mass spectrometry data

Title (Swedish)

Abstract

Mass spectrometry (MS) is used in peptidomics to find novel endogenous peptides that may lead to the discovery of new biomarkers. Identifying endogenous peptides from MS is a time-consuming and challenging task; storing identified peptides in a database and comparing them against unknown peptides from other MS runs avoids re-doing identification. MS produce large amounts of data, making interpretation difficult. A platform for helping the identification of endogenous peptides was developed in this project, including a library application for storing peptide data. Machine learning methods were also used to try to find patterns in peptide abundance that could be correlated to a specific sample or treatment type, which can help focus the identification work on peptides of high interest.

Keywords

Mass spectrometry, database, spectra, peptide annotation, pattern recognition

SupervisorsClaes AnderssonUppsala University

Scientific reviewerMats GustafssonUppsala University

Project name Sponsors

LanguageEnglish

Security

ISSN 1401-2138Classification

Supplementary bibliographical information Pages68

Biology Education Centre Biomedical Center Husargatan 3 UppsalaBox 592 S-75124 Uppsala Tel +46 (0)18 4710000 Fax +46 (0)18 471 4687

Page 4: Building a standard operating procedure for the analysis of mass
Page 5: Building a standard operating procedure for the analysis of mass

Building a standard operating procedure for the analysis of mass spectrometry data

Niklas Malmqvist

Populärvetenskaplig sammanfattning

Masspektrometri (MS) används för att bestämma massan hos molekyler. Ett vanligt användningsområde för MS är analys av kroppsegna (endogena) peptider. En peptid är en del av ett protein och fungerar som signalämnen i kroppen för bland annat hormonreglering. Genom att använda MS på peptider kan man även bestämma deras sammansättning av mindre beståndsdelar: aminosyror. Detta kallas att sekvensera peptider och används för deras identifiering. På detta sätt ökar man förståelsen för peptiders funktion och roll i kroppen. Kunskaper inom detta kan användas för att förstå sjukdomsförlopp och att designa läkemedel.

I detta projekt utvecklades en metod för att underlätta identifieringsarbetet av endogena peptider. En databasapplikation skapades för att kunna lagra information från MS och sedan kunna söka bland denna information med nya experimentdata. Detta gör att man kan se vilka peptider som förekommit i tidigare experiment och vilka aminosyror de består av. Med MS genereras stora mängder data vilket kan göra det svårt att överblicka ett experiment. Ett sätt att analysera stora mängder data med datorhjälp är mönsterigenkänning. Metoden användes i detta projekt för att undersöka mönster i peptidförekomst för en viss vävnad. Med hjälp av denna information kan man då fokusera identifieringsarbetet på just de peptiderna som ser ut att vara extra betydelsefulla.

Examensarbete 30 hpCivilingenjörsprogrammet Bioinformatik

Uppsala universitet, juni 2012

Page 6: Building a standard operating procedure for the analysis of mass
Page 7: Building a standard operating procedure for the analysis of mass

Contents

Contents 5

1 Preface 9

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2 Project goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Background 11

2.1 Mass spectrometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Ionization source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.2 Mass analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.3 Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.4 Peptide identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.5 File formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Existing tools and software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Spectra processing and scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.1 Peak picking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.2 Spectral deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.3 Similarity score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.4 Significance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.5 Scoring in SpectraST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Pattern recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4.3 Attribute importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4.4 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.5 Permutation test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Experimental setup and algorithmic solutions 29

3.1 Platform layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Peptide library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5

Page 8: Building a standard operating procedure for the analysis of mass

3.3.1 Spectral library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.2 Annotation library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 Pattern recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4.1 Data sub-setting and filtering . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4.2 Feature extraction - clustering . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4.3 Pattern recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Results 36

4.1 Using the peptide library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1.1 Importing spectral data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1.2 Searching the library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3 Pattern recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3.1 Data sub-setting and filtering . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3.2 Feature extraction, clustering . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3.3 Pattern recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3.4 Attribute importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 Discussion 48

5.1 Peptide library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2 Pattern recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6 Future work 53

7 Acknowledgments 54

8 Bibliography 54

Appendix A Software setup 59

A.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

A.2 Configure and build SpectraST . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Appendix B Source code 60

Appendix C User manual 61

6

Page 9: Building a standard operating procedure for the analysis of mass

Keywords

Endogenous Exists within a living organism (in the present context)

LCMS Liquid Chromatography Mass Spectrometry

MGF Mascot Generic Format

MS Mass spectrometry

PCA Principal Component Analysis

Peptidomics The study of endogenous peptides, discovery and identification

Proteomics The study of proteins and in particular their structure and function

PTM(s) Post-translational modification(s)

Snap-frozen Rapidly frozen (sample) moments after extraction

Stabilized Rapidly heated (sample) moments after extraction

Taxotere A cancer treatment drug

Trypsine A protease that exist in the digestive system of many organisms

7

Page 10: Building a standard operating procedure for the analysis of mass

8

Page 11: Building a standard operating procedure for the analysis of mass

1 Preface

This is a master thesis submitted to fulfill the requirements for a Master of Science degree inBioinformatics Engineering from Uppsala University, Sweden. The project behind the thesis wasconducted at Denator AB, Uppsala, Sweden under the supervision of Dr. Karl Sköld. The mainsupervisor was Dr. Claes Andersson at the Department of Medical Scienes, Uppsala University andthe scientific examiner was Prof. Mats Gustafsson at the same department.

1.1 Introduction

In peptidomics, the research is focused on discovering and identifying endogenous peptides andtheir function. Peptides exist as signal and regulatory substances in living organisms. Peptides canalso undergo PTMs and their change and expression can be interpreted to gain knowledge aboutbiological processes, for example diseases. Investigating and discovering endogenous peptides cantherefore lead to new biomarkers for diagnose or treatment of disease. An analysis tool for deter-mining the molecular mass of sample analytes is mass spectrometry (MS) and is used to discovernovel endogenous peptides [48]. A biological sample, for example a tissue sample, is analyzedby MS to obtain its peptide composition. It is crucial that the biochemical integrity of the sampleis maintained in order to achieve the true peptide contents. It has been shown that enzymatic ac-tivity in the samples lead to unwanted degradation products within one minute of extraction [43].Denator provides a technology for stabilizing biological samples, the Stabilizor T1® (ST1) instru-ment [47]. Samples are stabilized prior to analysis by heat inactivation of all enzymes, effectivelystopping the degradation of proteins and peptides. Previous studies have concluded that the ST1 isin fact essential to find biomarkers related to disease [15]. Denator has recently launched a plat-form for peptidomics in collaboration with Gothenburg University. Customers can send samples tothe facility for stabilization, preparation and liquid chromatography mass spectrometry (LCMS).LCMS can be used to identify peptides that differ in concentration in different biological groups,for example patients or treatment groups.

Any peptide found in a sample needs to be identified, or annotated, in order to yield any kind of bi-ological information. This is usually a time-consuming process which also requires experience andprevious knowledge. The mass information gained from MS is used to find the amino acid com-position of the peptide. The masses from the MS run is compared against the theoretical massesof amino acids; this is done using peptide search engines such as Mascot [24] or X!Tandem [49].The identification process differs between peptides and proteins. Proteins are commonly identifiedvia information from their cleavage products from enzymatic digestion by e.g. Trypsine in vivo.These products are biologically inactive peptides that have specific terminal amino acids. In con-trast, endogenous peptides do not provide any information on terminal amino acids, which makestheir identification more challenging [14]. In addition, post-translational modifications (PTMs) adda new dimension of difficulty to the identification of endogenous peptides.

It is always a possibility that the same peptide is detected in different MS runs, however this is notobvious without knowing the identity of the peptide. Once a peptide from a specific MS run has

9

Page 12: Building a standard operating procedure for the analysis of mass

been annotated, it is desirable to be able to find that specific peptide in other MS runs without havingto repeat the identification process. There is also no guarantee that an identity can be determined fora certain peptide, but it can still be of interest to observe an unknown peptide occurring in differentMS runs, from e.g. different tissues. Finding annotations is time-consuming not only because ofthe long execution times of the search engines. Ambigous identifications are frequently found,which require detailed inspection of the result in order to confirm that the identity is likely to becorrect.

While MS is useful in the identification of new biomarkers, it also generates information on thou-sands of peptides from a single sample. The large amount of data creates problems in interpretingthe results since it is not always obvious which peptides may be of interest.

In this project, a platform was developed to help the identification process of endogenous peptidesand to find patterns in peptide abundance among different sample groups, in order to focus onpotentially interesting peptides. The platform has two components; a database for storage of MSdata and a machine learning component that investigates possible patterns in the data. Currentlyavailable tools for peptide analysis focus on known identified peptides, while the idea in this projectis to give a certain peptide an identity even if it its amino acid sequence is unknown. The intentionis to be able to discern between new peptides and peptides that have already been seen before, evenif their exact identity is unknown. Available tools also use databases that contain annotated data,whether the database is public or local. An important aspect of this project is to be able to create acustom database that uses unannotated data.

1.2 Project goal

The project’s goal is to construct a software platform that help the identification of unknown pep-tides and is able to find a peptide level signature, consisting of a certain combination of peptides inorder to discern between sample groups. A peptide level signature is for example differences in acertain peptide’s abundance between untreated and treated specimens. The platform has two com-ponents: the spectral library application called Spectra Matching and Annotation Software Helper(SMASH), and a machine learning component that investigates possible patterns in the data.

The platform offers of three key services:

1. Library construction: Build a library of peptides from mass spectrometry (MS) runs. Theseinclude both peptides annotated and unannotated by a protein search engine, e.g. Mascot.Unannotated peptides use their MS/MS spectra (peptide fragment spectra) as identifiers,rather than an amino acid sequence.

2. Peptide matching: Find peptides that have been present in previous experiments but maynot have been identified by a search engine such as Mascot.

3. Pattern recognition: Machine learning methods to find patterns in peptide levels that arespecific for a sample group and in that way find a signature for e.g. disease treatment effect.

10

Page 13: Building a standard operating procedure for the analysis of mass

2 Background

This section covers the theoretical background to the various aspects of the project.

2.1 Mass spectrometry

Mass spectrometry is used to analyze particle masses, elemental composition and chemical struc-ture of a target molecule. There are a number of different technologies used in MS and not all ofthem will be covered here since the aim is to give a short introduction to MS. A mass spectrometerhas three main components: the ionization source, the mass analyzer and the detector. In short, thetechnology works by first ionizing the molecules in a sample, separating them according to theirmass-to-charge ratio (m/z) and detecting the separated ions. The detected ion signals are stored ina mass spectrum, where their m/z ratios are plotted versus their relative abundance [4].

2.1.1 Ionization source

The samples can be introduced into the mass spectrometer either directly or via some form of chro-matography, depending on the type of ionization source and nature of the sample. There are anumber of ionization methods, where the most common are Electrospray Ionization (ESI), MatrixAssisted Laser Desorption Ionisation (MALDI) [4]. Liquid Chromatography Mass Spectrometry(LCMS) is most common in peptidomics and is the setup used by Denator. In LCMS, a LiquidChromatography (LC) system is coupled to the mass spectrometer, thereby combining the separa-tion capabilities of LC on physical properties with the mass spectrometer.

ESI works by dissolving the sample in polar, volatile solvent and pumping it through a narrowstainless steel capillary, where a high voltage (3-4 kV) is applied over the capillary tip. The sampleform an aerosol of highly charged droplets when passed through the electric field formed by thehigh voltage. This process is aided by introducing an inert gas, such as nitrogen, that help direct theflow from the capillary tip into the mass spectrometer. The droplet size is reduced by using volatilesolvents since they evaporate easily. Reduction of droplet size is also aided by adding warm gas,called drying gas, that speeds up solvent evaporation. Charged sample ions are separated from thedroplets and passed through a sampling cone into an intermediate vacuum region, and subsequentlythrough an opening into the high vacuum environment of the mass analyzer. See figures 1 and 2 foran overview of ESI and an illustration of the reduction in droplet size.

The idea behind MALDI is to apply laser light on a sample in order to ionize it. Sample prepara-tion is done by mixing the sample in a volatile solvent, which is then mixed with a matrix. Thecompound of choice to be used as the matrix varies depending on the analyte. There are a few maincriteria that need to be met: the matrix must be soluble with the analyte, it must not be chemicallyreactive with the analyte and it must strongly absorb the light of the lasers’ wavelength. The roleof the matrix is to transfer energy to the analyte in order to excite it and thereby ionizing it. Inaddition, the matrix protects the analyte from exposure to excessive energy from the laser that mayotherwise cause sample degradation. The laser exposure takes place inside a vacuum chamber,

11

Page 14: Building a standard operating procedure for the analysis of mass

Figure 1 – Schematic overview of the Electrospray Ionization (ESI) technique. The sample is dissolvedin volatile solvent and passed through a small capillary with an electric field at the capillary tip. Thisform an aerosol of charged sample ions which is passed into the mass spectrometer.

Figure 2 – Droplet evaporation during the Electrospray Ionization (ESI) process. The sample ions arereleased from the droplet as solvent is evaporated and thereby reducing the droplet size.

12

Page 15: Building a standard operating procedure for the analysis of mass

where an anode or catode is present. As the energy transferred from the matrix eventually cause theanalytes to evaporate, they can then be introduced into a mass analyzer. The most common massanalyzer used with MALDI is the time-of-flight (TOF), a combination referred to as MALDI-TOF.The exact details behind the mechanism of MALDI is not entirely understood and is still subject toongoing research [4, 27].

2.1.2 Mass analyzer

There exist a number of different mass analyzers, which has the main function of separating theions formed in the ionization step by their m/z ratios. The most widely used mass analyzers includequadrapole ion traps and time-of-flight. Mass analyzers can be used in a series of two and the setupis then called tandem mass spectrometry (tandem MS) [4]. In peptidomics, the tandem MS setup isused to first detect the peptides in a sample and then split the peptides into peptide fragments. Thepeptide fragment are detected in the next MS run and give information on the amino acid sequence,which is needed to identify a peptide. Different peptides can have the same masses and by splittingthem into fragments it becomes possible to discern between them. Peptide fragment ions can existas different types of ions: a, b, c or x, y, x, where the type is determined by where the peptide bondis cleaved and which fragment retains the charge. The a, b or c ions are formed when the charge isretained by the amino terminal and the x, y or z ions are formed when the charge is retained by thecarboxy terminal. [46]

The TOF mass analyzer is based on the principle that the velocity of ionized analytes being accel-erated by a homogenous, constant electric field is directly related to their m/z ratio. It is thereforepossible to determine the masses of the analytes based on their arrival time at the detector. Theanalytes accelerated by the electric field are sent into a vacuum chamber that has no electric fieldor other ways of applying force on the analytes. This chamber is referred to as the field-free region.The principle behind TOF is described by equation 1.

t =(

2mdeE

) 12

+L(

m2eV0

) 12

(1)

where m = mass of particle, e = electronic charge, E = electic field applied in source, d = lengthof acceleration region, L = length of field-free region and V0 = accelerating potential [28]. Asimplified MALDI-TOF schematic can be seen in figure 3 [4].

The quadrapole mass analyzer consist of four parallel rods, where each opposing pair has a differentelectrical charge (fig. 4). An alternating voltage is applied between the rod pairs as well as a directcurrent voltage. Ions produced by the ionization source are passed into the middle of the fourrods and their motion depend on the oscillating electric field caused by the alternating voltage.Consequently, only analytes of specific m/z ratios will have a stable trajectory and be able to passthrough to the detector. The quadtrapole technique therefore makes it possible to scan for ionswithin a specific m/z range. [9]

13

Page 16: Building a standard operating procedure for the analysis of mass

Figure 3 – Simplified schematic of MALDI-TOF mass spectrometry. The sample ions are releasedfrom the matrix and accelerated by an electric field into the field-free region and subsequently registeredby the detector. V0= Accelerating potential, d = length of accelerating region, L = length of field-freeregion.

Figure 4 – A simplified schematic of the quadrapole mass analyzer. The oscillating electric field be-tween the rods causes a stable or an unstable trajectory of the ions depending on their m/z ratio. Onlyions with stable trajectories can pass through to the detector.

14

Page 17: Building a standard operating procedure for the analysis of mass

Figure 5 – An illustrative example of a mass spectra. The peaks can represent signals from eitherpeptides or peptide fragments if a tandem MS setup is used.

2.1.3 Detector

The detector records the signals from the ion current in the mass analyzer and record the data inthe form of mass spectra. The results form a mass spectrometer can be viewed as a spectrum ofpeptide signals, where the intensity of the peptide’s signal is plotted versus it’s m/z ratio - calleda mass spectrum. The mass spectrum is also referred to as an MS spectrum or MS/MS spectrumwhen representing peptides or peptide fragments respectively. An example of a mass spectrum canbe seen in figure 5. Common detectors include the photomultiplier, the electron multiplier and themicro-channel plate.

In peptidomics research, it is of interest to know the amount of peptides from an MS run. Thisinformation is usually key in understanding the posed biological question, for example if a certainpeptide occurs in any meaningful amounts after drug treatment. Such information is achieved byquantification of the MS data. Data from LC-MS is quantified by using the retention time, thetime it takes for a sample to leave the LC column. Peptides are not momentarily release from thecolumn, but eluate during a small period of time, a retention window. The intensity signal forgiven peptide over the length of its retention window is used to quantify that peptide. For thisproject, the quantification was done in the software Progenesis LC-MS [20] as it is the programused by Denator. Since it is a commercial program, details on the quantification process are notrevealed. Other quantification softwares exist such as MaxQuant [22], which is a freely availablequantification software.

15

Page 18: Building a standard operating procedure for the analysis of mass

2.1.4 Peptide identification

Peptide fragment spectra, or MS/MS spectra, are normally sent to a proteomics search engine suchas Mascot [24] or the open-source X!Tandem [49]. It is often impossible to match all peptides inan MS sample using existing search engines due to several reasons. These are variable data qualityand the number of possible combinations of peptide fragments when it comes to size, charge andterminal amino acids [11]. Post-translational modifications (PTMs) make the matching procedureeven more complicated since it adds to the number of possible amino acid variants.

Perhaps the most common type of peptides used in mass spectrometry is tryptic peptides, whichoriginate from proteins that are cleaved using Trypsine in vitro. This method yields peptides thatin principle always end in specific amino acids; lysine (K) and arginine (R). The available peptidesearch engines work best when the peptides are more than seven amino acids long and have an ioncharge of +2 or +3 [11]. Endogenous peptides does not have any specific amino acids in the endsbecause they are cleaved inside the body under unknown conditions. This leads to higher numberof possible amino acid combinations and thus makes identification more difficult.

2.1.5 File formats

There exist several different file formats for storing mass spectra, both open and proporietary. Thedifferent formats have been developed for different purposes by different organisation which un-fortunately has lead to a lack of standardization.

A common format is the mzXML [31] which is an XML (eXtensible Markup Language) formatdeveloped by the Seattle Proteomics Center (SPC) [39]. Markup languages works by envelopingproperties in the file in tags, which makes it human-readable and easy to organize. All the extradata in the file used for markup increase the storage space requirede for the file which is a downsideof mzXML. Another XML format, also similar to mzXML is the mzData format, developed by theHuman Proteome Organization (HUPO). In an attempt to create a standardized format and replacemzXML and mzData, the mzML format was created by SPC and HUPO. The mzML format wasreleased in 2008, with revisions made in 2009, and has thus existed for a while [32]. Even thoughefforts have been made to create a standard format, older formats such as mzXML is still used andare supported in current software. An example illustrating the mzXML format is depicted in figure6.

An example of a more compact open file format is the Mascot Generic Format (MGF) [25] devel-oped by Matrix Science to be used in the Mascot search engine. This format is not an XML formatand does therefore take up less storage space. An example of the MGF format can be seen in figure7.

2.2 Existing tools and software

There are currently a number of tools available for proteomics and peptidomics, where many arementioned in recent literature reviews [30]. The following are a selection of software that are

16

Page 19: Building a standard operating procedure for the analysis of mass

Figure 6 – An illustrative example of the mzXML format. The data for each spectrum is encapsuled inthe “scan” tag, which is an instance of the detector scanning for peptides during the MS run. The peakm/z ratios and corresponding intensities are encoded in the “peaks” tag under “contentType”.

Figure 7 – An illustrative example of the Mascot Generic Format (MGF). The data related to the spec-trum is encapsuled in the BEGIN IONS and END IONS statements. The TITLE and PEPMASS describethe title of the spectrum and the peptide mass in Daltons. The following pairwise numbers represent m/zratios of the peaks and the corresponding intensity.

17

Page 20: Building a standard operating procedure for the analysis of mass

considered to be of most relevance for the background of this project.

Mascot [24], X!Tandem [49] and Sequest [7] are examples of search engines that identify peptidesfrom peptide ion spectra - MS/MS spectra. These search engines use databases containing theoret-ical amino acid masses to match against the peak masses of the observed (query) MS/MS spectra.The database can be species specific and X!Tandem also allows for user customized databases.Peptides are cleaved artificially, in silico, which results in theoretical, calculated spectra that arematched against the query spectra. The quality of the match is determined by a significance mea-sure, such as an expectation score, see section 2.3.4. The theoretical masses also include possiblePTMs that may or may not be present on the amino acids. In the end this often results in a largenumber of possible combinations that need to be explored. The search parameters are configured bythe user prior to initiating the search. These parameters include the possible PTMs to explore, thecompound used to cleave the protein, mass tolerance and if peptides with certain charges shouldbe ignored. By adjusting these settings, it is possible to narrow or widen the search space. Theavailable search parameters vary depending on the search engine used.

Example of tools that identifies peptides by matching MS/MS spectra against a database of otherMS/MS spectra are X!Hunter [6] and SpectraST [17]. The idea here is to create a local library ofannotated peptide spectra that come from one of the search engines described above. These anno-tated spectra are then matched against raw unannotated peptide spectra provided by the user. Thisenables an easy way to setup and use a custom reference database for annotations. However, the in-tention in this project is to store raw peptide spectra that might later become annotated, essentiallyusing the spectrum itself as an identifier. X!Hunter and SpectraST does not provide a way to createa database that is able to hold raw spectra and add annotations over time. Although not the intenteduse of the software, it is possible to store raw spectra using SpectraST. It is however not possible toadd annotations afterwards, but the software is able to match and score spectra using the methodsdescribed in sections 2.3.5 and 2.3.4.

2.3 Spectra processing and scoring

The processing of spectral data prior to matching is something that have a lot of aspects. Com-parison of two spectra can be done qualitatively by a human observer. This is not possible forcomputers, as they need a way to quantify results in order to determine if something is similar ordissimilar. Therefore, a system that can process spectra and provide a measurement of similarity isrequired to do comparisons of large amounts of MS data. There are several aspects of such a sys-tem: peak picking, create consensus spectra, deconvolve spectra, a way to score spectra similarityand a way to assess the significance of that score.

2.3.1 Peak picking

Mass spectra often contain more peaks than expected from the ionization process. This unwantedsignal data is referred to as noise. The presence of noise can greatly reduce the chances of findingthe correct peptide and also increase ambiguity in the search results. For these reasons, it is prefer-

18

Page 21: Building a standard operating procedure for the analysis of mass

able to select only a number of peaks from a spectra that give the strongest signals and provide anoptimal signal-to-noise ratio. A common method for picking peaks is to pick the X peak that havethe strongest intensity, in an effort to discard noise. Another method is to divide the spectrum intoY parts and then pick the X strongest peaks in each part. [34]

2.3.2 Spectral deconvolution

Different isotopes of an element occur naturally in every sample. They differ only a few Daltons(Da) in mass, resulting in the formation of a cluster of peaks in the spectra, where each peakrepresents a specific isotope and charge state. To deconvolve a spectra means to group these peakstogether into one single peak and determine its effective mass and charge. [19]

2.3.3 Similarity score

Intensity signals in MS/MS spectra from different MS runs can be vastly different in magnitudeand need to be normalized in some way in order to make data comparable from separate runs. Ifa spectrum is represented by a vector, where the vector elements are peak intensities, normalizingthe spectrum intensities is often done by transforming the vector into a unit vector. The resultingvector will have a magnitude of one, regardless of the values of the original peak intensities. Thisis done by dividing every component ui of a vector U by the vector’s magnitude, Û (eq. 2).

U =

√n

∑i=0

u2i (2)

Scoring a comparison between two spectra can be done in different ways. A popular method usedby for example X!Tandem[8] is by calculating the dot product of the normalized intensity vectors(eq. 3). Since the vectors are normalized to unit vectors, the score can range from 0 to 1, where 1is an identical match. [16]

D =n

∑i=0

Ilibrary,iIquery,i (3)

where Ilibrary,i and Iquery,i are normalized intensities of the ith bin in the intensity vectors that repre-sent the query spectrum and the spectrum being queried.

There are other scoring functions based on empirically observed rules (Spectrum Mill [1]), orstatistically derived frequencies for fragmentation (PHENYX [10]), but using the dot product asdescribed above has proven to be useful. [30]

19

Page 22: Building a standard operating procedure for the analysis of mass

2.3.4 Significance measures

A way to measure the significance of a match is to use an expectation value (e-value). The purposeof the e-value is to provide a measure of how likely a certain score have arisen by random. Thelower the e-value, the less likely a hit is to have arisen by chance. This kind of measurement is usedin other types of search engines, such as BLAST. [29]

X!Tandem use a method based on a hypergeometric distribution of the dot product score to achievean e-score. A hypergeometric distribution describes the probability of making a number of suc-cessful draws from a population of finite size without replacement. The resulting scoring schemeis called Hyperscore and essentially adds two factors to the dot product score:

Hyperscore =

(n

∑i=0

IiPi

)∗Nb!∗Ny! (4)

where Ii is the intensity of peak i, Pi is 0 or 1 depending on whether or not the peak exist in thetheoretical spectra and Nb and Ny is the number of b and y ions respectively. A spectrum is queriedagainst all other spectra in the database and a distribution of the resulting Hyperscores is formed.It is assumed that the true match, if present in the database, will recieve the highest Hyperscore.Hence, the set of all scores but the highest is a sample from the null distribution of scores betweenthe query and non-matching spectra in the database, i.e. reflect an incorrect match (fig. 8a). Thesample is used to estimate the probability of obtaining a score at least as high as the highest scoringmatch and compute the e-score. The scores on the right side of the distribution are assumed to fallon a straight line when log-transformed from the argument that incorrect results are random. Thee-value is calculated based on the intersection of the extrapolated straight line with the maximumHyperscore, i (fig. 8b). The expectation score is then ei. [38]

The details on the actual implementation of the e-score in X!Tandem is not clearly stated [8], butthe main concept of how it works as a significance measure is described above.

2.3.5 Scoring in SpectraST

The dot score (eq. 3) is also used to score spectra in SpectraST. The peak intensities are placed intobins, 1 m/z unit wide, and a fraction of the intensity is spread out to neighboring bins to be ableto match equivalent but slighly m/z-shifted peaks. However, SpectraST use a different approach inorder to find the significance of a hit. The dot score difference between the two top hits is compared,a measure called ∆D (eq. 5).

∆D =D1−D2

D1(5)

If ∆D is large then the top hit clearly stand out from the other hits, thus it is more likely to becorrect. Another metric used by SpectraST is the dot bias (DB) and it says to which degree the dotproduct is dominated by just a few matching peaks (eq. 6). DB gets a value of 1 if the dot product

20

Page 23: Building a standard operating procedure for the analysis of mass

Figure 8 – (a) Score distribution for a given peptide matched against the entire database, where thetop score is circled. (b) The expectation score is calculated based on the logarithm of the scores and astraight line is extrapolated and the intersection of the top score (circle) is used to calculate the e-score.The figures are adapted from [38].

is due to a single peak, which is the case where any contributing score come from only one vectorelement in each vector. Let the only contributing vector elements be x and y in each spectrumrespectively. The numerator will then consist of the square root of a the squared product of theelements representing that peak in both spectra,

√x2y2 = xy. The denominator will simply be the

product xy resulting in xy/xy = 1. Conversely, DB gets a value of 1/√

b≈ 0 if all peaks contributeequally to the dot score, where b is the number of bins. If all contributing vector elements are ofthe size a and there are b bins in the vector, then the numerator will have the value

√ba4and the

denominator ba2, resulting in DB =√

ba4

ba2 = 1√b≈ 0 for large values of b.

Large or small DB values are typical for uncertain matches where the dot score is too high, causedby matching a few dominating peaks or matching many small peaks that are likely noise. [16]

DB =

√∑i I2

library, jI2query, j

D(6)

The hits from SpectraST are ranked by using both the dot score and dot bias to calculate a so-calleddiscriminant function, F (eq. 7) which is the SpectraST equivalent of an e-value.

F = 0.6D+0.4∆D−b (7)

where the penalty term b is determined by the DB (eq. 8) [16].

21

Page 24: Building a standard operating procedure for the analysis of mass

b =

0.12 i f DB < 0.10.12 i f 0.36 < DB ≤ 0.40.18 i f 0.4 < DB ≤ 0.450.24 i f DB > 0.45

0 f or all other valueso f DB

(8)

The authors of SpectraST have derived the discriminant function based on trial-and-error runs ona chosen test set. The parameters and form of the F function is assumed to be general enough forother datasets and it is also possible to adjust them for the needs of a specific application. It ishowever unclear how to decide whether or not a given value for F is relevant and seems to be leftfor the user to decide. [16]

2.4 Pattern recognition

Pattern recognition is a machine learning approach that is used to find distinctive patterns in datain order to create a model that describes the significance of the features, or attributes, in the data.A specific algorithm is used to train the model and this algorithm can be either supervised or un-supervised. Supervised learning means that the data is labeled, which implies that it is possibleto evaluate the model’s performance in terms of correctly or incorrectly classified examples. Un-supervised learning means that the data is unlabeled and there are no predetermined classificationassumptions. [2]

Both methods are useful in different context. This project used mainly supervised learning, orclassification since the data set it was based were categorized in certain tissue types or treatmentmethods.

2.4.1 Classification

Classification is a form of supervised learning where the data have discrete outputs or labels - classlabels. In the context of this project, the labels are discrete since they describe a tissue type ortreatment method. A classifier algorithm is trained by providing it with a training data set fromwhich it form a set of rules based on the features in that data. An unknown data example can begiven to the algorithm, which then tries to determine which class it belongs to based on the rule setfrom the training. This is examplified in figure 9.

There exist many algorithms for classification, some of them are: Decision Tree, Random Forest,Artificial Neural Networks (ANN) and Support Vector Machine (SVM). All methods each havetheir strengths and weaknesses. Most tree-based methods produce a model that can be interpretedvisually. This is a big advantage when it comes to understanding the biological meaning of themodel, as the rules are shown as decision points in a tree structure.

Although tree models have a visual advantage, they may not be suitable for data sets that have a very

22

Page 25: Building a standard operating procedure for the analysis of mass

Figure 9 – Schematic view of a classification procedure. A classifier, C is trained by providing it with aset of training data on which it form rules that are used to predict the class, Y of an unknown example,X.

large number of features, or attributes. Tree models easily become overly adapted, or over fitted,to the data if every attribute is part of the decision algorithm. The resulting model will become toocomplex and small changes in the data become amplified, leading to poor predictive performance.This project deals with data that e.g. consists of around 30 mass spectrometry runs and detectsabout 43 000 peptides. The resulting data set then has 30 examples and 43 000 attributes. Eventhough tree methods may not be suitable for this project, they are useful in order to illustrate how aclassifier works (fig. 10).

Random Forest was used as the classifier algorithm in this project. Although based on decisiontrees, it is an ensemble method - it grows many tree models instead of just one. This compensatesfor the downsides of decision trees mentioned above. Each tree is constructed using a subset of theattributes in the data set to create a split, i.e. a node in the decision tree (fig. 10) - creating a forestof decision trees. A new example is classified by sending it to each tree model in the forest. Everymodel gives a classification, which is a vote for that class, and the final classification becomes theone that got the most votes from the forest. Random Forests are also fast to train on large data setsand can handle a lot of attributes, making it a suitable choice for this project [18].

2.4.2 Feature extraction

A small number of examples compared to the number of attributes can be problematic; it means thatthere are few cases available to train the classifier on and a large number of possible combinationsof the features of the data. This leads to challenges when it comes to producing a robust model.Unfortunately, mass spectrometry analysis is both expensive and time-consuming which meansthat the data sets will often have few examples compared to the number of attributes. Featureextraction is a way to reduce the number of attributes. Clustering is an example of this method inwhich similar attributes are grouped and together work as one single attribute rather than severalindividual attributes. Clustering can also be used on the examples in the dataset in order to revealgroupings in the data, but is in this context used to find similar features instead. There are two maintypes of clustering: partitional and hiearchial. The principal of partitional clustering is to divide the

23

Page 26: Building a standard operating procedure for the analysis of mass

Figure 10 – Simplified example of a tree classifier model where peptide abundances are used to de-termine which tissue type the peptides belong to. The green nodes are decision points that representattributes in the data set, peptides in this case. The numbers on the arrows equals the mass value of thepeptide in the adjacent node that in turn leads to the next decision point or to the classification of thepeptide to a tissue type.

objects in the data set into non-overlapping subsets, where each object exist in exactly one subset(fig 11). Hierarchial clustering works by nest several clusters into a hierachy and organize them ina dendrogram. More similar data points will form their own small clusters, while less similar datapoints exist together in a larger cluster (fig. 12).[51]

K-means clustering is a type of clustering method with the aim of placing n observations into k clus-ters. A given set of observations X = x1 x2 ... xn is divided into k sets, S = s1 s2 ... sk ,where k ≤ n so that the within-cluster sum of squares is minimized:

mink

∑i=1

∑x j∈Si

||x j−µ j||2 (9)

where µiis the mean of the elements in Si and is also called the centroid. Each data point is assignedto the cluster with the closest centroid - the center point of the cluster. The centroids for each clusterare then re-computed and the process is repeated until the centroids do not change. [50, 51]

Larger values of k means that the data set becomes less compressed - the number of attributes getslarger and the objective of feature extraction is to reduce that number. On the other hand, smallvalues of k can result in large numbers of observations in the same clusters, observations that shouldbe separated. In essence, this is a matter of choosing compactness versus retaining information inthe data. An optimal value for k can be determined by using the so-called Elbow method [26].

In short, the Elbow method uses the ratio of the sum of squares distance (SSD) between the k-

24

Page 27: Building a standard operating procedure for the analysis of mass

Figure 11 – Example of partitional cluster-ing: the data points are divided into non-overlapping subsets where each data pointonly exist in one subset.

Figure 12 – Example of hierarchial cluster-ing: the data points (D1-D4) exist in nestedclusters that is represented in a dendrogram.With no criteria for similarity, all data pointsexist in the same clusters. As more stringentcriteria for likeness are introduced, the clus-ters are split into smaller clusters containingonly the data points that fulfill the criteria.

means clusters and total SSD in order to find the a suitable value for k. This ratio is equivalent tothe ratio of total variance explained for a given value of k. The goal is to retain as much informationin the data as possible, thus explain as much of the variance as possible, while reducing the numberof attributes as much as possible. This means a high percentage total variance explained and a lowvalue for k. The total sum of square distance, T is described by:

T = ∑i||xi−µc(i)||2 +∑

i||µc(i)−µ||2 (10)

where the first term is the SSD within a cluster and the second term the SSD between the clusters.µis the centroid and c(i) refers to the cluster for example i and µc(i) =

1Nc

∑i∈I(c)where Ncis the sizeof cluster c and I(c) indexes examples in cluster c. The sum of variances, Vtot can be expressed as:

Vtot =1N

T (11)

where N is the total number of data points. Furthermore the variance between the cluster Vbetweenis described as:

Vbetween =1N ∑

i||µc(i)−µ||2 (12)

25

Page 28: Building a standard operating procedure for the analysis of mass

Figure 13 – An example illustrating the Elbow method, where the total variance explained by the datais plotted versus the number of k-means clusters used. The blue circle marks the optimal choice for thevalue of k, since it finds the best trade off between explaining as much of the variance as possible whilekeeping the number of clusters as low as possible.

Thus, the ratio of Vbetween and Vtot is Vbetween/Vtot =1N ∑i ||µc(i)−µ||2

1N T

which is equal to the ratio of the

SSD between clusters and the total SSD. The Elbow method is further illustrated in figure , wherethe blue circle marks the “elbow” and thus the optimal value for k.

Principal component analysis (PCA) is another method for reducing the number of attributes. Thedata set is transformed to reduce the number of dimensions and this method can also be used as anunsupervised exploratory way to find patterns in the data. The transformed data is described by theprincipal components, which is a set of linearly uncorrelated vectors. The goal of PCA is to find abasis, that can re-express the data in the best way. This can be simply described by the followingequation:

PX = Y (13)

where X is an m∗n matrix representing the original data set and Y is a representation of X after alinear transformation, P. The rows of Y are called the principal components and the rows of P area set of new basis vectors for representing the columns of X:

26

Page 29: Building a standard operating procedure for the analysis of mass

PX =

p1...

pm

[ x1 · · · xn]=

p1x1 · · · p1xn... . . . ...

pmx1 · · · pmxn

= Y

The principal components, ∑ pi x jrepresent the transformed values of a particular data point and arealso called scores.

A covariance matrix express how much the dimensions in the datasets vary with respect to eachother, compared to the variance measure which express the variation in one dimension indepen-dently. For a given data set, the aim of PCA includes minimizing the redundancy, given by thecovariance, and maximizing the signal, given by the variance. A covariance matrix Cxfor the m∗nmatrix X, is computed as described in the following equation:

Cx =1

n−1XXT (14)

where the values in X are in mean deviation form, meaning that the mean value x of all data points inX have been subtracted form each element xi, or is zero. The diagonal elements in Cx describe thevariance in the data set and the off-diagonal elements describe the covariance. Cx is diagonalized inorder to minimize the covariance, and forms the manipulated covariance matrix Cy. This means thatthe off-diagonal elements are zero and thus the redundancy in the data is minimized. PCA assumesthat the direction in the m dimensional data set X that have the largest variance are most important,and that the basis vectors in P are orthonormal. The direction giving the largest variance in X issaved as the vector p1, the first principal direction. Subsequently, the direction giving the nextlargest variance is saved in p2 and so on. The search for each new principal component is restrictedto directions that are perpendicular to all previously selected principal components, because of theassumed orthonormality constraint. The rows of P, {p1 ... pm} are in fact the eigenvectors of thematrix Cx, and are also called loadings and represent weight factors that is multiplied with theoriginal data to calculate a principal component yi = pix j.

In short, the aim of PCA can be summarized as follows: find an orthonormal matrix P whereY = PX such that the matrix Cy ≡ 1

n−1YYTis diagonal [42, 12, 44].

PCA provides a way of visualizing high-dimensional data. Since the peptide data sets in this projectcontain several thousands of peptides, this is useful in gaining an overview of the data and hopefullyobserve any clear patterns or trends. This is shown in the result section 4.3.3, figures 27, 28 and 29.

2.4.3 Attribute importance

To determine the importance of the attributes in the data set is key in order to gain an understandingof the most prominent features in the data set. In the context of this project, this means finding thepeptides that are typical for a certain tissue type or treatment effect. Attribute importance can bemeasured in different ways and a straight forward way is to measure the loss in classifier accuracywhen a certain attribute is removed. For example: a data set contains three attributes, a, b and

27

Page 30: Building a standard operating procedure for the analysis of mass

Figure 14 – An example of 5-fold cross-validation showing the partitioning into training set and testset. The gray boxes represent elements in the training set and the white boxes represent elements in thetest set.

c. The classifier’s accuracy is 90% when using all three attirbutes, i.e. it can classify a givenexample correctly 9 out of 10 times. If attribute c is removed, the classifier’s performance drop to80%, meaning that attribute c has an importance of 10% to the overall accuracy of the classifier.The attribute that contribute to the largest drop in accuracy when removed are considered the mostimportance one. Attribute importance can also be measure by looking at the Gini Index decrease.The Gini Index in describes inequality in the distribution in the data. It can be though of as highimpurity in the data with respect to the classes, so the Gini Index should be as low as possible [3].A high decrease in Gini Index when an attribute is used is therefore desirable. [5]. Another methodfor measuring the importance of attributes is to randomly permutate the attributes and evaluatingthe differences in the classifier’s accuracy.

2.4.4 Cross-validation

The model produced by a classifier needs to be tested in some fashion to ensure its accuracy androbustness. A useful and common way to to this is cross-validation (CV). In k-fold CV, the data issplit into k equally-sized parts. All parts except one is used to train the model and the remainingpart is used to test the model - how well it can classify a given example based on the rules set upduring the training. The training and testing procedure is repeated using different parts as trainingand test data set until all k parts have been used. [36] An example of 5-fold CV is illustrated infigure 14.

A special case of CV is leave-one-out CV (LOOCV) which means that the data set is partitionedinto a number of parts equal to the number of examples in the data set. Consequently, one exampleis used to test the model each time, which is suitable for data sets where few examples are available.The motivation is that the model can be trained using as many examples as possible, while it canstill be tested on “new” test data.

28

Page 31: Building a standard operating procedure for the analysis of mass

2.4.5 Permutation test

In order to draw any meaningful conclusions from the performance of a classifier, the statisticalsignificance of that performance must be established. A method of doing this is by conductinga permutation test, where the idea is to establish whether or not the performance is reliable orrandom. In the context of classification, this can be done by randomly shuffling the class-labels inthe data set, effectively destroying the relation between an example and the class it belongs to. Theclassifier is then trained on the data with scrambled class-labels and its performance is measured.This is repeated many times to create a distribution of resulting performances. The null hypothesisis stated such that there is no relation between an example and its class label. A p-value is used toassess the significance, which describes the probability of the null hypothesis being false. The p-value is expressed as the number of times the performance with scrambled class-labels is at least ashigh as the performance obtained from non-scrambled class-labels. Hence, a low p-value signifiesa reliable performance from the classifier. [21]

3 Experimental setup and algorithmic solutions

This section explains the software, methods and solutions used in the project to construct the plat-form. It also covers the description of the data set used in the pattern recognition and the overalllayout of the platform.

3.1 Platform layout

An overview of the platform’s layout is shown in figure 15.

There are several software and scripts that together make up the platform. The spectral libraryconstruction and spectral matching functions were provided by a modified version of SpectraST[40]. Annotation data are stored in an SQLite [45] database. File conversions are partly done byProteoWizard’s MsConvert [13] and quantification is done with Progenesis LC-MS [20]. SpectraSTis a part of the software suite Trans Proteomic Pipeline (TPP) but has been used as a standaloneversion in this project. All except Progenesis LC-MS are open-source software.

The scripting language Perl was used to integrate the different softwares of the peptide library.Most of the integration has to to with file format conversations to enable communication betweenthe softwares. Perl is both suitable for this task and familiar to the author and was therefore chosenas the scripting language of choice.

SpectraST was originally developed to enable the construction of a local spectral library whereannotated MS/MS spectra is stored. The intention was to reduce search times and avoid doingrediscoveries of the same peptides. The software stores MS/MS spectra in a database which aredirectly compared against query spectra. This approach reduce search times since there is no insilico peptide cleavage, in contrast to e.g. X!Tandem. However, SpectraST does not support con-tinous addition of annotations to the stored MS/MS spectra. For this reason, the annotations are

29

Page 32: Building a standard operating procedure for the analysis of mass

stored separately in an SQLite database as mentioned above. SpectraST was however chosen as themethod of creating the spectral library and perform the spectra matching because of its speed andthe possibility to make customizations.

Some modifications and fixes were done to SpectraST to make it more suitable to the project’sneeds. In its original state, there seemed to be a bug in the SpectraST software that made it importone less spectra than the number of spectra that were present in a file. This meant that zero spectrawere imported when the file only contained one spectra. The output format of a search result wasslightly modified to also include information under “Remarks” for a search entry. Tissue type orotherwise categorical data is stored in the comments section of the database entry and is key inorder to identify the sample downstream. Therefore it needed to be included in the search output.A full list of changes made in order to build SpectraST is available in appendix A.

SpectraST supports importation of spectra in several different formats. The mzXML format waschosen because of technical limitations in SpectraST, which resulted in import errors when tryingto import raw spectra in any format other than mzXML. However, the spectra imported into thedatabase are meant to be exported from Progenesis LC-MS which does not support mzXML. Thismeans that the spectra are exported from Progenesis LC-MS in MGF format and then converted tomzXML by using ProteoWizard MsConvert.

3.2 Data set

The data set used to perform the pattern recognition in this project consisted of quantified massspectrometry data from endogenous peptides. This data set contains the amount of a certain peptide- or peptide abundance - for each sample in the MS run. The quantification were done in thesoftware Progenesis LC-MS.

This data set will be referred to as the “taxotere” data set and contains MS data from mouse braintissue from 13 different mice. A sample from the left and right striatum is taken from each mouse.The samples in the data set were stabilized either by snap-freezing the samples [35] or by heat inac-tivation using Denator’s ST1 instrument. Furthermore, the samples were either treated or untreatedwith Taxotere. The resulting data set thus have four classes describing the sample’s inactivationtechnique used and whether or not it was treated with taxotere:

• “SnapFroz Yes” - the sample was inactivated by snap freezing and treated with Taxotere.

• “SnapFroz No” - the sample was inactivated by snap freezing and not treated with Taxotere.

• “Stab Yes” - the sample was inactivated by snap freezing and treated with Taxotere.

• “Stab No” - the sample was inactivated by snap freezing and not treated with Taxotere.

The entire data set contains 26 examples (MS runs) and 43450 attributes (peptide signals). Allclasses contained six examples, except “SnapFroz Yes” which contained eight examples. A shortexcerpt of the data set is shown in figure 16.

30

Page 33: Building a standard operating procedure for the analysis of mass

Figure 15 – The platform’s layout can be though of as two branches. One has to do with the peptidelibrary, “searches and annotations”. The other branch includes the machine learning part, “Find interest-ing peptides”, where the objective is to find which peptides are specific for a tissue or a treatment effect.The results from the machine learning branch can then be used to query the peptide library, as shown inthe figure by an asterisk.

Figure 16 – An excerpt of the data set used for the pattern recognition. Each attributes represent apeptide’s m/z ratio, retention time, mass and charge. For example, 709.65_27.16_4960.50_7 means anm/z ratio of 709.65, a retention time of 27.16 minutes, a mass of 4960.50 Da and a charge of +7. Theclass labels representing the different sample preparations and treatments is shown in the right-mostcolumn.

31

Page 34: Building a standard operating procedure for the analysis of mass

3.3 Peptide library

The peptide library constructed in this project is comprised of two different databases. One databaseholds the raw spectral data and one database holds the annotation information, in which MS runsa certain peptide has occurred and whether or not two spectra are similar enough to be assumedto representing the same peptide. SpectraST is designed to be used with annotated peptide datathat is directly imported from e.g. Mascot or Sequest. In this project the idea is to first store thespectra without annotations and mark them up in later stages with annotation data as the spectraare identified. To meet this need, a separate database is used to handle annotations. For the end-user, this distinction will not be noticeable. The search results are presented in the form of anHypertext Markup language (HTML) table. Using HTML for this purpose provides an easy wayto create a result view that is compatible with any web browser and thus reduce any extra softwaredependencies.

Matched peptides get a ranking that depends on their likeliness to be a good hit by the F-valuegiven by the discriminant function F described above (eq. 7). A high F-value means a highlyranked and therefore likely match. A matching test was performed to try out the performance of thematching procedure. A data set consisting of neuropeptides from rat (“rat” data set) were importedinto the spectral library (see 3.3.1) and annotations for most of these spectra were imported into theannotation library (see 3.3.2). The annotations were obtained by searching all MS/MS spectra fromthe “rat” data set in X!Tandem against a database of confirmed or highly likely endogenous peptidesfrom rat. All spectra from the “taxotere” data set described above was used as the query, were nonewere previously annotated. The query spectra for the top eight hits from the test matching was sentto X!Tandem for annotations in order to compare with the annotation obtained from the “rat” dataset. The following parameter settings were used in X!Tandem:

Modifications: C-terminal amidation och oxidation@M

This setting describes the allowed modifications of the peptide. In this case amidation of the C-terminal amino acid and a possible oxidation of Methionine.

Refinement: Acetyl N-term, Deamidated@N, Deamidated@Q, Acetyl@K, Oxidation@M

The refinement parameter allows for further relaxation of the search constraint by allowing morepossible modification in a second-stage search, which expands the search space and thereby in-creasing the chance of finding an annotation.

Spectrum: parent monoisotopic mass error: 10ppm, fragment monoisotopic mass error: 0.5 Da

The spectrum setting describes the tolerance in mass difference between the query spectra and thedatabase spectra. The parent mass error reflects the mass of the peptide that was fragmented into theMS/MS spectrum and the fragment mass error refers to the peak masses of the MS/MS spectrum.

3.3.1 Spectral library

The spectral library, or database, was constructed using a modified version of SpectraST. The li-brary is stored in a binary file format specific for SpectraST, called splib, that is fast in terms of

32

Page 35: Building a standard operating procedure for the analysis of mass

search speed but the downside is that there is no way to extract data directly from the database - itshould be thought of as a reference catalog only. There is a text version of the library that is createdwhen the library is created, but it is not required for the library to function as the search processonly uses the binary library file. Once peptide spectra are matched, they need to be extracted fromthe raw spectra data files in order to make further analysis, such as sending them to Mascot orSequest for annotation.

3.3.2 Annotation library

The annotation library was constructed using SQLite, a relational database engine that provides alight-weight database service that also take up little storage space. An Entity Relationship (ER)diagram describing the database design is shown in figure 17. This type of diagram shows thetables in the database and which entity types (attributes) each table holds and the data type ofeach entity. The entities are related to each other between the tables as described by a cardinality.The cardinalities express the number of instances one entity can, or must be associated with theinstances of another entity. There are three types of cardinalities: one-to-one, one-to-many andmany-to-many relationships. [41]

Information from the annotation database is automatically fetched by a Perl script during a searchin the peptide library. An SQL query is sent to the database asking for any available annotationsfor a given spectrum. For example, in order to find if there are any peptides in the database similarto a given peptide with the name “Peptide_123”, the following SQL query would be sent:

s e l e c t * from o c c u r r e n c e where i d =( s e l e c t s i m i l a r _ p e p t i d e from s i m i l a rwhere i d = ’ P e p t i d e _ 1 2 3 ’ ;

If there are any similar peptides to “Peptide_123”, the names of these peptides and the MS run theyoccur in will be returned.

Peptide information are imported into the annotation library during the same time as the peptidespectra are imported into the spectral library. At this stage there aren’t any annotation data availablein the sense of amino acid sequence or protein information. This information consist of peptideidentity, if a spectra is similar to another spectra and in which run the peptide occurs. This meansthat there is enough information to be able to track the occurrence of spectra to specific experiments,even if the peptide itself is not annotated.

3.4 Pattern recognition

This subsection will explain the methods behind the process of pattern recognition. The aim is tofind patterns in the peptide data in an attempt to correlate peptide abundance to a specific tissue ortreatment.

33

Page 36: Building a standard operating procedure for the analysis of mass

Figure 17 – Entity Relationship diagram that describes the design of the annotation database. Each boxrepresent a table in the database, where the name of the table is listed at the top and the attribute name andcorresponding data types are listed below. The labels in the arrows describes the cardinalities betweenentries in the the tables. For example, the cardinality 1 to N between “Peptide” and “Annotation” meansthat one peptide can have several annotations.

34

Page 37: Building a standard operating procedure for the analysis of mass

3.4.1 Data sub-setting and filtering

The peptide abundance levels in the data set vary a lot in magnitude, from 0 up to around 65000 000. The abundance values are dimensionless and reflect the relative abundance between allpeptides in the data set. An abundance value of 0 means that no peptide signal were detected inthat particular sample. However, the vast majority of data is in the lower ranges, under 1000. Thedata set was filtered using a cutoff for the abundance value to create a subset of the original dataset. The purpose was partly to investigate if there is a difference in low and high level abundancepeptides when it comes to prediction performance. Filtering and sub-setting was also performed todetermine if peptides in low abundance are significant and not just noise in the data.

Abundance cutoff values were chosen arbitrarily from a visual inspection of the data distribution(fig. 25). Several values were tested; 150, 300, 500 and no cutoff. The filter works as follows:

• Go through all peptides in the data set.

• Check if the peptide occurs at a level higher than the given threshold in any MS run (anyobservation).

• If it does occur at a level higher than the threshold, save the peptide to the subset.

A more stringent cutoff was also used to explore any differences between high and low abundancepeptides when it comes to discerning between sample groups. The hard criteria were that all ob-servations must fulfill a threshold, where the cutoff value of 500 was used.

3.4.2 Feature extraction - clustering

Feature extraction was done using a k-means method implemented in R (“kmeans” in package statsversion 2.14.1) [33]. Since it is impossible to know the optimal number of clusters for a given dataset beforehand, several values of k were tested to find the optimal ones; 5, 10, 20, 30, 40, 50, 100,150, 200, 250, 500, 1000, 3000, 5000, 10000.

The optimal number of clusters, k, was investigated by using the Elbow method.

3.4.3 Pattern recognition

Principal Component Analysis (PCA) was used as an exploratory way of finding distinctive patternsin the data. The scores of the two major principal components were plotted against each other inorder to discover any clear group distinction between groups in the data set.

Classification was done using Random Forest (“randomForest” in package randomForest version4.6-6) as the classifier in R. A model was trained and tested by means of LOOCV.

The Random Forest model was trained and tested using 2000 trees and√

p attributes for eachsplit, where p is the number of attributes. The accuracy was calculated by taking the mean of the

35

Page 38: Building a standard operating procedure for the analysis of mass

classification error for each class for all CV folds. Also, the confusion matrix for each CV fold wasadded into a total confusion matrix for the entire CV. A model was trained for all combinations ofdata filtering and the values of k that was determined as optimal from the elbow method (see 3.5.2).

A permutation analysis was done to assess the LOOCV performance estimates that had the bestpredictive performance. The class labels on the data set were randomly shuffled so that they wouldnot represent their true example. The classifier was then trained and tested on the data set containedscrambled class labels. This was repeated 500 times for each combination of cutoff and value ofk and was done to ensure that the performance obtained in the model test was robust and did notoccur by chance. A p-value was calculated from the permutation analysis to get a measure of themodel’s statistical significance. The permutation analyses were focused on the models that gavethe best performance for a given abundance cutoff and number of clusters.

Overall attribute importance for the classifier was given by evaluating the mean decrease in Gini In-dex. This measure was chosen simply because it was the available measure in the R implementationof Random Forest for evaluating the overall attribute importance for the classifier. Class-specificattribute importance was given by evaluating the mean decrease in accuracy for the exclusion of anattribute, or cluster as a result of the feature extraction. The peptides in the clusters that contributedto the highest decrease in accuracy, if removed from the model, were looked at separately.

4 Results

This section explains the results of the project’s work. It explains the results of the pattern recogni-tion and how the peptide libary is designed and used.

4.1 Using the peptide library

The library is run by calling different Perl scripts. A Graphical User Interface (GUI) would bepreferable but the development of such an interface was outside the time fram of the project. Amanual was written that explains the usage in a more detailed way, see Appendix B.

4.1.1 Importing spectral data

The actual import process is done via a Perl script that reads all MGF files in a specified directory,extracts all spectra from the files and puts them into individual files, converts all the files to mzXMLand finally imports them into SpectraST that creates the library files in splib format and store themin a standardized folder.

The spectra’s entry names in the spectral library are the same as the name of the file they wereimported from. This means that all spectra in a file will be given the same names when imported.Each spectra does have a unique name which must be kept when the spectra is imported into thelibrary. Otherwise it would be impossible to trace a match to the correct spectra. To make this

36

Page 39: Building a standard operating procedure for the analysis of mass

Figure 18 – Example of a search result.

Figure 19 – Example of annotation information presented in the search output.

possible, every spectra in an mgf file is extracted and put into its own file that has the same filename as the spectra title. Consequently, the execution time of the import process becomes longer.

4.1.2 Searching the library

Initiating a search is done by calling a Perl script where a query file containing at least one MS/MSspectrum is provided. More details on how to perform a search is available in the manual in Ap-pendix B. The search output is presented to the user as an HTML table and contains various infor-mation about the peptide match. These include dot product score, ∆D, dot bias, the value of thediscriminant function (F-value) and number of annotations (fig. 18). The output also includes a plotof both the query spectra and the database spectra to get a visualization of the match (fig. 20). Thetext version of the library is used in this case in order to find the spectrum details. The annotationinformation is presented on a sub page, where information such as amino acid sequence, e-valuesand the search engine used to produce the results are shown (fig. 19). Here, it is also possible toget the output file from the search engine, which contains a lot more detailed information abouta certain annotation. There is also information whether or not the annotation has been manuallychecked, in order to help the user get a quick overlook.

It is desirable to be able to mark annotations that have been manually verified in order to avoiddoing the same work several times. This is shown in the annotation information under the column

37

Page 40: Building a standard operating procedure for the analysis of mass

Figure 20 – Example of a visualization of a spectrum match.

“Checked?”. Marking annotations as checked is done by passing a file containing peptide identitiesto a script that updates the information for those id’s accordingly. It is also possible to change thisstatus if a for example a mistake is made.

The spectra matching did not originally work as intended. SpectraST does some processing ofthe query spectra before doing the actual matching, which cause two identical spectra to not havea score equal to 1. The configurable search parameters were changed in a trial-and-error fashionuntil two identical spectra indeed got a perfect score (Appendix A.2).

The F-value cutoff value for measuring similarity between spectra was set to 0.65 as the defaultvalue. This value was set by testing several values and visually comparing plots of the spectrapairs. As mentioned above, this value can be changed by the user to change the stringency of thesimilarity criteria.

4.2 Performance

Importing spectra into the database is fairly fast, although probably limited by the transfer speed ofthe hard drive since disk operations is a big part of the process. The speed does become slower dueto the fact that each spectra needs to be imported from an individual file. Importing about 1 GBof MS/MS data i MGF format took about four hours on a computer using a mechanical hard driveoperating at 5400 rpm with a SATA 3 Gb/s interface and a dual core Intel Core i5 processor @ 2.40Ghz.

Searching the spectrum database is fast in itself, however the search process is followed by fetchingannotations and creating plots for visualizing the peptide matches. This results in longer executiontimes the more search hits there are.

38

Page 41: Building a standard operating procedure for the analysis of mass

Table 1 – Subset sizes for all abundance cutoffs.

Abundance cutoff Number of attributesnone 43450150 40652300 36786500 32868

only >500 2689only <500 10582

The last two rows describe the subset sizes when the “hard” cutoff criteria was used,where “only >500” (“only <500”) refers to the case where all peptides must occur atan abundance value higher (lower) than 500 in all runs, see section 3.4.1.

The annotation search for the top eight hits from the test matching (section 3.3) resulted in thesame annotations as their match in the spectral library, in all cases. This showed that the spectramatching was able to correctly match different spectra that represent the same peptide, even cross-species peptides. The matching is also able to cope with the differences that occur between differentmass spectrometer technologies; an Orbitrap [37] detector was used for the “taxotere” data and anFourier Transform Ion Cyclone Resonance (FTICR) [23] detector for the “rat” data.

4.3 Pattern recognition

This subsection will cover the results of the methods used to perform the pattern recognition on thepeptide data. Various parameters were tried out in order to make the analysis as comprehensive aspossible, within the given time frame of the project. For brevity, only results that were consideredrelevant are presented here. The aim of this section is to show how peptides of interest and ofpossible biological significance can be found using machine learning methods. The goal is not togain a deepened understanding of the biology surrounding these peptide but to present a method ofhow to pin-point the interesting peptides for further analysis.

4.3.1 Data sub-setting and filtering

The abundance cut off values resulted in subsets of different sizes shown in table 1. The cut offvalues were chosen arbitrarily in order to create data sets of varying size, without reducing thenumber of peptides more than about 10000.

4.3.2 Feature extraction, clustering

The elbow method revealed that the difference in variance explained became negligible for numberof clusters greater than 250 for all abundance cutoffs except < 500 (fig. 23). The “elbow” occursat about the same number of clusters for all other abundance cutoff values (fig. 21 and 22). The

39

Page 42: Building a standard operating procedure for the analysis of mass

Figure 21 – Elbow plot: total variance explained over number of clusters used in k-means clustering.The “elbow” is located where the curve bends off the most, in this case around k = 30-40. Data set:taxotere. The red line represents data with no abundance cutoff, the blue line represents data withabundance cutoff at 150.

resulting elbow plot for cutoff < 500 show a much lower slope than for the other cutoffs. This islikely due to loss of variance in the data as there are no abundance value above 500 in the data set.

For all values of k, there was a single, or a handful, very large cluster in the data set. A representa-tive example can be seen in figure 24. This cluster contained almost only peptides of relatively lowabundance (fig. 26) compared to the abundance levels in the entire data set (fig. 25).

Several values around the “elbow” were investigated for all abundance cutoffs in the classificationprocedure since they were possible candidates for an optimal value for k. The following values ofk were investigated:

• No abundance cutoff and cutoff at 150: k = 30 and 40 (fig. 21).

• Cutoff at 300 and 500: k = 20, 30, 40, 50 (fig. 22).

• “Hard” cutoff at < 500 and > 500: k = 20, 30, 40, 50 and k = 250 respectively (fig. 23).

4.3.3 Pattern recognition

PCA was applied on the entire data set and the scores for the first two principal components wereplotted and the examples were marked according to their group. The analysis revealed a clear

40

Page 43: Building a standard operating procedure for the analysis of mass

Figure 22 – Elbow plot: total variance explained over number of clusters used in k-means clustering.The “elbow” is located where the curve bends off the most, in this case around k = 20-50. Data set:taxotere. The red line represents data with abundance cutoff at 300, the blue line represents data withabundance cutoff at 500.

Figure 23 – Elbow plot: total variance explained over number of clusters used in k-means clustering.The “elbow” is located where the curve bends off the most, in this case around k = 20-50. Data set:taxotere. The red line represents data with hard abundance cutoff at > 500, the blue line represents datawith hard abundance cutoff at < 500.

41

Page 44: Building a standard operating procedure for the analysis of mass

Figure 24 – Example showing the number of peptides per cluster, where the vast majority of peptidesreside in the same cluster. This example represents k-means clustering with k = 50, abundance cutoff:500.

Figure 25 – Abundance distribution for all peptides in the data set. The abundance level range from 0 to83970000 and the mean value is 12680. Values in the plot have been scaled by a log10-factor for betterclarity and 0-values are ignored since log10(0) is undefined (220394 of 1129700 values are 0-values).

42

Page 45: Building a standard operating procedure for the analysis of mass

Figure 26 – Abundance distribution for all peptides in the largest cluster for k-means with k = 50 andabundance cutoff: 500. The abundance level range from 0 to 2672102 and the mean value is 1427.Values in the plot have been scaled by a log10-factor for better clarity and 0-values are ignored sincelog10(0) is undefined (199520 of 987610 values are 0-values).

distinction between the samples that were snap-frozen and stabilized (fig. 27). However, it was notpossible to discern between samples that were or were not treated with Taxotere.

A PCA was also performed on the data consisting of all peptides that resided in the large clusterafter k-means clustering, as discussed above (see 4.6.1). This was done to see if there were anydistinct pattern in the data or if the cluster simply formed because the peptides in it were left outin all other clusters. The results revealed that there were indeed a signal in the data that made itpossible to separate snap-frozen and stabilized samples (fig. 28). The pattern were the same forall abundance cutoffs and k-values combinations that were chosen from the elbow plots. It washowever not possible to see any clear separation when it comes to Taxotere treatment.

For the entire taxotere data set, the PCA revealed a few outlying examples (fig. 27). The PCAperformed on the peptides in largest cluster showed that there was a clear outlier that occurred inboth PCA analyses, which can be seen in the top-right corner of figure 28. The outliers seen infigure 27 was removed and the PCA was re-done without outliers. After removing the outliers, itwas still possible to see a clear difference between snap-frozen and stabilized samples (fig. 29).

The results from Random Forest classification indicates a difference in predictive accuracy de-pending on abundance cutoff and number of clusters, although not statistically ascertained. Anabundance cutoff at both 150 and 300 resulted in the lowest error rates of all models trained. Foreach cutoff, there were two models that had the same overall error rate (table 2).

The harder cutoff criteria, > 500 resulted in very poor predictive performance. The overall error

43

Page 46: Building a standard operating procedure for the analysis of mass

Figure 27 – The scores of the two most im-portant principal components from a PCAanalysis on the entire taxotere data set. Twooutliers can be seen in the top right corneras well as one possible outlier in the bottomright corner. The examples are numbered as:1 - “Stab No”, 2 - “SnapFroz No”, 3 - “StabYes”, 4 - “SnapFroz Yes”.

Figure 28 – The scores of the two most im-portant principal components from a PCAanalysis on the peptides residing in thelargest cluster for k-means with value 50.Abundance cutoff: 500. An outlier can beseen in the top right corner, the same outlieras seen in figure 27. The examples are num-bered as: 1 - “Stab No”, 2 - “SnapFroz No”,3 - “Stab Yes”, 4 - “SnapFroz Yes”.

Table 2 – Mean error rates for the Random Forest classifier, abundance cutoffs: 150, 300.

Error rate (%) p-value k (number) Abundance cutoffOverall SnapFroz No SnapFroz Yes Stab No Stab Yes

30.8 33.3 0.0 50.0 50.0 < 0.002 20 15034.6 33.3 0.0 50.0 66.7 < 0.002 20 300

The classifier was trained on 26 examples using LOOCV. The columns describe theerror rate for the classifier overall and the error rates with respect to discerning a spe-cific class. The p-value describes the probability of the model’s overall error rate tohave occurred by chance. Data set: Taxotere.

44

Page 47: Building a standard operating procedure for the analysis of mass

Figure 29 – The scores of the two most important principal components from a PCA analysis on theentire taxotere data set, after outliers (fig. 27) have been removed. The examples are numbered as: 1 -“Stab No”, 2 - “SnapFroz No”, 3 - “Stab Yes”, 4 - “SnapFroz Yes”.

Table 3 – Mean error rates for the Random Forest classifier, abundance cutoff: < 500.

Error rate (%) p-value k (number)Overall test SnapFroz No SnapFroz Yes Stab No Stab Yes

38.5 50.0 12.5 83.3 16.7 < 0.002 250

The classifier was trained on 26 examples using LOOCV. The columns describe theerror rate for the classifier overall and the error rates with respect to discerning a spe-cific class. The p-value describes the probability of the model’s overall error rate tohave occurred by chance. Data set: Taxotere.

rate is 50 % (table 4). Abundance cutoff < 500 resulted in a better model with an overall errorrate at about 38 % (table 3). This indicates that low-abundance peptides are significant in order todetermine the sample class and that there are no clear correlation between high abundance leveland strong predictive power for a given peptide.

Permutation analysis was performed on the best-performing models to explore any differences inrobustness and showed that all trained models were indeed better than guessing classes at random.All models for cutoff 150 and 300 show the same overall error rate but have slightly different class-specific error rates. The class “SnapFroz Yes” is by far predicted correctly more than the otherclasses. The p-value describing the validity of the classifier’s performance was calculated from thedistribution of error rates from the permutation analysis (fig. 30).

The classifier could not separate all classes equally well, but it did not make any mistake in separat-ing sample preparations: whether or not it was snap-frozen or stabilized (table 5). This indicated asignificant difference in peptide abundances between these groups and also confirms the result from

45

Page 48: Building a standard operating procedure for the analysis of mass

Table 4 – Mean error rates for the Random Forest classifier, abundance cutoff: > 500.

Error rate (%) k (number)Overall test SnapFroz No SnapFroz Yes Stab No Stab Yes

50.0 50.0 25.0 66.7 66.7 30

The classifier was trained on 26 examples using LOOCV. The columns describe the er-ror rate for the classifier overall and the error rates with respect to discerning a specificclass. Data set: Taxotere.

Figure 30 – Error rate distribution for a permutation analysis where the class labels were randomizedand the model was trained and tested using these scrambled labels 500 times. In this example, anabundance cutoff at 150 and a k value at 20 was used. Data set: taxotere.

46

Page 49: Building a standard operating procedure for the analysis of mass

Table 5 – Confusion matrix example for the Random Forest classifier.

SnapFroz No SnapFroz Yes Stab No Stab YesSnapFroz No 4 2 0 0SnapFroz Yes 0 8 0 0

Stab No 0 0 3 3Stab Yes 0 0 4 2

Value for k in k-means: 20, abundance cutoff: 300. The class names on the left handside are the actual classes and the class names on the top side are the predicted classes.For example, of the six “SnapFroz No” examples, four were predicted correctly andtwo were predicted as “SnapFroz Yes” - giving an error rate of 2/6, or 33.3%.

Table 6 – Mean error rates for the Random Forest classifier, classes describe Taxotere treatment.

Error rate (%) p-value Abundance cutoff k (number)Overall No Yes

34.6 41.7 28.6 0.002 150 2034.6 41.7 28.6 < 0.002 300 20

The classes have been reduced to Taxotere treatment “yes” or “no”. The p-value de-scribes the probability of the model’s overall error rate to have occurred by chance.Abundance cutoff: 150, 300. Data set: Taxotere.

the PCA analysis (fig. 27). However, the classifier was not able to separate groups with respectto Taxotere treatment. To further investigate the possibility of separating on Taxotere treatment,two new classes were formed instead of four; Taxotere treated and non-Taxotere treated samples.Consequently, these new classes only considered Taxotere treatment and not sample preparation.

Some of the models trained on the two-class data set showed promising results. The lowest errorrates were obtained for abundance cutoffs 150 and 300 for k value 20 (table 6). Permutation testswere performed on these models and revealed that their performance is statistically significant anddid not just occur by chance.

To investigate the possibility of separating on Taxotere treatment even further, the dataset wassplit into two parts: snap-frozen samples and stabilized samples. Within each of these datasetsthere were two classes: treated with Taxotere and not treated with Taxotere. An abundance cutoff at 150 was used and the data sets were clustered using k-means as previously described. A kvalue of 40 and 20 was optimal for snap-frozen and stabilized samples respectively. The resultingclassifier models showed an overall error rate of 42.9% for the snap-frozen samples and 66.7%for the stabilized samples. Permutation tests were performed on both models and showed that theperformance for the snap-frozen samples was significant (p-value 0.002) , while the performancefor the stabilized samples was not significant (p-value 0.799) (table 7).

47

Page 50: Building a standard operating procedure for the analysis of mass

Table 7 – Mean error rates for the Random Forest classifier, data split by sample preparation.

Error rate (%) p-value Sub set k (number)Overall Yes No

42.9 25.0 66.7 0.002 Snap-frozen 4066.7 50.0 83.3 0.799 Stabilized 20

The data set have been split into only snap-frozen and only stabilized samples. Withineach subset are the classes Taxotere treatment “yes” or “no” The p-value describes theprobability of the model’s overall error rate to have occurred by chance. Abundancecutoff: 150. Data set: Taxotere.

4.3.4 Attribute importance

The attribute importance given by the classifier was plotted for the top ten most influential attributes(clusters). This was done for the top-performing models given by a k value of 20 for abundancecutoffs at 150 and 300 respectively. Attribute importance was given by evaluating the mean de-crease in Gini Index [3] (see section 2.4.5). Clusters 7 and 14 was the most important attributes forthe model trained on data with abundance cut off 150 (fig. 31). Looking at the model for abundancecutoff 300, cluster 17 and 11 was the most important attributes (fig. 32). In total, these four clusterscontained 106 peptides.

Class-specific attribute importance showed that different clusters had high importance for the clas-sifier depending on the class (fig. 33, 34, 35, 36). The clusters (peptides) specific for snap-frozenand stabilized samples were looked into in more detail, since the classifier were always separatewith those sample types. This means that “SnapFroz Yes” and “SnapFroz No” were looked at asone class, and “Stab Yes” and “Stab No” as one class. The masses for both the overall importantpeptides and the class-wise important peptides were plotted and show some common peptides be-tween the overall importance and class-wise importance. The highest ranking cluster for each cutoff and class were picked out and duplicates were removed. There are almost no common peptidesthat distinguish snap-frozen and stabilized samples, except for a few just below 5000 Da (fig. 37).

The attribute importance for separation based on Taxotere treatment showed that clusters 7 and14 for abundance cut off 150 were the most important ones, as well as clusters 14 and 17 for cutoff 300. These are the same clusters that had the highest importance in distinguishing snap-frozensamples, in particular “SnapFroz No” (fig. 33, 34).

5 Discussion

Some decisions and limitations had to be made throughout the project in order to focus the worktowards the goal and try to finish on time. The overall goal of the project was to create a platformwith the purpose of aiding the analysis of MS data (see section 1.2). While there are some polishingleft to do and aspects to further look into, the overall goal of the project has been met. The spectraldatabase works as intended and can be used for storage and query of both raw and annotated spectra.

48

Page 51: Building a standard operating procedure for the analysis of mass

Figure 31 – Overall attribute importance for the Random Forest model trained on a clustered data setwhere k = 20 and abundance cutoff 150. The higher Gini Index decrease, the better.

The pattern recognition revealed some interesting results from quantified MS data and is able topick out specific peptides that are of importance in categorizing a sample.

5.1 Peptide library

The dot product score (eq. 3) should intuitively become 1 if two identical spectra are matched.This was not the case when matching was done with SpectraST initially. Since SpectraST is meantto hold spectra that are calculated from a search engine, the spectra in the database are generalfor a certain peptide. The query spectra are processed by default, which lead to a score that wasnot equal to 1 even though two identical spectra were matched. The intention from the SpectraSTdevelopers is most likely to try to reduce noise in the spectra, which is why the spectra wereprocessed. Therefore, the search parameters were changed to allow both smaller and more peaks tobe included in the query spectra (see Appendix A.2), so that two identical spectra get a dot productscore of 1. It is possible that the modifications to the filter are a bit too generous, that they allow formore bad peaks than what would be optimal. The parameters were tested out on a limited numberof spectra which means that they may be too adapted to those spectra in particular.

The integration of many different softwares have its downsides, which are most prominent when itcomes to execution times. The feature that allows the user to visually inspect a match is helpful,but makes the search times longer.

49

Page 52: Building a standard operating procedure for the analysis of mass

Figure 32 – Overall attribute importance for the Random Forest model trained on a clustered data setwhere k = 20 and abundance cutoff 300. The higher Gini Index decrease, the better.

Figure 33 – Class-specific attribute importance for “SnapFroz No” in the Random Forest model trainedon a clustered data set where k = 20. Abundance cutoffs 150 (left) and 300 (right).

50

Page 53: Building a standard operating procedure for the analysis of mass

Figure 34 – Class-specific attribute importance for “SnapFroz Yes” in the Random Forest model trainedon a clustered data set where k = 20. Abundance cutoff 150 (left) and 300 (right).

Figure 35 – Class-specific attribute importance for “Stab No” in the Random Forest model trained on aclustered data set where k = 20. Abundance cutoff 150 (left) and 300 (right).

Figure 36 – Class-specific attribute importance for “Stab Yes” in the Random Forest model trained ona clustered data set where k = 20. Abundance cutoff 150 (left) and 300 (right).

51

Page 54: Building a standard operating procedure for the analysis of mass

Figure 37 – Masses for the most important attributes for the Random forest classifier’s overall perfor-mance and class-wise predictive performance (classes: snap-frozen and stabilized). The plots representmodels trained on data sets with abundance cutoffs 150 and 300.

52

Page 55: Building a standard operating procedure for the analysis of mass

5.2 Pattern recognition

It also seemed possible that there actually is a signal in the data that makes it possible to discernbetween taxotere treatment. When the class labels were replaced to only describe Taxotere treat-ment yes or no, there was a poor performance from the classifier (table 6). When dividing the dataset into snap-frozen and stabilized, the predictive performance dropped even more (table 7). Thereseems to be a stronger signal for Taxotere treatment in the snap-frozen samples, as seen by the errorrate and p-value in table 7. This signal is probably the reason why it is possible to discern betweentreated and untreated samples when using the entire data set, since both snap-frozen and stabilizedsamples were mixed and the only class-wise separation is in Taxotere treatment (table 6). The lowperformance was also probably partly due to the low number of examples in each data set after theyare split; there were only 12 and 14 in the stabilized and snap-frozen sub sets respectively.

There was a very big overlap in important attributes between snap-frozen samples and Taxoteretreated samples. This suggest that most of the signal for Taxotere treatment exist among the snap-frozen samples. A theory that is further supported by the results in table 7. This result is somewhatunexpected as the higher level of degradation in snap-frozen samples compared to stabilized sam-ples should result in a weaker Taxotere signal. However, the data set contain a few more examplesof snap-frozen samples which may be a reason for the stronger separation capability of Taxoteretreatment within that group.

There seems to be a clear distinction between peptide masses for peptides that are specific for acertain class or sample (fig. 37). There were only a few common peptides around 4900 Da betweensnap-frozen and stabilized and it seems that more peptides can be associated with snap-frozensamples. It is however hard to draw any conclusion based on the peptide masses alone. A morethorough investigation into these peptides should be conducted in order to gain any meaningfulunderstanding. Such an investigation is however outside the scope of this project.

6 Future work

The peptide library is run directly from the file system, which makes the HTML-based result viewstatic in a sense. System commands, such as calling a script, can not be performed via the HTMLpage, which leads to extra steps in order to do certain tasks, such as mark annotations as checked.A suggestion for further development is to run the platform on a web server. Not only does thissolve the above mentioned problems, it also provides a relatively easy way to construct a GUI.Additionally, a web server allows for remote access to the peptide library. This enables using acentralized setup of the peptide library, rather than storing in locally on only one computer.

It would also be beneficial to further develop the feature that allows the user to visually inspect aspectra match, since it currently result in longer search times. Additional support of file formatscan be useful since it would increase the flexibility the peptide library and allow for import fromother format, without having to manually convert them to MGF.

Very few peptides, in relation to the total number of peptides in the data set, contribute a lot to the

53

Page 56: Building a standard operating procedure for the analysis of mass

predictive power of the classifier. It may suggest that a the majority of peptides in the data set arenot relevant in explaining the biological differences between the samples. However, more detailedinvestigations into this area needs to be conducted to be able to say anything concrete. It shouldalso be interesting to perform machine learning on other data sets, where it might be possible tocome up with more conclusive results.

The permutation tests show that the classifier performance is indeed statistically significant (p-value0.002 and lower). The error rate in itself is quite high however. More examples is probably requiredin order to further lower the error rate. It might be possible to get better results by using a differentclassifier model, but that is an investigation outside the scope of this project. The most importantreason for high error rates is most likely due to the few number of examples. Highest performancewas obtained for data sets where a cutoff had been used. This would indicate that there is somelevel of noise in the data and that filtering this noise yields better predictive performance. It isunclear at exactly which level the abundance signal is noise. A more detailed investigation into thiswould be sensible as it could help develop a more standardized procedure for pre-processing thiskind of data.

In this project, only Random Forest were used in the machine learning for classification, which isonly one of the many available classifiers. Using other classifiers might yield different results andis worth looking into.

7 Acknowledgments

I would like to thank:

My supervisor, Claes Andersson for his continuous support, patience and advice during the project.

Karl Sköld, Mats Borén, Marcus Söderquist and Beatrice Orback at Denator AB for the opportunityto conduct this degree project.

Mats Gustafsson for his time and efforts in reviewing this report.

Kim Kultima (Uppsala university) for his valuable insights in the peptide identification process.

8 Bibliography[1] AGILENTTECHNOLOGIES (2008). Spectrum Mill MS Proteomics Workbench. WWW-

document. Accessed 2012-08-26. http://spectrummill.mit.edu/.

[2] AIHORIZON (2012). Machine Learning, Part II: Supervised and Unsupervised Learning.WWW-document. Accessed 2012-04-02. http://www.aihorizon.com/essays/generalai/.

[3] KELLIE J. ARCHER & RYAN V. KIMES (2008). Empirical characterization of random forestvariable importance measures. Computational Statistics & Data Analysis 52(4), 2249–2260.ISSN 01679473.

54

Page 57: Building a standard operating procedure for the analysis of mass

[4] DR ALISON E. ASHCROFT (2006). An Introduction to Mass Spectrometry. WWW-document. Accessed 2012-03-07. http://www.astbury.leeds.ac.uk/facil/MStut/mstutorial.htm.

[5] JEAN-PAUL RODRIGUE BRIAN SLACK (2012). The Gini Coefficient. WWW-document.Accessed 2012-05-15. http://people.hofstra.edu/geotrans/eng/ch4en/meth4en/ch4m1en.html.

[6] R CRAIG, J C CORTENS, D FENYO & R C BEAVIS (2006). Using Annotated Peptide MassSpectrum Libraries for Protein Identification research articles. Journal of Proteome Research1843–1849.

[7] JIMMY ENG & JOHN YATES (1999). Sequest. WWW-document. Accessed 2012-03-07.http://fields.scripps.edu/sequest/.

[8] DAVID FENYÖ & RONALD C BEAVIS (2003). A method for assessing the statistical sig-nificance of mass spectrometry-based protein identifications using general scoring schemes.Analytical Chemistry 75(4), 768–74. ISSN 0003-2700.

[9] PAUL GATES (2009). Quadruple & Triple Quadrupole (QQQ) Mass Analysis. WWW-document. Accessed 2012-06-11. http://www.chm.bris.ac.uk/ms/theory/quadmassspec.html.

[10] GENEBIO (2012). PHENYX. WWW-document. Accessed 2012-08-26.http://www.genebio.com/products/phenyx/.

[11] KONKALLU HANUMAE GOWD, K S KRISHNAN & PADMANABHAN BALARAM (2007).Identification of Conus amadis disulfide isomerase: minimum sequence length of peptidefragments necessary for protein annotation. Molecular BioSystems 3(8), 554–66. ISSN 1742-206X.

[12] STEVEN M. HOLLAND (2008). Principal Components Analysis (PCA). WWW-document.Accessed 2012-05-28. http://strata.uga.edu/software/pdf/pcaTutorial.pdf.

[13] DARREN KESSNER, MATT CHAMBERS, ROBERT BURKE, DAVID AGUS & PARAG

MALLICK (2008). ProteoWizard: open source software for rapid proteomics tools devel-opment. Bioinformatics (Oxford, England) 24(21), 2534–6. ISSN 1367-4811.

[14] KIM KULTIMA, ANNA NILSSON, BIRGER SCHOLZ, UWE L ROSSBACH, MARIA FÄLTH &PER E ANDRÉN (2009). Development and evaluation of normalization methods for label-free relative quantification of endogenous peptides. Molecular & Cellular Proteomics : MCP8(10), 2285–95. ISSN 1535-9484.

[15] KIM KULTIMA, KARL SKÖLD & MATS BORÉN (2011). Biomarkers of disease and post-mortem changes - Heat stabilization, a necessary tool for measurement of protein regulation.Journal of Proteomics 1–15. ISSN 1876-7737.

[16] HENRY LAM, ERIC W DEUTSCH, JAMES S EDDES, JIMMY K ENG, NICHOLE KING,STEPHEN E STEIN & RUEDI AEBERSOLD (2007). Development and validation of a spectrallibrary searching method for peptide identification from MS/MS. Proteomics 7(5), 655–67.ISSN 1615-9853.

55

Page 58: Building a standard operating procedure for the analysis of mass

[17] HENRY LAM, E.W. DEUTSCH, J.S. EDDES, J.K. ENG, S.E. STEIN & RUEDI AEBERSOLD

(2008). Building consensus spectral libraries for peptide identification in proteomics. NatureMethods 5(10), 873–875.

[18] ADELE CUTLER LEO BREIMAN (2004). Random Forests. WWW-document. Accessed 2012-05-28. http://www.stat.berkeley.edu/breiman/RandomForests/cc_home.htm.

[19] XIAOWEN LIU, YUVAL INBAR, PIETER C DORRESTEIN, COLIN WYNNE, NATHAN ED-WARDS, PUNEET SOUDA, JULIAN P WHITELEGGE, VINEET BAFNA & PAVEL A PEVZNER

(2010). Deconvolution and database search of complex tandem mass spectra of intact pro-teins: a combinatorial approach. Molecular & Cellular Proteomics : MCP 9(12), 2772–82.ISSN 1535-9484.

[20] NONLINEAR DYNAMICS LTD. (2012). Progenesis LC-MS. WWW-document. Accessed2012-01-29. http://www.nonlinear.com/products/progenesis/lc-ms/overview/.

[21] THOMAS LUMLEY & KEN RICE (2011). Permutation Tests and Debugging. WWW-document. Accessed 2012-06-13. http://faculty.washington.edu/kenrice/sisg/SISG-08-06.pdf.

[22] MATTHIAS MANN (2008). individualized p . p . b . -range mass accuracies and proteome-wide protein quantification. Nature Biotechnology 26(12), 1367–1372.

[23] ALAN G MARSHALL (2000). Milestones in Fourier transform ion cyclotron resonance massspectrometry technique development. Journal of Mass Spectrometry 200, 331–356.

[24] MATRIXSCIENCE (2010). Mascot. WWW-document. Accessed 2012-01-29.http://www.matrixscience.com/.

[25] MATRIXSCIENCE (2012). Data File Format. WWW-document. Accessed 2012-06-12.http://www.matrixscience.com/help/data_file_help.html.

[26] ERIK MOOI & MARKO SARSTEDT (2011). A Concise Guide to Market Research. SpringerBerlin Heidelberg, Berlin, Heidelberg. ISBN 978-3-642-12540-9.

[27] MIKE NALDRETT (2000). Matrix-assisted Laser Desorption/Ionisation. WWW-document.Accessed 2012-06-11. http://www.jic.ac.uk/services/proteomics/maldi.htm.

[28] MIKE NALDRETT (2000). Time-of-Flight Mass Spectrometry. WWW-document. Accessed2012-06-11. http://www.jic.ac.uk/services/proteomics/tof.htm.

[29] NCBI (2012). The Statistics of Sequence Similarity Scores. WWW-document. Accessed2012-03-07. http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html.

[30] ALEXEY I NESVIZHSKII (2010). A survey of computational methods and error rate esti-mation procedures for peptide and protein identification in shotgun proteomics. Journal ofProteomics 73(11), 2092–123. ISSN 1876-7737.

56

Page 59: Building a standard operating procedure for the analysis of mass

[31] PATRICK PEDRIOLI (2004). mzXML Schema. WWW-document. Accessed 2012-06-12.http://sashimi.sourceforge.net/schema_revision/mzXML_2.0/Doc/mzXML_2.0.html.

[32] PROTEOMICSSTANDARDSINITIATIVE (2012). mzML. WWW-document. Accessed 2012-06-12. http://www.psidev.info/mzml.

[33] RDOC (2012). K-means clustering. WWW-document. Accessed 2012-05-08.http://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html.

[34] BERNHARD Y RENARD, MARC KIRCHNER, FLAVIO MONIGATTI, ALEXANDER RIVANOV, JURI RAPPSILBER, DOMINIC WINTER, JUDITH A J STEEN, FRED A HAMPRECHT

& HANNO STEEN (2009). When less can yield more - Computational preprocessing ofMS/MS spectra for peptide identification. Proteomics 9(21), 4978–84. ISSN 1615-9861.

[35] ENVIRONMENTAL REPOSITORIES (2008). Collection, Storage, Retrieval and Distributionof Biological Materials for Research. Cell Preservation Technology 6(1), 3–58. ISSN 1538-344X.

[36] JEFF SCHNEIDER (1997). Cross Validation. WWW-document. Accessed 2012-05-08.http://www.cs.cmu.edu/schneide/tut5/node42.html.

[37] MICHAELA SCIGELOVA & ALEXANDER MAKAROV (2006). Orbitrap mass analyzer–overview and applications in proteomics. Proteomics 6 Suppl 2, 16–21. ISSN 1615-9861.

[38] BRIAN SEARLE (2009). X!Tandem explained. WWW-document. Accessed 2012-02-02.http://www.proteomesoftware.com/Proteome_software_pro_xtandem.html.

[39] SEATTLEPROTEOMICSCENTER (2012). Institute For Systems Biology. WWW-document.Accessed 2012-06-12. http://www.proteomecenter.org/.

[40] SEATTLEPROTEOMICSCENTER (2012). SpectraST. WWW-document. Accessed 2012-04-02. http://tools.proteomecenter.org/wiki/index.php?title=Software:SpectraST.

[41] ANGELA B. SHIFLET (2002). Entity-Relationship Model. WWW-document. Accessed 2012-06-13. http://wofford-ecs.org/dataandvisualization/ermodel/material.htm.

[42] JONATHON SHLENS (2005). A Tutorial on Principal Component Analysis. WWW-document.Accessed 2012-06-12. http://www.brainmapping.org/NITP/PNA/Readings/pca.pdf.

[43] KARL SKÖLD, MARCUS SVENSSON, MATHIAS NORRMAN, BENITA SJÖGREN, PER

SVENNINGSSON & PER E ANDRÉN (2007). The significance of biochemical and molec-ular sample integrity in brain proteomics and peptidomics: stathmin 2-20 and peptides assample quality indicators. Proteomics 7(24), 4445–56. ISSN 1615-9853.

[44] LINDSAY I SMITH (2006). A tutorial on Principal Components Analysis. WWW-document.Accessed 2012-05-15. http://www.sccg.sk/haladova/principal_components.pdf.

57

Page 60: Building a standard operating procedure for the analysis of mass

[45] SQLITE.ORG (2012). About SQLite. WWW-document. Accessed 2012-04-02.http://www.sqlite.org/about.html.

[46] HANNO STEEN & MATTHIAS MANN (2004). The ABC’s (and XYZ’s) of peptide sequenc-ing. Nature Reviews. Molecular Cell Biology 5(9), 699–711. ISSN 1471-0072.

[47] MARCUS SVENSSON, MATS BOREN, KARL SKÖLD, MARIA FÄLTH, BENITA SJÖGREN,MALIN ANDERSSON, PER SVENNINGSSON & PER E ANDREN (2009). Heat stabilizationof the tissue proteome: a new technology for improved proteomics. Journal of ProteomeResearch 8(2), 974–81. ISSN 1535-3893.

[48] MARCUS SVENSSON, KARL SKÖLD, ANNA NILSSON, MARIA FÄLTH, KATARINA NY-DAHL, PER SVENNINGSSON & PER E ANDRÉN (2007). Neuropeptidomics: MS applied tothe discovery of novel peptides from the brain. Analytical Chemistry 79(1), 15–6, 18–21.ISSN 0003-2700.

[49] THEGLOBALPROTEOMEMACHINE (2011). X!Tandem. WWW-document. Accessed 2012-03-01. http://www.thegpm.org/TANDEM/index.html.

[50] ERIC W. WEISSTEIN (2012). K-Means Clustering Algorithm. WWW-document. Accessed2012-06-12. http://mathworld.wolfram.com/K-MeansClusteringAlgorithm.html.

[51] JIEPING YE (2007). Cluster Analysis: Basic Concepts and Algorithms. WWW-document.Accessed 2012-05-08. http://www.public.asu.edu/jye02/CLASSES/Fall-2007/NOTES/Basic-cluster.ppt.

58

Page 61: Building a standard operating procedure for the analysis of mass

Appendix A Software setup

SpectraST version 4.0, included in TPP version 4.5.2., was built as a standalone version using aMinGW and a modified version of the included make file in order to make it build in a Windows7 environment. Other software used were: Internet Explorer version 9, Strawberry Perl version5.12.3.0, R version 2.14.1, ProteoWizard release 3_0_3329 and SQLite 32-bit version.

A.1 Requirements

Operating system: Windows 7.

Web browser: Internet Explorer, with enabled Active X objects.

Software: Strawberry Perl, SQLite, R, ProteoWizard MsConvert

The following modules for Perl: DBI, XML::Simple and Getopt::Long.

A.2 Configure and build SpectraST

The following changes were made to SpectraST.

Source code:

SpectraSTPeaklist.cpp:

• at the top of the file: added #include <ctime>.

SpectraST_ramp.cpp:

• row 52: changed to #include “../../common/wglob.h”.

SpectraST_Util.cpp:

• at the top of the file: added #include <io.h>.

SpectraSTMzXMLLibImporter.cpp:

• row 141: add one to getLastScan(), that is change to getLastScan()+1.

• row 166: loop from k < 0 to k < numScans.

SpectraST_cramp.cpp:

• row 147: change the last condition to arg < 0 instead of arg < 1.

59

Page 62: Building a standard operating procedure for the analysis of mass

Search parameters:

SpectraSTSearchParams.cpp:

peakScalingIntensityPower = 0.5

peakScalingUnassignedPeaks = 0.0

filterMaxPeaksUsed = 500

filterMaxDynamicRange = 10000.0

Other:

Installed the package zlib (libz-dev) for MinGW.

Appendix B Source code

The SMASH software and source code is available on Google Code:

http://smash-library.googlecode.com/files/SMASH.rar

60

Page 63: Building a standard operating procedure for the analysis of mass

Appendix C User manual

SMASH

Spectra Matching and Annotation

Software Helper

INSTRUCTION MANUAL

Written by Niklas Malmqvist.

Last updated: 2012-06-04

Page 64: Building a standard operating procedure for the analysis of mass

Page 1

TABLE OF CONTENTS

Introduction ......................................................................................................................................... 2

Folder structure and layout ................................................................................................................. 2

Platform setup ..................................................................................................................................... 2

Import spectra ..................................................................................................................................... 3

Import annotations ............................................................................................................................. 4

Searching the library ........................................................................................................................... 4

Viewing the results .............................................................................................................................. 4

Mark annotations as checked ............................................................................................................. 5

Extract spectra for further analysis ..................................................................................................... 5

Page 65: Building a standard operating procedure for the analysis of mass

Page 2

INTRODUCTION

This is a usage manual for the Spectra Matching and Annotation Software Helper (SMASH). It will

describe the layout and various steps in everything from setting up the library from scratch to

viewing search results.

The application is run by calling different Perl scripts from the Windows command line. All scripts are

located in the folder called “Scripts”. Do not move any of the scripts to another folder – they won’t

work if you do! Start the command line by pressing the Windows logo (“start” button), write “cmd”

and press enter. Navigate to folders by writing “cd <folder name>”, e.g. cd C:/myfolder, and press

enter. For more information on how to use the command line, type “help” and press enter.

Tip: you can auto-complete file paths and filenames by pressing the tab key after you’ve entered a

partial file name or a partial file path.

FOLDER STRUCTURE AND LAYOUT

The following list explains the various subfolders and their content:

Annotations – Stores search output files from search engines such as The GPM or Mascot.

Files are copied here automatically.

Backup – Contains backups of the database in subfolders named after which date the backup

was performed.

Javascript – Contains JavaScript library files.

Libimport – Contains the spectra library files.

Libsearch – Contains search output and results.

Logs – Contains log files.

pwiz – Contains ProteoWizard with Msconvert.

R – Contains scripts and plots for visualizing spectra matches.

Scripts – Contains the Perl scripts that run the platform.

SpectraST – Contains a modified version of SpectraST v4.0 Standalone.

SQLite – Contains the SQLite program and the annotation database file.

PLATFORM SETUP

Requirements:

Operating system: Windows 7.

Web browser: Internet Explorer, with enabled ActiveX objects.

Software: Strawberry Perl, SQLite, R, ProteoWizard MsConvert, RapidMiner 5

The following modules for Perl: DBI, XML::Simple, Config::Simple and Getopt::Long.

MinGW with the “libz-dev” package installed.

Page 66: Building a standard operating procedure for the analysis of mass

Page 3

Unpack the file containing the platform into C:\.

Run the script setupdb.pl to create the annotation database, this creates a database called

“pepdb.db”. Run the script with an input parameter that specifies the filename if you want to change

it: perl setupdb.pl databasename

See “Configuration file” for more details.

CONFIGURATION FILE

This file holds a few parameters that configure the platform. These are directories to programs that

are used in the platform and certain parameters for similarity measures between spectra. It is

important that this file has correctly set values, or the platform will not work.

The configuration file holds the following parameters.

SQLite database file – directory to the annotation database file created by setupdb.pl. This does not

need to be changed if setupdb.pl was run without specifying a database name. Should look similar to:

[sqlite] database=“C:/Platform/SQLite/databasefile.db” R installation path – the location of Rscript.exe included in the R installation. Should look similar to: [r] installdir="C:/Program Files/R/R-2.14.1/bin/Rscript.exe"

Similarity – parameter that sets the F-value cutoff for determining the similarity between two

spectra automatically (default: 0.65). Should look similar to:

[similarity] fval_cutoff=0.65

Note: the above file directories are just examples and may be different on your computer.

IMPORT SPECTRA

Run the script called Importer.pl with the following parameters:

-n Name of the library

-d Description, e.g. of the sample and the type of mass spectrometer used

-l Path to the folder containing files (in mgf format) with the spectra to be imported

Example: perl Importer.pl –n MyLibrary –d “Hypothalamus Orbitrap” –l C:\files\spectra

Page 67: Building a standard operating procedure for the analysis of mass

Page 4

Note: if the library name, description or file path contains spaces, they must be enclosed in

quotations (“).

IMPORT ANNOTATIONS

There are currently support for search results from Mascot and The GPM. Annotations are imported

by running the script ImpAnno.pl with appropriate parameters:

-s Specifies the search engine. Options: m for mascot, x for X!Tandem.

-r File containing a list of search hits. This file should be in pepXML for Mascot and a tab

separated list from The GPM*.

-o Search engine output. A file that contains the detailed information displayed in the

search engine.

* Important: This file must not be in any proprietary format such as .xlsx. Instead, open the list with

e.g. Microsoft Excel and save it as a regular text file (.txt).

SEARCHING THE LIBRARY

Run the script LibSearch.pl with one parameter, the file path to the file containing the query spectra.

Example: perl LibSearch.pl C:\files\myquery.mgf

The results will be put in C:\Platform\Libsearch\Results. The search output will be displayed in an

HTML file named after the query file. In the example above it will be “myquery_results.html”. The

search output will also be available as a tab separated text file which has the ending .xls and will be

called “myquery_results.xls” in the example above.

Note: the HTML result page must be opened in Internet Explorer to function as intended. Plots are

created for each match so the search may take some time depending on the number of hits.

VIEWING THE RESULTS

The search results are presented as a table where information on the spectrum matches can be seen.

This information includes:

Dot product score (D) – a measurement between 0 and 1 of how similar two spectra are.

Delta D – the difference in D between the top hit and the runner-up.

Dot bias – a measurement of how much D is dominated by a few peaks. Ranges between 0

and 1, where a value of 1 means that one peak make up the entire score.

F-value – a measurement of the significance of the hit (see “<report.pdf>” for details).

m/z difference – precursor m/z difference between the query and the database spectrum.

Page 68: Building a standard operating procedure for the analysis of mass

Page 5

Notes – notes on experiment and/or instrument type entered by the user when the spectra

were imported.

Selected? – a checkbox for selecting spectra (see “Extract spectra for further analysis”).

Annotations – how many annotations (amino acid sequence, modifications etc.) the peptide

has.

Spectra – a visualization of the match between each spectra pair.

It is possible to sort the results by clicking inside a header in the table. The annotation information

can be viewed by clicking on the number stating the number of annotations for a certain spectra.

Note that even if the number is 0, there can still be information regarding similarity to other peptide

spectra and in which mass spectrometry runs the peptide has appeared in.

MARK ANNOTATIONS AS CHECKED

Run the script called markchecked.pl. The input parameters is a text file containing the id’s of the

peptides you want to mark as checked and if you want to mark or unmark the annotations as

checked. To mark as checked, simply type “mark” as the second parameter and “unmark” if you want

to unmark them as checked.

Example, marking annotations for the id’s in id_list.txt as checked: perl markchecked.pl id_list.txt

mark

The id list file should contain one id per row and look something like this:

Peptide_id_23512.42.123 Peptide_id_28512.12.163 Peptide_id_97126.26.525 …

EXTRACT SPECTRA FOR FURTHER ANALYSIS

In the search result view, mark the checkboxes for the spectra you are interested in and click

“Confirm peptides”. Allow the ActiveX control to run if prompted by the browser (i.e. click “yes” in

the pop-up window). This creates a file with selected spectra id’s that is saved in Libsearch/Searches

and is named after the query file plus “_spectralist”. For example, if a query is named

“SpectFile3.mgf” then the file with the id’s is called “SpectFile3_spectralist”.

Run the script PrepareQueryMGF.pl and use the id file discussed above and the location of the mgf

file that was used in the query. If the file “MySpectraFile123.mgf” is in C:\files\mgfdata, the following

command would be given: perl PrepareQueryMGF.pl C:\files\mgfdata SpectFile3_spectralist

This creates an mgf file called “SpectFile3_spectra.mgf” in this example, which is saved in

Libsearch/Searches. The file contains the full spectrum information for each spectrum where the

checkbox was marked in the search result view. This information consists of: precursor mass, charge,

scans, retention time as well as m/z and intensity for each peak.