navigation, visual exploration and curation of large ... · 1 institut de pharmacologie et de...

1
David Bouyssié 1* , Anne-Marie Hesse 2* , Emmanuelle Mouton-Barbosa 1* , Magali Rompais 3* , Christine Carapito 3* , Véronique Dupierris 2* , Alexandre Burel 3*, Aymen Romdhani 3* , Christophe Bruley 2* 1 Institut de Pharmacologie et de Biologie Structurale, CNRS UMR5089, Université de Toulouse 2 Laboratoire Biologie à Grande Echelle, U1038 INSERM/CEA/UJF, iRTSV, CEA Grenoble 3 Laboratoire de Spectrométrie de Masse BioOrganique (LSMBO), IPHC-DSA, Université de Strasbourg. CNRS UMR7178 * ProFI, Proteomics French Infrastructure Navigation, visual exploration and curation of large- scale proteomics data with Proline Due to the intrinsic complexity of bottom-up proteomics experiments, inaccuracies and errors can occur throughout the data-processing pipeline. Accordingly, result reliability must be carefully assessed, not only by statistically controlled procedures, but also through examination of the underlying data by experts. Proline software suite provides a unique environment combining a structured representation of data and metadata based on an underlying relational database with an interactive graphical user interface from which users can navigate into considerable amounts of data, allowing the examination and manual curation of all results by human experts. http://proline.profiproteomics.fr/ Proline Datastore Protein Quantitation: The ion abundances are stored in the database, and Proline can then reuse the quantitative data to summarize peptide ion measurements as protein abundances using various summarization methods. A change of method does not require the whole quantification process to be restarted. Import & Organize search results: Proline can import results from Mascot, OMSSA or X!Tandem and search results and identification summaries can be merged to build a parent dataset, taking into account peptides identified in all merged datasets. Validate identifications: Validation can be performed at peptide-spectrum match (PSM) and protein levels: a set of predefined filters can be applied to accept or reject a PSM or a protein depending on user-specified threshold values applied to different properties. A target-decoy validation approach can be performed by adjusting the false discovery rate (FDR) to a user value at both levels LC-MS feature extraction: Proline first detects chromatographic peaks from raw MS data contained in mzDB files and assigns those LC-MS peaks to the validated PSMs. In Proline, results from each step of the workflow can be displayed and inspected, starting with the MS/MS spectra and their sequence interpretations. Validated and rejected PSMs can both be displayed and searched using different criteria. In the case of LC-MS quantification, peptides measured across aligned MS runs are represented together. For each peptide, the quantified ions and their extracted elution peaks can be viewed, as well as the Cross Assignment: The software then aligns the LC-MS maps representing identified signals to assign unidentified peaks to peptide ions across all runs. Data exploration & Visualization Workflow Steps Users can browse the content of identification results through a set of predefined views in Proline. The view represented here shows validated protein sets for an identification dataset. The alignment process is a critical step in label-free quantification. Proline provides an effective tool to check the quality of alignments by viewing the estimated retention time difference between runs. Navigation within data is facilitated by allowing users to define their own navigation path through the data. A navigation path is materialized by a new window layout where views are dynamically updated depending on the user’s selection. Performances have been assessed with a proteomic standard dataset composed of an equimolar mixture of 48 human proteins (UPS1, Sigma) spiked at different concentrations (from 10amol to 50fmol) into a yeast cell lysate background. Proline also offers a web graphical user interface to remotely access Proline Server via a web browser. isotope elution peaks. Suspect profiles can be individually discarded by the user, and protein quantification can then be recomputed without the invalidated measurements (as exemplified by co-eluting peaks, in the above snapshot). Maxquant Mascot + Proline -log10(pvalue) Welch t-test difference -log10(pvalue) Welch t-test difference theoretical log2(ratio) experimental log2(ratio) theoretical log2(ratio) experimental log2(ratio) Expected linearity Median of 48 proteins ratio Yeast 50f-500a 50f-5f 50f-25f In Proline, different summarization methods are proposed, ranging from simple summation to a MaxLFQ-like procedure called MRF (Median Ratio Fitting). Proteins of the mixed dataset were classified as variant after application of different p-value thresholds. Sensitivity (TPR = TP/144) was plotted as a function of false-discovery proportion (FDP= FP / (TP+ FP))

Upload: trantruc

Post on 14-Oct-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Navigation, visual exploration and curation of large ... · 1 Institut de Pharmacologie et de Biologie Structurale, CNRS UMR5089, Université de Toulouse 2 Laboratoire Biologie à

David Bouyssié1*, Anne-Marie Hesse2*, Emmanuelle Mouton-Barbosa1*, Magali Rompais3*, Christine Carapito3*, Véronique

Dupierris2*, Alexandre Burel3*, Aymen Romdhani3*, Christophe Bruley2*

1 Institut de Pharmacologie et de Biologie Structurale, CNRS UMR5089, Université de Toulouse2 Laboratoire Biologie à Grande Echelle, U1038 INSERM/CEA/UJF, iRTSV, CEA Grenoble

3 Laboratoire de Spectrométrie de Masse BioOrganique (LSMBO), IPHC-DSA, Université de Strasbourg. CNRS UMR7178* ProFI, Proteomics French Infrastructure

Navigation, visual exploration and curation of large-scale proteomics data with Proline

Due to the intrinsic complexity of bottom-up proteomics experiments, inaccuracies and errors can occur throughout the data-processing pipeline.

Accordingly, result reliability must be carefully assessed, not only by statistically controlled procedures, but also through examination of the underlying

data by experts. Proline software suite provides a unique environment combining a structured representation of data and metadata based on an underlying

relational database with an interactive graphical user interface from which users can navigate into considerable amounts of data, allowing the examination

and manual curation of all results by human experts.

http://proline.profiproteomics.fr/

Proline

Datastore

Protein Quantitation: The ion abundances are stored in the

database, and Proline can then reuse the quantitative data tosummarize peptide ion measurements as protein abundancesusing various summarization methods. A change of method doesnot require the whole quantification process to be restarted.

Import & Organize search results: Proline can import results from Mascot,

OMSSA or X!Tandem and search results and identification summaries can bemerged to build a parent dataset, taking into account peptides identified in allmerged datasets.

Validate identifications: Validation can be performed at peptide-spectrum

match (PSM) and protein levels: a set of predefined filters can be applied toaccept or reject a PSM or a protein depending on user-specified threshold valuesapplied to different properties. A target-decoy validation approach can beperformed by adjusting the false discovery rate (FDR) to a user value at bothlevels

LC-MS feature extraction: Proline first detects chromatographic peaks from

raw MS data contained in mzDB files and assigns those LC-MS peaks to thevalidated PSMs.

In Proline, results from each step of the workflow can be displayed and inspected, starting with the MS/MS spectra and their sequence interpretations. Validated and rejected PSMs can both be displayed and searched using different criteria.

In the case of LC-MSquantification, peptidesmeasured across aligned MSruns are representedtogether. For each peptide,the quantified ions and theirextracted elution peaks canbe viewed, as well as the

Cross Assignment: The software then aligns the LC-MS maps

representing identified signals to assign unidentified peaks topeptide ions across all runs.

Data exploration & VisualizationWorkflow Steps

Users can browse the content ofidentification results through a set ofpredefined views in Proline. The viewrepresented here shows validatedprotein sets for an identification dataset.

The alignment processis a critical step inlabel-free quantification.Proline provides aneffective tool to check

the quality of alignments by viewing the estimated retention time differencebetween runs.

Navigation within data isfacilitated by allowing users todefine their own navigation paththrough the data. A navigationpath is materialized by a newwindow layout where views aredynamically updated dependingon the user’s selection.

Performances have been assessed with a proteomic standard datasetcomposed of an equimolar mixture of 48 human proteins (UPS1, Sigma)spiked at different concentrations (from 10amol to 50fmol) into a yeast celllysate background.

Proline also offers a webgraphical user interface toremotely access Proline Servervia a web browser.

isotope elution peaks. Suspect profiles can be individually discarded by the user, and proteinquantification can then be recomputed without the invalidated measurements (as exemplified byco-eluting peaks, in the above snapshot).

MaxquantMascot + Proline

-lo

g10

(pva

lue)

Welch t-test difference

-lo

g10

(pva

lue)

Welch t-test difference

theoretical log2(ratio)

exp

erim

en

tallo

g2

(ra

tio

)

theoretical log2(ratio)

exp

erim

en

tallo

g2

(ra

tio

)

Expected linearity Median of 48 proteins ratio

Yeast50f-500a50f-5f50f-25f

In Proline, different summarizationmethods are proposed, ranging fromsimple summation to a MaxLFQ-likeprocedure called MRF (Median RatioFitting). Proteins of the mixed dataset wereclassified as variant after application ofdifferent p-value thresholds. Sensitivity(TPR = TP/144) was plotted as a function offalse-discovery proportion (FDP= FP / (TP+FP))