mass spectrometry and proteomics - lecture 5...matthias trost newcastle university...
TRANSCRIPT
Previously
• Proteomics• Sample prep
144
Lecture 5
• Quantitation techniques• Search Algorithms• Proteomics software
145
146
Current limitations of MS-based Proteomics
Bantscheff et al, Anal Bioanal Chem,2007
• Cellular proteins span a wide range of expression and current mass spectrometric technologies typically sample only a fraction of all the proteins present in a sample. • Due to limited data quality, only a fraction of all identified proteins can also be reliably quantified.
147
Limitations of Proteomics –concentration of proteins in plasma
Anderson & Anderson, MCP, 2002
148
Quantitation techniquesLabel-free• Ion intensity• Spectral counting
Chemical isotopic labeling• ICAT• iTRAQ/TMT• mTRAQ• Formaldehyde label• Enzymatic label
Metabolic isotopic labeling• SILAC• 15N
149
The three different spectral sources of quantitative information
Wilm, Proteomics, 2010
150
Quantitation methods
Isotope label(SILAC, ICAT, demethyl label etc)
Fragmentation-based label(iTRAQ)
Label-free
MS
MS/MS
X Da
151
Quantitation strategies
Bantscheff et al, Anal Bioanal Chem,2007
152
Characteristics of quantitative MS methods
Bantscheff et al, Anal Bioanal Chem, 2007
153
Label-free quantitation
• MASCOT • identification driven
peptide assignment
Peak detection (in triplicate) Hierarchical clusteringPeak detection (in triplicate)
MS/MSCondition A Condition B
154
Label-free proteomics
Advantages and Disadvantages
+ Lower complexity+ Lower cost+ Primary tissue possible(+) Repetitions increase
identification rates
- High LC-reproducibility necessary
- Good clustering dependent on high mass accuracy
- Several peptides for reliable quantitation required
Stdev Cond. A 0.089 Stdev Cond. B 0.067Ratio Cond. A/Cond. B 0.49
RLEIpSPDpSpSPER
Cond. B
Cond. A
155
Another label-free quantitation: Spectral counting
• The number of spectra matched to peptides from a protein is used as a surrogate measure of protein abundance.
• As the sampling of peptides in a mass spectrometer is usually depending on the peptides’ intensities, spectral counting has a reasonable statistical significance.
• Spectral counting is cheaper, easier to implement and does not require highly reproducible data.
• It requires however still thorough computational and statistical analysis.
• Modern mass specs are getting to sensitive and fast for this quantitation.
156
Isobaric tag for relative and absolute quantitation (TMT or iTRAQ)
• Reacts with N-termini and other primary amines of peptides.
• Uses a reporter group for quantification that can be identified in MS/MS spectra.
• Another labeled group serves as a balancer.
https://www.thermofisher.com/
157
Isobaric tag for relative and absolute quantitation (TMT or iTRAQ)
• Quantification is done in MS/MS mode (low intensity!)
• Once labeled with TMT or iTRAQ, the 4/6/8/10 individual samples are pooled for further processing and analysis.
• During subsequent MS/MS of the peptides, each isobaric tag produces a unique reporter ion that identifies which samples the peptide originated and its relative abundance.
Gingras et al, Nat Rev Mol Cell Biol, 2007
158
Isobaric tag for relative and absolute quantitation (iTRAQ or TMT)
+ Up to 11 samples (11-plex) can be quantified at the same time.
+ Saves instrument time.
- Quite expensive.
- Low dynamic range.
- Can not be performed in most ion-trap instruments as they do not reach this low mass range.
- Non-changing peptides are favored to be identified.
- large mass addition to peptides
- high ratios are suppressed by co-eluting other peptides. www.thermo.com
Ratio compression in TMT experiments
159
Ow, J Prot Res, 2009Ting et al, Nature Methods, 2011
Reducing ratio compression by using Synchronous Precursor Selection (SPS)
160
161
Formaldehyde/dimethyl label
• Samples are labeled with heavy and light formaldehyde on their primary amines (N-termini, Lys)
• relatively cheap and simple.
• can be used on virtually any sample.
• quite large mass difference between samples.
• Problematic retention time shifts in long LC runs due to Deuterium.Chen et al, Anal Chem, 2003; Boersema et al, Proteomics, 2008
162
Formaldehyde/dimethyl label
Chen et al, Anal Chem, 2003
163
Enzymatic isotope label• Further disadvantage:
Introduction of 18O at acidic side chains
• often incomplete incorporation of the label
Miyagi et al, Mass Spec Rev, 2006
164
Stable isotope labeling with amino acids in cell culture (SILAC)
• Cells are grown with “normal” and heavy isotope amino acids.
+ The isotopically labeled peptides are chemically (almost) identical (Retention time etc)
+ The different samples are mixed at a very early step during sample preparation.
- labeled amino acids (Lys/Arg) might be metabolized to other amino acids
- Expensive for large amounts of cells.
- Not for primary tissue.
- Increases complexity of the sample.
- Some cell types do not grow well in dialysed serum.commons.wikimedia.org
Neutron encoding (NeuCode) SILAC
• Makes use of the subtle mass differences caused by nuclear binding energy variation in stable isotopes (“mass defect”).
• For example, labelling with lysine with 2H8 (+8.0502 Da) and Lysine with 13C6 and 15N2 (+8.0142 Da).
• Can only be resolved with very high resolution >200,000.• In a low-resolution (<15,000) MS/MS scan, peaks are overlaying
and indistinguishable, thus both peaks add to the intensity.• Theoretically, up to 39 isotopologues of Lysine are possible.
165
Herbert et al, Nature Methods 2013Rose et al, Anal Chem, 2013
Neutron encoding (NeuCode) SILAC
166Herbert et al, Nature Methods 2013
(a) Mass calculations of the 39 isotopologues for a +8-Da lysine. Shown in solid black are the isotopologues used for the experiments presented here. (b) Theoretical calculations depicting the percentage of peptides that are resolved (full width at 1% maximum peak height) when spaced 12, 18 or 36 mDa apart for resolving powers (R) of 15,000–1,000,000. (c) Top, MS1 scan collected with typical 30,000 resolving power. Center, a selected precursor with m/z at 827 collected with 30,000 resolving power (black) and the signal recorded in a high-resolution MS1 scan (480,000 resolving power).
Protein Identification
• Either “de novo” (thus no database) or from genomic data.
• When genomic data is available, the software performs an in silico digestion of the whole database using the specific protease.
• The mass of the peptide and the MS/MS spectrum are compared to the theoretical mass and the spectrum.
167
Search Engines• Good search engines take common rules (high peaks
after P) into account.• The engines calculates a score from the number of
matched peaks compared to peaks present in spectrum. • This score is usually linked to a probability.• Lately, search engines using spectral libraries have
emerged. They are much faster and more accurate. However, good spectra for each peptide are required and ideally acquired in different kinds of instruments.
168
For large scale proteomics, identification of peptides becomes a complex matching problem
Peptide ID & matching
For large scale proteomics, identification of peptides becomes a complex matching problem
Peptide ID & matching
Peptide A Fragment Masses
ProteomeUniProt
Peptide B Mass Peptide B Fragment Masses
Peptide A MassDigestionin silico
Fragmentationin silico
Database
Observed Mass1000 ± 0.010 Da
Corresponding MS2 data
The Database Search1. MS1 filter2. MS2 scoring3. Probabilistic analysis
m/zIn
tens
ity
Database Search
Observed Mass1000 ± 0.010 Da
Peptide A Mass999.980
Peptide B Mass999.993
Peptide C Mass1000.005
Peptide D Mass1000.010
Peptide E Mass1000.025
Database Search –MS1 filter
Observed Mass1000 ± 0.010 Da
Peptide A Mass999.980
Peptide B Mass999.993
Peptide C Mass1000.005
Peptide D Mass1000.010
Peptide E Mass1000.025
Database Search –MS1 filter
Observed Mass1000 ± 0.010 Da
Peptide B Mass999.993
Peptide C Mass1000.005
Peptide D Mass1000.010
Observed Spectra
Database Search –theoretical MS/MS spectra
Score
9
80
1
Observed Mass1000 ± 0.010 Da
Peptide C Mass1000.005 80Peptide Evidence:
Theoreticalspectra
Observedspectra Score
Database Search –scoring
Search constraints• “Classic”
– Peptide/precursor mass accuracy– MS/MS/fragment mass accuracy– Fixed and variable modifications– Enzyme (specificity)– Instrument/type of ions generated
• Proposed– Retention time
177
Commonly used Search Engines
• Mascot • Sequest• OMSSA • X!Tandem• Andromeda (within MaxQuant)• …
178
Decoy/target strategy to determine FDR
179
PEP =# hits decoy database# hits
@ a given score
Decoy/target strategy to determine FDR
probability that a match of score 100 is incorrect
~ 0
probability that the match of score 10 is incorrect~ 90%
>UbiquitinMQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
Decoy/target strategy to determine FDR
>UbiquitinMQIFVK
MQIFVKTarget Database
VFIQMKDecoy Database
False-Discovery Rate• Peptide/protein identification by mass spectrometry is a
statistical analysis with false-negatives and false-positives.
• False-discovery rate (FDR) is estimated by searching the data against a combined forward and reversed database. The number of hits from the reversed database is thought equivalent with false hits in the forward database.
• Please note that the FDR is on the identification level only, not on the quantitation level.
• Commonly accepted FDRs are <1%.
182
• We accept that a very small proportion of peptide identifications (usually set to 1%) will likely be false discoveries
• Hence, having multiple supporting peptides per protein is important for confident identification and quantitation
Considerations
• FDR estimation is challenging using small databases or when most of the database is identified. Always use bigger databases (for example include human with bacterial database)
Considerations
• Choose your PTMs wisely
• Too many PTMs lead to combinatorial explosion and long database search times
• Common chemical modifications– Deamidation (NQ)– Gln PyroGlu– Oxidation (M)– Carbamidomethylation (C)– Acetyl (N-terminus)
Considerations
VFIQMKVFIQMKTLSDYNIQK
Protein AProtein BProtein C
ESTLHLVLR Protein AProtein BProtein C
EGIPPDQQRMQIFVK
The vast majority of MS identification and quantitation is performed on peptides; information on proteins is through inference
The peptide to protein relationship is a “many to many” match
Considerations
VFIQMKVFIQMKTLSDYNIQK
Protein AProtein BProtein C
ESTLHLVLR Protein AProtein BProtein C
EGIPPDQQRMQIFVK
Assigning non-unique peptides:
“Occam’s Razor”Accept the simplest explanation that fits the observations
Non-unique peptides are assigned to proteins that have the most unique peptides
Considerations
• Evaluate distribution of data
• Normalise data
• Calculate standard deviation to set cutoffs
Check your data: histograms
• Intensities vs intensities
• Reproducibility
Check your data: scatter plots
• Evaluate experimental reproducibility (0.05 is usual p-value cutoff)
• Appropriate fold change cutoff depends on standard deviation
Check your data: volcano plots
Databases• UniProt databases are the standard for mouse, human
and most other organisms. • They should be ideally non-redundant.• Can/should contain splice variants.• Database should not be too small (problem for bacteria)
as FDR calculation might be wrong. • A common set of contaminants (keratin, BSA, milk
proteins…) should be added to the searched database.
191
Software for MS ID and Quant
SRM/Targeted
• MaxQuant• Trans Proteomic
Pipeline (TPP)• Proteome
Discoverer• PEAKS• Scaffold
• Skyline
Software Platforms
• Mascot• Sequest• OMSSA• Morpheus
de novosequencing
• PEAKS
TMT quantitation
• COMPASS • MaxQuant• Proteome
Discoverer
ID only